# Ted Data Analysis

In [1]:
from IPython.display import Image
Image(url='TED.gif') 

Founded in 1984 by Richard Saulman as a non profit organisation that aimed at bringing experts from the fields of Technology, Entertainment and Design together, TED Conferences have gone on to become the Mecca of ideas from virtually all walks of life. As of 2015, TED and its sister TEDx chapters have published more than 2000 talks for free consumption by the masses and its speaker list boasts of the likes of Al Gore, Jimmy Wales, Shahrukh Khan and Bill Gates.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from scipy import stats 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### TED Dataset

In [3]:
df = pd.read_csv('ted_main.csv')

FileNotFoundError: [Errno 2] File b'ted_main.csv' does not exist: b'ted_main.csv'

### Features Available

* **name**: The official name of the TED Talk. Includes the title and the speaker.
* **title**: the title of the talk.
* **description**: A blurb of what the talk is about.
* **main_speaker**: The first named speaker of the talk.
* **speaker_occupation**: The occupation of the main speaker.
* **num_speaker**: The number of speakers in the talk.
* **duration**: The duration of the talk in seconds.
* **event**: The TED/TEDx event where the talk took place.
* **film_data**: The Unix timestamp of the filming.
* **published_data**: The Unix timestamp for the publication of the talk on TED.com
* **comments**: The number of first level commnets made on the talk.
* **tags**: The themes associated with the talk.
* **languages**: The number of languages in which the talk is available.
* **ratings**: A stringfied dictionary of the various ratings given to the talk.
* **related_talks**: A list of dictionaries of recommended talks to watch next.
* **url**: The URL of talk.
* **views**: The number of views on the talk

In [None]:
df.head()

Reorder the columns

In [None]:
df = df[['name', 'title', 'description', 'main_speaker', 'speaker_occupation', 'num_speaker', 
         'duration', 'event', 'film_date', 'published_date', 
         'comments', 'tags', 'languages', 'ratings', 'related_talks', 
         'url', 'views']]


Convert timestamps into a human readable format

In [None]:
import datetime
df['film_date'] = df['film_date'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime('%d-%m-%Y'))
df['published_date'] = df['published_date'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime('%d-%m-%Y'))

In [None]:
df['published_date'].iloc[0]

These represent all the talks that have ever been posted on the TED Platform until september 21, 2017.

In [None]:
len(df)

### Most Viewed talks all the time

15 most viewed TED talks of all time. The number of views gives us a good idea of the popularity of the TED talk.

In [None]:
ted_talks = df[['title', 'main_speaker', 'views', 'film_date']].sort_values(by = 'views', ascending = False)[:15]

In [None]:
ted_talks

* Ken Robinson's talk on **Do Schools kill creativity?** is the most popular TED Talks of all time with 47.2 million views?
* Robinson's talk is closely followed by Amy Cuddy's talk on **Your Body Language May Shape Who You Are**.

In [None]:
sns.set_style('whitegrid')

In [None]:
ted_talks['fname'] = ted_talks['main_speaker'].apply(lambda x : x[:3])
plt.figure(figsize = (10, 6))
plt.title('Speeker And Views', fontsize = 20)
sns.barplot(x = 'fname', y = 'views', data = ted_talks)

In [None]:
plt.title('Distribution of Views', fontsize = 20)
sns.distplot(df['views'])

In [None]:
df['views'].describe()

The average number of views on TED Talks is **1.6 miilion** and the median number of views is **1.12 million**. This suggests a very high average level of popularity of TED Talks. 

### Comments 

In [None]:
df['comments'].describe()

* On average, there are **191.5 comments** on every TED Talk.
* There is a **huge standard deviation** associated with the comments.
* The minumum number of comments on a talk is **2** and maximum is **6404**. 

In [None]:
plt.title('Distribution of Comments', fontsize = 20)
sns.distplot(df['comments'])

In [None]:
plt.figure(figsize = (10, 6))
plt.title('Comments Less Than 500', fontsize = 20)
sns.distplot(df[df['comments']<500]['comments'])

From the plot above, we can see that bulk of the talks have **fewer than 500 comments**. The mean obtained has been heavily influences by outliers.

If the number of views is correlated with the number of comments. We should think that this is the case as more popular videos tend to have more comments. Let's find out.

In [None]:
sns.jointplot(x = 'views', y = 'comments', data = df)

In [None]:
df[['views','comments']].corr()

As the scatterplot and the correlation matrix show, the pearson coefficient is slightly **more than 0.5**. This result was pretty expected.
Let us now check the number of views and comments on the 10 most commented TED Talks of all the time. 

In [None]:
df[['title', 'main_speaker', 'views', 'comments']].sort_values('comments', ascending = False).head(10)

As can be seen above, Richard Dawkins' talk on **Militant Atheism'** generated the greatest amount of discussion and opinions despite having significantly lesser views than Ken Robinson's talk, which is second in the list.

we will define a new feature discussion quotient which is simply the ratio of the number of comments to the number of views. We will then check which talks have the largest discussion quotient.

In [None]:
df['discussion_quo'] = df['comments']/df['views']

In [None]:
df[['title', 'main_speaker', 'views', 'comments', 'discussion_quo', 'film_date']].sort_values('discussion_quo', ascending = False).head(10)

The most discuss talk is **The case for same sex marriage**.

### Analysing TED Talks by the month and the year.

In [None]:
df['month'] = df['film_date'].apply(lambda x: sorted(x.split('-'))[0])

In [None]:
month_df = pd.DataFrame(df['month'].value_counts()).reset_index()
month_df.columns = ['month', 'talks']

In [None]:
plt.title('Talks Per Month', fontsize = 20)
s = sns.barplot(x='month', y='talks', data = month_df)
s.set(xlabel = 'Months', ylabel = 'Number of talks')

**February** is the most popular month for TED confrences whereas **August** and **December** are the least popular. February's popularity is largely due to the fact that the official TED Conferences are held in February. Let us check the distribution for TEDx talks only.