# 💻 Data Science Best Practices with pandas
---
# 📝 Introduction to the TED Talks dataset

> This is a kernel follows the talk from `PyCon 2019` of `Kevin Markham`.
> - 📍 PyCon: [Full Conference](https://www.youtube.com/watch?v=ZjrUmNq41Eo&t=3778s)
> - 📍 Youtube channel: [Data School](https://www.youtube.com/user/dataschool)
---

# 🔍 Exploratory Data Analysis
> Exploratory Data Analysis is all about answering a specific question. In this notebook we will try to answer the following questions:
> 1. Which talks provoke the most online discussion?
> 2. What were the "best" events in TED history to attend?
> 3. Which occupations deliver the funniest TED talks on average?

In [None]:
!pip install hvplot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import hvplot.pandas

In [None]:
ted = pd.read_csv("/kaggle/input/ted-talks/ted_main.csv")
ted.head()

In [None]:
# rows, columns
ted.shape

In [None]:
# object columns are usually strings, but can also be arbitrary Python objects (lists, dictionaries)
ted.dtypes

In [None]:
ted.info()

In [None]:
# count the number of missing values in each column
ted.isna().sum()

# ❌ Missing Values
> In `machine learning`, we need to handle `missing values`. There are many types of missing values:
> - `Standard Missing Values`: These are missing values that Pandas can detect.
> - `Non-Standard Missing Values`: Sometimes it might be the case where there’s missing values that have different formats.
> - `Unexpected Missing Values`: For example, if our feature is expected to be a string, but there’s a numeric type, then technically this is also a missing value.

---

> It’s important to understand these different types of missing data from a `statistics point of view`. The type of missing data will influence how you deal with filling in the missing values.
> - Sometimes you’ll simply want to delete those rows, other times you’ll replace them.
> - A very common way to replace missing values is using a median (for objects) or mean (for numerique values). 

---

> But those are weak appraoch, some times we need `domain knowledge` about the data and `statical study` to fill the missing values.

In [None]:
# fill the miising values. In the general case if the missing values type is numerique
# we fill it with mean values. if its an object we fill missing values with mode.
ted['speaker_occupation'] = ted.speaker_occupation.fillna(ted.speaker_occupation.mode()[0])

In [None]:
ted.isna().sum()

In [None]:
ted.describe()

# 📢 Which talks provoke the most online discussion?

In [None]:
ted.hvplot.hist(subplots=True, height=250, width=250, shared_axes=False, value_label='Rate').cols(4)

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(ted.corr(), annot=True)

From the heatmap, the number of views correlates well with language and comments.

In [None]:
ted.hvplot.hist(y='languages', height=350, width=450)

In [None]:
# sort by the number of first-level comments, though this is biased in favor of older talks
ted.sort_values('comments').tail()

## 📌 Limitations of this approach: 
> 1. sub comments (nested comments).
> 2. how long its being online.

## 📏 To correct this behavior, one solution is to normalise comments by views.

In [None]:
# creating a new column 'comments_per_view'
ted['comments_per_view'] = ted.comments / ted.views

# interpretation: for every view of the same-sex marriage talk, there are 0.002 comments
ted.sort_values('comments_per_view').tail()

In [None]:
# make this more interpretable by inverting the calculation
ted['views_per_comment'] = ted.views / ted.comments

# interpretation: 1 out of every 450 people leave a comment
ted.sort_values('views_per_comment').head()

In [None]:
ted.hvplot.line(y='views', height=350, width=550)

## 🧾 **Lessons:**

> 1. Consider the limitations and biases of your data when analyzing it
> 2. Make your results understandable

# 📊 Visualize the distribution of comments

In [None]:
# line plot is not appropriate here (use it to measure something over time)
ted.hvplot.hist(y='comments', bins=50, height=350, width=550)

In [None]:
# check how many observations we removed from the plot
ted[ted.comments >= 1000].shape

After filtering the data we lose only a small amount of data. This process is called excluding outliers.

In [None]:
ted[ted.comments < 1000].hvplot.hist(y='comments', bins=50, height=350, width=550)

## 🧾 **Lessons:**

> 1. Choose your plot type based on the question you are answering and the data type(s) you are working with
> 2. Use pandas one-liners to iterate through plots quickly
> 3. Try modifying the plot defaults
> 4. Creating plots involves decision-making

# 📉 Plot the number of talks that took place each year

In [None]:
# event column does not always include the year
ted.event.sample(10)

We can't rely on `event` Feature, because most of the events don't have a year.

In [None]:
# dataset documentation for film_date says "Unix timestamp of the filming"
ted.film_date.head()

In [None]:
# results don't look right
pd.to_datetime(ted.film_date).head()

In [None]:
# now the results look right
pd.to_datetime(ted.film_date, unit='s').head()

In [None]:
ted['film_datetime'] = pd.to_datetime(ted.film_date, unit='s')

# verify that event name matches film_datetime for a random sample
ted[['event', 'film_datetime']].sample(5)

In [None]:
# new column uses the datetime data type (this was an automatic conversion)
ted.dtypes

In [None]:
# datetime columns have convenient attributes under the dt namespace
ted.film_datetime.dt.year.head()

In [None]:
# count the number of talks each year using value_counts()
ted.film_datetime.dt.year.value_counts()

In [None]:
# points are plotted and connected in the order you give them to pandas
ted.film_datetime.dt.year.value_counts().hvplot.line(height=350, width=550)

In [None]:
# need to sort the index before plotting
ted.film_datetime.dt.year.value_counts().sort_index().hvplot.line(height=350, width=550)

In [None]:
# we only have partial data for 2017
ted.film_datetime.max()

## 🧾 **Lessons:**

> 1. Read the documentation
> 2. Use the datetime data type for dates and times
> 3. Check your work as you go
> 4. Consider excluding data if it might not be relevant

# 🥇 What were the "best" events in TED history to attend?

In [None]:
def clean_event(event):
    years = list(range(1990, 2018, 1))
    
    for year in years:
        if str(year) in event:
            return f"TED_{year}"
    else:
        return event

In [None]:
print(ted.event.nunique())
print(ted.event.apply(clean_event).nunique())

In [None]:
ted['event'] = ted.event.apply(clean_event)

In [None]:
# count the number of talks (great if you value variety, but they may not be great talks)
ted.event.value_counts().head()

In [None]:
# use views as a proxy for "quality of talk"
ted.groupby('event').views.mean().head()

In [None]:
# find the largest values, but we don't know how many talks are being averaged
ted.groupby('event').views.mean().sort_values().tail()

In [None]:
# show the number of talks along with the mean (events with the highest means had only 1 or 2 talks)
ted.groupby('event').views.agg(['count', 'mean']).sort_values('mean').tail()

In [None]:
# calculate the total views per event
ted.groupby('event').views.agg(['count', 'mean', 'sum']).sort_values('sum').tail()

In [None]:
ted.event.value_counts()[:20].hvplot.barh()

In [None]:
ted.groupby('event').views.mean().sort_values(ascending=False)[:20].hvplot.barh()

## 🧾 **Lessons:**

> 1. Think creatively for how you can use the data you have to answer your question
> 2. Watch out for small sample sizes

# 📤 Unpack the ratings data

In [None]:
# previously, users could tag talks on the TED website (funny, inspiring, confusing, etc.)
ted.ratings.head()

In [None]:
# two ways to examine the ratings data for the first talk
ted.loc[0, 'ratings']
ted.ratings[0]

In [None]:
# this is a string not a list
type(ted.ratings[0])

In [None]:
# convert this into something useful using Python's ast module (Abstract Syntax Tree)
import ast

# literal_eval() allows you to evaluate a string containing a Python literal or container
ast.literal_eval('[1, 2, 3]')

# if you have a string representation of something, you can retrieve what it actually represents
type(ast.literal_eval('[1, 2, 3]'))

In [None]:
# unpack the ratings data for the first talk
ast.literal_eval(ted.ratings[0])

In [None]:
# now we have a list (of dictionaries)
type(ast.literal_eval(ted.ratings[0]))

In [None]:
# define a function to convert an element in the ratings Series from string to list
def str_to_list(ratings_str):
    return ast.literal_eval(ratings_str)

In [None]:
# test the function
str_to_list(ted.ratings[0])

In [None]:
# Series apply method applies a function to every element in a Series and returns a Series
ted.ratings.apply(str_to_list).head()

In [None]:
# lambda is a shorter alternative
ted.ratings.apply(lambda x: ast.literal_eval(x)).head()

In [None]:
# an even shorter alternative is to apply the function directly (without lambda)
ted.ratings.apply(ast.literal_eval).head()

In [None]:
ted['ratings_list'] = ted.ratings.apply(lambda x: ast.literal_eval(x))

In [None]:
# check that the new Series looks as expected
ted.ratings_list[0]

In [None]:
# each element in the Series is a list
type(ted.ratings_list[0])

In [None]:
# data type of the new Series is object
ted.ratings_list.dtype

In [None]:
# object is not just for strings
ted.dtypes

## 🧾 **Lessons:**

> 1. Pay attention to data types in pandas
> 2. Use apply any time it is necessary

# 🧮 Count the total number of ratings received by each talk

**Bonus exercises:**

> - for each talk, calculate the percentage of ratings that were negative
> - for each talk, calculate the average number of ratings it received per day since it was published

In [None]:
ted.ratings_list[0]

In [None]:
# start by building a simple function
def get_num_ratings(list_of_dicts):
    return list_of_dicts[0]

In [None]:
# pass it a list, and it returns the first element in the list, which is a dictionary
get_num_ratings(ted.ratings_list[0])

In [None]:
# modify the function to return the vote count
def get_num_ratings(list_of_dicts):
    return list_of_dicts[0]['count']

In [None]:
# pass it a list, and it returns a value from the first dictionary in the list
get_num_ratings(ted.ratings_list[0])

In [None]:
# modify the function to get the sum of count
def get_num_ratings(list_of_dicts):
    num = 0
    for d in list_of_dicts:
        num = num + d['count']
    return num

In [None]:
# looks about right
get_num_ratings(ted.ratings_list[0])

In [None]:
# check with another record
ted.ratings_list[1]

In [None]:
# looks about right
get_num_ratings(ted.ratings_list[1])

In [None]:
# apply it to every element in the Series
ted.ratings_list.apply(get_num_ratings).head()

In [None]:
# another alternative is to use a generator expression
sum((d['count'] for d in ted.ratings_list[0]))

In [None]:
# use lambda to apply this method
ted.ratings_list.apply(lambda x: sum((d['count'] for d in x))).head()

In [None]:
# another alternative is to use pd.DataFrame()
pd.DataFrame(ted.ratings_list[0])['count'].sum()

In [None]:
# use lambda to apply this method
ted.ratings_list.apply(lambda x: pd.DataFrame(x)['count'].sum()).head()

In [None]:
ted['num_ratings'] = ted.ratings_list.apply(get_num_ratings)

In [None]:
# do one more check
ted.num_ratings.describe()

## 🧾 **Lessons:**

> 1. Write your code in small chunks, and check your work as you go
> 2. Lambda is best for simple functions

# 🤹 Which occupations deliver the funniest TED talks on average?

Bonus exercises:

> - for each talk, calculate the most frequent rating
> - for each talk, clean the occupation data so that there's only one occupation per talk

## ✔️ Step 1: Count the number of funny ratings

In [None]:
# "Funny" is not always the first dictionary in the list
ted.ratings_list.head()

In [None]:
# check ratings (not ratings_list) to see if "Funny" is always a rating type
ted.ratings.str.contains('Funny').value_counts()

In [None]:
# write a custom function
def get_funny_ratings(list_of_dicts):
    for d in list_of_dicts:
        if d['name'] == 'Funny':
            return d['count']

In [None]:
# examine a record in which "Funny" is not the first dictionary
ted.ratings_list[3]

In [None]:
# check that the function works
get_funny_ratings(ted.ratings_list[3])

In [None]:
# apply it to every element in the Series
ted['funny_ratings'] = ted.ratings_list.apply(get_funny_ratings)
ted.funny_ratings.head()

In [None]:
# check for missing values
ted.funny_ratings.isna().sum()

## ✔️ Step 2: Calculate the percentage of ratings that are funny

In [None]:
ted['funny_rate'] = ted.funny_ratings / ted.num_ratings

In [None]:
# "gut check" that this calculation makes sense by examining the occupations of the funniest talks
ted.sort_values('funny_rate').speaker_occupation.tail(20)

In [None]:
# examine the occupations of the least funny talks
ted.sort_values('funny_rate').speaker_occupation.head(20)

## ✔️ Step 3: Analyze the funny rate by occupation

In [None]:
# calculate the mean funny rate for each occupation
ted.groupby('speaker_occupation').funny_rate.mean().sort_values().tail()

In [None]:
# however, most of the occupations have a sample size of 1
ted.speaker_occupation.describe()

## ✔️ Step 4: Focus on occupations that are well-represented in the data

In [None]:
# count how many times each occupation appears
ted.speaker_occupation.value_counts()

In [None]:
# value_counts() outputs a pandas Series, thus we can use pandas to manipulate the output
occupation_counts = ted.speaker_occupation.value_counts()
type(occupation_counts)

In [None]:
# show occupations which appear at least 10 times
occupation_counts[occupation_counts >= 10].hvplot.barh()

In [None]:
# save the index of this Series
top_occupations = occupation_counts[occupation_counts >= 5].index
top_occupations

## ✔️ Step 5: Re-analyze the funny rate by occupation (for top occupations only)

In [None]:
# filter DataFrame to include only those occupations
ted_top_occupations = ted[ted.speaker_occupation.isin(top_occupations)]
ted_top_occupations.shape

In [None]:
# redo the previous groupby
ted_top_occupations.groupby('speaker_occupation').funny_rate.mean().sort_values()[:20].hvplot.barh()

## 🧾 **Lessons:**

> 1. Check your assumptions about your data
> 2. Check whether your results are reasonable
> 3. Take advantage of the fact that pandas operations often output a DataFrame or a Series
> 4. Watch out for small sample sizes
> 5. Consider the impact of missing data
> 6. Data scientists are hilarious