In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval

# Cleaning the Movies_Metadata csv

If you follow along with the comments while we cleaned data, you will note that we made some minor changes, primarily in chaining some of the functions that Banik uses. While it's not necessary to chain functions, it does simplify our code, so we included this method as another option. It's one of the nifty features of Pandas.

In [3]:
#read in the data
df = pd.read_csv('../data/movies_metadata.csv')

#print the shape of the dataframe
print(f"The shape is {df.shape}")

#get the column info
df.info()

#####################
# Helper Functions
#####################
#converts ints & string representations of numbers to floats
def to_float(x):
    try:
        x = float(x)
    except:
        x = np.nan
    return x

#Helper function to convert NaT to 0 and all other years to integers.
def convert_int(x):
    try:
        return int(x)
    except:
        return 0

#we can run both apply and astype in one line by chaining them
df['budget'] = df['budget'].apply(to_float).astype('float')

#Convert release_date into pandas datetime format
df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')

#Extract year from the datetime and convert to integer. (Again, we're chaining functions)
df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan).apply(convert_int)

#convert vote_count to integer
df['vote_count'] = df['vote_count'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan).apply(convert_int)

#Convert all NaN into stringified empty lists and apply literal eval and convert to list by chaining functions
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

#filter to just the relevant columns
df = df[['id','title','budget', 'genres', 'overview', 'revenue', 'runtime', 'vote_average', 'vote_count', 'year']]
df.head()
df.to_csv('movies_metadata_clean.csv', index=False)

The shape is (5000, 24)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  5000 non-null   bool   
 1   belongs_to_collection  825 non-null    object 
 2   budget                 5000 non-null   int64  
 3   genres                 5000 non-null   object 
 4   homepage               311 non-null    object 
 5   id                     5000 non-null   int64  
 6   imdb_id                5000 non-null   object 
 7   original_language      5000 non-null   object 
 8   original_title         5000 non-null   object 
 9   overview               4979 non-null   object 
 10  popularity             5000 non-null   float64
 11  poster_path            4979 non-null   object 
 12  production_companies   5000 non-null   object 
 13  production_countries   5000 non-null   object 
 14  release_date           4996 non-

# Cleaning the Ted Talks

This is straight out of the book. Apply is a handy function available in pandas that lets you run a function for each row or column of your data. You're seeing examples here of using a lambda (inline) function as well as using a separately created function (convert_int). 

The lambda function is just grabbing the year from the published date. It's doing that by splitting the string on the '-' character. This creates an array. We grab the first item in the array, which, if we had a valid date, should be the year. If we didn't have a valid date, then we drop in the np.nan.

In [5]:
import pandas as pd
import numpy as np

ted = pd.read_csv('../data/ted-talks/ted_main.csv')
#Convert release_date into pandas datetime format
ted['published_date'] = pd.to_datetime(ted['published_date'],
                                       errors='coerce', unit='s')

#see what the new date looks like
print("This is what the datetime string looks like:")
display(ted['published_date'].head())


#Extract year from the datetime
ted['published_year'] = ted['published_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

#Helper function to convert NaT to 0 and all other years to integers.
def convert_int(x):
    try:
        return int(x)
    except:
        return 0

#Apply convert_int to the year feature
ted['published_year'] = ted['published_year'].apply(convert_int)


ted.head()

This is what the datetime string looks like:


0   2006-06-27 00:11:00
1   2006-06-27 00:11:00
2   2006-06-27 00:11:00
3   2006-06-27 00:11:00
4   2006-06-27 20:38:00
Name: published_date, dtype: datetime64[ns]

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,published_year
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,2006-06-27 00:11:00,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,2006
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,2006-06-27 00:11:00,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2006
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,2006-06-27 00:11:00,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,2006
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,2006-06-27 00:11:00,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,2006
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,2006-06-27 20:38:00,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,2006


This is also straight from the book. When we use the literal_eval function on the ratings column, we get a dictionary that we can manipulate. The "name" key holds the part of the ratings that we care about. We want to convert these words to lower case and create a list of the words. We create an empty list if there were no ratings.

In [6]:
#Import the literal_eval function from ast
from ast import literal_eval

#Convert all NaN into stringified empty lists
ted['ratings'] = ted['ratings'].fillna('[]')

#Apply literal_eval to convert stringified empty lists to the list object
ted['ratings'] = ted['ratings'].apply(literal_eval)


#Convert list of dictionaries to a list of strings
ted['ratings'] = ted['ratings'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

ted.to_csv('ted_clean.csv', index=False)
#See how 'ratings' has changed?
ted.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,published_year
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,2006-06-27 00:11:00,"[funny, beautiful, ingenious, courageous, long...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,2006
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,2006-06-27 00:11:00,"[funny, courageous, confusing, beautiful, unco...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2006
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,2006-06-27 00:11:00,"[funny, courageous, ingenious, beautiful, unco...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,2006
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,2006-06-27 00:11:00,"[courageous, beautiful, confusing, funny, inge...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,2006
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,2006-06-27 20:38:00,"[ingenious, funny, beautiful, courageous, long...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,2006
