# Import the required libraries

In [1]:
import pandas as pd
import math
import re

Note that the file path will be different for you.

The dataset can be downloaded at https://data.world/owentemple/ted-talks-complete-list. I removed a the number of additional features in the dataset with the transcription since I won't be using them with TF-IDF. See the column headings below to see which ones I have kept.

# Load the data and remove null values

In [2]:
def remove_null_vals(df):
    df = df.dropna()
    return df

def load_data(path):
    # Will infer the headings of each column
    dataframe = pd.read_csv(path)
    return dataframe

In [3]:
PATH = 'C:\\Users\\maxel\\OneDrive\\Search_Engine\\Version_1\\TED_Talks_dataset.csv'
df = load_data(PATH)
df = remove_null_vals(df)

df.head()

Unnamed: 0,id,speaker,headline,URL,description,year_filmed,duration,views_as_of_06162017,tags,transcript
0,1,Al Gore,Averting the climate crisis,http://www.ted.com/talks/view/id/1,With the same humor and humanity he exuded in ...,2006,00:16:17,3177001.0,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ..."
1,2,Amy Smith,Simple designs to save a life,http://www.ted.com/talks/view/id/2,Fumes from indoor cooking fires kill more than...,2006,00:15:06,1379328.0,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ..."
2,3,Ashraf Ghani,How to rebuild a broken state,http://www.ted.com/talks/view/id/3,Ashraf Ghani's passionate and powerful 10-minu...,2005,00:18:45,790536.0,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r..."
3,4,Burt Rutan,The real future of space exploration,http://www.ted.com/talks/view/id/4,"In this passionate talk, legendary spacecraft ...",2006,00:19:37,1985119.0,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst..."
4,5,Chris Bangle,Great cars are great art,http://www.ted.com/talks/view/id/5,American designer Chris Bangle explains his ph...,2002,00:20:04,859487.0,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac..."


The columns that we will consider is the decription, tags, heading and transcription. However we must account for the heading, description and tags to carry more weight per word when processing the search query since they give a better description of what the talk will be about.

To reduce the complexity of the search, we shall make a new feature which combines the description, tags and the title to make a more highly weighted description of the talk compared to the transcripted (in terms of description the individual words carry). We shall call this feature exposition.

In [4]:
def combine_tags_description_heading(df):
    s = ' '
    df['exposition'] = [row['headline'] +s+ row['description'] +s+ row['tags'] for index, row in df.iterrows()]
    
    return df

In [5]:
df = combine_tags_description_heading(df)
df['exposition'][0]

'Averting the climate crisis With the same humor and humanity he exuded in "An Inconvenient Truth," Al Gore spells out 15 ways that individuals can address climate change immediately, from buying a hybrid to inventing a new, hotter brand name for global warming. cars,alternative energy,culture,politics,science,climate change,environment,sustainability,global issues,technology'

We have added a space when combining the headline, description and tags to ensure the words are spilt when tokenizing

# Pre-processing the data

The data must be pre-processed in order to ensure we achieve maximum accuracy with the search. We shall conduct stemming and lemmetisation on the the transcripts, decription and heading of the talk.

In [6]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
def remove_stop_words(text):
    text = re.split(r'[;,.\s]\s*', text)
    stop_words = stopwords.words('english')
    index_to_remove = []
    
    for i in range(len(text)):
        
        # Make the word lower case
        text[i] = text[i].lower()
        
        # Remove the word if the word is a stop word
        if text[i] in stop_words:
            index_to_remove.append(i)
            continue

    for index in index_to_remove[::-1]:
        text.pop(index)
    
    return text

def remove_symbols(text):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n\r0123456789"
    for symbol in symbols:
        text = text.replace(symbol, ' ')
        
    return text

def remove_apostrophes(text):
    return [word.replace('`', '') for word in text]

We must be careful about certain words such as the conversion of U.S to us or apostrophes. "Won't" will be converted to "wont" so won't be removed with the stop words. We must take out the apstrophes separately afterwards.

We are also going to remove numbers since the transcript files contain timings which are not relavent to the search.

# Stemming and Lemmatization

Stemming converts words to its stem. For example running and ran is converted to run based on some set of rules. This is what we want since it does really make a difference which tense the word is in for our search query. We are going to use a library for this because coding a stemmer seems like it would be boring and not much would be gained from it. The library will be Porter-Stemmer which identifies and removes the suffix or affix of a word (the attachments on the words).

In [8]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem('running'), stemmer.stem('played'), stemmer.stem('undecided')) 

run play undecid


Lemmetisation is reducing a word to its root synonym. Unlike stemming, lemmatisation will produce a word that is in a set dictionary. We shall use stemming for simplicity.

In [9]:
def stemming(text):
    stemmer = PorterStemmer()
    
    for i in range(len(text)):
        text[i] = stemmer.stem(text[i])
    
    return text

Putting this all together in a function we get

In [10]:
def preprocess(df):
    for target_text in ['exposition', 'transcript']:
        for index, row in df.iterrows():
            text = row[target_text]
            
            text = remove_symbols(text)
            text = remove_stop_words(text)
            text = remove_apostrophes(text)
            text = stemming(text)
            
            df[target_text][index] = text
    
    return df        

May take a while to run depending on the dataset size

In [11]:
df = preprocess(df)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,id,speaker,headline,URL,description,year_filmed,duration,views_as_of_06162017,tags,transcript,exposition
0,1,Al Gore,Averting the climate crisis,http://www.ted.com/talks/view/id/1,With the same humor and humanity he exuded in ...,2006,00:16:17,3177001.0,"cars,alternative energy,culture,politics,scien...","[, thank, much, chri, truli, great, honor, opp...","[avert, climat, crisi, humor, human, exud, inc..."
1,2,Amy Smith,Simple designs to save a life,http://www.ted.com/talks/view/id/2,Fumes from indoor cooking fires kill more than...,2006,00:15:06,1379328.0,"MacArthur grant,simplicity,industrial design,a...","[, term, invent, i'd, like, tell, tale, one, f...","[simpl, design, save, life, fume, indoor, cook..."
2,3,Ashraf Ghani,How to rebuild a broken state,http://www.ted.com/talks/view/id/3,Ashraf Ghani's passionate and powerful 10-minu...,2005,00:18:45,790536.0,"corruption,poverty,economics,investment,milita...","[, public, dewey, long, ago, observ, constitut...","[rebuild, broken, state, ashraf, ghani', passi..."
3,4,Burt Rutan,The real future of space exploration,http://www.ted.com/talks/view/id/4,"In this passionate talk, legendary spacecraft ...",2006,00:19:37,1985119.0,"aircraft,flight,industrial design,NASA,rocket ...","[, want, start, say, houston, problem, we'r, e...","[real, futur, space, explor, passion, talk, le..."
4,5,Chris Bangle,Great cars are great art,http://www.ted.com/talks/view/id/5,American designer Chris Bangle explains his ph...,2002,00:20:04,859487.0,"cars,industrial design,transportation,inventio...","[, want, talk, background, idea, car, art, act...","[great, car, great, art, american, design, chr..."


We don't want to run this every time we want to calculate TF-IDF so we will save if to a csv file after converting the exposition and transcription into lists.

In [12]:
df['exposition'] = [','.join(row['exposition']) for i, row in df.iterrows()]

df['transcript'] = [','.join(row['transcript']) for i, row in df.iterrows()]

In [13]:
df.to_csv('processed_TED_Talks.csv', index = False)

PermissionError: [Errno 13] Permission denied: 'processed_TED_Talks.csv'