## Preprocessing Data
The Ted Talk datasets come in two separate datasets which we'll need to combine.

In [38]:
import pandas as pd
ted_main = pd.read_csv('./data/ted_main.csv')
transcripts = pd.read_csv('./data/transcripts.csv')

print(ted_main.columns)
print('===========================================================')
print(transcripts.columns)

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views'],
      dtype='object')
Index(['transcript', 'url'], dtype='object')


"ted_main" contains the features relevant to our summarizer, "descriptions" and "main_speaker". "transcripts" contains "transcript". Both csv files contain "url" which we can join on.

In [39]:
merged = pd.merge(ted_main, transcripts, left_on="url", right_on="url")
merged = merged.iloc[:,1:] #Creates a default empty column of row numbers so we want to remove this.

In [40]:
merged.head(5)

Unnamed: 0,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,transcript
0,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,Good morning. How are you?(Laughter)It's been ...
1,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,"Thank you so much, Chris. And it's truly a gre..."
2,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,"(Music: ""The Sound of Silence,"" Simon & Garfun..."
3,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,If you're here today — and I'm very happy that...
4,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,"About 10 years ago, I took on the task to teac..."


There a few columns we don't need for the purposes of our project. However, we'll keep them for now in-case we find some of the features may help the notes/summary-generator in terms of ROGUE-1/ROGUE-2 scores. For now, we'll be removing certain features specific to Ted Talk transcripts such as the use of parentheses to denote sound. e.g. "(applause)", "(Music: Song Title, Artist)", "(Laughter)". 

In [42]:
import re

"""
These "audio" parentheses are either added with a space included or not. For further cleaning, it's better to 
add an additional white-space, then collapse all whitespace that's 2 or greater in length into a single one. 
Also makes all letters lower-case.
"""
def strip_parens(transcript):
    pattern = r'\((.*?)\)'
    return re.sub(pattern, ' ', transcript)

merged['transcript'].head(5)

0    Good morning. How are you?(Laughter)It's been ...
1    Thank you so much, Chris. And it's truly a gre...
2    (Music: "The Sound of Silence," Simon & Garfun...
3    If you're here today — and I'm very happy that...
4    About 10 years ago, I took on the task to teac...
Name: transcript, dtype: object

Can see on record 0, "How are you?(Laughter)It's been..." and on record 2, "(Music: "The Sound of Silence," Simon & Garfun..."

In [43]:
merged['transcript'] = merged['transcript'].apply(strip_parens)
merged['transcript'].head(5)

0    Good morning. How are you? It's been great, ha...
1    Thank you so much, Chris. And it's truly a gre...
2     Hello voice mail, my old friend. I've called ...
3    If you're here today — and I'm very happy that...
4    About 10 years ago, I took on the task to teac...
Name: transcript, dtype: object

Some of these "audio parentheses" are often stacked next to each other which will introduce some odd spacings between tokens in the transcript so we'll fix this as well.

In [44]:
def condense_ws(transcript):
    pattern = r'\s+'
    return re.sub(pattern, ' ', transcript.strip()) #strip() to remove any spacings at the beginning or end
merged['transcript'] = merged['transcript'].apply(condense_ws)

In [45]:
merged['transcript'].head(5)

0    Good morning. How are you? It's been great, ha...
1    Thank you so much, Chris. And it's truly a gre...
2    Hello voice mail, my old friend. I've called f...
3    If you're here today — and I'm very happy that...
4    About 10 years ago, I took on the task to teac...
Name: transcript, dtype: object

Need to do some text cleaning where we'll map contractions, remove posessive punctuation (e.g. 's), punctuation, and any special characters - Some of the transcripts have been translated from another language and introduce unique special characters. Additionally, removing stop words as well. 

There are a few contraction mappings available either through packages or on github as individual python files.

Run !pip3 install contractions in the notebook if the package is not installed or without the "!" in your terminal

In [46]:
#!pip3 install contractions #uncomment if needed

#### Mapping Contractions

In [47]:
import contractions
def map_contractions(transcript):
    words = transcript.split()
    expanded_words = map(contractions.fix, words)
    return ' '.join(expanded_words)

merged['expanded_transcript']  = merged['transcript'].apply(map_contractions)
merged['expanded_transcript'].head()

0    Good morning. How are you? it is been great, h...
1    Thank you so much, Chris. And it is truly a gr...
2    Hello voice mail, my old friend. I have called...
3    If you are here today — and I am very happy th...
4    About 10 years ago, I took on the task to teac...
Name: expanded_transcript, dtype: object

In [48]:
merged['transcript'].head(5)

0    Good morning. How are you? It's been great, ha...
1    Thank you so much, Chris. And it's truly a gre...
2    Hello voice mail, my old friend. I've called f...
3    If you're here today — and I'm very happy that...
4    About 10 years ago, I took on the task to teac...
Name: transcript, dtype: object

Can see that words have been expanded. When it expands words like "I've", "I'd", etc., it'd automatically make the "I" characters back into upper case. So we've holded off on converting all characters lowercase but that is the next step.

In [53]:
merged['expanded_transcript'] = merged['expanded_transcript'].str.lower()
merged['expanded_transcript'].head(5)

0    good morning. how are you? it is been great, h...
1    thank you so much, chris. and it is truly a gr...
2    hello voice mail, my old friend. i have called...
3    if you are here today — and i am very happy th...
4    about 10 years ago, i took on the task to teac...
Name: expanded_transcript, dtype: object

#### Removing stop words

In [58]:
stop = stopwords.words('english')

def clean_transcript(transcript):
    transcript = re.sub(r"'s\b","", transcript) #remove 's
    pattern = r'\b(?:{})\b'.format('|'.join(stop))
    transcript = re.sub(pattern, '', transcript)
    transcript = re.sub(r'[^a-zA-Z]', " ", transcript)
    return re.sub(r'\s+', ' ', transcript)

merged['processed_transcript'] = merged['expanded_transcript'].apply(clean_transcript)

Now we'll save the data as a pickle file to preserve the exact state of the dataframe.

In [59]:
import pickle
outfile = open('./data/processed_ted', 'wb')
pickle.dump(merged, outfile)
outfile.close()

# Loading Data Back In

In [None]:
infile = open('./data/merged_ted', 'rb')
merged = pickle.load(infile)
infile.close()

### Preliminary Results
It's difficult to show preliminary results in terms of quantitative data but we found it may be useful to show the first steps to creating our notes-generator using Ted Talk transcripts. For this deliverable, we'll be calculating the weighted sums of each sentence after cleaning the data then using a word cloud as a initial visualization.

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize(transcript):
    words = transcript.split()
    lemmatized_transcript = ' '.join(map(lemmatizer.lemmatize, words))
    return lemmatized_transcript

merged['lemmatized_transcript'] = merged['transcript'].apply(lemmatize)

In [None]:
merged['lemmatized_transcript'].head(5)

# Loading Gigaword Tensorflow Dataset

In [None]:
import tensorflow_datasets as tfds
tfds.list_builders()

The one we'll be focusing on is Gigaword. A large corpus and corresponding summary. We hope to utilize this dataset to train our model then be able to fine-tune its results geared towards capturing important sentiment present in Ted Talk transcripts. 

In [None]:
train_gw, test_gw = tfds.load('gigaword', split=['train[:10%]', 'test[:10%]']) # 10% from training and 10% from testing

In [None]:
temp = train_gw.take(5)
for i in temp:
    print(i['document'])