# NLP on Ted Talk transcripts

slides and code located at

https://github.com/1fmusic/jean_bartik_computing_symposium_rankin.git


# Explore, Clean and Pre-process text
In this notebook we will 

1. Clean
2. Tokenize
3. Stem/lemmatize
4. Normalize (remove stopwords, unwanted characters, punctuation, lowercase)


### Create a new conda environment with the correct packages 

To create a new environment in (ana)conda - which you should do for each project so you dont break stuff - download the `environment.yml` file and follow these directions. 

Open a conda prompt (windows) or a terminal window (linux/mac):<br>
            `$ cd ~/Documents/path_where_i_put_the_yml_file/`<br>
            `$ conda env create -f environment.yml`<br>

Activate the environment (the name is in the yml file)<br>
            `$ conda activate jbcs2020`<br>
            `$ jupyter notebook` 

Then click on the jupyter notebook titled `ted_clean_jbcs.ipynb`

# Install and load libraries

In [None]:
import nltk
import re
import pickle
import os
import pandas as pd
import numpy as np

import nltk

### ONLY The first time you use the environment, download these packages from nltk

In [None]:
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# Import Data
We import the csv of transcripts and URLs into a pandas dataframe. 

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

https://chrisalbon.com/

In [None]:
talks = pd.read_csv('./data/ted_trans.csv', encoding = "UTF-8")  

In [None]:
# print the first 5 rows using pandas 'head()' method
talks.head(5)

Keep only the transcript column 

In [None]:
talks = talks.loc[:,'transcript']

TODO: print portions from 3 different transcripts (**talks**)

In [None]:
talks[0][:521]

In [None]:
##TODO print talk 1

In [None]:
##TODO print talk number 15

In [None]:
##TODO print another talk

#### number of transcripts you want to analyze (also creates a list of numbers for iteration)

In [None]:
fileids = range(0,len(talks))
fileids

# Tokenize (split) into words
Typically, you would just go straignt to word tokenization if you are planning to do topic modeling. There are MANY, MANY ways to tokenize text into words. I will just show a few, but feel free to explore the possibilities.

## Method 1
wordpunct_tokenize from NLTK
splits the text into words and punctuaiton as separate tokens (this makes it easy to remove)

In [None]:
tokenized_talks = [nltk.wordpunct_tokenize(talks[fileid]) \
             for fileid in fileids]

#to view a few
print('\n-----\n'.join(nltk.wordpunct_tokenize(talks[0][500:560])))

## Method 2
Word_tokenize from NLTK

In [None]:
doc_words_word_tok = [nltk.word_tokenize(talks[fileid]) \
             for fileid in fileids]

print('\n-----\n'.join(nltk.word_tokenize(talks[0][500:560])))

# Normalization
## Lemmatize

+ A method for getting the word root.
+ It will replace the ending with the correct letters instead of chopping it off like some of the stemming functions. This leaves us will a few non-stemmed words.  
        i.e. children -> child,   capacities -> capacity, but also, unpredictability -> unpredictability
        
## Lowercase
+ also lowercase using **.lower()** at the word level

In [None]:
lemmizer = nltk.WordNetLemmatizer()


my_text = "With our capabilities, we will educate the children. They are all associated with various playgrounds."


for word in nltk.wordpunct_tokenize(my_text):
    print(word, lemmizer.lemmatize(word.lower()))

## Stem
Now we will see how stemming with the porter stemmer the tokenized words will cut off the word ending to get to the root. Now we get `recently -> recent`, but also `associated -> associ`.

We can print out the original word next to the stemmed word to check

In [None]:
stemmer = nltk.stem.porter.PorterStemmer()


my_text = "With our capabilities, we will educate the children. They are all associated with various playgrounds."


for word in nltk.wordpunct_tokenize(my_text):
    print(word, stemmer.stem(word.lower()))

# Remove Stopwords, punctuation, or other non-letter/numbers
+ NLTK has a set of common words that do not add any semantic information to our text, we will use this list and add our own items to it
        + punctuation
        + music notes

In [None]:
stop = nltk.corpus.stopwords.words('english')
stop[:15]

add our own terms or characters to the list

In [None]:
stop += ['.', ',',':','...','!"','?"', "'", '"',' - ',' — ',',"','."',';','♫♫','♫']
stop = set(stop)

Write a function to remove the stop words from a document using our list. Print a few talks and see if there are still a few words in there that are not giving us any information. If so, add them to the **stop** list.

# Non-speech sounds, events

In [None]:
# remove parethetical non-speech sounds from text using a regular expression
clean_parens_talks= [re.sub(r'\([^)]*\)', ' ', talks[fileid]) for fileid in fileids]

# print one talk
clean_parens_talks[1][:400]

In [None]:
talks[1][:400]

#  Define a cleaning function that combines the methods from above.
1. clean (remove parentheticals)
2. tokenize into words using wordpunct
3. lowercase and remove stop words
4. lemmatize or stem
5. lowercase and remove stop words
6. join the words back into a document and put into a list of cleaned documents

In [None]:
def clean_text(text):
    
    """ 
    Takes in a corpus of documents and cleans. Needs multiple docs. 
    
    IN: corpus of documents
    
    OUT: cleaned text = a list (documents) of lists (cleaned word in each doc)
    """

    lemmizer = WordNetLemmatizer()

    stop = ## TODO: import and/or create your list of stopwords
   

    cleaned_text = []
    
    for doc in text:
        cleaned_words = []
        
        # remove parentheticals
        clean_parens = re.sub(r'\([^)]*\)', ' ', doc)
        
        
        # tokenize into words
        for word  in nltk.wordpunct_tokenize(clean_parens):  
            low_word = word.lower()

            # throw out any words in stop words (doing it here and later makes it faster)
            if low_word not in stop:

                # lemmatize  to roots
                root_word = lemmizer.lemmatize(low_word)  

                # keep if not in stopwords (yes, again)
                ## TODO: remove stopwords again

                    # put into a list of words for each document
                    cleaned_words.append(root_word)
        
        # keep corpus of cleaned words for each document    
        cleaned_text.append(' '.join(cleaned_words))
    
    return cleaned_text

In [None]:
cleaned_talks = clean_text(talks)

In [None]:
# TODO: print a few of our cleaned words from talk 1


In [None]:
# TODO: print a few of our cleaned words from talk 15


# Save 
Save as a pickle file (or csv) for topic modeling in the next notebook

In [None]:
with open('./data/cleaned_talks.pkl', 'wb') as picklefile:
    pickle.dump(cleaned_talks, picklefile)