# Natural Language Processing Workflow(2)-EDA

For text data, EDA is to summarize the main characteristics of the dateset and find some patterns before identifying the hidden patterns with machine learning techniques. For example, for the actors we can look ata the following:
1. **Most common words**
2. **Size of vocabulary** - look number of unique words and also how quickly someone speaks
3. **Amount of profanity**

But before we dive into any text EDA, we need to identify what specific questions we need to solve.

## 1. Most Common Words

In [None]:
import pandas as pd

# read in the document-term matrix
data = pd.read_pickle('dtm.pkl')

# by using transpose we put the name of actors on the columns, 
#this will be easier for us to find the top 30 words she/he like to use
data = data.transpose()
data.head()

In [None]:
# Find the top 30 words by each actors
top_dict = {}

for c in data.columns:
    # after sort the values, return a datafrme (one index, one/more values)
    top = data[c].sort_values(ascending = False).head(30)
    top_dict[c] = list(zip(top.index, top.values))

# top_dict is a dictornary, 
# the key is the name of actor
# the values is a tuple nested list, just like[('a',1),('b',2)]
top_dict

In [2]:
# Example to show list(zip(a,b))
a=['ab','cd']
value = [3,5]
list(zip(a,value))

[('ab', 3), ('cd', 5)]

In [None]:
# Print the top 15 words said by each actor
for actor, top_words in top_dict.items():
    print(actor)
    print(','.join([word for word,count in top_words[0:14]]))
    print('---')

If some of the top words have little meaning, that means we need to add them to stop words list.

In [None]:
from collections import Counter

# first pull out the top 30 words for each actor
words = []
for actor in data.columns:
    top = [word for (word,count) in top_dict[actor]]
    for t in top:
        words.append(t)      
words

# Aggregate the words list and identify most common words
Counter(words).most_common()

# If more than half of the actors(total 12) have it as top word, exclude it from the list
add_stop_words = [word for word,count in Counter(words).most_common() if count > 6]
add_stop_words

In [None]:
# Updata document term matrix with the new list of stop words
from sklean.feature_extraction import text
from sklean.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words = stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm_update1 = pd.DataFrame(data_cv.toarray(), columns = cv.get_feature_names())
data_dtm_update1.index = data_clean.index

# Pickle it for later use
import pickle
#pickle.dump and *.to_pickle is basically the same method to save file
pickle.dump(cv, open('cv_stop.pkl', 'wb'))
data_dtm_update1.to_pickle('dtm_stop.pkl')


After we got the top words for each actor, one of the best way to visually communicate this to someone else is to use **Word Cloud**. By looking into the word could, we can identify if the data make sense. If not, we have be back to the previous step to clean the data again.

In [None]:
from wordcloud import WordCloud

wc = WordCloud(stopwords = stop_words, backgroud_color = 'white', max_font_size = 150, random_state = 42)

# to help give title to subplot, we can create a full_name list
full_names = ['...', '...',...]

import matplotlib.pyplot as plt
%matplotlib inline

# Create subplots for each actors
for index, actor in enumerate(data.columns):
    wc.generate(data_clean.transcript[actor])
    
    # since we have 12 actors, create a subplot 3*4
    plt.subplot(3,4, index+1)
    plt.imshow(wc, interpolation = 'bilinear')
    plt.axis('off')
    plt.title(full_names[index]) 

## 2. Size of vocabulary

In [None]:
# Find the number of unique words each actor uses
unique_list = []
for actor in data.columns:
    uniques = data[actor].to_numpy().nonzero()[0].size
    unique_list.append(unique)


data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns = ['actor', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort

In [None]:
# Calculate the speach speed of each actor

# First find the total number of words each actor uses
total_list = []
for actor in data.columns:
    totals = sum(data[actor])
    total_list.append(totals)
    
# Check the run time from IMDB
run_times = [**, **, ...]

data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words']/data_words['run_times']

# Sort the dataframe by words per minute to see who talks the slowest and fastest
data_wpm_sort = data_words.sort_values(by ='words_per_minute')
data_wpm_sort

In [None]:
# Visulalize the plot
import numpy as np

# np.arange(start=1, stop=10, step=3)
y_pos = np.arange(len(data_words))

plt.subplot(1,2,1)
plt.barh(y_pos, data_unique_sort, align = 'center')
plt.yticks(y_pos, data_unique_sort.actor)
plt.title('Number of Unique Words', fontsize = 20)

plt.subplot(1,2,2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align = 'center')
plt.yticks(y_pos, data_wpm_sort.actor)
plt.title('Number of Words Per Minute', fontsize = 20)

plt.tight_layout() 

## 3. Amount of Profanity

In [None]:
# Let's isolate just these bad words
data_bad_words = data.transpose()[['fucking', 'fuck', 'shit']]
data_profanity = pd.concat([data_bad_words.fucking + data_bad_words.fuck, data_bad_words.shit], axis=1)
data_profanity.columns = ['f_word', 's_word']
data_profanity

In [None]:
# Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8]

for i, comedian in enumerate(data_profanity.index):
    x = data_profanity.f_word.loc[comedian]
    y = data_profanity.s_word.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)
    plt.xlim(-5, 155) 
    
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)

plt.show()