# Exploratory Data Analysis

## Introduction

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:

1. **Most common words** - find these and create word clouds
2. **Size of vocabulary** - look number of unique words and also how quickly someone speaks
3. **Amount of profanity** - most common terms

### Assignments 1: 
#### Find `Most Common Words` and create word cloud.

#### Read in the document-term matrix

In [None]:
# Import necessary libraries
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Load the Document-Term Matrix (dtm.pkl)
data_dtm = pd.read_pickle("dtm.pkl")

# Load the cleaned data (data_clean.pkl)
data_clean = pd.read_pickle("data_clean.pkl")

In [None]:
data_dtm

#### Find  and print the top 30 words said by each comedian


In [None]:
# Find and print the top 30 words said by each comedian
for comedian in data_dtm.index:
    print(f"\nTop 30 words for {comedian}:")
    print(data_dtm.loc[comedian].sort_values(ascending=False).head(30))

#### By looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that. Look at the most common top words and add them to the stop word list.



In [None]:
# Stop words list for adel_karam
stop_words_adel_karam = ['said', 'like', 'know', 'dont', 'dani','hes','youre', 'just', 'got', 'come', 'theres', 'want', 'im', 'thats']

# Stop words list for amy_schumer
stop_words_amy_schumer = ['like', 'im', 'just', 'know', 'guys', 'dont', 'right', 'thats', 'oh', 'theyre', 'youre', 'thank', 'cause', 'gonna', 'guy', 'okay', 'did', 'shes', 'think', 'yeah']

# Stop words list for beth_stelling
stop_words_beth_stelling = ['like', 'just', 'im', 'dont', 'know', 'youre', 'think', 'going', 'people', 'thats', 'feel', 'sex', 'theres', 'hes', 'got', 'theyre', 'maybe', 'mom', 'time', 'good']

# Stop words list for big_jay_oakerson
stop_words_big_jay_oakerson = ['like', 'just', 'im', 'know', 'thats', 'dont', 'right', 'youre', 'gonna', 'guy', 'shes', 'shit', 'got', 'man', 'dude', 'oh', 'big', 'good', 'yeah', 'guys']

# Stop words list for chelsea_handler
stop_words_chelsea_handler = ['like', 'im', 'said', 'dont', 'gonna', 'just', 'people', 'thats', 'youre', 'know', 'hes', 'going', 'time', 'didnt', 'okay', 'think', 'want', 'dan', 'cause', 'little']

# Stop words list for chris_rock
stop_words_chris_rock = ['like', 'thats', 'right', 'im', 'got', 'shit', 'know', 'okay', 'fuck', 'dont', 'just', 'man', 'cause', 'fucking', 'black', 'people', 'kids', 'theyre', 'everybody', 'white']

# Stop words list for david_cross
stop_words_david_cross = ['right', 'im', 'dont', 'thats', 'know', 'people', 'yeah', 'like', 'want', 'just', 'gonna', 'tell', 'think', 'hes', 'oh', 'god', 'thing', 'theres', 'theyre', 'okay']

# Stop words list for dave_chappelle
stop_words_dave_chappelle = ['know', 'im', 'like', 'said', 'man', 'tell', 'everybody', 'ngga', 'didnt', 'just', 'dont', 'time', 'right', 'shit', 'thats', 'got', 'dream', 'people', 'did', 'life']

# Stop words list for dylan_moran
stop_words_dylan_moran = ['know', 'people', 'just', 'thats', 'dont', 'going', 'im', 'look', 'youre', 'like', 'cause', 'time', 'really', 'oh', 'thing', 'didnt', 'need', 'theyre', 'okay', 'yeah']

# Stop words list for george_carlin
stop_words_george_carlin = ['like', 'im', 'know', 'youre', 'right', 'passed', 'guys', 'running', 'come', 'cloth', 'sergeant', 'ox', 'attack', 'life', 'loirn', 'limping', 'feather', 'dance', 'make', 'lot']

# Stop words list for iliza_shlesinger
stop_words_iliza_shlesinger = ['like', 'youre', 'im', 'dont', 'okay', 'know', 'just', 'right', 'thats', 'want', 'women', 'yeah', 'time', 'wedding', 'theyre', 'love', 'got', 'hes', 'going', 'think']

# Stop words list for kevin_hart
stop_words_kevin_hart = ['im', 'dont', 'fcking', 'said', 'got', 'know', 'thats', 'fck', 'like', 'man', 'just', 'right', 'shit', 'gonna', 'gotta', 'good', 'time', 'kids', 'house', 'people']

# Stop words list for kevin_james
stop_words_kevin_james = ['like', 'know', 'just', 'dont', 'im', 'thats', 'right', 'goes', 'youre', 'got', 'time', 'did', 'hes', 'good', 'yeah', 'shes', 'theyre', 'say', 'cause', 'gonna']

# Stop words list for louis_c_k
stop_words_louis_c_k = ['just', 'im', 'like', 'thats', 'dont', 'people', 'kids', 'want', 'shit', 'theres', 'know', 'youre', 'fuck', 'going', 'really', 'thing', 'good', 'time', 'theyre', 'kid']

# Stop words list for matt_rife
stop_words_matt_rife = ['like', 'know', 'im', 'just', 'thats', 'dont', 'fucking', 'oh', 'think', 'want', 'man', 'okay', 'people', 'youre', 'good', 'going', 'life', 'hes', 'time', 'fuck']

# Stop words list for pete_davidson
stop_words_pete_davidson = ['like', 'know', 'im', 'goes', 'dont', 'just', 'thats', 'mom', 'right', 'shes', 'youre', 'yeah', 'fuck', 'cause', 'fucking', 'got', 'okay', 'want', 'little', 'going']

# Stop words list for ricky_gervais
stop_words_ricky_gervais = ['just', 'like', 'dont', 'know', 'oh', 'right', 'im', 'got', 'okay', 'yeah', 'thats', 'fat', 'god', 'think', 'ive', 'people', 'going', 'theyre', 'said', 'fucking']

# Stop words list for sarah_cooper
stop_words_sarah_cooper = ['im', 'sarah', 'just', 'like', 'oh', 'cooper', 'nice', 'okay', 'yeah', 'gonna', 'youre', 'great', 'know', 'fine', 'good', 'got', 'right', 'thats', 'dont', 'little']

# Stop words list for tom_segura
stop_words_tom_segura = ['indians', 'indian', 'like', 'passed', 'everybody', 'im', 'know', 'youre', 'right', 'dont', 'goes', 'yeah', 'think', 'thats', 'say', 'want', 'got', 'gonna', 'shit', 'fucking']

# Stop words list for trevor_noah
stop_words_trevor_noah = ['like', 'know', 'dont', 'people', 'white', 'yeah', 'thats', 'youre', 'anthem', 'man', 'just', 'im', 'time', 'think', 'hes', 'germany', 'right', 'world', 'america', 'thing']

#### Let's aggregate this list and identify the most common words along with how many routines they occur in


In [None]:
from collections import Counter

# Aggregate stop words lists
all_stop_words = (
    stop_words_adel_karam +
    stop_words_amy_schumer +
    stop_words_beth_stelling +
    stop_words_big_jay_oakerson +
    stop_words_chris_rock +
    stop_words_chelsea_handler +
    stop_words_david_cross +
    stop_words_dave_chappelle +
    stop_words_dylan_moran +
    stop_words_george_carlin +
    stop_words_iliza_shlesinger +
    stop_words_kevin_hart +
    stop_words_kevin_james +
    stop_words_louis_c_k +
    stop_words_matt_rife +
    stop_words_pete_davidson +
    stop_words_ricky_gervais +
    stop_words_sarah_cooper +
    stop_words_tom_segura +
    stop_words_trevor_noah
)

# Count occurrences of each word
word_counts = Counter(all_stop_words)

# Identify the most common words and their occurrences
most_common_words = word_counts.most_common()

# Print the results
print("Most common words and their occurrences:")
for word, count in most_common_words:
    print(f"{word}: {count} occurrences")

#### If more than half of the comedians have it as a top word, exclude it from the list


In [None]:
# Identify comedians count
num_comedians = 20

# Exclude words that appear in more than half of the comedians' lists
filtered_words = {word: count for word, count in word_counts.items() if count >= num_comedians / 2}

# Print the filtered words
print("Filtered words based on more than half comedians having it as a top word:")
for word, count in filtered_words.items():
    print(f"{word}: {count} occurrences")

In [None]:
## Now I will put these stop words in Final_Stop_Words
final_stop_words = list(set(filtered_words))

print(final_stop_words)

In [None]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Load the cleaned data (data_clean.pkl)
data_clean = pd.read_pickle("data_clean.pkl")

stop_words = list(text.ENGLISH_STOP_WORDS.union(final_stop_words))
# Add new stop words
cv = CountVectorizer(stop_words=stop_words)

# Recreate document-term matrix
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm_NLP3 = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm_NLP3.index = data_clean.index
data_dtm_NLP3

# Pickle it for later use
with open('data_dtm_NLP3.pkl', 'wb') as file:
    pickle.dump(data_dtm_NLP3, file)

In [None]:
data_dtm_NLP3

In [None]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
# !pip install wordcloud
from wordcloud import WordCloud

In [None]:
# Reset the output dimensions
plt.rcParams['figure.figsize'] = [16, 12]

# Create subplots for each comedian
comedians = data_dtm_NLP3.index
for index, comedian in enumerate(comedians):
    wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2", max_font_size=150, random_state=42)
    wc.generate(data_clean.transcript[comedian])
    
    plt.subplot(4, 5, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(comedian)

plt.show()

## Observations
1. George Carlin's top words are quite different, with specific references like "indians," "passed," and "running," suggesting his tendency towards social commentary and observational humor. And its indicating the george carling is talking about an indian warrior/sergeant in his standup comedy

2. Amy Schumer frequently uses words like "guy" "girl" and "mom" which might reflect that she might be talking about her relationships.

3. Ricky Gervas is using words like "fat", "gay","god" indicating that he's doing dark comedy either body shaming  and using homophobic comments

4. Sarah Cooper's references to "president" and "news," reflecting her political satire.

Each comedian has a unique set of words that they frequently use, indicating their individual style and preferred topics.

### Assignment 2:
#### Find the number of unique words that each comedian uses.

To find unique comedians store I will data_dtm_nlp3 in a new dataframe then wherever there is numerical value I will change it to 1 and then to find which unique words are said by the  comedians that cell should be 1 and column sum should be 1 too 

In [None]:
# Create a binary dataframe where each cell is 1 if the word is used by the comedian, 0 otherwise
binary_data_dtm_NLP3 = data_dtm_NLP3.applymap(lambda x: 1 if x > 0 else 0)

# Calculate the column sum for each word to check if it's unique to a comedian
unique_words_sum = binary_data_dtm_NLP3.sum(axis=0)

# Filter out words that are unique to a comedian (sum is 1)
unique_words_per_comedian = binary_data_dtm_NLP3.loc[:, unique_words_sum == 1]

# Calculate the row sum for each comedian to find the number of unique words
number_of_unique_words_per_comedian = unique_words_per_comedian.sum(axis=1)

# Create a new dataframe with the number of unique words per comedian
unique_words_count_df = pd.DataFrame({'Comedian': number_of_unique_words_per_comedian.index, 'UniqueWordsCount': number_of_unique_words_per_comedian.values})

# Print the dataframe with the number of unique words per comedian
print("Number of Unique Words Said by Each Comedian:")
print(unique_words_count_df)

 #####  Before finding the wpm for each comedian i will total the number of words each comedian said

In [None]:
# Summing the values across each row to get the total words
total_words_by_comedian = data_dtm.sum(axis=1)

# Creating a new DataFrame df_nlp3
df_nlp3 = pd.DataFrame({'Comedian': total_words_by_comedian.index, 'Total_Words': total_words_by_comedian.values})

# Displaying the new DataFrame
print(df_nlp3)

In [None]:
import numpy as np
# List of comedians
comedians = df_nlp3['Comedian'].tolist()
# Took duration of each comedian from IMDB
durations = [59,61,58,68,68,70,56,56,74,6,58,57,62,59,63,49,60,47,57,55]
# Adding the 'Duration' column to the DataFrame
df_nlp3['run_times'] = durations
# Displaying the updated DataFrame
print(df_nlp3)

In [None]:
# Calculate the words per minute of each comedian
df_nlp3['words_per_minute'] = df_nlp3['Total_Words'] / df_nlp3['run_times']
df_nlp3

In [None]:
# Sorting the DataFrame by 'words_per_minute'
df_sorted = df_nlp3.sort_values(by='words_per_minute', ascending=False)
df_sorted

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Sorting the DataFrame by 'words_per_minute'
df_sorted = df_nlp3.sort_values(by='words_per_minute', ascending=False)

# Set up the subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Plotting the Number of Unique Words
axes[0].barh(df_sorted['Comedian'], df_sorted['Total_Words'], color='skyblue')
axes[0].set_yticks(df_sorted['Comedian'])
axes[0].set_xlabel('Total Words')
axes[0].set_title('Number of Total Words', fontsize=16)

# Plotting the Number of Words Per Minute
axes[1].barh(df_sorted['Comedian'], df_sorted['words_per_minute'], color='salmon')
axes[1].set_yticks(df_sorted['Comedian'])
axes[1].set_xlabel('Words Per Minute')
axes[1].set_title('Number of Words Per Minute', fontsize=16)

# Adjusting layout
plt.tight_layout()

# Show the plot
plt.show()

## Observations

* **Talking Speed:**
  * George Carlin and Kevin Hart speak quickly, averaging around 91 and 93 words per minute respectively.

* **Vocabulary:**
  * George Carlin has a smaller vocabulary, using only 550 words so it means that his comedy length was short, while Dylan Moran have expansive vocabularies, with over 5000 words.
  
* **Consistency in Talking Speed:**
  * Matt rife and Dylan moran maintain steady talking speeds.

### Assignment 3: 
#### Check the profanity by analysing the common bad words, like `fucking`, `fuck`, `shit etc.

In [None]:
# take a look at the most common words
lol = Counter(data_dtm)
most_common_words = [word for word, _ in lol.most_common()]
print(most_common_words)

In [None]:
# isolate just these bad words
profanity_words = ['fucking', 'fuck', 'shit','bitch','dickhead', 'dicks', 'dicksucker','fucked', 'fuckers', 'fuckin', 'fucking', 'fucks','ngga', 'nggas', 'ngger']
data_bad_words = data_dtm[profanity_words]
data_bad_words

In [None]:
pip install adjustText

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from adjustText import adjust_text


# Create a figure with subplots
plt.figure(figsize=(6, 6))

# Loop through each comedian
texts = []
for i, comedian in enumerate(data_bad_words.index):
    # Filter 'f_words' and 's_words' based on starting letter
    f_words = [word for word in profanity_words if word.startswith('f')]
    s_words = [word for word in profanity_words if word.startswith('s')]

    # Calculate the total count of 'f_words' and 's_words'
    x = data_bad_words[f_words].sum(axis=1).loc[comedian]
    y = data_bad_words[s_words].sum(axis=1).loc[comedian]

    # Create a scatter plot
    plt.scatter(x, y, color='blue')

    # Add text annotations to the list
    texts.append(plt.text(x + 1.5, y + 0.5, comedian, fontsize=8))

# Adjust text positions to avoid overlapping
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

# Set plot title and labels
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)

# Show the plot
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(data_bad_words, annot=True, fmt="d", cmap="YlGnBu", xticklabels=profanity_words)
plt.title("Profanity Words Count for Each Comedian")
plt.xlabel("Profanity Words")
plt.ylabel("Comedian Name")
plt.show()

Observations:

Frequency of Profanity:

- Matt Rife uses the most profanity overall, with high counts of "fucking," "fuck," and "shit," suggesting a tendency towards more explicit language in his comedy.
- Ricky Gervais also uses a significant amount of profanity, particularly the word "fuck," along with variations like "fucking" and "fucks."
- George Carlin doesn't use any of the listed profane words, indicating a cleaner style of comedy.

Variety of Profanity:

- Chris Rock and Big Jay Oakerson use a wide range of profanity, including "fucking," "fuck," "shit," and others, suggesting a more diverse use of explicit language in their routines.
- Kevin Hart doesn't use the word "fuck" but does use variations like "fucking" and "fucks," indicating a preference for slightly less explicit language.

Use of Specific Profanity:

- Pete Davidson and Louis C.K. use the word "fuck" most frequently, with high counts for "fucking" and "fuckers," indicating a preference for this particular profanity in their comedy.
- Chelsea Handler uses the word "fuck" and its variations frequently but doesn't use the word "bitch" as much as some other comedians, suggesting a focus on certain types of profanity over others.

Absence of Profanity:

- George Carlin and Kevin James don't use any of the listed profane words, indicating a clean or family-friendly style of comedy that avoids explicit language.
- Sarah Cooper also refrains from using profanity, further indicating a cleaner comedic style.

### Chris rock and Dave chappele have used the most number of N-words (Nigga) indicating that they are black skinned 

### Assignment 4:(optional)
What other word counts do you think would be interesting to compare instead of the f-word and s-word? Create a scatter plot comparing them.

Ans: Other than plot f-word and s-word we can compare d-word and b-word

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from adjustText import adjust_text


# Create a figure with subplots
plt.figure(figsize=(6, 6))

# Loop through each comedian
texts = []
for i, comedian in enumerate(data_bad_words.index):
    # Filter 'f_words' and 's_words' based on starting letter
    f_words = [word for word in profanity_words if word.startswith('b')]
    s_words = [word for word in profanity_words if word.startswith('d')]

    # Calculate the total count of 'f_words' and 's_words'
    x = data_bad_words[f_words].sum(axis=1).loc[comedian]
    y = data_bad_words[s_words].sum(axis=1).loc[comedian]

    # Create a scatter plot
    plt.scatter(x, y, color='blue')

    # Add text annotations to the list
    texts.append(plt.text(x + 1.5, y + 0.5, comedian, fontsize=8))

# Adjust text positions to avoid overlapping
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

# Set plot title and labels
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of B Bombs', fontsize=15)
plt.ylabel('Number of D Words', fontsize=15)

# Show the plot
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from adjustText import adjust_text


# Create a figure with subplots
plt.figure(figsize=(6, 6))

# Loop through each comedian
texts = []
for i, comedian in enumerate(data_bad_words.index):
    # Filter 'f_words' and 's_words' based on starting letter
    f_words = [word for word in profanity_words if word.startswith('f')]
    s_words = [word for word in profanity_words if word.startswith('d')]

    # Calculate the total count of 'f_words' and 's_words'
    x = data_bad_words[f_words].sum(axis=1).loc[comedian]
    y = data_bad_words[s_words].sum(axis=1).loc[comedian]

    # Create a scatter plot
    plt.scatter(x, y, color='blue')

    # Add text annotations to the list
    texts.append(plt.text(x + 1.5, y + 0.5, comedian, fontsize=8))

# Adjust text positions to avoid overlapping
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

# Set plot title and labels
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of D Words', fontsize=15)

# Show the plot
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from adjustText import adjust_text


# Create a figure with subplots
plt.figure(figsize=(6, 6))

# Loop through each comedian
texts = []
for i, comedian in enumerate(data_bad_words.index):
    # Filter 'f_words' and 's_words' based on starting letter
    f_words = [word for word in profanity_words if word.startswith('b')]
    s_words = [word for word in profanity_words if word.startswith('f')]

    # Calculate the total count of 'f_words' and 's_words'
    x = data_bad_words[f_words].sum(axis=1).loc[comedian]
    y = data_bad_words[s_words].sum(axis=1).loc[comedian]

    # Create a scatter plot
    plt.scatter(x, y, color='blue')

    # Add text annotations to the list
    texts.append(plt.text(x + 1.5, y + 0.5, comedian, fontsize=8))

# Adjust text positions to avoid overlapping
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

# Set plot title and labels
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of B Bombs', fontsize=15)
plt.ylabel('Number of F Words', fontsize=15)

# Show the plot
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from adjustText import adjust_text


# Create a figure with subplots
plt.figure(figsize=(6, 6))

# Loop through each comedian
texts = []
for i, comedian in enumerate(data_bad_words.index):
    # Filter 'f_words' and 's_words' based on starting letter
    f_words = [word for word in profanity_words if word.startswith('f')]
    s_words = [word for word in profanity_words if word.startswith('n')]

    # Calculate the total count of 'f_words' and 's_words'
    x = data_bad_words[f_words].sum(axis=1).loc[comedian]
    y = data_bad_words[s_words].sum(axis=1).loc[comedian]

    # Create a scatter plot
    plt.scatter(x, y, color='blue')

    # Add text annotations to the list
    texts.append(plt.text(x + 1.5, y + 0.5, comedian, fontsize=8))

# Adjust text positions to avoid overlapping
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))

# Set plot title and labels
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of N Words', fontsize=15)

# Show the plot
plt.show()

In [None]:
# Calculate total f-bomb frequency for each comedian
data_bad_words['total_f_bomb'] = data_bad_words.filter(like='f').sum(axis=1)

# Plotting
plt.figure(figsize=(12, 8))
plt.bar(data_bad_words.index, data_bad_words['total_f_bomb'], color='skyblue')
plt.xlabel('Comedian')
plt.ylabel('Frequency of F-Bombs')
plt.title('Frequency of F-Bombs by Comedian')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()