# Exploratory Data Analysis

Here we focus on some data statistics that we can derive from our previous data cleaning, such as finding the most common words for a specific movie or the size of the vocabulary that is used to describe the movie. We do this step to find out if the data cleaning process worked fine or if there are still words that doesn't make any sense. For example if the word cloud shows us that "one" is a commonly used word for a specific movie, we might have to do some more data cleaning and remove unnecessary words so we can get the best results for our Natural Language Model.

In [14]:
import pandas as pd
# in anaconda prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
data = pd.read_pickle('Songs_DBpedia_DTM.pkl')
data = data.transpose() # for easier aggregation
data.head()

Unnamed: 0,'97_Bonnie_&_Clyde,'Cuz_I_Can_(Pink_song),'Round_Midnight_(song),'S_Wonderful,"'The_Half_of_It,_Dearie'_Blues",'Till_I_Collapse,'Tis_Harry_I'm_Plannin'_to_Marry,'n_Beetje,'t_Is_Genoeg,'t_Is_OK,...,A_Day_in_the_Life_of_a_Tree,A_Fellow_Needs_a_Girl,A_Great_Day_for_Freedom,A_Guy_Is_a_Guy,A_Guy_Like_You,A_Hard_Rain's_a-Gonna_Fall,A_Kinder_Eye,A_Little_Bit_Longer_(song),A_Love_That_Will_Never_Grow_Old,A_Man_Without_Love_(Kenneth_McKellar_song)
abbey,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abbreviation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abolition,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
absent,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
academy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [34]:
#top 15 words in each summary
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

{"'97_Bonnie_&_Clyde": [('song', 5),
  ('shady', 3),
  ('slim', 3),
  ('kim', 2),
  ('daughter', 2),
  ('dark', 1),
  ('throw', 1),
  ('pier', 1),
  ('sea', 1),
  ('seat', 1),
  ('dada', 1),
  ('whilst', 1),
  ('covered', 1),
  ('rapper', 1),
  ('album', 1),
  ('wife', 1),
  ('baby', 1),
  ('couple', 1),
  ('just', 1),
  ('fashion', 1),
  ('little', 1),
  ('taking', 1),
  ('strange', 1),
  ('ex', 1),
  ('undertone', 1),
  ('like', 1),
  ('going', 1),
  ('let', 1),
  ('car', 1),
  ('strap', 1)],
 "'Cuz_I_Can_(Pink_song)": [('album', 3),
  ('song', 1),
  ('dead', 1),
  ('pink', 1),
  ('single', 1),
  ('written', 1),
  ('martin', 1),
  ('promotional', 1),
  ('fourth', 1),
  ('abbey', 0),
  ('pollution', 0),
  ('polish', 0),
  ('police', 0),
  ('poco', 0),
  ('player', 0),
  ('pleasure', 0),
  ('pope', 0),
  ('play', 0),
  ('platinum', 0),
  ('place', 0),
  ('pipe', 0),
  ('piggy', 0),
  ('pierrot', 0),
  ('pop', 0),
  ('positive', 0),
  ('popular', 0),
  ('predominantly', 0),
  ('prince',

In [35]:
# Print the top 15 words said by each comedian
for song, top_words in top_dict.items():
    print(song)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

'97_Bonnie_&_Clyde
song, shady, slim, kim, daughter, dark, throw, pier, sea, seat, dada, whilst, covered, rapper
---
'Cuz_I_Can_(Pink_song)
album, song, dead, pink, single, written, martin, promotional, fourth, abbey, pollution, polish, police, poco
---
'Round_Midnight_(song)
jazz, midnight, variety, pianist, quintet, hall, quickly, wide, standard, monk, musician, fame, version, added
---
'S_Wonderful
song, especially, broadway, stage, jazz, standard, frank, popular, composed, musical, written, funny, year, considered
---
'The_Half_of_It,_Dearie'_Blues
lady, musical, good, song, half, dearie, blues, composed, piggy, pink, pipe, place, pop, play
---
'Till_I_Collapse
rapper, song, album, chart, fourth, hook, despite, track, billion, numerous, collapse, studio, till, single
---
'Tis_Harry_I'm_Plannin'_to_Marry
song, sung, film, garter, sammy, jane, finally, near, times, end, musical, brown, marry, music
---
'n_Beetje
song, dutch, faithful, bit, contest, little, previous, pronunciation, po

In [37]:
# Let's first pull out the top 30 words for each comedian
words = []
for song in data.columns:
    top = [word for (word, count) in top_dict[song]]
    for t in top:
        words.append(t)
        
Counter(words).most_common()

[('song', 82),
 ('album', 59),
 ('play', 55),
 ('platinum', 53),
 ('place', 52),
 ('player', 51),
 ('pleasure', 49),
 ('pipe', 49),
 ('poco', 47),
 ('polish', 43),
 ('police', 43),
 ('pink', 42),
 ('pollution', 41),
 ('pop', 41),
 ('popular', 41),
 ('piggy', 39),
 ('pope', 32),
 ('written', 30),
 ('band', 29),
 ('pierrot', 27),
 ('previously', 27),
 ('previous', 26),
 ('abbey', 25),
 ('number', 24),
 ('music', 23),
 ('primarily', 22),
 ('pretty', 22),
 ('popularity', 22),
 ('pier', 21),
 ('prince', 21),
 ('version', 21),
 ('studio', 19),
 ('single', 18),
 ('track', 18),
 ('debut', 18),
 ('time', 16),
 ('rock', 16),
 ('positive', 15),
 ('piece', 15),
 ('present', 14),
 ('live', 13),
 ('prior', 13),
 ('composed', 12),
 ('chart', 12),
 ('featured', 12),
 ('love', 12),
 ('title', 12),
 ('prelude', 11),
 ('posse', 11),
 ('night', 10),
 ('included', 9),
 ('contest', 9),
 ('billboard', 9),
 ('tour', 9),
 ('later', 9),
 ('recording', 9),
 ('piano', 9),
 ('produced', 9),
 ('musical', 8),
 ('fil

## After the first run
When running the program for the first time, I saw many Russian/Swedish words, with letters the english alphabet doesn't include. So, the first step was for me to remove all the letters, that don't come up in the English alphabet. I did this by using the ascii letters and removing every non-ascii letter from the whole corpus. This stepremoved nearly all foreign letters from the text and left me with only the words with ascii symbols.

## After the second run 
The second time running this command, I found out that most words that are at the top of the list, aren't actual English words such as "perry" "eminem", so my goal was to go back to my original data_clean program and add a function that would let me remove all the unnecessary words. The whole word count went from 2500+ words to now nearly 1300 words.