# Lab Assignment 2 for CSE 7324 Fall 2017

___Members___: Hongning Yu, Hui Jiang, Hao Pan

## 1. Business Understanding
The dataset we use is a lyrics dataset (lyrics from MetrLyrics), which can be downloaded from Kaggle for free: https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics. By exploring this dataset, we are able to know the key features of certain song genre and predict the corresponding genre for new songs.


In this dataset, there are 362237 records and 5 features (song name, year, artist, genre, and lyrics). It is comprised of text documents and contains only text divided into documents. Besides, we can predict song genres according to lyrics, so it meets requirements for Lab 2.


For this project, our mainly purpose is to find the features for different song genres by analyzing the most frequent words in lyrics. And visualizing features will reveal more information about those features in the dataset. And then we may be able to figure out the relationship among features, which might benefit our genre prediction as well.


The statictic and prediction results can be applied to applications related to song searching or displaying. For example, song searching applications, like Siri may use when you ask her "What song is it?", can narrow down song searching scope by classify songs according to lyric features. As for song displaying application, it could reconmend songs by analyzing lyrics from users' favorite songs.


To ensure the correct rate of our prediction, we will keep a predict accuracy(AUC) target, like 80%, using accuracy measurement functions. We will use other more helpful evaluation metrics and functions if needed.


## 2. Data Encoding
First let's load the data in to dataframe. The data is already in a csv file but all of the lyrics are in raw text with different formats. Our gold is to predict genre basing on lyrics, so we still need to clean all lyrics.

In [72]:
import pandas as pd
import nltk
import numpy as np
import string

pd.set_option('display.max_columns', 60)

In [2]:
df = pd.read_csv("./lyrics.csv", encoding="utf-8")
df.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


### check null values in dataset.

In [3]:
df.isnull().sum()

index         0
song          2
year          0
artist        0
genre         0
lyrics    95680
dtype: int64

Looks like there are null values in lyrics and song. Just drop them.

In [4]:
df.dropna(inplace=True)
df.isnull().sum()

index     0
song      0
year      0
artist    0
genre     0
lyrics    0
dtype: int64

### check genre

In [5]:
df.genre.value_counts()

Rock             109235
Pop               40466
Hip-Hop           24850
Not Available     23941
Metal             23759
Country           14387
Jazz               7970
Electronic         7966
Other              5189
R&B                3401
Indie              3149
Folk               2243
Name: genre, dtype: int64

As we can see, some genres have way more records than others. For our genre-predicting classification problem, we could sample the dataset and choose subsets of some genres to avoid bias. But let's now keep it as it is and deal with this later.

Check certain genres:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 266556 entries, 0 to 362236
Data columns (total 6 columns):
index     266556 non-null int64
song      266556 non-null object
year      266556 non-null int64
artist    266556 non-null object
genre     266556 non-null object
lyrics    266556 non-null object
dtypes: int64(2), object(4)
memory usage: 14.2+ MB


### 2.1 Read in data and check data quality

### Change to ASCII
First let's try to get rid of all non-ascii characters, since we only want english characters

**Takes too much time**

In [7]:
# %%time
# import re
# for row in df.index[:1000]:
#     df.loc[row, 'lyrics'] = df.loc[row, 'lyrics'].encode('ascii', errors='ignore').decode()

# for row in df.index[:1000]:
#     df.loc[row, 'lyrics'] = re.sub(r'[^\x00-\x7f]',
#                                    r'', 
#                                    df.loc[row, 'lyrics']) 

### English Filter
We want to focus on song's with english lyrics, so let's delete all non-english records if they exist.

I tried to build a English-ratio detector to eliminate all non-english songs. 
Reference: https://github.com/rasbt/musicmood/blob/master/code/collect_data/data_collection.ipynb

But the loop of set calculation **takes too much time**. Need to improve.

In [8]:
# %%time
# def eng_ratio(text):
#     ''' Returns the ratio of non-English to English words from a text '''

#     english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 
#     text_vocab = set(w.lower() for w in text.split('-') if w.lower().isalpha()) 
#     unusual = text_vocab.difference(english_vocab)
#     diff = len(unusual)/(len(text_vocab)+1)
#     return diff

    
# # first let's eliminate non-english songs by their names
# before = df.shape[0]
# for row_id in range(100):
#     text = df.loc[row_id]['song']
#     diff = eng_ratio(text)
#     if diff >= 0.5:
#         df = df[df.index != row_id]
# after = df.shape[0]
# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)

### English Filter Ver.2
This is another approach, which uses a package from https://github.com/saffsd/langid.py. This package can detect language in a fairly quicker way. But still, 260k records takes around 50 mins.

In [9]:
# # package from https://github.com/saffsd/langid.py
# import langid

# before = df.shape[0]
# for row in df.index:
#     lang = langid.classify(df.loc[row]['lyrics'])[0]
#     if lang != 'en':
#         df = df[df.index != row]
# after = df.shape[0]

# rem = before - after
# print('%s have been removed.' %rem)
# print('%s songs remain in the dataset.' %after)

23693 have been removed.
242863 songs remain in the dataset.


### save english songs to a new csv

In [10]:
# df.to_csv('lyrics_new.csv',index_label='index')

*****
### Re-read csv file as df
Now only English songs exists in our dataset.

In [73]:
df = pd.read_csv("./lyrics_new.csv", encoding="utf-8").drop('index.1', axis=1)
df.genre.value_counts()

Rock             102619
Pop               34919
Hip-Hop           23042
Metal             22249
Not Available     18654
Country           14307
Jazz               7498
Electronic         7374
Other              3951
R&B                3362
Indie              3010
Folk               1878
Name: genre, dtype: int64

### Resampling  df --> df_sample
300k records easily run out of memory. So I tried to resample the dataset and choose equal size of each genre.

In [74]:
grouped = df.groupby('genre')
df_sample = grouped.apply(lambda x: x.sample(n=1800, random_state=7))

print("Size of dataframe: {}".format(df_sample.shape[0]))
      
df_sample.genre.value_counts()

Size of dataframe: 21600


Country          1800
Indie            1800
Hip-Hop          1800
Rock             1800
Folk             1800
Electronic       1800
Pop              1800
Other            1800
Metal            1800
Jazz             1800
Not Available    1800
R&B              1800
Name: genre, dtype: int64

In [75]:
# reset index means remove index (and change index to a column if not drop)
df_sample.reset_index(drop=True, inplace=True)
df_sample.head(10)

Unnamed: 0,index,song,year,artist,genre,lyrics
0,104901,it-s-great-to-be-single-again,2007,david-allan-coe,Country,No more dirty dishes in the sink when I come h...
1,216767,how-can-you-buy-killarney,2007,charlie-landsborough,Country,An American landed on Erin's green isle\nHe ga...
2,126582,sawing-on-the-strings,2007,alison-krauss,Country,Way back in the mountains\nWay back in the hil...
3,129927,i-don-t-believe-you-ve-met-my-baby,2006,dolly-parton,Country,"Last night, my tears they were fallin'\nI went..."
4,218507,please-don-t-hurry-your-heart,2008,caitlin-cary,Country,"Oh, when you're leaving for the hundredth time..."
5,80092,new-dug-grave,2007,gillian-welch,Country,I left home when I was twenty\nJust to see wha...
6,310224,no-memories-hangin-round,2015,bobby-bare,Country,You don't want no more heartaches\nAnd I don't...
7,68325,that-s-how-much-i-love-you,2014,eddy-arnold,Country,Well if I had a nickel I know what I would do\...
8,215612,i-m-fine-either-way,2007,bobby-pinson,Country,Come on\nMouth full of blood one eye swoll shu...
9,191238,c-mon,2014,amber-hayes,Country,"Hey, hey I'm lookin' at you\nBoy I gotta tell ..."


### Check the lyrics' quality

In [76]:
# check lyrics with length less than 100
less_than_100 = 0
for row in df_sample.index[:1000]:
    if len(df_sample.loc[row]['lyrics'])<=100:
        print(df_sample.loc[row]['lyrics'])
        less_than_100 += 1
print("\nNum of lyrics with length less than 100 in first 1000: {}".format(less_than_100))

instrumental
This track is an instrumental and has no lyrics.
guitars and cadilacs
hillbilly music
only thing that keeps me hanging on
instrumental
INSTRUMENTAL

Num of lyrics with length less than 100 in first 1000: 5


It looks like lots of songs don't have meaningful lyrics(instrumental music, or something wrong happened when crawling).

So we just drop all song records with less than 100 lyric length

### df_sample --> df_clean

In [77]:
print("Deleting records with lyric length < 100")

len_before = df_sample.shape[0]

df_clean = df_sample.copy()

for row in df_clean.index:
    if len(df_clean.loc[row]['lyrics']) <= 100:
        df_clean.drop(row, inplace=True)

len_after = df_clean.shape[0]

print("Before: {}\nAfter : {}\nDeleted: {}".format(len_before, len_after, len_before-len_after))

Deleting records with lyric length < 100
Before: 21600
After : 20954
Deleted: 646


In [78]:
df_clean.genre.value_counts()

Country          1791
R&B              1788
Other            1783
Pop              1779
Hip-Hop          1771
Indie            1768
Jazz             1758
Rock             1756
Metal            1723
Not Available    1694
Electronic       1686
Folk             1657
Name: genre, dtype: int64

***
### transfer lyrics to list  
### df_clean --> x & y

In [79]:
x = df_clean['lyrics'].values
y = df_clean['genre'].values
print('Size of x: {}\nSize of y: {}'.format(x.size, y.size))

x = x.tolist()

x[1]

Size of x: 20954
Size of y: 20954


"An American landed on Erin's green isle\nHe gazed on killarny with a rapturous smile\nHow can I buy it he said to the guy\nI'll tell you how with a smile he replied\nHow can you buy all the stars in the sky\nHow can you buy two Blue Irish eyes\nWhen you can purchase a fine mothers heart\nThen you can buy killarny\nNature restore on her guilt's with a smile\nMe and Rose the shamrock and the barley\nWhen you can buy all those wonderful things\nThen you can buy killarny\nOver in Killarny, Many years ago,\nthere's a song my mother sang to me\nin a voice so sweet and low.\nJust a simple Irish ditty,\nIn her sweet ould fashion way,\nAnd I'd give the world if I could hear\nThat song of hers today.\nToo-ra-loo-ra-loo-ral,\nToo-ra-loo-ra-li,\nToo-ra-loo-ra-loo-ral,\nHush, now don't you cry!\nToo-ra-loo-ra-loo-ral,\nToo-ra-loo-ra-li,\nToo-ra-loo-ra-loo-ral,\nThat's an Irish lullaby."

### removing punctuation and \n

reference: https://stackoverflow.com/questions/13970203/how-to-count-average-sentence-length-in-words-from-a-text-file-contains-100-se

In [80]:
# def count_sentence_len(lyric):
#     """count average sentence len for a lyric"""
#     sents_list = lyric.split('\n')
#     avg_len = sum(len(x.split()) for x in sents_list) / len(sents_list)
#     return avg_len

# sentence_length_avg = []

x_clean = []

translator = str.maketrans('', '', string.punctuation)
for l in x:
    l = l.translate(translator)
#     sentence_len = count_sentence_len(l)
#     sentence_length_avg.append(sentence_len)
    l = l.replace('\n', ' ')
    
    x_clean.append(l)

In [81]:
# randomly print 5 lyrics
import random
for i in random.sample(range(len(x_clean)), 5):
    print(x_clean[i])
    print("=============================")

Verse 1 Another night stuck at home all alone Please help me out I want to roam PreChorus Cause living life through a window aint no life at all trapped in tiny room thats way to small its way to small Verse 2 I think about you from 9 to 3 Thats when theres nothing on TV PreChorus 1X Chorus I need someonewho cares yeah theyll be thereyeah theyll be there to help me outwithout a doubt yeah theyll be thereyeah theyll be there Verse 3 I used to be an ok guy now im a home fly So much to tell my rooms my cell PreChorus 1X Chorus1X Bridge Cause its so lame when no one knows your name and its so sad that my life is so bad Chorus 1X
Youll be on my side Somethings burning my way I would kill for you No way to trap your game I was so faithful I wanna tear your lies My fevers rising torn inside This is the noise I break the future Emotional I nearly died Ill strip away Im tired of violence This is the noise I break the future I can predict what drives you on Floating out to wonderland Be back to 

In [82]:
print(len(x_clean))

20954


### 2.2 Removing stop words
nltk package has a build in library of stop words. Here I build my own stop-words dictionary basing on sklearn buildin stop word dictionary.

In [83]:
%%time
x_clean = [x.lower() for x in x_clean]

x_clean_new = []
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS)
stop_words = stop_words + ['will', 'got', 'ill', 'im', 'let']

for text in x_clean:
    text = ' '.join([word for word in text.split() if word not in stop_words])
    x_clean_new.append(text)
    
x_clean = x_clean_new

CPU times: user 19.9 s, sys: 136 ms, total: 20 s
Wall time: 20.3 s


### 2.3 Bag-of-words representation

Here I used a english dictionary from https://github.com/eclarson/MachineLearningNotebooks/tree/master/data

In [84]:
with open('./ospd.txt', encoding='utf-8', errors='ignore') as f1:
    vocab1 = f1.read().split("\n")

print(len(vocab1))

79340


In [85]:
from sklearn.feature_extraction.text import CountVectorizer



# CounterVectorizer can automatically change words into lower case
cv = CountVectorizer(stop_words='english',
                    encoding='utf-8',
                    lowercase=True,
                    vocabulary=vocab1)

bag_words = cv.fit_transform(x_clean)

print('Shape of bag words: {}'.format(bag_words.shape))
print("Length of Vocabulary: {}".format(len(cv.vocabulary_)))

Shape of bag words: (20954, 79340)
Length of Vocabulary: 79340


Let's createe a pandas dataframe containing bag-of-words(bow) model

In [86]:
df_bow = pd.DataFrame(data=bag_words.toarray(),columns=cv.get_feature_names())
df_bow.head()

Unnamed: 0,aa,aah,aahed,aahing,aahs,aal,aalii,aaliis,aals,aardvark,aardwolf,aargh,aarrgh,aarrghh,aas,aasvogel,aba,abaca,abacas,abaci,aback,abacus,abacuses,abaft,abaka,abakas,abalone,abalones,abamp,abampere,...,zygoid,zygoma,zygomas,zygomata,zygose,zygoses,zygosis,zygosity,zygote,zygotene,zygotes,zygotic,zymase,zymases,zyme,zymes,zymogen,zymogene,zymogens,zymogram,zymology,zymosan,zymosans,zymoses,zymosis,zymotic,zymurgy,zyzzyva,zyzzyvas,Unnamed: 61
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [87]:
%%time
word_freq = df_bow.sum().sort_values(ascending=False)

CPU times: user 44.1 s, sys: 6.87 s, total: 51 s
Wall time: 53.2 s


In [88]:
word_freq[:30]

love      31574
like      27405
know      26740
just      24935
oh        19478
time      15466
baby      13818
want      13161
come      12529
cause     12482
way       12077
say       11882
make      11646
yeah      11150
life       9547
heart      9446
right      9295
feel       9116
away       9067
need       8847
day        8622
night      8189
tell       8186
man        8107
girl       7367
world      7097
good       6845
think      6827
theres     6812
little     6764
dtype: int64

### 2.4 Tf-idf representation

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english',
                             encoding='utf-8',
                             lowercase=True,
                             vocabulary=vocab1)

tfidf_mat = tfidf_vect.fit_transform(x_clean)

print('Shape of bag words: {}'.format(tfidf_mat.shape))
print("Length of Vocabulary: {}".format(len(tfidf_vect.vocabulary_)))

Shape of bag words: (20954, 79340)
Length of Vocabulary: 79340


In [None]:
df_tfidf = pd.DataFrame(data=tfidf_mat.toarray(),columns=tfidf_vect.get_feature_names())
df_tfidf.head()

Unnamed: 0,aa,aah,aahed,aahing,aahs,aal,aalii,aaliis,aals,aardvark,aardwolf,aargh,aarrgh,aarrghh,aas,aasvogel,aba,abaca,abacas,abaci,aback,abacus,abacuses,abaft,abaka,abakas,abalone,abalones,abamp,abampere,...,zygoid,zygoma,zygomas,zygomata,zygose,zygoses,zygosis,zygosity,zygote,zygotene,zygotes,zygotic,zymase,zymases,zyme,zymes,zymogen,zymogene,zymogens,zymogram,zymology,zymosan,zymosans,zymoses,zymosis,zymotic,zymurgy,zyzzyva,zyzzyvas,Unnamed: 61
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
%%time
word_score = df_tfidf.sum().sort_values(ascending=False)

In [None]:
word_score[:30]

We can also calculate the corelation matrix, where number in each position (i,j) represents the correlation between song i and song j.

In [None]:
corr = (tfidf_mat * tfidf_mat.T).A

In [None]:
corr.shape

## 3. Data Visualization
### 3.1 Summary

In [None]:
df_clean.head()

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline


plt.style.use('ggplot')
freq = pd.DataFrame(word_freq, columns = ['frequency'])
fig = freq[:20].plot(kind = 'barh', figsize=(9,8), fontsize=18)
# plt.legend('number of occurrences', loc = 'upper right')



plt.gca().invert_yaxis()
plt.title('words frequencies', fontsize=20)

As we can see in this histogram, the top frequent words are "love", "know", "like" and so on. Among these top 20 frequent words listed in the histogram, the frequency of the top 4 words (love, know, like, just) is almost trible of the last 3 words, i.e. there's a considerable difference between the frequency of different words. One more thing we notice is that, there is a interjection in the list, "oh", and it is the top 6 frequent word. We didn't even notice artists used so many "oh" in the lyrics!

In [None]:
score = pd.DataFrame(word_score, columns = ['Score'])
ax = score[:20].plot(kind = 'barh', figsize=(9,8), fontsize=18)
plt.legend('score', loc = 'lower right', fontsize=15)
plt.gca().invert_yaxis()
plt.title('tf-idf score')

To figure out the most frequency word for each genre, TF-IDF may be more appropriate (given that TF-IDF reflects how important words to the document). 
From the plot above, we can see that the top frequent words are totally different from those words listed according to term frequency. 

And we can see that there are some words, like "al", "bo", "dor", "la", have high TF-IDF score. This may due to the phynominon that these words only exist in some documents (songs), makes them so "special" and are highlighted as important words for documents.

TF-IDF analysis for each genre is needed.

In [None]:
# code example from https://www.kaggle.com/carrie1/drug-of-choice-by-genre-using-song-lyrics
df_clean['word_count'] = df_clean['lyrics'].str.split().str.len()
df_clean.info()
f, ax = plt.subplots(figsize=(10, 9))
sns.violinplot(x = df_clean.word_count)
plt.xlim(-100, 1000)
plt.title('Word count distribution', fontsize=26)

The violinplot plot the distribution of all the songs according to number of words of lyrics.

The figure shows that most of songs have lyric length form 100 to 300 words. The lyric length median locates around 200. And only a small part of songs' lyric length longer than 400.

This make sense for the real lyrics length. After all, people may get tired of songs with too many lyrics and are more unlikely to fall in love with the songs only have a few words. 

Above plot is for all lyrics, without classifying by genre. We still cannot get the desired feature for each genre.

In [None]:
f, ax = plt.subplots(figsize=(10, 9))
sns.boxplot(x = "genre", y = "word_count", data = df_clean, palette = "Set1")
plt.ylim(1,2000)

To figure out the lyric length feature for each genre. We group the data by genre and get box plot for each genre. 

According to the plot, medians of most box are under 250 (around 200). Only the median for Hip-Hop is around 500, more than double length than the others. For the maximum, Electronic, Rock, Hip-Hop have the top 3 longest lyrics. And there's no big difference for the minimum for all the genres.

In general, the top 5 longest lyrics genres(named) are Hip-Hop, Pop, R&B, Electronic, Indie. The last 3 genres(named) are Jazz, Metal, Country. It seems that the genres with up tempo are more likely to have longer lyrics, and vice-versa. But we still need to pay attention to some exception. Metal songs with up tempo, however, mostly they have shorter lyrics than the other up-tempo songs. Thus, the length of lyrics can be a reference for genre classification but should not be the decision metric.

### distribution across time

In [None]:
mpl.rc("figure", figsize=(12,12))
sns.violinplot(x='genre', y='year', data=df_clean)

Looks like the distribution is biased with extreme values. So let's check outliers.

In [None]:
df_clean[df_clean['year'] <= 2000].shape[0]

Drop songs before 2000 and plot again.

In [None]:
for row in df_clean[df_clean['year'] <= 2000].index:
    df_clean.drop(row, inplace=True)

In [None]:
mpl.rc("figure", figsize=(15, 25))
sns.violinplot(x='year', y='genre', data=df_clean, inner="quartile")

We can see that the distributions are quite different. Country, Metal, Pop, R&B and Rock have a more centrilized distribution, mostly created during 2005~2010. Other genres have a quite streched distribution. Other songs(song's not labled with a genre) are mostly composed after 2012, propably because new songs don't have labels yet. 

Several geners had a big-bang around 2006~2009. We are wondering if this distribution was due to reality or just crawlling problems

### top artist

In [None]:
top_artist = df.artist.value_counts().head(8).index.tolist()

# df_clean['artist'].isin(top_artist)
# df_clean.loc[df_clean['artist'] in]

df_top_artist = df_clean.loc[df_clean['artist'].isin(top_artist), :]
df_top_artist.head()

In [None]:
df_top_artist.info()

In [None]:
mpl.rc("figure", figsize=(25, 15))
sns.violinplot(x='artist', y='year', data=df_top_artist, inner="quartile")
sns.set(font_scale=3)

For the top 8 artists, we plot this figure to explore their high-yield years.
For artist eddy-arnold, dolly-parton, eminem, barba-streisan and bee-gees, their most songs were composed during 2005~2010. And for the cris-crown and bob dylan, it seems they kept creating for a long time. However, bob-dylan's works are sort of "ahead of time". It may due to the mis-input of the information.  

In [None]:
print(df_bow.shape)
print(len(y))

### length of songs

In [None]:
df_bow['length'] = df_bow.sum(axis=1)

In [None]:
# create two new columns: 
# @ length: length of documents basing on bag-of-word model
# @ genre: genre of the record

df_bow['genre'] = pd.Series(y).values

In [None]:
mpl.rc("figure", figsize=(25, 15))
sns.violinplot(x='length', y='genre', data=df_bow, inner="quartile")
sns.set(font_scale=3)

This is another way to calculate lyrics' length basing on word bags. The violin plot for the lyric length among each genre plot corresponding to the box-plot above.

Next we want to check the top 10 frequent words of each genre.

In [None]:
genre_count = df_bow.groupby('genre').sum()
genre_count.drop('length', axis=1, inplace=True)
genre_count.head()

In [None]:
genre_count_new = genre_count.transpose()
genre_list = df_clean.genre.unique().tolist()

In [None]:
for genre in genre_list:
    t = genre_count_new.nlargest(10, genre, keep='first')[genre]
    
    fig = plt.figure(figsize=(6,4))
    fig.suptitle(genre, fontsize=20)
    plt.xticks(rotation='vertical')
    sns.barplot(t.values, t.index, alpha=0.8)
sns.set(font_scale=3)

In above histogram, we list some top frequent words for each genre. For different genres, they have top-10 frequent words in common and. And these information can be visualized in word cloud figures in Part 4.

From those histograms, it is pretty straightforward that 'love' is almost every types of music cared about. And also other words they share in common, which are 'know','time', 'oh' etc. And also many of those words are verbs.
It looks like hip-hop music has a quite differnet set of frequent words, distinctive from other genres.

## 4. Word Cloud
Now it is 'wordcloud' time.
Word cloud is a visual representation of text data, and it is a very efficient way to represent word frequencies.

First let's try to draw the overall wordcloud basing on term frequency.

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
plt.style.use('ggplot')

all_lyrics = ''
for lyric in x_clean:
    all_lyrics += (' '+lyric)

In [None]:
# code example from https://amueller.github.io/word_cloud/index.html
wordcloud = WordCloud(max_font_size=60).generate(all_lyrics)
import matplotlib.pyplot as plt
plt.figure(figsize=(15,15))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

We can clearly see that the most frequently used word is 'love' over all, then comes with 'got',

In [None]:
word_freq[:30]

As we can see, the word cloud describes word frequency in a visuable way.

Let's try plot word clouds in different genres.

In [None]:
d = {'genre': y.tolist(), "lyric": x_clean}
df_plot = pd.DataFrame(d)
df_plot.head(10)

Now let's separate those lyrics into different genres.

In [None]:
# create a dictionary and store all lyrics basing on their genre
lyrics = {}
for genre in df_plot.genre.unique().tolist():
    lyrics[genre] = ' '
    for row in (df_plot[df_plot['genre'] == genre].index):
        lyrics[genre] = lyrics[genre] + ' ' + df_plot.loc[row, 'lyric']

In [None]:
for genre, lyric in lyrics.items():
    wordcloud = WordCloud(max_font_size=60).generate(lyric)
    
    fig = plt.figure(figsize=(10,8))
    fig.suptitle(genre, fontsize=24)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.tight_layout()

In thoes word cloud, word 'love' is almost the most frequent one in each genre. And word 'life', 'know' and 'time' etc. are frequently used as well.
But there are also some differences among those genres. For example, in jazz, word 'heart' used more that other genre,
and there are more dirty words in hip-hop, which makes sense.
After exploring all the lyrics, we can make a conclusion that most of the lyrics have some words in common, but depending on what kind of music they are, they do have unique words. Based on this results, we can make genre prediction in the future.

# Reference: 
Raschka, S. (2015). Python machine learning. Packt Publishing Ltd.

https://www.kaggle.com/carrie1/drug-of-choice-by-genre-using-song-lyrics

https://github.com/eclarson/MachineLearningNotebooks/tree/master/data

https://amueller.github.io/word_cloud/index.html