# Text Mining of Disney Billboard Top 100 songs using NLTK
The Objective of this task is to determine the words mostly used in Disney Animation Song lyrics. That have chartted on the Billboard Top 100 chart.  The 

To achieve this, we will go throught the following steps

 - import the required libraries
 - load the csv file
 - Preprocess the data
 - Text Mining
   - Text Normalization
     - Tokenization
   - Story Generation and Visualization from lyrics
     - Understanding the common words used in the lyrics: WordCloud
   - Bag-of-Words Features
   - TF-IDF
   - Word2Vec Embeddings
   - Prepare Vector for Lyrics

In [1]:
import nltk
import pandas as pd
import numpy as np
import string
import warnings
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('disney_songs.csv', encoding='cp1252')

In [3]:
df.head()

Unnamed: 0,No.,Song,Movie,Year,Lyrics,Highest Rank
0,1.0,We Don’t Talk About Bruno,Encanto,2022.0,"We don't talk about Bruno, no, no, no\nWe don'...",1.0
1,2.0,Can You Feel the Love Tonight,The Lion King,1994.0,I can see what's happening\nWhat\nAnd they don...,4.0
2,3.0,A Whole New World,Aladdin,1993.0,"I can show you the world\nShining, shimmering,...",1.0
3,4.0,Let It Go,Frozen,2014.0,The snow glows white on the mountain tonight\n...,5.0
4,5.0,Colors of the Wind,Pocahontas,1995.0,You think you own whatever land you land on\nT...,4.0


- The Lyrics have special characters, puntuation and numbers, which in the case, isn't really neccessary. These words were removed.

- Words less than 4 letters were removed form the lyrics.

In [10]:
df['tidy_lyrics'] = df['Lyrics'].str.replace("([\\n:.?<_&=>0-9;,\-()])", " ")
df['tidy_lyrics'] = df['tidy_lyrics'].apply(lambda x:" ".join([w for w in str(x).split() if len(w) > 3]))
df.head(10)

  df['tidy_lyrics'] = df['Lyrics'].str.replace("([\\n:.?<_&=>0-9;,\-()])", " ")


Unnamed: 0,No.,Song,Movie,Year,Lyrics,Highest Rank,tidy_lyrics
0,1.0,We Don’t Talk About Bruno,Encanto,2022.0,"We don't talk about Bruno, no, no, no\nWe don'...",1.0,don't talk about Bruno don't talk about Bruno ...
1,2.0,Can You Feel the Love Tonight,The Lion King,1994.0,I can see what's happening\nWhat\nAnd they don...,4.0,what's happening What they don't have clue The...
2,3.0,A Whole New World,Aladdin,1993.0,"I can show you the world\nShining, shimmering,...",1.0,show world Shining shimmering splendid Tell pr...
3,4.0,Let It Go,Frozen,2014.0,The snow glows white on the mountain tonight\n...,5.0,snow glows white mountain tonight footprint se...
4,5.0,Colors of the Wind,Pocahontas,1995.0,You think you own whatever land you land on\nT...,4.0,think whatever land land earth just dead thing...
5,6.0,Beauty and the Beast,Beauty and the Beast,1992.0,Tale as old as time\nTrue as it can be\nBarely...,9.0,Tale time True Barely even friends Then somebo...
6,7.0,Surface Pressure,Encanto,2022.0,"I’m the strong one, I’m not nervous\nI’m as to...",8.0,strong nervous tough crust Earth move mountain...
7,8.0,Circle of Life,The Lion King,1994.0,Nants ingonyama bagithi baba\nSithi uhm ingony...,18.0,Nants ingonyama bagithi baba Sithi ingonyama N...
8,9.0,Go the Distance,Hercules,1997.0,I have often dreamed of a far off place\nWhere...,24.0,have often dreamed place Where hero's welcome ...
9,10.0,You’ll Be in My Heart,Tarzan,1999.0,Come stop your crying\nIt will be alright\nJus...,21.0,Come stop your crying will alright Just take h...


### Text Normalization
We use the __nltk__ PorterStemmer() function to normalize the lyrics. We however have to tokenize the lyrics. __Tokenization__ is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

In [11]:
#Word tokenization
tokenized_lyrics = df['tidy_lyrics'].apply(lambda x: x.split())
tokenized_lyrics.head()

0    [don't, talk, about, Bruno, don't, talk, about...
1    [what's, happening, What, they, don't, have, c...
2    [show, world, Shining, shimmering, splendid, T...
3    [snow, glows, white, mountain, tonight, footpr...
4    [think, whatever, land, land, earth, just, dea...
Name: tidy_lyrics, dtype: object

In [12]:
#We can now normalize the tokenized tweets
from nltk.stem.porter import *
stemmer = PorterStemmer()
tokenized_lyrics = tokenized_lyrics.apply(lambda x: [stemmer.stem(i) for i in x])

In [13]:
#now lets stitch these tokens back together. 
#it can easily be done using nltk's MosesDetokenizer
for i in range(len(tokenized_lyrics)):
    tokenized_lyrics[i] = ' '.join(tokenized_lyrics[i])
df['tidy_lyrics'] = tokenized_lyrics

In [14]:
df.head()

Unnamed: 0,No.,Song,Movie,Year,Lyrics,Highest Rank,tidy_lyrics
0,1.0,We Don’t Talk About Bruno,Encanto,2022.0,"We don't talk about Bruno, no, no, no\nWe don'...",1.0,don't talk about bruno don't talk about bruno ...
1,2.0,Can You Feel the Love Tonight,The Lion King,1994.0,I can see what's happening\nWhat\nAnd they don...,4.0,what' happen what they don't have clue they'll...
2,3.0,A Whole New World,Aladdin,1993.0,"I can show you the world\nShining, shimmering,...",1.0,show world shine shimmer splendid tell princes...
3,4.0,Let It Go,Frozen,2014.0,The snow glows white on the mountain tonight\n...,5.0,snow glow white mountain tonight footprint see...
4,5.0,Colors of the Wind,Pocahontas,1995.0,You think you own whatever land you land on\nT...,4.0,think whatev land land earth just dead thing c...
