# Topic Modeling with Red Hot Chilli Peppers Lyrics

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Reading in RHCP Song Data

In [3]:
rhcp_song_data = pd.read_excel('rhcp_songs.xlsx')
rhcp_song_data

Unnamed: 0,Album,Song,Lyrics
0,I'm With You,Monarchy of Roses,The crimson tide is flowing through your finge...
1,I'm With You,Factory of Faith,"All my life I was swinging for the fence,\nI w..."
2,I'm With You,Brendan's Death Song,If I die before I get it done\nWill you decide...
3,I'm With You,Ethiopa,"We're rolling everybody, it starts with bass""\..."
4,I'm With You,Annie Wants A Baby,"Lucy Rebar, she's a friend of mine\nLater she'..."
5,I'm With You,Look Around,"Stiff club, it's my nature,\nCustom love is th..."
6,I'm With You,The Adventures of Raindance Maggie,Lipstick junkie\nDebunked the all in one\nShe ...
7,I'm With You,Did I Let You Know,m comin' for you 'cause I adore you and\nI'd l...
8,I'm With You,Goodbye Hooray,"Wooh, wooh, ha\nJunior paints that old cafe\nH..."
9,I'm With You,Happiness Loves Company,Stop marching 'cause you think you shot to num...


In [36]:
songs = []
for i in range(71):
    print(rhcp_song_data['Song'].loc[i])
    songs.append(rhcp_song_data['Song'].loc[i])
songs

Monarchy of Roses
Factory of Faith
Brendan's Death Song
Ethiopa
Annie Wants A Baby
Look Around
The Adventures of Raindance Maggie
Did I Let You Know
Goodbye Hooray
Happiness Loves Company
Police Station
Even You Brutus?
Meet Me At The Corner
Dance, Dance, Dance
By The Way
Universally Speaking
This Is The Place
Dosed
Don't Forget Me
This Zephyr Place
Can't Stop
I Could Die For You
Midnight
Throw Away Your Television
Cabron
Tear
On Mercury
Minor Thing
Warm Tape
Venice Queen
Bicycle Song
Runaway
The Getaway
Dark Necessities
We turn red
The Longest Wave
Goodbye Angels
Sick Love
Go Robot
Feasting on flowers
Detroit
This is ticonderoga
Encore
The hunter
Dreams of a samurai
Dani California
Snow
Charlie
Stadium Arcadium
Hump De Bump
She's Only 18
Slow Cheetah
Torture Me
Strip My Mind
Especially in Michigan
Warlocks
C'mon Girl
Get On Top
All Around The World
Scar Tissue
Otherside
Californication
Easily
Porcelain
Emit Remmus
This Velvet Glove
Savior
Purple Stain
Road Trippin'
Fat Dance
Right on 

['Monarchy of Roses',
 'Factory of Faith',
 "Brendan's Death Song",
 'Ethiopa',
 'Annie Wants A Baby',
 'Look Around',
 'The Adventures of Raindance Maggie',
 'Did I Let You Know',
 'Goodbye Hooray',
 'Happiness Loves Company',
 'Police Station',
 'Even You Brutus?',
 'Meet Me At The Corner',
 'Dance, Dance, Dance',
 'By The Way',
 'Universally Speaking',
 'This Is The Place',
 'Dosed',
 "Don't Forget Me",
 'This Zephyr Place',
 "Can't Stop",
 'I Could Die For You',
 'Midnight',
 'Throw Away Your Television',
 'Cabron',
 'Tear',
 'On Mercury',
 'Minor Thing',
 'Warm Tape',
 'Venice Queen',
 'Bicycle Song',
 'Runaway',
 'The Getaway',
 'Dark Necessities',
 'We turn red',
 'The Longest Wave',
 'Goodbye Angels',
 'Sick Love',
 'Go Robot',
 'Feasting on flowers',
 'Detroit',
 'This is ticonderoga',
 'Encore',
 'The hunter',
 'Dreams of a samurai',
 'Dani California',
 'Snow',
 'Charlie',
 'Stadium Arcadium',
 'Hump De Bump',
 "She's Only 18",
 'Slow Cheetah',
 'Torture Me',
 'Strip My Mind

In [41]:
line_list = []
songs = []
albums = []
for i in range(len(rhcp_song_data)):
    lines = rhcp_song_data.loc[i, 'Lyrics'].split('\n')
    for line in lines: 
        line_list.append(line)
        songs.append(rhcp_song_data['Song'].loc[i])
        albums.append(rhcp_song_data['Album'].loc[i])
        
rhcp_song_lines = pd.DataFrame({'Album': albums, 'Song': songs, 'Line': line_list})
rhcp_song_lines

Unnamed: 0,Album,Line,Song
0,I'm With You,The crimson tide is flowing through your finge...,Monarchy of Roses
1,I'm With You,The promise of a clean regime are promises we ...,Monarchy of Roses
2,I'm With You,Do you like it rough I ask and are you up to t...,Monarchy of Roses
3,I'm With You,The catacombs of bet and bone where cultures c...,Monarchy of Roses
4,I'm With You,"Several of my best friends wear,",Monarchy of Roses
5,I'm With You,The colors of the crown,Monarchy of Roses
6,I'm With You,"And Mary wants to fill it up,",Monarchy of Roses
7,I'm With You,"And Sherry wants to tear it all back down, girl",Monarchy of Roses
8,I'm With You,The savior of your light,Monarchy of Roses
9,I'm With You,The monarchy of roses,Monarchy of Roses


## Preprocessing the Raw Text Data

In [4]:
import re # regex library
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # Effectively removes HTML markup tags
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # Finds emoticons and punctuation
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '') # removes uppercase and emoticons
    text = re.sub("[^A-Za-z]", " ", text) # removes non-letter characters (numbers)
    return text

In [42]:
rhcp_song_lines['Processed Lyrics'] = rhcp_song_lines['Line'].apply(preprocessor)

## Running Latent Dirichlet Allocation with a Count Vectorizer

**Latent Dirichlet Allocation** is an unsupervised algorithm for topic modeling that can be used to generate topics from any **corpus**. Here are some basic terms and concepts behind LDA:
- A **document** is defined as a sequence of words. In this case, a document is just a single email.
- A **corpus** is a collection of documents, which in this case is a collection of emails.
- A **topic** is characterized by a particular probability distribution over different words. For example, in the topic of **soccer** certain words such as **free kick** and **offside** are more likely to appear than others.
- A document contains a **mixture of different topics** with some topics being more relevant or prominent than others.
- LDA involves using the **Dirichlet distribution** to model the distribution of words among different topics.

The LDA algorithm allows us to not only extract the major topics from a corpus, but also allows us to create a model that can be used to infer which of these topics are most prevalent in any given document, even if it was not present in the original corpus.

In [116]:
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS = STOP_WORDS.union(set(['girl', 'll', 'don', 'let', 'gon']))

In [121]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

num_topics = 5
num_features= 5000

print('Running CountVectorizer...')
tf_vectorizer = CountVectorizer(max_df=0.90, min_df=10, ngram_range=(1,1), max_features=num_features, stop_words=STOP_WORDS)
tf = tf_vectorizer.fit_transform(rhcp_song_lines['Processed Lyrics'])
tf_feature_names = tf_vectorizer.get_feature_names()

print('Running Latent Dirichlet Allocation...')
lda = LatentDirichletAllocation(n_topics=num_topics,
                                max_iter=5, 
                                learning_method='online', 
                                learning_offset=100.,
                                random_state=0).fit(tf)

Running CountVectorizer...
Running Latent Dirichlet Allocation...


In [123]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

num_top_words=10
display_topics(lda, tf_feature_names, num_top_words)

Topic 0:
right world ayo coming friend mirror light think cabron fine
Topic 1:
like want love said look inside know told day sun
Topic 2:
oh away life long night dream baby californication feel wanna
Topic 3:
know time got come ve tell little mind good sure
Topic 4:
dance yeah phfat cause way find hey share need lonely


In [124]:
lda.perplexity(tf)

131.7363488835548

In [133]:
from sklearn.model_selection import train_test_split
X = rhcp_song_lines['Line']
y = rhcp_song_lines['Album']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=100)

In [146]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('tf', TfidfVectorizer(preprocessor=preprocessor, 
                                            ngram_range=(1,2),
                                            stop_words='english')), 
                    ('mlp', MLPClassifier(hidden_layer_sizes=[1000,1000,1000]))])

In [None]:
pipeline.fit(X_train, y_train)

In [145]:
from sklearn.metrics import accuracy_score
pred = pipeline.predict(X_test)
print(accuracy_score(y_test, pred))

0.5835616438356165


## Topic Modeling with LDA2VEC