In [1]:
import pandas as pd
import pickle
import scipy.sparse
from gensim import matutils, models
from nltk import word_tokenize, pos_tag

# Topic Modeling

This is perhaps the most analysis from all notebooks - but also one of the more difficult to achieve some nice result. Let's try it!

We start by loading our Document Term Matrix and cleaned text.

In [2]:
data = pd.read_pickle('dtm_stop.pkl')
data_cleaned = pd.read_pickle('data_clean.pkl')
term_doc_matrix = data.transpose()
term_doc_matrix.head()

Unnamed: 0,2Pac,Cardi B,Eminem,J. Cole,Joyner Lucas,Juice WRLD,Kanye West,Lil Pump,Logic,Mac Miller,Nas,Nicki Minaj,Notorious B.I.G.
aa,0,0,1,0,0,0,1,0,1,0,0,0,0
aaaaaaaaaa,0,0,0,0,0,0,0,0,0,0,0,1,0
aaaaaaack,0,0,1,0,0,0,0,0,0,0,0,0,0
aaaaah,0,0,0,0,0,0,0,0,0,0,0,1,0
aaaaahhh,0,0,0,0,0,0,1,0,0,0,0,0,0


We need to do some mino preprocessing again. We will remove the stop words and only take nouns, adjectives and verbs into account. If you want to do your topic modeling, you can always play around with your own choice. Maybe only nouns and adjectives achieve a better result?

In [3]:
def nouns_adj_verbs(text):
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' or pos[:2] == 'VV'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]
    return ' '.join(nouns_adj)

In [4]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

add_stop_words = ['im', 'got', 'like',
                 'dont', 'know', 'just',
                 'fuck', 'shit', 'yeah',
                 'aint', 'thats', 'make',
                 'bitch', 'love', 'wanna', 
                 'cause', 'n*ggas', 'n*gga', 
                 'time', 'em', 'man', 
                  'want', 'let', 'come']

stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

cv = CountVectorizer(stop_words=stop_words)

In [5]:
data_nouns_adj_verbs = pd.DataFrame(data_cleaned.lyrics.apply(nouns_adj_verbs))
data_nouns_adj_verbs

Unnamed: 0,lyrics
2Pac,aint nothin gangsta party eh light ahh nothin ...
Cardi B,whores house whores house whores house whores ...
Eminem,yeah i guess huh obvious eye eye funny much i ...
J. Cole,work growth famous important anything anything...
Joyner Lucas,fall fall i more i i werent cant picture someo...
Juice WRLD,nahnahnahnahnahnah smoke cigarettes cancer che...
Kanye West,hour hour power minute minute lord second seco...
Lil Pump,lyrics first snippet elliot dinner brr man ben...
Logic,lyrics song please song welcome pressure progr...
Mac Miller,youre young much matters something night dream...


Even with only these three kinds of word and without stopwords the text still does make some sense to us humans. Hopefully the same goes for the machine! But let's create a DTM again of our intermediate result.

In [6]:
cv = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cv = cv.fit_transform(data_nouns_adj_verbs.lyrics)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_nouns_adj_verbs.index
data_dtm

Unnamed: 0,aa,aaaaaaack,aaaaahhh,aaaaayyyyooooo,aaaah,aaaahh,aaaand,aaahhh,aaand,aaass,...,世界中で聴いてる,帰っていただいて結構,彼の行動が気になって仕方ないはず,感謝しています,最高だったでしょう,本当はロジックを愛してやまないんでしょう,楽しんでいただけたことを願っています,毎日,私たちは共に歴史を刻んできた,耳を塞ぐか
2Pac,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Cardi B,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Eminem,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
J. Cole,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Joyner Lucas,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Juice WRLD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Kanye West,0,0,1,0,3,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Lil Pump,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Logic,1,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
Mac Miller,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can perform the topic modeling algorithm. We will use [Laten Dirichlect Allocation](https://de.wikipedia.org/wiki/Latent_Dirichlet_Allocation) (LDA). I recommend you to read [this excellent article](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) on how LDA works.

We will give an overview for each rapper, which of the words have the highest probability (eahc word individually in a topic) to appear together with other words. We set the boundary for number of topics to 5.

In [1]:
corpus = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtm.transpose()))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

NameError: name 'matutils' is not defined

In [8]:
lda = models.LdaModel(corpus=corpus, num_topics=5, id2word=id2word, passes=80)
lda.print_topics()

[(0,
  '0.014*"brr" + 0.010*"chyeah" + 0.010*"racks" + 0.009*"yuh" + 0.008*"vroom" + 0.007*"slatt" + 0.006*"goyard" + 0.005*"esskeetit" + 0.005*"pinky" + 0.005*"fasho"'),
 (1,
  '0.009*"biggie" + 0.007*"je" + 0.006*"nicki" + 0.006*"funk" + 0.004*"cole" + 0.004*"cardi" + 0.004*"que" + 0.004*"dem" + 0.003*"combs" + 0.003*"buck"'),
 (2,
  '0.011*"buck" + 0.008*"joyner" + 0.005*"yayo" + 0.004*"wha" + 0.003*"jurisdiction" + 0.003*"yup" + 0.003*"cha" + 0.002*"bah" + 0.002*"blat" + 0.002*"isis"'),
 (3,
  '0.004*"nas" + 0.004*"woah" + 0.003*"shady" + 0.002*"cmon" + 0.002*"demons" + 0.002*"outlaw" + 0.002*"pac" + 0.002*"codeine" + 0.001*"dre" + 0.001*"dig"'),
 (4,
  '0.004*"logic" + 0.002*"bam" + 0.002*"kanye" + 0.002*"monster" + 0.002*"miller" + 0.002*"sinatra" + 0.002*"yeezy" + 0.001*"cuz" + 0.001*"rhymes" + 0.001*"roc"')]

In [9]:
corpus_transformed = lda[corpus]
list(zip([a[0][0] for a in corpus_transformed], data_dtm.index))

[(3, '2Pac'),
 (1, 'Cardi B'),
 (3, 'Eminem'),
 (1, 'J. Cole'),
 (2, 'Joyner Lucas'),
 (3, 'Juice WRLD'),
 (4, 'Kanye West'),
 (0, 'Lil Pump'),
 (4, 'Logic'),
 (4, 'Mac Miller'),
 (3, 'Nas'),
 (1, 'Nicki Minaj'),
 (1, 'Notorious B.I.G.')]

In [10]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
lda_sk = LDA(n_components=5, n_jobs=-1)
lda_sk.fit(data_dtm)

LatentDirichletAllocation(n_components=5, n_jobs=-1)

Let's do this again but with another Library called [pyLDAvis](https://github.com/bmabey/pyLDAvis). The nice thing about this LDA lib is that it already creates an interactive visualization of the results and saves it in an html file. Data Science wouldn't be complete without any visualizations after all!

In [11]:
from pyLDAvis import sklearn as sklearn_lda
import os
import pyLDAvis

LDAvis_data_filepath = os.path.join('./ldavis')

LDAvis_prepared = sklearn_lda.prepare(lda_sk, data_cv, cv)
with open(LDAvis_data_filepath, 'wb') as f:
    pickle.dump(LDAvis_prepared, f)
        
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)
    
pyLDAvis.save_html(LDAvis_prepared, './ldavis.html')

The results might be a bit difficult. In general rappers tend to talk about themselves, which is why we often see their name in the result.

However topic 3 has words like 'codeine' and 'demons' which might be referred to lyrics about mental state and depression - probably Juice WRLD?

But overall this section certainly needs some tweaking, the topics are quite difficult to interprete.

In [12]:
from IPython.display import IFrame

words = cv.get_feature_names()
for topic_idx, topic in enumerate(lda_sk.components_):
    print("\nTopic #%d:" % topic_idx)
    print(" ".join([words[i] for i in topic.argsort()[:-11:-1]]))

IFrame(src='./ldavis.html', width=1080, height=720)


Topic #0:
shady cole buck dre miller thoughts superman duh rhyme joyner

Topic #1:
cardi bam kanye monster yeezy roc hoo hype cmon chi

Topic #2:
nas cmon outlaw pac row heaven dogg soldier queensbridge mob

Topic #3:
woah racks codeine demons brr yuh dig percs choppa skrrt

Topic #4:
biggie je nicki funk logic dem buck combs que cmon


  and should_run_async(code)


Finally, let's have a look on the distance between the topic. Note that the numbers i of one bubble in this visualization represents the topic n-1 from previous cell.

As mentioned earlier, rappers tend to talk about themselves - Topic 1, 2 and 5 (0, 1 and 4 in previous cell) most often contain the name of these artists. And in the visualization these bubbles are very close. So my assumption might be true (at least partially).