# <h1 align = 'center'> Data collection, Transmission and Security part 4</h1> 
#### <center> Marvel VS DC  </center>
#### <center> Achraf BELLA - Ecole Centrale - Casablanca - January 2022 </center>
***

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


# Data Acquisition

In [2]:
marvel_usa = pd.read_pickle('marvel_data_usa_df.pkl')

In [3]:
marvel_europe = pd.read_pickle('marvel_data_europe_df.pkl')

In [4]:
DC_usa = pd.read_pickle('DC_data_usa_df.pkl')

In [5]:
DC_europe = pd.read_pickle('DC_data_europe_df.pkl')

In [6]:
print('shape of marvel europe data {}'.format(marvel_europe.shape))
print('shape of marvel usa data {}'.format(marvel_usa.shape))
print('shape of DC europe data {}'.format(DC_europe.shape))
print('shape of DC usa data {}'.format(DC_usa.shape))


shape of marvel europe data (5867, 13)
shape of marvel usa data (32391, 13)
shape of DC europe data (8774, 13)
shape of DC usa data (29062, 13)


### Document-Term-Matrix

L'est build document for the 4 dataset we have 

In [7]:
cv1 = CountVectorizer(stop_words='english', min_df=3)
data_cv1 = cv1.fit_transform(marvel_europe.cleaned_tweet)
marvel_europe_dtm = pd.DataFrame(data_cv1.toarray(), columns=cv1.get_feature_names())
marvel_europe_dtm.index = marvel_europe.index
marvel_europe_dtm.head(5)

Unnamed: 0,able,absolut,absolute,absolutely,abt,acc,accent,accept,access,according,...,youngronin,youre,youtub,youtube,youve,zack,zelma,zendaya,zero,zhao
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
cv2 = CountVectorizer(stop_words='english', min_df=3)
data_cv2 = cv2.fit_transform(marvel_usa.cleaned_tweet)
marvel_usa_dtm = pd.DataFrame(data_cv2.toarray(), columns=cv2.get_feature_names())
marvel_usa_dtm.index = marvel_usa.index
marvel_usa_dtm.head(5)

Unnamed: 0,aarnft,aaron,aaronlopresti,aaronmeyers,aarons,abandonedlizard,ability,able,abo,abomination,...,zip,zips,zircher,zombie,zombies,zombiesquadhq,zone,zoom,zsjl,zynirel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
cv3 = CountVectorizer(stop_words='english', min_df=3)
data_cv3 = cv3.fit_transform(DC_usa.cleaned_tweet)
DC_usa_dtm = pd.DataFrame(data_cv3.toarray(), columns=cv3.get_feature_names())
DC_usa_dtm.index = DC_usa.index
DC_usa_dtm.head(5)

Unnamed: 0,aamaadmipay,aap,aaron,aaronbaileya,aaronlopresti,aaronmeyers,abc,abigaildmain,abilities,ability,...,zombies,zone,zoom,zorro,zoë,zsharf,zsjl,zsyadf,zurenarrh,zynirel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
cv4 =  CountVectorizer(stop_words='english', min_df=3)
data_cv4 = cv4.fit_transform(DC_europe.cleaned_tweet)
DC_europe_dtm = pd.DataFrame(data_cv4.toarray(), columns=cv4.get_feature_names())
DC_europe_dtm.index = DC_europe.index
DC_europe_dtm.head(5)

Unnamed: 0,abbey,able,abo,absolute,absolutely,abt,abuser,accent,accept,accidentally,...,zammitmarc,zatanna,zdarsky,zero,zod,zodiac,zone,zoë,zsjl,zumypowa
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Topic modeling

In [11]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



In [12]:
id2word1 = dict((v, k) for k, v in cv1.vocabulary_.items())
id2word2 = dict((v, k) for k, v in cv2.vocabulary_.items())
id2word3 = dict((v, k) for k, v in cv3.vocabulary_.items())
id2word4 = dict((v, k) for k, v in cv4.vocabulary_.items())


### Topics from Marvel in Europe

In [13]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus

sparse_counts = scipy.sparse.csr_matrix(marvel_europe_dtm.transpose())
corpus = matutils.Sparse2Corpus(sparse_counts)

lda = models.LdaModel(corpus=corpus, id2word=id2word1, num_topics=3, passes=80)
lda.print_topics()

[(0,
  '0.072*"mcu" + 0.046*"spiderman" + 0.038*"way" + 0.028*"home" + 0.021*"movie" + 0.020*"eternals" + 0.017*"film" + 0.010*"good" + 0.009*"watched" + 0.009*"best"'),
 (1,
  '0.068*"mcu" + 0.039*"mcudirect" + 0.019*"like" + 0.015*"dont" + 0.014*"marvel" + 0.008*"make" + 0.008*"think" + 0.008*"thats" + 0.007*"mcusource" + 0.007*"character"'),
 (2,
  '0.077*"man" + 0.074*"iron" + 0.059*"marvel" + 0.032*"captain" + 0.026*"comics" + 0.011*"america" + 0.009*"great" + 0.009*"like" + 0.008*"new" + 0.007*"comic"')]

Topics : spiderman no way home + captain america + mcu

### Topics from Marvel in USA

In [14]:
sparse_counts = scipy.sparse.csr_matrix(marvel_usa_dtm.transpose())
corpus = matutils.Sparse2Corpus(sparse_counts)

lda = models.LdaModel(corpus=corpus, id2word=id2word2, num_topics=3, passes=80)
lda.print_topics()

[(0,
  '0.129*"thor" + 0.044*"jonrfleming" + 0.035*"archerbm" + 0.033*"michaeltmcc" + 0.030*"tipct" + 0.029*"kelledin" + 0.027*"bsherrle" + 0.024*"trektheglobe" + 0.020*"johnfloridaman" + 0.018*"edmundzavada"'),
 (1,
  '0.048*"mcu" + 0.021*"like" + 0.014*"avengers" + 0.010*"movie" + 0.009*"think" + 0.008*"good" + 0.008*"mcudirect" + 0.008*"dont" + 0.008*"really" + 0.007*"movies"'),
 (2,
  '0.039*"marvel" + 0.038*"man" + 0.035*"spiderman" + 0.034*"iron" + 0.032*"way" + 0.024*"home" + 0.020*"avengers" + 0.014*"captain" + 0.013*"comics" + 0.008*"watch"')]

Topics: Thor, avengers and spiderman no way home

### Topics from DC in Europe

In [15]:
sparse_counts = scipy.sparse.csr_matrix(DC_europe_dtm.transpose())
corpus = matutils.Sparse2Corpus(sparse_counts)

lda = models.LdaModel(corpus=corpus, id2word=id2word4, num_topics=3, passes=80)
lda.print_topics()

[(0,
  '0.036*"dccomics" + 0.025*"green" + 0.021*"wonder" + 0.019*"woman" + 0.017*"arrow" + 0.012*"aquaman" + 0.009*"dont" + 0.009*"like" + 0.008*"gotham" + 0.008*"lantern"'),
 (1,
  '0.136*"batman" + 0.019*"new" + 0.017*"movie" + 0.010*"thebatman" + 0.008*"best" + 0.008*"comic" + 0.007*"love" + 0.006*"film" + 0.006*"book" + 0.006*"dark"'),
 (2,
  '0.107*"superman" + 0.015*"like" + 0.011*"man" + 0.009*"hes" + 0.009*"lois" + 0.009*"amp" + 0.009*"justice" + 0.008*"aquaman" + 0.007*"good" + 0.007*"theme"')]

Topics: wonder woman, the movie of batman and superman and there's lois which is the wife of superman in the new series 

### Topics from DC in USA

In [16]:
sparse_counts = scipy.sparse.csr_matrix(DC_usa_dtm.transpose())
corpus = matutils.Sparse2Corpus(sparse_counts)

lda = models.LdaModel(corpus=corpus, id2word=id2word3, num_topics=3, passes=80)
lda.print_topics()

[(0,
  '0.036*"superman" + 0.031*"dccomics" + 0.024*"amp" + 0.018*"green" + 0.018*"comics" + 0.013*"lois" + 0.011*"lantern" + 0.010*"new" + 0.009*"watching" + 0.009*"comic"'),
 (1,
  '0.104*"batman" + 0.018*"movie" + 0.014*"like" + 0.009*"dont" + 0.008*"time" + 0.008*"new" + 0.008*"best" + 0.008*"really" + 0.007*"good" + 0.007*"know"'),
 (2,
  '0.084*"superman" + 0.022*"aquaman" + 0.020*"wonder" + 0.019*"woman" + 0.016*"like" + 0.010*"man" + 0.009*"think" + 0.009*"favorite" + 0.008*"youre" + 0.006*"played"')]

Topics: superman, wonder woman & batman

Pretty sure we are able to make more sens by filtering trought the words (noun + adjective)