<a href="https://colab.research.google.com/github/ProfAI/nlp00/blob/master/7%20-%20Topic%20modelling/topic_modelling_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic modelling

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL

In [14]:
!wget https://raw.githubusercontent.com/franciscadias/data/master/abcnews-date-text.csv

--2019-04-16 16:05:50--  https://raw.githubusercontent.com/franciscadias/data/master/abcnews-date-text.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54096356 (52M) [text/plain]
Saving to: ‘abcnews-date-text.csv’


2019-04-16 16:05:51 (160 MB/s) - ‘abcnews-date-text.csv’ saved [54096356/54096356]



In [22]:
import pandas as pd

headlines_df = pd.read_csv("abcnews-date-text.csv")
headlines_df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [16]:
headlines_df = headlines_df.sample(frac=.1, random_state=0)
headlines_df.shape

(108217, 2)

## Preprocess

In [35]:
import gensim
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:

lemmatizer = nltk.stem.WordNetLemmatizer()

def preprocess(text):
  
  tokens = []
  
  for token in gensim.utils.simple_preprocess(text):
    if(token not in gensim.parsing.preprocessing.STOPWORDS):
      tokens.append(lemmatizer.lemmatize(token, pos='v'))
  
  return tokens

In [39]:
headlines_df["headline_processed"] = headlines_df["headline_text"].map(preprocess)
headlines_df.head()

NameError: ignored

In [41]:
headlines = headlines_df["headline_processed"].values
headlines[:3]

array([list(['aba', 'decide', 'community', 'broadcast', 'licence']),
       list(['act', 'witness', 'aware', 'defamation']),
       list(['call', 'infrastructure', 'protection', 'summit'])],
      dtype=object)

In [0]:
dictionary = gensim.corpora.Dictionary(headlines)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

## Bag of Words

In [45]:
X = [dictionary.doc2bow(headline) for headline in headlines]
X[0]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]

In [0]:
lda = gensim.models.LdaMulticore(X, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [49]:
for index, topic in lda.print_topics(-1):
    print("\nTOPIC %d - parole più popolari" % (index+1))
    print(topic)


TOPIC 1 - parole più popolari
0.033*"trump" + 0.029*"queensland" + 0.012*"close" + 0.011*"force" + 0.011*"island" + 0.010*"john" + 0.009*"david" + 0.008*"million" + 0.008*"christmas" + 0.007*"sea"

TOPIC 2 - parole più popolari
0.019*"win" + 0.015*"year" + 0.015*"interview" + 0.011*"world" + 0.011*"break" + 0.011*"league" + 0.010*"afl" + 0.010*"end" + 0.010*"donald" + 0.009*"final"

TOPIC 3 - parole più popolari
0.039*"man" + 0.034*"police" + 0.023*"charge" + 0.021*"court" + 0.016*"murder" + 0.015*"attack" + 0.015*"kill" + 0.014*"crash" + 0.013*"woman" + 0.013*"die"

TOPIC 4 - parole più popolari
0.018*"sa" + 0.017*"country" + 0.017*"rural" + 0.016*"fund" + 0.016*"wa" + 0.016*"help" + 0.015*"health" + 0.011*"indigenous" + 0.010*"nsw" + 0.009*"centre"

TOPIC 5 - parole più popolari
0.023*"south" + 0.020*"house" + 0.018*"north" + 0.015*"china" + 0.014*"west" + 0.014*"australia" + 0.012*"australian" + 0.011*"new" + 0.011*"turnbull" + 0.010*"talk"

TOPIC 6 - parole più popolari
0.025*"gov

## TF-IDF

In [0]:
tfidf = gensim.models.TfidfModel(X)
X = tfidf[X]

In [0]:
lda = gensim.models.LdaMulticore(X, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [54]:
for index, topic in lda.print_topics(-1):
    print("\nTOPIC %d - parole più popolari" % (index+1))
    print(topic)


TOPIC 1 - parole più popolari
0.014*"rural" + 0.013*"market" + 0.011*"news" + 0.009*"podcast" + 0.009*"abc" + 0.008*"price" + 0.007*"share" + 0.006*"turnbull" + 0.006*"queensland" + 0.006*"rise"

TOPIC 2 - parole più popolari
0.006*"coal" + 0.005*"capital" + 0.005*"hill" + 0.005*"hobart" + 0.005*"andrew" + 0.004*"fix" + 0.004*"western" + 0.004*"data" + 0.004*"exchange" + 0.003*"new"

TOPIC 3 - parole più popolari
0.006*"monday" + 0.006*"july" + 0.006*"mental" + 0.005*"islamic" + 0.005*"say" + 0.005*"northern" + 0.004*"australia" + 0.004*"png" + 0.004*"state" + 0.003*"clarke"

TOPIC 4 - parole più popolari
0.015*"trump" + 0.012*"interview" + 0.009*"grandstand" + 0.007*"nrl" + 0.007*"royal" + 0.006*"afl" + 0.006*"tuesday" + 0.006*"september" + 0.005*"commission" + 0.004*"malcolm"

TOPIC 5 - parole più popolari
0.007*"christmas" + 0.007*"august" + 0.007*"november" + 0.006*"april" + 0.006*"syria" + 0.006*"age" + 0.006*"street" + 0.006*"jam" + 0.004*"chris" + 0.004*"online"

TOPIC 6 - paro

## Visualizzare il modello

In [55]:
!pip install pyldavis

Collecting pyldavis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 8.2MB/s 
Collecting funcy (from pyldavis)
  Downloading https://files.pythonhosted.org/packages/47/a4/204fa23012e913839c2da4514b92f17da82bf5fc8c2c3d902fa3fa3c6eec/funcy-1.11-py2.py3-none-any.whl
Building wheels for collected packages: pyldavis
  Building wheel for pyldavis (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
Successfully built pyldavis
Installing collected packages: funcy, pyldavis
Successfully installed funcy-1.11 pyldavis-2.1.2


In [57]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda, X, dictionary, mds='tsne')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
