# Topic Modeling
## This notebook outlines the concepts involved in Topic Modeling


Topic modeling is a statistical model to **discover** the abstract "topics" that occur in a collection of documents

It is commonly used in text document. But nowadays, in social media analysis, topic modeling is an emerging research area.

One of the most popular algorithms used is **Latent Dirichlet Allocation** which was proposed by
David Blei et al in 2003.

Dataset: 
https://raw.githubusercontent.com/subashgandyer/datasets/main/kaggledatasets.csv

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Tokenize
    - Stop words removal
    - Non-alphabetic words removal
    - Lowercase
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Visualize the topics

### Install the necessary library

In [1]:
# ! pip install gensim

In [2]:
import nltk
! nltk.download('stopwords')

/bin/bash: -c: line 1: syntax error near unexpected token `'stopwords''
/bin/bash: -c: line 1: ` nltk.download('stopwords')'


### Import the necessary libraries

In [3]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim

### Download the dataset

In [4]:
# ! wget https://raw.githubusercontent.com/subashgandyer/datasets/main/kaggledatasets.csv

### Load the dataset

In [5]:
df = pd.read_csv("data/kaggledatasets.csv")
df.head()

Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1241,"Version 2,2016-11-05|Version 1,2016-11-03",crime\nfinance,CSV,144 MB,ODbL,"442,136 views","53,128 downloads","1,782 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1046,"Version 10,2016-10-24|Version 9,2016-10-24|Ver...",association football\neurope,SQLite,299 MB,ODbL,"396,214 views","46,367 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1024,"Version 2,2017-09-28",film,CSV,44 MB,Other,"446,255 views","62,002 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,789,"Version 2,2017-07-19|Version 1,2016-12-08",crime\nterrorism\ninternational relations,CSV,144 MB,Other,"187,877 views","26,309 downloads",608 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."
4,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,Zielak,618,"Version 11,2018-01-11|Version 10,2017-11-17|Ve...",history\nfinance,CSV,119 MB,CC4,"146,734 views","16,868 downloads",68 kernels,13 topics,https://www.kaggle.com/mczielinski/bitcoin-his...,Context\nBitcoin is the longest running and mo...


### Explore the dataset

### Extract the data for topic modeling

In [6]:
for i in df['Description'].items():
    raw = str(i[1]).lower()
    print(raw)

the datasets contains transactions made by credit cards in september 2013 by european cardholders. this dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. the dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
it contains only numerical input variables which are the result of a pca transformation. unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. features v1, v2, ... v28 are the principal components obtained with pca, the only features which have not been transformed with pca are 'time' and 'amount'. feature 'time' contains the seconds elapsed between each transaction and the first transaction in the dataset. the feature 'amount' is the transaction amount, this feature can be used for example-dependant cost-senstive learning. feature 'class' is the response variable and it takes value 1 in case of 

### Pre-process the dataset
- Tokenize
- Stop words removal
- Non-alphabetic words removal
- Lowercase
- Define them

### Define the pattern, tokenizer, stop words and lemmatizer

In [7]:
pattern = r'\b[^\d\W]+\b'
tokenizer = RegexpTokenizer(pattern)
en_stop = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

### Preprocess

In [8]:
texts = []


for i in df['Description'].items():
    # clean and tokenize document string
    raw = str(i[1]).lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [raw for raw in tokens if not raw in en_stop]
    
    # lemmatize tokens
    lemma_tokens = [lemmatizer.lemmatize(tokens) for tokens in stopped_tokens]
    
    # remove word containing only single char
    new_lemma_tokens = [raw for raw in lemma_tokens if not len(raw) == 1]
    
    # add tokens to list
    texts.append(new_lemma_tokens)


print(texts[0])

['datasets', 'contains', 'transaction', 'made', 'credit', 'card', 'september', 'european', 'cardholder', 'dataset', 'present', 'transaction', 'occurred', 'two', 'day', 'fraud', 'transaction', 'dataset', 'highly', 'unbalanced', 'positive', 'class', 'fraud', 'account', 'transaction', 'contains', 'numerical', 'input', 'variable', 'result', 'pca', 'transformation', 'unfortunately', 'due', 'confidentiality', 'issue', 'cannot', 'provide', 'original', 'feature', 'background', 'information', 'data', 'feature', 'principal', 'component', 'obtained', 'pca', 'feature', 'transformed', 'pca', 'time', 'amount', 'feature', 'time', 'contains', 'second', 'elapsed', 'transaction', 'first', 'transaction', 'dataset', 'feature', 'amount', 'transaction', 'amount', 'feature', 'used', 'example', 'dependant', 'cost', 'senstive', 'learning', 'feature', 'class', 'response', 'variable', 'take', 'value', 'case', 'fraud', 'otherwise', 'given', 'class', 'imbalance', 'ratio', 'recommend', 'measuring', 'accuracy', 'usi

### Create a dictionary

In [9]:
dictionary = Dictionary(texts)

### Filter low frequency words

In [10]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

### Create an index to word dictionary

In [11]:
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

### Train the Topic model

In [12]:
ldamodel = LdaModel(corpus, num_topics=15, id2word = id2word, passes=20)

### Display the topics

In [13]:
pprint(ldamodel.top_topics(corpus,topn=5))

[([(0.0181141, 'player'),
   (0.014698765, 'match'),
   (0.0146566825, 'game'),
   (0.01349743, 'time'),
   (0.012353201, 'inspiration')],
  -0.7918305939969578),
 ([(0.011850098, 'text'),
   (0.009736271, 'file'),
   (0.007820071, 'contains'),
   (0.007774869, 'http'),
   (0.007116005, 'use')],
  -0.8615363689063186),
 ([(0.01577201, 'cell'),
   (0.015328783, 'instance'),
   (0.010367524, 'name'),
   (0.010325557, 'learning'),
   (0.010141488, 'group')],
  -1.794857334941512),
 ([(0.013969049, 'year'),
   (0.011342459, 'number'),
   (0.0103551345, 'information'),
   (0.010077194, 'state'),
   (0.007924146, 'crime')],
  -1.8012512390817939),
 ([(0.030116025, 'image'),
   (0.027993483, 'column'),
   (0.015135894, 'activity'),
   (0.015083765, 'label'),
   (0.01414586, 'csv')],
  -1.9007929103337424),
 ([(0.03862689, 'csv'),
   (0.021309918, 'score'),
   (0.021087777, 'weapon'),
   (0.016905691, 'name'),
   (0.01678431, 'time')],
  -1.9570779605794606),
 ([(0.054189567, 'model'),
   (0.0

### Display the 15 topics with words

In [14]:
for idx in range(15):
    print("Topic #%s:" % idx, ldamodel.print_topic(idx, 10))

Topic #0: 0.389*"university" + 0.072*"state" + 0.048*"college" + 0.022*"california" + 0.020*"texas" + 0.015*"institute" + 0.011*"north" + 0.011*"new" + 0.010*"solar" + 0.010*"technology"
Topic #1: 0.016*"cell" + 0.015*"instance" + 0.010*"name" + 0.010*"learning" + 0.010*"group" + 0.010*"file" + 0.009*"company" + 0.009*"classification" + 0.008*"software" + 0.008*"attribute"
Topic #2: 0.012*"text" + 0.010*"file" + 0.008*"contains" + 0.008*"http" + 0.007*"use" + 0.007*"date" + 0.006*"available" + 0.006*"time" + 0.006*"language" + 0.005*"inspiration"
Topic #3: 0.018*"day" + 0.016*"back" + 0.014*"woman" + 0.013*"number" + 0.012*"lower" + 0.012*"inspiration" + 0.011*"set" + 0.011*"city" + 0.011*"health" + 0.011*"risk"
Topic #4: 0.018*"player" + 0.015*"match" + 0.015*"game" + 0.013*"time" + 0.012*"inspiration" + 0.012*"see" + 0.011*"others" + 0.011*"research" + 0.011*"world" + 0.011*"get"
Topic #5: 0.058*"description" + 0.052*"yet" + 0.033*"time" + 0.026*"tweet" + 0.014*"season" + 0.014*"many

### LSI Model

In [15]:
from gensim.models import LsiModel
lsamodel = LsiModel(corpus, num_topics=10, id2word = id2word)
pprint(lsamodel.print_topics(num_topics=10, num_words=10))

[(0,
  '0.970*"university" + 0.174*"state" + 0.076*"college" + 0.051*"texas" + '
  '0.049*"california" + 0.039*"institute" + 0.031*"new" + 0.028*"technology" + '
  '0.027*"florida" + 0.027*"north"'),
 (1,
  '-0.389*"player" + -0.247*"team" + -0.221*"shot" + -0.200*"number" + '
  '-0.177*"time" + -0.173*"file" + -0.159*"year" + -0.156*"csv" + '
  '-0.146*"goal" + -0.126*"ice"'),
 (2,
  '0.437*"player" + 0.307*"shot" + 0.259*"team" + -0.250*"integer" + '
  '-0.224*"strongly" + 0.175*"ice" + 0.174*"goal" + -0.154*"file" + '
  '0.151*"attempt" + -0.133*"csv"'),
 (3,
  '0.595*"integer" + 0.535*"strongly" + 0.263*"interested" + 0.261*"enjoy" + '
  '0.119*"much" + 0.116*"player" + -0.098*"file" + -0.093*"year" + '
  '0.090*"shot" + -0.088*"csv"'),
 (4,
  '0.402*"year" + -0.325*"date" + -0.265*"element" + -0.199*"tag" + '
  '-0.192*"registration" + -0.186*"zero" + -0.180*"end" + -0.174*"start" + '
  '-0.171*"one" + -0.165*"application"'),
 (5,
  '0.535*"csv" + -0.436*"year" + -0.193*"number" +

In [16]:
for idx in range(10):
    print("Topic #%s:" % idx, lsamodel.print_topic(idx, 10))
print("=" * 20)

Topic #0: 0.970*"university" + 0.174*"state" + 0.076*"college" + 0.051*"texas" + 0.049*"california" + 0.039*"institute" + 0.031*"new" + 0.028*"technology" + 0.027*"florida" + 0.027*"north"
Topic #1: -0.389*"player" + -0.247*"team" + -0.221*"shot" + -0.200*"number" + -0.177*"time" + -0.173*"file" + -0.159*"year" + -0.156*"csv" + -0.146*"goal" + -0.126*"ice"
Topic #2: 0.437*"player" + 0.307*"shot" + 0.259*"team" + -0.250*"integer" + -0.224*"strongly" + 0.175*"ice" + 0.174*"goal" + -0.154*"file" + 0.151*"attempt" + -0.133*"csv"
Topic #3: 0.595*"integer" + 0.535*"strongly" + 0.263*"interested" + 0.261*"enjoy" + 0.119*"much" + 0.116*"player" + -0.098*"file" + -0.093*"year" + 0.090*"shot" + -0.088*"csv"
Topic #4: 0.402*"year" + -0.325*"date" + -0.265*"element" + -0.199*"tag" + -0.192*"registration" + -0.186*"zero" + -0.180*"end" + -0.174*"start" + -0.171*"one" + -0.165*"application"
Topic #5: 0.535*"csv" + -0.436*"year" + -0.193*"number" + 0.175*"file" + -0.166*"date" + -0.155*"total" + -0.1

## Visualize the topics and documents with the trained Topic Model
- Use pyLDAvis from gensim

In [17]:
# !pip3 install pyLDAvis
import pyLDAvis.gensim


### Enable the notebook for visualization

In [18]:
pyLDAvis.enable_notebook()

### Visualize the Topic model

In [19]:

pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

  pid = os.fork()
  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  return node.n
  if isinstance(node, ast.Num):  # <number>
  return node.n
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
