# <b>Topic Modelling Pipeline with spaCy and Gensim</b>

## Introduction
This notebook is part of the team Merseus spike on a custom chronotope visualisation.<br>
The goal is to take a large dataset that has user defined labels (from the cluster graph tool) and then to find what the main sub-topics are within those clusters and write these back to the dataset on an individual post level.<br>
To achieve this it was decided that creating a topic modelling pipeline using spaCy and Gensim might be the best approach.<br>
<br>
<blockquote>
    In machine learning and natural language processing, a <b>topic model</b> is a type of statistical model for discovering the<br>
    abstract “topics” that occur in a collection of documents.
</blockquote>
<br>
The following notebook follows the creating of a topic modelling pipeline using spaCy and gensim, as laid out in a 
<a href="https://towardsdatascience.com/building-a-topic-modeling-pipeline-with-spacy-and-gensim-c5dc03ffc619" title=”Wikipedia”>medium blog post</a><br>

## Requirements
### Packages:
```
pip install pandas
pip install gensim
pip install spacy
pip install tqdm
pip install pprint
```

### Models:
Pre-trained spaCy model can be downloaded from the following for the latest version of spaCy: https://spacy.io/models/en#en_core_web_lg<br>
In this example we are using the large model so run the following command:
- `python -m spacy download en_core_web_lg`



-----------

In [3]:
import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import re

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS

# tqdm allows you to create progress bars to track how long your code is taking to process
from tqdm.notebook import tqdm as tqdm

# pprint is to make our topics formatted a little nicer when we take a look
from pprint import pprint

## Loading the Data and Model
In this notebook we are using the a csv generated from a discover search as follows:
<br>
`{ query: "xbox OR nintendo OR PS5", published_after: "16 Jun 2020 23:00" , published_before: "18 Jun 2020 09:05" }`

In [2]:
df = pd.read_csv('dataset_xbox_nintendo_ps5.csv')

In [3]:
# Getting rid of rows with null values and making sure all values are strings. This shouldn't have to be done in discover as in the pipeline for prepararing
# the data to be sent to the cluster graph we remove all empty posts, and the posts should be string by default from switchboard
df.dropna(subset = ["text"], inplace=True)
df['text'] = df['text'].astype(str)

## Data Preprocessing

In [4]:
 def preprocess(sentence):
    '''
    Removes any unwanted characters from the sentence such as special characters, numbers, excess whitespace, newlines, etc.
    '''
#     Keeps both English and Chinese characters (can add more languages by expanding the regex)
    filtered_sentence = re.sub('[^A-Za-z\u4E00-\u9FFF ]+', ' ', sentence)
    return ' '.join(filtered_sentence.split())

In [5]:
df['cleaned_text'] = df['text'].map(lambda s:preprocess(s)) 

In [6]:
newest_doc = df['cleaned_text']

In [7]:
newest_doc

0        NOT CIAS SMARTPHONE Warner Bros Games e NetEas...
1        Scarlet Nexus es la nueva entrega de Bandai Na...
2                  Twitter Instagram Xbox Steam ThisJoeLee
3        Pathfinder Kingmaker anunciado para PS e Xbox ...
4        GAMEPLAY CRIS TALES Un PRECIOSO JRPG desarroll...
                               ...                        
37556    Mengatur atau diatur art lfl flatdesign playst...
37557    FUSER la prochaine volution des jeux musicaux ...
37559    Sony ha comprato la licenza di Oodle Texture u...
37560    Nintendo Switch 數位下載版 ZUMBA Burn It Up 終於在今天 月...
37561    AMR has recently added a new study titled Smar...
Name: cleanText, Length: 31402, dtype: object

## <b>The spaCy Part</b>

In [8]:
# 'en_core_web_lg' is the pre-trained large spacy model we downloaded before
# here we are loading it into spacy to be used for training
nlp= spacy.load('en_core_web_lg')

## Defining a list of stop words
<blockquote>
    Stop words are basically common words that don’t really add a lot of predictive value to your model. If you don’t remove the word “the” from your corpus<br>
    for example, it will likely show up in every topic you generate, given how often “the” is used in the English language.<br>
    Note that no word, even one as ubiquitous as “the” is automatically a stop word. Stop words are usually determined by the particular task at hand.
</blockquote>
<br>
If you notice that you are getting lots of words coming out as topics that don't convey much meaning or that yp_listou don't particularly want, add them to the `custom_stop_list` array below

In [9]:
custom_stop_list = []

In [10]:
# This list was taken from the comments in https://gist.github.com/sebleier/554280
standard_stop_list = ["0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "a1", "a2", "a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad", "added", "adj", "ae", "af", "affected", "affecting", "affects", "after", "afterwards", "ag", "again", "against", "ah", "ain", "ain't", "aj", "al", "all", "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "ao", "ap", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "ar", "are", "aren", "arent", "aren't", "arise", "around", "as", "a's", "aside", "ask", "asking", "associated", "at", "au", "auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bi", "bill", "biol", "bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "c1", "c2", "c3", "ca", "call", "came", "can", "cannot", "cant", "can't", "cause", "causes", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "changes", "ci", "cit", "cj", "cl", "clearly", "cm", "c'mon", "cn", "co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn", "couldnt", "couldn't", "course", "cp", "cq", "cr", "cry", "cs", "c's", "ct", "cu", "currently", "cv", "cx", "cy", "cz", "d", "d2", "da", "date", "dc", "dd", "de", "definitely", "describe", "described", "despite", "detail", "df", "di", "did", "didn", "didn't", "different", "dj", "dk", "dl", "do", "does", "doesn", "doesn't", "doing", "don", "done", "don't", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "e2", "e3", "ea", "each", "ec", "ed", "edu", "ee", "ef", "effect", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "empty", "en", "end", "ending", "enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "f2", "fa", "far", "fc", "few", "ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows", "for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore", "fy", "g", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going", "gone", "got", "gotten", "gr", "greetings", "gs", "gy", "h", "h2", "h3", "had", "hadn", "hadn't", "happens", "hardly", "has", "hasn", "hasnt", "hasn't", "have", "haven", "haven't", "having", "he", "hed", "he'd", "he'll", "hello", "help", "hence", "her", "here", "hereafter", "hereby", "herein", "heres", "here's", "hereupon", "hers", "herself", "hes", "he's", "hh", "hi", "hid", "him", "himself", "his", "hither", "hj", "ho", "home", "hopefully", "how", "howbeit", "however", "how's", "hr", "hs", "http", "hu", "hundred", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "i'd", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "i'll", "im", "i'm", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest", "into", "invention", "inward", "io", "ip", "iq", "ir", "is", "isn", "isn't", "it", "itd", "it'd", "it'll", "its", "it's", "itself", "iv", "i've", "ix", "iy", "iz", "j", "jj", "jr", "js", "jt", "ju", "just", "k", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "know", "known", "knows", "ko", "l", "l2", "la", "largely", "last", "lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "let's", "lf", "like", "liked", "likely", "line", "little", "lj", "ll", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "m2", "ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn", "mightn't", "mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu", "much", "mug", "must", "mustn", "mustn't", "my", "myself", "n", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily", "necessary", "need", "needn", "needn't", "needs", "neither", "never", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "no", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "oa", "ob", "obtain", "obtained", "obviously", "oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones", "only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "other", "others", "otherwise", "ou", "ought", "our", "ours", "ourselves", "out", "outside", "over", "overall", "ow", "owing", "own", "ox", "oz", "p", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly", "pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po", "poorly", "possible", "possibly", "potentially", "pp", "pq", "pr", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps", "pt", "pu", "put", "py", "q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "r2", "ra", "ran", "rather", "rc", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "research-articl", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "s2", "sa", "said", "same", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "sf", "shall", "shan", "shan't", "she", "shed", "she'd", "she'll", "shes", "she's", "should", "shouldn", "shouldn't", "should've", "show", "showed", "shown", "showns", "shows", "si", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so", "some", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified", "specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "sy", "system", "sz", "t", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te", "tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "thats", "that's", "that've", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "there's", "thereto", "thereupon", "there've", "these", "they", "theyd", "they'd", "they'll", "theyre", "they're", "they've", "thickv", "thin", "think", "third", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn", "to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "t's", "tt", "tv", "twelve", "twenty", "twice", "two", "tx", "u", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ut", "v", "va", "value", "various", "vd", "ve", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt", "vu", "w", "wa", "want", "wants", "was", "wasn", "wasnt", "wasn't", "way", "we", "wed", "we'd", "welcome", "well", "we'll", "well-b", "went", "were", "we're", "weren", "werent", "weren't", "we've", "what", "whatever", "what'll", "whats", "what's", "when", "whence", "whenever", "when's", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "where's", "whereupon", "wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "who'll", "whom", "whomever", "whos", "who's", "whose", "why", "why's", "wi", "widely", "will", "willing", "wish", "with", "within", "without", "wo", "won", "wonder", "wont", "won't", "words", "world", "would", "wouldn", "wouldnt", "wouldn't", "www", "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "y2", "yes", "yet", "yj", "yl", "you", "youd", "you'd", "you'll", "your", "youre", "you're", "yours", "yourself", "yourselves", "you've", "yr", "ys", "yt", "z", "zero", "zi", "zz"]

In [11]:
# Adding the list of custom user defined stop words to the standard list of stop words
stop_list = custom_stop_list + standard_stop_list

# Updates spaCy's default stop words list with additional words. 
nlp.Defaults.stop_words.update(stop_list)

# Iterates over the words in the stop words list and resets the "is_stop" flag.
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

In [12]:
def lemmatizer(doc):
    # This takes in a doc of tokens from the NER and lemmatizes them. 
    # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
    doc = u' '.join(doc)
    return nlp.make_doc(doc)

In [13]:
def remove_stopwords(doc):
    # This will remove stopwords and punctuation.
    # Use token.text to return strings, which we'll need for Gensim.
    doc = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
    return doc

In [14]:
# The add_pipe function appends our functions to the default pipeline.
nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

In [15]:
doc_list = []

# Iterates through each article in the corpus.
for doc in tqdm(newest_doc):
    # Passes that article through the pipeline and adds to a new list.
    pr = nlp(doc)
    doc_list.append(pr)

HBox(children=(FloatProgress(value=0.0, max=31402.0), HTML(value='')))




In [20]:
# This is what each post now looks like
print(' '.join(map(str, doc_list[0])))

CIAS SMARTPHONE Warner Bros Games NetEase anunciam Lord Rings rise War Chegar para Android iOS ainda gigante chinesa NetEase Warner Bros Interactive Entertainment WB Games anunciaram uma parceria para produ Lord Rings rise war novo jogo baseado famosa saga JRR Tolkien TLoTR rise War ser desenvolvido por dio NetEase sede Hong Kong momento infelizmente muitos detalhes sobre projeto parceria permitir NetEase trabalhar uma marca alto prest gio para mercado ocidental depois colabora Marvel Games Pokemon Company dos ltimos anos deram origem jogos para celular sucesso NetEase ganhando crescente participa mercado Ocidente depois conquistar China objetivo agora expandir para Europa ricas Nas ltimas semanas empresa tamb fundou dio desenvolvimento Jap especializado produ jogos AAA para Xbox Series PlayStation novo jogo para celular Lord Rings sendo trabalhado para Android iOS deve ser lan ado ainda este ano gratuitamente suporte para microtransa Fonte Eye


## <b>The Gensim Part</b>
In order to work on text documents, Gensim requires the words (aka tokens) be converted to unique ids. In order to achieve that, Gensim lets you create a Dictionary object<br>
that maps each word to a unique id.<br>
<br>
The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and<br>
other models that Gensim specializes in.

In [16]:
# Creates a mapping of word IDs to words.
words = corpora.Dictionary(doc_list)

# Turns each document into a bag of words.
corpus = [words.doc2bow(doc) for doc in doc_list]

## <b>Run the model</b>

In [17]:
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=words,
    num_topics=10, 
    random_state=2,
    update_every=1,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

## Print the resulting topics

In [18]:
# Print the keyword in the 10 topics
pprint(lda_model.print_topics(num_words=10))

[(0,
  '0.035*"Xbox" + 0.026*"Laptop" + 0.023*"Bluetooth" + 0.019*"Black" + '
  '0.018*"Android" + 0.015*"Windows" + 0.015*"Blue" + 0.015*"iPhone" + '
  '0.015*"Mic" + 0.015*"usb"'),
 (1,
  '0.020*"Xbox" + 0.013*"Series" + 0.012*"para" + 0.008*"PlayStation" + '
  '0.008*"yang" + 0.008*"Sony" + 0.007*"che" + 0.007*"mais" + 0.007*"jogo" + '
  '0.007*"Microsoft"'),
 (2,
  '0.031*"mon" + 0.030*"Pok" + 0.025*"para" + 0.015*"Nintendo" + 0.014*"una" + '
  '0.013*"del" + 0.013*"por" + 0.011*"las" + 0.008*"su" + 0.008*"Switch"'),
 (3,
  '0.043*"game" + 0.029*"Nintendo" + 0.017*"Pokemon" + 0.017*"Switch" + '
  '0.015*"Xbox" + 0.015*"PlayStation" + 0.012*"release" + 0.011*"console" + '
  '0.010*"Sony" + 0.009*"reveal"'),
 (4,
  '0.059*"https" + 0.015*"facebook" + 0.013*"live" + 0.012*"youtube" + '
  '0.012*"XBOX" + 0.011*"fortnite" + 0.010*"twitch" + 0.009*"instagram" + '
  '0.007*"pubg" + 0.007*"Facebook"'),
 (5,
  '0.054*"Nintendo" + 0.041*"Switch" + 0.018*"https" + 0.016*"Xbox" + '
  '0.015*"H