# Gensim

It would be ideal if we could extract topics from the dataset autonomously, and not having to come up with important queries by ourselves. 
Gensim according to their website allows us to:
* To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
* To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

Its core concepts are based around the usual NLP concepts

* **Document:** some text.
* **Corpus:** a collection of documents.
* **Vector:** a mathematically convenient representation of a document.
* **Model:** an algorithm for transforming vectors from one representation to another.


In [83]:
import pandas as pd
import numpy as np
from pprint import pprint  # pretty-printer
from collections import defaultdict
import matplotlib.pyplot as plt
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
import string
import joblib
from pathlib import Path
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import pprint
from nltk.stem import WordNetLemmatizer
from gensim import corpora
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
pd.set_option('max_colwidth', 180)
pd.set_option('display.max_colwidth', 0)


nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

#Define the name of the csv file you are loading

file_name = 'Eluvio_DS_Challenge.csv'

[nltk_data] Downloading package stopwords to /home/george/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/george/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/george/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Preprocessing Document 

* tokenizing the documents and lower case the tokens using nltk
* removing words existing in nltk stopwords and also removes punctuations.Stopwords are words that don't add any meaning to the sentences so we can safely remove them.
* lemmatizing 
* I am also removing one letter lemmas because I had an issue with the letter u popping up.All of times someone might need to remove words manually

In [84]:
import time
from smart_open import open
from gensim import models
wordnet_lemmatizer = WordNetLemmatizer()

def thorough_filter(words):
    filtered_words = []
    for word in words:
        pun = []
        for letter in word:
            pun.append(letter in string.punctuation)
        if not all(pun):
            filtered_words.append(word)
    return filtered_words

def preprocess_document(document):
    stop_words = stopwords.words('english')
   
    words = nltk.word_tokenize(document.lower()) #tokenizing document
    words = [word for word in words if word not in stop_words + list(string.punctuation)]#checking if word exists in stop words,if it exists we remove it
    words = thorough_filter(words)#using function from to remove better punctuations
    words = [wordnet_lemmatizer.lemmatize(word)  for word in words]#we chose to lemmatize instead of stemming
    words = [word for word in words if len(word)>1]
    return words

def serialize_corpus(corpus,path):
    
    corpora.MmCorpus.serialize(path, corpus)
 
docs = pd.read_csv(file_name)#We can also read in chunks 
texts = [preprocess_document(doc) for doc in docs["title"]]


* The dictionary is the "bag-of-words" representation.

**(Wikipedia)**:

The **bag-of-words** model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

To  convert tokenized documents to vectors we are using the doc2bow function.We are also saving the corpus so its easily opened again.

In [85]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in texts]
#serialize_corpus(corpus,"~/Downloads/eluvio_gensim/corpus_01.mm")

### pyLDAvis
pyLDAvis is a tool to visualize lda models and it can integrate with gensim. Latent Dirichlet Allocation is basically a transformation of the initial vector space to a space of lower dimensionality(topics). In each topic each word represents a probabilistic contribution to it, so each topic shows us related words to extract info from. This method of finding relationships seems to be much more efficient compared to spaCy's dependency matches(although we do have to mention again they can be very useful for specific queries).We are not just finding relationships in one sentence ,we relate words from the whole dataset and using visualizing methods like pyLDAvis would be very useful for business uses and automated without the need of queries.
According to gensim:
https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py

**gensim uses a fast implementation of online LDA parameter estimation based on 2, modified to run in distributed mode on a cluster of computers.**

So LDA would be very handy for large companies with the ability to execute the algorithm on a cluster of computers like Eluvio as they need to find relationships in very large datasets.

**Latent Semantic Indexing**
Another similar method that can also be used that allows for "online training". It would be very useful and interesting to write a custom visualizing method similar to pyLDAvis if ,after testing, LDA can provide better associations for Eluvio's business use cases.

In [86]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=20)
pyLDAvis.enable_notebook()
#corpus_lsi = lsi_model[corpus_tfidf]
#for save or loading models we can use the following commands
#lsi_model.save("name.lsi")  
#loaded_lsi_model = models.LsiModel.load("name.lsi")



In [87]:
lda_viz = gensimvis.prepare(lda_model, corpus, dictionary)

In [88]:
pyLDAvis.display(lda_viz)

I experimented with a lot of different values for the topic number.Very high values were throwing a weird error,values above 40 gave clusters that were very close in the intertopic distance map so I thought I had to reduce the topic number even more.A value of around 20 topics clustered the data in clearly identifiable clusters.

* Saliency represents the importance of a term
* Relevance represents the relevance of term w for a topic t

### Some good example cluster results (or lamda relevance metric of around 0.77):
**Topic 1:** syria,pakistan,iran,and air strike get clustered together so we can clearly an association of air strikes occuring in these countries.
**Topic 2:** police and attack ,probably a lot of posts about police and attacks
**Topic 7:** election,power,want,take very relatable terms
**Topic 20:**  EU and refugees get clustered together

### gensim-data
Another interesting idea is to use gensim's trained dataset storage to exploit more information from a number of relevant terms for each topic.

In [89]:
import gensim.downloader as api
info = api.info()
model = api.load("word2vec-google-news-300")  # download the model and return as object ready for use


In [90]:
terms=["iraq","un","obama","quake","jewish","hamas","troops","minister"]
for term in terms:
    print("Term: "+term)
    print(model.most_similar(term))
    print("")

Term: iraq
[('afghanistan', 0.7293226718902588), ('iraqi', 0.6724763512611389), ('Afganistan', 0.6447621583938599), ('afganistan', 0.6384879350662231), ('Iraq', 0.6277869939804077), ('iran', 0.6082879900932312), ('libya', 0.6052849292755127), ('vietnam', 0.6033740639686584), ('cheney', 0.6009483337402344), ('iraqis', 0.5914426445960999)]

Term: un
[('software_libero_rilasciato', 0.529174268245697), ('questionably', 0.4934689998626709), ('sotto_licenza_GNU_GPL', 0.4834793508052826), ('dis', 0.46554914116859436), ('superfluously', 0.46098390221595764), ('dubiously', 0.4599626958370209), ('très', 0.4597243368625641), ('revoltingly', 0.45808321237564087), ('um', 0.4571697413921356), ('de_merde', 0.4546312391757965)]

Term: obama
[('mccain', 0.7319012880325317), ('hillary', 0.7284600138664246), ('obamas', 0.7229632139205933), ('george_bush', 0.7205674648284912), ('barack_obama', 0.7045838832855225), ('palin', 0.7043113708496094), ('clinton', 0.6934447884559631), ('clintons', 0.6816835403442

For example if the token quake starts showing up ,we can expect aftershock quakes in the region,as the the word quake is related to aftershock.We can also see hamas is related to words like  palestinian and Isreal,gaza .These words also show up in our visualization something that confirms our clustering.
We can also relate these tokens back to their original posts so would be able to assign tags to these posts using the most related terms and produced similar keywords from the pretrained datasets of gensim-data.

In [91]:
twitter_model = api.load("glove-wiki-gigaword-300")
for term in terms:
    print("Term: "+term)
    print(twitter_model.most_similar(term))
    print("")

Term: iraq
[('iraqi', 0.8006436228752136), ('baghdad', 0.7278226017951965), ('iraqis', 0.7095191478729248), ('saddam', 0.7028290629386902), ('afghanistan', 0.6704175472259521), ('kuwait', 0.6445066332817078), ('hussein', 0.6407447457313538), ('u.s.-led', 0.5943576097488403), ('troops', 0.5887258052825928), ('iran', 0.5829001665115356)]

Term: un
[('u.n.', 0.8638535141944885), ('annan', 0.6368896961212158), ('nations', 0.6264333128929138), ('kofi', 0.5853826403617859), ('peacekeeping', 0.5739530920982361), ('u.n', 0.5699392557144165), ('peacekeepers', 0.5560168027877808), ('envoy', 0.5389721393585205), ('humanitarian', 0.5304235816001892), ('resolution', 0.5192335247993469)]

Term: obama
[('barack', 0.9254721999168396), ('mccain', 0.7590768337249756), ('bush', 0.7570988535881042), ('clinton', 0.7085603475570679), ('hillary', 0.6497915387153625), ('kerry', 0.6144053339958191), ('rodham', 0.6138635277748108), ('biden', 0.5940852165222168), ('gore', 0.5885976552963257), ('democrats', 0.560

# Conclusion
**Clustering** data with machine learning is the most powerful tool of large companies ,as these companies have access to huge amounts of data to test/train their models .This requires a considerable amount of computational resources(cluster of computers) so finding methods to reduce  these costs whilst still maintaining  or even improving the ability to exctract important information from a desired dataset must be a top priority. Another problem is what to do with new data coming in, lsi kind of solves this problem as it can receive new data and still functions but I haven't seen anything similar being done in vision datasets. My thesis on spiking neural networks(on event based sensors) aims to use a method called **evolving spiking neural networks** (and training algorithms from the book:https://www.springer.com/gp/book/9783662577134) which gives a neural network the ability to automatically "create" new classification nodes as data are coming in.First, compares the data with prexisting trained data and if is not similar to any of it,new nodes get generated. It would be very interesting to test this method in larger amounts of data and see its performance.Gensim is better than spaCy in reducing the dimensionality of the input space when using algorithms to cluster topics.However, spaCy uses vectors too ,neural network models but it doesn't offer topic clustering unfortunately.