# Gensim

It would be ideal if we could extract topics from the dataset autonomously, and not having to come up with important queries by ourselves. 
Gensim according to their website allows us to:
* To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
* To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

Its core concepts are based around the usual NLP concepts

* **Document:** some text.
* **Corpus:** a collection of documents.
* **Vector:** a mathematically convenient representation of a document.
* **Model:** an algorithm for transforming vectors from one representation to another.


In [37]:
import pandas as pd
import numpy as np
from pprint import pprint  # pretty-printer
from collections import defaultdict
import matplotlib.pyplot as plt
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
import string
import joblib
from pathlib import Path
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import pprint
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim import models
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
pd.set_option('max_colwidth', 180)
pd.set_option('display.max_colwidth', 0)


nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

#Define the name of the csv file you are loading

file_name = 'Eluvio_DS_Challenge.csv'

[nltk_data] Downloading package stopwords to /home/george/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/george/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/george/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Preprocessing Document 

* tokenizing the documents and lower case the tokens using nltk
* removing words existing in nltk stopwords and also removes punctuations.Stopwords are words that don't add any meaning to the sentences so we can safely remove them.
* lemmatizing 
* I am also removing one letter lemmas because I had an issue with the letter u popping up.All of times someone might need to remove words manually

In [38]:
import time
from smart_open import open

from gensim.models import Phrases
from gensim.models import Word2Vec


wordnet_lemmatizer = WordNetLemmatizer()

    
def thorough_filter(words):
    filtered_words = []
    for word in words:
        pun = []
        for letter in word:
            pun.append(letter in string.punctuation)
        if not all(pun):
            filtered_words.append(word)
    return filtered_words

bigram = Phrases()
def preprocess_document(document):
    stop_words = stopwords.words('english')
   
    words = nltk.word_tokenize(document.lower()) #tokenizing document
    words = [word for word in words if word not in stop_words + list(string.punctuation)]#checking if word exists in stop words,if it exists we remove it
    words = thorough_filter(words)#using function from to remove better punctuations
    words = [wordnet_lemmatizer.lemmatize(word)  for word in words]#we chose to lemmatize instead of stemming
    words = [word for word in words if len(word)>1]
    bigram.add_vocab([words])
    return words

def serialize_corpus(corpus,path):
    
    corpora.MmCorpus.serialize(path, corpus)
 



We could have also split the data depending on age but I found it more important to still keep all the data in case of underfitting

In [39]:
docs = pd.read_csv(file_name)#We can also read in chunks 
texts = [preprocess_document(doc) for doc in docs["title"]]

In [40]:
bigrams=list(bigram[texts])

In [41]:
dictionary = corpora.Dictionary(bigrams)
corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in bigrams]
#serialize_corpus(corpus,"~/Downloads/eluvio_gensim/corpus_02.mm"

### pyLDAvis


In [42]:
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=20)
pyLDAvis.enable_notebook()
#corpus_lsi = lsi_model[corpus_tfidf]
#for save or loading models we can use the following commands
#lsi_model.save("name.lsi")  
#loaded_lsi_model = models.LsiModel.load("name.lsi")

In [43]:
lda_viz = gensimvis.prepare(lda_model, corpus, dictionary)

In [44]:
pyLDAvis.display(lda_viz)

Hugginface could have also been to used because it supports numerous transformers and also pre-trained models.SpaCy as I understand implements these too.Training transformers for this dataset seems a little too much and transformers are expensive to train in general. I also decided to not do any predictive analysis on upvotes because upvotes in my opinion are not that much of importance, its not like we are trying to predict sales.For example we might predict that a post will have low upvotes,that might be because the title is boring.However,a company might be very interested in that type of event the post is about and excluding that post in the analysis because of low predicted upvotes might be detrimental for a certain use case.Also, upvotes would have to be always compared to the total number of users but that metric is vague again since a low number of users might be interested in a certain "topic" but that doesn't mean the subject is not important. We can just sort topic keywords based on upvotes if we want to see what topic keywords are "hot" or aggregate the upvotes of posts containing the topic and then sort based on that. Transformers could have been very useful for sentiment analysis and combine that with splitting the corpus by age .However, I think that would be more appropriate for a larger/ dataset .