<a href="https://colab.research.google.com/github/CALDISS-AAU/sdsphd21/blob/master/notebooks/Unsupervised_ML_and_NLP_caldissNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Moving on to text-vectorization

Before we dive into text-vectorization and representation: A bit something on SpaCy and part-of-speech tagging. Also: We will be using SpaCy as our main pre-processing tool, as it combines most of the things that we need in one convinient api.

![](https://assets.vogue.com/photos/613fd623fdb2d0f49a50f0a5/master/w_2560%2Cc_limit/GettyImages-1340129801.jpeg)


In [None]:
!pip install pandas --upgrade

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Here is a long article text - again
article = """Lil Nas X—Montero, Monte, or Nas to the friends he still has—knew before I even arrived at the San Vicente Bungalows in West Hollywood that he was where he was meant to be. No one had to tell him who I was. The universe had informed him, as he arrived and saw his lucky numbers—7 and 9—on the license plate of the car in front of him, that his future depended on him being here, right now. I, unfortunately, am not fluent in the universe. I’ve never felt I heard it speak to me, even as so many of my intuitions led me down paths toward great fortune. So as I arrived to the elite social club, which at one point was a haven for WeHo’s seediest gay hookups, I had no idea where I was meant to be. I only knew that after getting on an airplane, hungover in a post–Tony Awards weekend haze, I didn’t particularly want to be anywhere, let alone with our country’s biggest pop star, a young man a decade younger than me, interviewing him about his many successes."""

In [None]:
import spacy
nlp = spacy.load("en")

# Let's apply the model to the article (as easy as that)
article_nlp = nlp(article)

In [None]:
# spaCy also splits sentences
[sentance for sentance in article_nlp.sents][:10]

In [None]:
# And: it will also annotate them with POS-labels
sentance = [sentance for sentance in article_nlp.sents][4]
[(token.text, token.pos_) for token in sentance]

In [None]:
# We can use this to extract only tokens that we think bear most of the meaning
[token.text for token in sentance if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] and not token.is_stop] 

In [None]:
# Also, we can create lemmas, thus reducing heterogeneity in the vocabulary without sacrificing much meaning

[tok.lemma_ for tok in nlp("a text about innovations and all kinds things which goes nowhere")]

In [None]:
# Isn't that great?

[token.lemma_.lower() for token in sentance if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] and not token.is_stop] 

Thus we have created a representation of a text that is probably as "minimal" as possible - Maximising meaning and minimising "noise"

In the last part of this notebook we will try to use such representations to explore the content of text collections

### Bag of words model

In order for a computer to understand text we need to somehow find a useful representation.
If you need to compare different texts e.g. articles, you will probably go for keywords. These keywords may come from a keyword-list with for example 200 different keywords
In that case you could represent each document with a (sparse) vector with 1 for "keyword present" and 0 for "keyword absent"
We can also get a bit more sophoistocated and count the number of times a word from our dictionary occurs.
For a corpus of documents that would give us a document-term matrix
![example](https://i.stack.imgur.com/C1UMs.png)

Let's try creating a bag of words model from our initial example.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    
     'A text about cats.',
     'A text about dogs.',
     'And another text about a dog.',
     'Why always writing about cats and dogs, always dogs?',
   ]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [None]:
pd.DataFrame(X.A, columns=vectorizer.get_feature_names())

#### TF-IDF - Term Frequency - Inverse Document Frequency

A token is importan for a document if appears very often
A token becomes less important for comparaison across a corpus if it appears all over the place in the corpus

*Innovation* in a corpus of abstracts talking about innovation is not that important

\begin{equation*}
w_{i,j} = tf_{i,j}*log(\frac{N}{df_i})
\end{equation*}

- $w_{i,j}$ = the TF-IDF score for a term i in a document j
- $tf_{i,j}$ = number of occurence of term i in document j
- $N$ = number of documents in the corpus
- $df_i$ = number of documents with term i


We will use TF-IDF to transform our corpus. However, first we need to fir the TF-IDF model.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [None]:
pd.DataFrame(X.A, columns=vectorizer.get_feature_names())

In [None]:
# let's fist install this nice visualizer
!pip install -qq pyLDAvis

In [None]:
# and import it
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
%matplotlib inline
pyLDAvis.enable_notebook()

We will be using a dataset from EU Cordis which describes H2020 research projects. No tweets for now.

http://data.europa.eu/euodp/en/data/dataset/cordisH2020projects

In [None]:
# I put a little sample (500 observations) of the data on github

reports = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/cordis-h2020reports.gz')

In [None]:
reports.info()

In [None]:
# reindec
reports.index = range(len(reports))

In [None]:
# now, let's combine everything that we learned about preprocessing in a few lines of code

tokens = []

for summary in nlp.pipe(reports['summary'], disable=["parser", "ner"]):
  proj_tok = [token.lemma_.lower() for token in summary 
              if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] 
              and not token.is_stop
              and not token.is_punct] 
  tokens.append(proj_tok)


In [None]:
# Let's bring the tokens back in

reports['tokens'] = tokens

In [None]:
reports['tokens'][3]

Another library that you have to know when doing NLP (once you progress to DeepLearning and recent stuff probably not any more but for now) is gensim

https://radimrehurek.com/gensim/

This is the library that handles all kinds of statistical NLP tasks and goes as far as implementing (super efficient) embedding model training (next class)
But: With NLP today being all BERT, ELMO and transformers probably declining in importance. Back in 2013 gensim was a major discovery and breakthrough helper when I was working on my PhD. One more reason to have a look at it.



In [None]:
!pip install -qq -U gensim

In [None]:
# Import the dictionary builder
from gensim.corpora.dictionary import Dictionary

In [None]:
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(reports['tokens'])

In [None]:
# filter out low-frequency / high-frequency stuff, also limit the vocabulary to max 1000 words
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=1000)

In [None]:
# construct corpus using this dictionary
corpus = [dictionary.doc2bow(doc) for doc in reports['tokens']]

In [None]:
# That's how the corpus looks
corpus[3][:10]

### Topic modelling - NLP meets unsupervised ML

The corpus is a list of tuples, with word-ids and the number of their occurrence in documents: LDA - https://youtu.be/DWJYZq_fQ2A

We will start with a topic modelling approach that is good for interpretable topics but not too much for further processing

![alt text](https://miro.medium.com/max/1600/1*pZo_IcxW1GVuH2vQKdoIMQ.jpeg)


In [None]:
# we'll use the faster multicore version of LDA

from gensim.models import LdaMulticore

In [None]:
# Training the model
lda_model = LdaMulticore(corpus, id2word=dictionary, num_topics=10, workers = 4, passes=10)

In [None]:
# Check out topics
lda_model.print_topics(-1)

In [None]:
# Where does a text belong to?
lda_model[corpus][0]

In [None]:
reports['summary'][0]

In [None]:
# Let's try to visualize
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)


In [None]:
 # Let's Visualize
pyLDAvis.display(lda_display)

In [None]:
# In case you run a website and want to publish it...or embed it in a blogpost...
pyLDAvis.save_html(lda_display, 'lda.html')

In [None]:
# And that's how you get the topic-number that's ranked highest

print(sorted([(2, 0.121567), (9, 0.8610384)], key=lambda x: -x[1]))
print(sorted([(2, 0.121567), (9, 0.8610384)], key=lambda x: -x[1])[0][0])

From here, you can assign topics to texts...do some EDA, explore how topics evolve over time etc.

Finally, let's try out LSA (an older topic-moddeling approach similar to NMF) - thus unsupervised ML

More on LDA: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/


### Your Turn:

![alt text](https://media.giphy.com/media/1zjRp3fs05jhjTuwr3/giphy.gif)

Perform an LDA analysis of the #Grammy's dataset

- Filter the corpus using `tweet-preprocessor` - try to figure out how to use it using it's documentation
- Clean up further with SpaCy (keep only ADV, ADJ, NOUN)
- Use Gensim to build a Dictionary (Filter extremes) and Corpus
- Use Gensim to run LDA
- Identify 10 topics
- Plot topic-counts by 2 minute

### LSI models


In [None]:
# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

In [None]:
# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

In [None]:
# Now we can transform the whole corpus
tfidf_corpus = tfidf[corpus]

In [None]:
# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

# And we fir it on the tfidf_corpus pointing to the dictionary as reference and the number of topics.
# In more serious settings one would pick between 300-400
lsi = LsiModel(tfidf_corpus, id2word=dictionary, num_topics=100)

In [None]:
lsi.show_topics(num_topics=10)

In [None]:
# And just as before, we can use the trained model to transform the corpus
lsi_corpus = lsi[tfidf_corpus]

In [None]:
# Load the MatrixSimilarity
from gensim.similarities import MatrixSimilarity

# Create the document-topic-matrix
document_topic_matrix = MatrixSimilarity(lsi_corpus)
document_topic_matrix_ix = document_topic_matrix.index

In [None]:
# this now allows us to perform similarity-queries

sims = document_topic_matrix[lsi_corpus[0]]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

We will go deeper into how that works next time
The last bit is a bit of a quick bonus and should be super familiar from M1.

Since we now have a matrix with observations and features - why not trying to apply unsupervisd ML that we know from M1?

In [None]:
!pip install -q umap-learn

In [None]:
# dimensionalility reduction for plotting
import umap

embeddings = umap.UMAP(n_neighbors=15, metric='cosine').fit_transform(document_topic_matrix_ix)

#------------------------------
# we could use that too

#from sklearn.decomposition import PCA

#reduced = PCA(n_components = 10).fit_transform(document_topic_matrix_ix)

In [None]:
# Nothing new here
from sklearn.cluster import KMeans
clusterer = KMeans(n_clusters = 10)
clusterer.fit(document_topic_matrix_ix)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Plotting things
sns.set_style("darkgrid")

plt.rcParams.update({'font.size': 12})
plt.figure(figsize=(12,12))
g = sns.scatterplot(*embeddings.T,
                    #reduced[:,0],reduced[:,1],
                   hue=clusterer.labels_,
                    palette="Paired",
                   legend='full')

In [None]:
# Let's explore the clusters ... that should actually correlate with topics found by LDA
reports['cluster'] = clusterer.labels_

In [None]:
reports[reports['cluster'] == 2]['teaser']

In [None]:
from gensim.matutils import corpus2dense

In [None]:
# Let's check out the topics by getting "top-tfidf" for the different clusters (and we need to transponse)
tfidf_matrix = corpus2dense(tfidf_corpus, len(dictionary)).T

In [None]:
# write cluster-numbers into our data
reports['cluster'] = clusterer.labels_

In [None]:
# Get indices to subset the matrix
cluster_index = reports[reports['cluster'] == 2].index

In [None]:
tfidf_matrix[cluster_index,:].shape

In [None]:
# Use numpy to sum up columns for tfidf, get the indices of the sorted values, and flip (descending), and only top 10
topk = np.flip(np.argsort(np.sum(tfidf_matrix[cluster_index,:], axis=0)))[:10]

In [None]:
# Use dictionary to get the words from indices
[dictionary[x] for x in topk]

In [None]:
# Let's loop it

for i in set(clusterer.labels_):
  cluster_index = reports[reports['cluster'] == i].index
  topk = np.flip(np.argsort(np.sum(tfidf_matrix[cluster_index,:], axis=0)))[:10]

  print(str(i) + str([dictionary[x] for x in topk]))