# An interactive introduction to word embeddings

## Goals

- Demystify text-based AI models
- Convince you that this is very cool!

## Applications

- Translation (eg. Google Translate)
- Text recommendation (autocomplete)
- Chatbots (automatic customer service)
- Much much more!

- [See here for state of the art on tasks](https://github.com/sebastianruder/NLP-progress)

In [None]:
# setup for the lecture
import pandas as pd
import scipy as sc
import sklearn
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sys
### Gensim is outside the anaconda distribution ###
### uncomment to install Gensim ###
#!{sys.executable} -m pip install gensim
import gensim
import gensim.downloader as model_api

# Load pretrained word embeddings
# This will download 60mb of data the first time it's loaded
word_vectors = model_api.load("glove-wiki-gigaword-50")

First, a **magic trick**!

$(Paris - France) + Russia = x$ 

$Paris + Russia - France = x$

which should give us $x = Moscow$

In [None]:
# Get the most similar word to an expression
word_vectors.most_similar_cosmul(positive=['paris', 'russia'], negative=['france'])

#Cosine similarity those percentages

**NLP ADVANTAGE:** It's easy to generate datasets in NLP if you're clever!

In [None]:
word_vectors.most_similar_cosmul(positive=['queen', 'man'], negative=['woman'])

In [None]:
df = pd.read_csv('../data/airline_tweets.csv')
df.head()

## Fundamental Problem

_If we want the the text to produce predictions or suggestions, we first need to translate it to a mathematical form._

**Naive solution:** Have each document (each review) become a list of the words it contains.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [None]:
# Naive solution
vectorizer = CountVectorizer(max_features=1000)
#Setting max feature will reduce the vocab to 1000 features
# does this by term frequency - reduces dimensionality 

X = df['text']
y = df['airline_sentiment']

X = vectorizer.fit_transform(X)
wordLabels = vectorizer.get_feature_names()

# Print example of the bag-of-words matrix - OHE of all possible words for each row
pd.DataFrame(data=X.toarray(), columns=wordLabels).head()

# Remember the word Bag of Words

# It is easy to try to **predict a review's rating** with this approach:

In [None]:
X = df['text']
y = df['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

X_train = vectorizer.fit_transform(X_train) #Only fit to Train - prevent data leakage
X_test = vectorizer.transform(X_test)

log_model = LogisticRegression(max_iter=1000).fit(X_train,y_train)

preds = log_model.predict(X_test)

print(classification_report(y_test,preds))

#NOTE: 
# To optimize BoW - more pre-processing word cleaning is required 
# ie. stop word removal and/or stemming/lemmatization 
# Otherwise words that frequently appear in our documents will have little to no predictive power ('a', 'the', etc.)

## TF-IDF

You can also augment the classic bag-of-words with [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

TF-IDF: 
- Looks at work frequency within a document (in this case the comment) and also looks at how much it appears throughout the many rows 

- Gets a weight on the word to understand the importance or value a word has. Like The - very often used so might not reveal a lot 

- Looks at a ratio 
- Frequency is the number of times the term appears in a document
- Will decrease the weight of a word that occurs often in the document set while increasing the weight of words that occur less frequently in the set 

In [None]:
# td-idf - term frequency - inverse document frequency

vectorizer = CountVectorizer()
tf = TfidfTransformer()

X = df['text']
y = df['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)


X_train = vectorizer.fit_transform(X_train) #fit_transform CountVectorizer on training data
X_train = tf.fit_transform(X_train)         #fit_transform TfidfTransformer on training data

X_test = vectorizer.transform(X_test)       #transform CountVectorizer on testing data
X_test = tf.transform(X_test)               #transform TfidfTransformer on testing data

In [None]:
log_model= LogisticRegression(max_iter=1000)#Create instance of our model

log_model.fit(X_train,y_train)              #Fit the model on the data

preds = log_model.predict(X_test)

print(classification_report(y_test,preds))

## TfidfVectorizer - CountVectorizer & TfidfTransformer in one step!

In [None]:
tf = TfidfVectorizer()                #Create an instance of our TfidfVectorize()

X = df['text']
y = df['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

X_train = tf.fit_transform(X_train)  #fit_transform on training data
X_test = tf.transform(X_test)        #transform on testing data 

In [None]:
log_model = LogisticRegression(max_iter=1000)

log_model.fit(X_train,y_train)           

preds = log_model.predict(X_test)

print(classification_report(y_test,preds))

Some fixes for bag-of-words approach are detailed in the first week of [this free NLP course](https://www.coursera.org/learn/language-processing/).

## Dimensionality Reduction with some PCA 

In [None]:
X = df['text']
y = df['airline_sentiment']

In [None]:
#Perform Train Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
# Apply our TfidfVectorizer to our data toarray() - PCA does not support sparse input
tf = TfidfVectorizer()

tf_X_train = tf.fit_transform(X_train).toarray()
tf_X_test = tf.transform(X_test).toarray()

In [None]:
COMPRESSED_SIZE = 200            #Firs 200 principle components 

pca_model = PCA(COMPRESSED_SIZE) # create instance of our PCA model 

In [None]:
# - ensure input is not sparse 

pca_model.fit(tf_X_train)

In [None]:
#transform both the tf_X_train and tf_X_test - ensure input is not sparse 

pca_train = pca_model.transform(tf_X_train) 
pca_test = pca_model.transform(tf_X_test)

In [None]:
tf_X_train.shape, pca_train.shape #Same number of rows, feature dimensions are reduced 

#NOTE: 
# Principal Component is a linear combination of original features

# 100% of the variance in the data is explained by all original features....
# We trade off some of the explained variance for less dimensions
# This can be significant savings for data sets with MANY dimensions but only a few strong features  

In [None]:
#Time for modelling - create a logistic_reg model with max_iter set to 1000 

model = LogisticRegression(max_iter=1000)
model.fit(pca_train,y_train)

In [None]:
#Generate predictions based on our pca_test set
preds = model.predict(pca_test)

In [None]:
print(classification_report(y_test,preds))

In [None]:
#Percentage of explained variance for each new dimensions

pca_model.explained_variance_ratio_[:10]

# What are the problems with this approach?

- Doesn't associate similar/same words


- No information about words themselves


- No word importance information

Some fixes for bag-of-words approach are detailed in the first week of [this free NLP course](https://www.coursera.org/learn/language-processing/).



There are many better methods to generate embeddings.

- The most popular is [word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) and [GloVE](https://nlp.stanford.edu/projects/glove/).
- There are also methods based on [matrix factorization](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/) like we did.
- Modern techniques use recurrent neural net models [predicting words](https://thegradient.pub/nlp-imagenet/) to generate better embeddings.