# Task 2

### Use a pretrained word-embedding (word2vec, glove or fasttext) for featurization instead of the bag-of-words model. Does this improve classification? How about combining the embedded words with the BoW model?

### Using Word2Vec

In [47]:
import pandas as pd
import numpy as np

import gensim
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import string

import re

from nltk.tokenize import RegexpTokenizer

In [48]:
df = pd.read_csv('./data/wine_clean.csv')

First we modify the cleaned dataframe that the descriptions are all lower case and do not contain any stop words.  We then turn the sentence into an array to work with for later.  Some of the code applied in this section was adapted from the following article on cleaning data: https://medium.com/@chaimgluck1/have-messy-text-data-clean-it-with-simple-lambda-functions-645918fcc2fc

In [49]:
df['description']= df['description'].str.lower()
df['description']= df['description'].apply(lambda elem: re.sub('[^a-zA-Z]',' ', elem)) 

tokenizer = RegexpTokenizer(r'\w+')
df['description'] = df['description'].apply(tokenizer.tokenize)

stopword_list = stopwords.words('english')
df['description'] = df['description'].apply(lambda elem: [word for word in elem if not word in stopword_list])

Next, we load our pretrained model.  For this tasks we selected GoogleNews pretrained model with over 3 million words in the vocabulary. (This will not be included in the download file since it is 1.5GB, so it must be downloaded to run)

In [50]:
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin', binary=True)

After the model is loaded into memory, we will go through each description and vectorize the sentence.  For this, we will be adding together each word in the description and skipping it if the word is not in the vocabulary.  Another option is averaging out the vectors.

In [51]:
vectorized_sentences = []
for description in df["description"]:
    sum_vector = np.zeros((300,))
    for word in description:
        if word in model.vocab:
            sum_vector += model[word]
    vectorized_sentences.append(sum_vector)

We modify our dataframe such that the only thing left are the labels of points associated with wines in the first column, and the vectorized sentences of the wine descriptions in the next 300 columns

In [52]:
df["description"] = vectorized_sentences
df = df[["description", "points"]]
df2 = pd.DataFrame(df.description.values.tolist(), index= df.index)
df_final = pd.concat([df["points"], df2], axis=1)
df_final.head()

Unnamed: 0,points,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
0,87,-0.364258,0.786743,-0.038315,1.797501,-0.749458,-0.481781,2.20575,-2.917053,0.027161,...,-2.048325,-1.840363,-1.04303,-3.078369,1.085205,0.373444,0.347908,-0.657639,1.683472,2.301544
1,87,0.922775,1.643341,-1.00119,2.650461,-0.556315,1.640244,1.762878,-2.678284,0.507172,...,-2.07682,-0.422913,-0.375568,0.862671,-0.212234,0.177979,2.442276,-1.439209,2.908058,2.351929
2,87,-0.886719,-0.114258,-2.253662,2.624756,-0.5224,-0.135101,2.016785,-2.015633,2.668503,...,-2.481445,-1.066376,-0.811798,0.28064,0.609894,0.722351,2.144394,-1.166077,1.962616,-0.028076
3,87,0.643204,1.847412,-0.722839,2.535858,-0.22052,-0.943481,3.13623,-2.898071,0.966766,...,-0.702271,-1.211792,-0.134789,-1.066437,0.064209,-1.379456,2.903175,-1.199646,1.380554,1.718262
4,87,0.75354,0.717773,-0.650528,2.113831,-1.660583,-0.490204,0.650208,-0.766724,1.1362,...,-1.064392,-1.62381,0.3255,0.238525,0.733032,-0.449524,1.335938,-0.23053,1.998627,1.014282


From here on out it is a traditional regression task.  We will be focusing on using primarily linear models and comparing their performance.  It seems from an initial performance check, it would seem that OLS and Ridge are the best models as Lasso/Elastic Net performance imply that this data doesn't resond well to L2 Regularization

In [53]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet


In [54]:
X = df_final.iloc[:,1:]
y = df_final.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [55]:
print("OLS average cv score: ", np.mean(cross_val_score(LinearRegression(), X_train, y_train, cv=5)))
print("Ridge average cv score: ", np.mean(cross_val_score(Ridge(), X_train, y_train, cv=5)))
print("Lasso average cv score: ", np.mean(cross_val_score(Lasso(), X_train, y_train, cv=5)))
print("ElasticNet average cv score: ", np.mean(cross_val_score(ElasticNet(), X_train, y_train, cv=5)))

OLS average cv score:  0.6044429152817308
Ridge average cv score:  0.6044452894820764
Lasso average cv score:  0.2588743254322418
ElasticNet average cv score:  0.37803716421264333


We can slightly improve performance by using a CatBoostRegressor.

In [56]:
from catboost import CatBoostRegressor

In [59]:
print("CatBoost average cv score: ", np.mean(cross_val_score(CatBoostRegressor(silent=True), X_train, y_train, cv=3)))

CatBoost average cv score:  0.6373447621897614


### Combining the Two Approaches
We will now try to combine our approaches from part 1.

In [60]:
df2 = pd.read_csv("./clean_wine2.csv")
df2 = df2.drop(['Unnamed: 0'], axis=1)

In [61]:
df_combined = pd.concat([df2,df_final.iloc[:,1:]],axis=1)

After concatenating both dataframes, we try  Ridge and CatBoost.  Both show a significant improvement as compared to using each of them singularly.  We finalize

In [62]:
X = df_combined.iloc[:,1:]
y = df_combined.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [63]:
print("Ridge average cv score: ", np.mean(cross_val_score(Ridge(), X_train, y_train, cv=5)))

Ridge average cv score:  0.6862580662234377


In [64]:
print("CatBoost average cv score: ", np.mean(cross_val_score(CatBoostRegressor(silent=True), X_train, y_train, cv=3)))

CatBoost average cv score:  0.7171070452227163


In [None]:
print(CatBoostRegressor(silent=True).fit(X_train, y_train).score(X_test, y_test))