<a href="https://colab.research.google.com/github/AceroMike/Natural-Language-Processing/blob/main/NLP_Modeling_with_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
import nltk
from nltk.corpus import gutenberg
import gensim
import warnings
warnings.filterwarnings("ignore")

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

nltk.download('gutenberg')
!python -m spacy download en

In [2]:
# Text Cleaning function
def text_cleaner(text):
    # visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In this notebook, I will be modeling using **word2vec**. In other notebooks, we have worked with Bag-of-Words and TF-IDF to get numerical representations of text data. Both of these methods depend on the occurences of the words, however, word2vec is different. 

word2vec is an algorithm that takes in some corpus of text. Word 2 vec does what it sounds like it does, it creates vectors from words. So say for word W in some corpus, word2vec will create a vector of values. Then it will look at words that appear close to word W and make sure that if the vectors are also close. 

To see word2vec in action, I will be training classification models on the books Alice in Wonderlan by Lewis Caroll and Persuasion by Jane Austen. These are available through the Natural Language Toolkit (nltk) by importing the gutenberg corpus. Let's get started. First to load in and clean the text data. 

In [3]:
# Loading the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# Cleaning up Chapter Headings
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# Applying the text cleaner    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [4]:
# Parsing the cleaned novels
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

Now that we have cleaned the text, we want to group the text into sentences, these will be our observations, by author, to train a classification model. Then we want to create a DataFrame. 

In [7]:
# Group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one DataFrame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


Now that we have our sentences in token form, we want to remove redundant tokens. We want to remove stopwords, punctuation, and lemmatize our tokens to prepare the data for analysis. 

In [8]:
# Get rid of stop words and punctuation,
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop]

Now we are ready to model with word2vec, but before, let's take a look at some of the parameters that we can use in word2vec. The descriptions come from the word2vec documentation. I have also defined the starting values. 

* `workers=4`: Use these many worker threads to train the model (=faster training with multicore machines).
* `min_count=1`: Set the minimum word count threshold to 1.
* `window=6`: Maximum distance between the current and predicted word within a sentence.
* `sg=0`: Training algorithm: 1 for skip-gram; otherwise CBOW. Use CBOW because your corpus is small.
* `sample=1e-3`: Penalize frequent words.
* `size=100`: Set the word vector length to 100.
* `hs=1`: If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.Use hierarchical softmax.

Optional: Downside is that workers have to be set to 1. 
*   `seed=44` for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling.
*   `workers=1`: Hopefully this code does not run too slow.

Okay first, we will use the top settings. When I build models I like to change one thing while holding everything else constant. When I am building models to learn from them I like to try and find out what the actual effect of altering a parameter has on the predicitons. If they are better, or worse, can we identify why? First, I want to change the window distance. I want to choose a window higher, and a window lower. So I will train 3 models simulataneously, for each specification. 

**Messing with the window parameter**

Let's define our models



In [9]:
# Word2Vec Models

# window=4
model1 = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=4,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

# window=6
model2 = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=6,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

#window=8
model3 = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=8,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

Now that we have the models we want to create the numerical vectors of each word in a sentence. 

In [11]:
# Empty array of Zeros
word2vec1 = np.zeros((sentences.shape[0],100))
word2vec2 = np.zeros((sentences.shape[0],100))
word2vec3 = np.zeros((sentences.shape[0],100))

In [12]:
word2vec1

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Now to fill these arrays with some data from our sentences. 

In [13]:
for i, sentence in enumerate(sentences["text"]):
    word2vec1[i,:] = np.mean([model1[lemma] for lemma in sentence], axis=0)
    word2vec2[i,:] = np.mean([model2[lemma] for lemma in sentence], axis=0)
    word2vec3[i,:] = np.mean([model3[lemma] for lemma in sentence], axis=0)

# Creating a Dataframe for each word2vec
word2vec1 = pd.DataFrame(word2vec1)
word2vec2 = pd.DataFrame(word2vec2)
word2vec3 = pd.DataFrame(word2vec3)

word2vec1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.593090,0.125660,0.102273,-0.002626,-0.153981,0.120061,0.197181,0.159078,-0.351577,-0.525317,0.197541,0.518661,-0.082424,0.088284,-0.176848,0.294890,0.066775,-0.368475,0.185509,0.063295,-0.077567,0.016506,-0.055045,0.133580,0.361915,-0.015884,0.022807,0.272579,-0.120107,0.026279,0.142475,0.342468,-0.057785,0.153448,-0.323533,-0.039495,0.189363,0.058391,0.042092,0.231203,...,-0.311339,-0.359907,0.311202,-0.191088,-0.241105,0.013136,-0.058113,0.072874,0.071527,-0.082586,0.223800,-0.286902,-0.003710,-0.118313,-0.072899,-0.015766,0.289712,0.312455,0.336744,-0.119179,-0.112208,0.004392,0.063863,-0.033848,0.390154,0.286367,-0.193072,-0.018720,-0.212526,-0.135100,-0.144363,-0.096167,-0.033384,0.118025,0.172062,-0.009105,-0.231549,-0.145739,0.166962,0.019277
1,0.499260,0.112178,0.072917,0.002343,-0.139124,0.098222,0.185955,0.142321,-0.274334,-0.469089,0.171020,0.426292,-0.077282,0.079499,-0.177450,0.241441,0.052851,-0.287148,0.161990,0.046533,-0.060256,0.020956,-0.049840,0.103143,0.320767,-0.000888,0.024832,0.244554,-0.086759,0.004027,0.117350,0.278075,-0.044566,0.135592,-0.281328,-0.029153,0.161841,0.038714,0.041794,0.175230,...,-0.263096,-0.316396,0.252927,-0.157610,-0.201413,0.007385,-0.052907,0.070242,0.060646,-0.062690,0.192317,-0.242210,0.004048,-0.096226,-0.036455,-0.018636,0.241959,0.261674,0.276624,-0.077251,-0.100240,-0.003521,0.036731,-0.029395,0.321417,0.236663,-0.156490,0.000020,-0.169704,-0.099070,-0.128151,-0.073530,-0.010956,0.104788,0.169803,-0.019622,-0.219825,-0.103355,0.138659,0.042323
2,0.726160,0.160354,0.095932,-0.013199,-0.195994,0.141977,0.244198,0.203197,-0.432478,-0.649302,0.230697,0.644938,-0.090413,0.107013,-0.224174,0.350699,0.090947,-0.456160,0.234558,0.077450,-0.102349,0.020014,-0.058745,0.156582,0.466838,-0.019526,-0.000245,0.339621,-0.153757,0.033871,0.191802,0.427233,-0.085512,0.191244,-0.405802,-0.052305,0.239530,0.064413,0.050893,0.290403,...,-0.390933,-0.449513,0.374585,-0.231309,-0.296132,-0.006417,-0.077322,0.076574,0.085414,-0.109791,0.276065,-0.373793,0.011528,-0.150279,-0.093985,-0.023399,0.362994,0.371862,0.425999,-0.138355,-0.142786,0.011233,0.091384,-0.048948,0.490272,0.368527,-0.247509,-0.016721,-0.265918,-0.161803,-0.174043,-0.116018,-0.019934,0.135186,0.208702,-0.014391,-0.291085,-0.175176,0.223361,0.027897
3,0.642033,0.150461,0.102995,-0.008951,-0.196570,0.127856,0.185137,0.193834,-0.383852,-0.589803,0.221474,0.558878,-0.087437,0.099326,-0.166237,0.298137,0.097000,-0.391374,0.205905,0.068253,-0.072592,-0.000542,-0.048091,0.119714,0.411945,-0.049757,0.018556,0.277931,-0.120092,0.034928,0.164277,0.375009,-0.048728,0.160510,-0.368117,-0.041199,0.172903,0.080095,0.036951,0.224742,...,-0.347010,-0.396360,0.329009,-0.193281,-0.276706,-0.005203,-0.067640,0.088278,0.100690,-0.088125,0.233926,-0.311733,0.002029,-0.117116,-0.073733,-0.050910,0.328872,0.315253,0.343059,-0.116519,-0.124305,0.000739,0.073612,-0.025891,0.420431,0.321188,-0.215922,0.009740,-0.234625,-0.175428,-0.184770,-0.114257,-0.010815,0.155462,0.199806,-0.021368,-0.256647,-0.165393,0.214295,0.017613
4,0.495741,0.115527,0.076642,0.001828,-0.120230,0.106972,0.170313,0.132831,-0.291191,-0.424512,0.160836,0.421108,-0.073010,0.084443,-0.143384,0.248903,0.046370,-0.316512,0.140131,0.039509,-0.060905,0.013710,-0.045126,0.099644,0.293832,-0.017368,0.013536,0.222836,-0.097310,0.034687,0.118884,0.272684,-0.050687,0.116679,-0.272645,-0.030114,0.142990,0.053429,0.036903,0.192358,...,-0.262491,-0.290761,0.271875,-0.136660,-0.215397,-0.003056,-0.055246,0.052250,0.073315,-0.080253,0.188585,-0.240069,-0.002263,-0.092237,-0.061709,-0.017633,0.237478,0.251863,0.282271,-0.087030,-0.083744,0.000591,0.058069,-0.023724,0.322232,0.237114,-0.153073,-0.014703,-0.171192,-0.104445,-0.123495,-0.084454,-0.025412,0.093222,0.139940,-0.000588,-0.200263,-0.127221,0.151134,0.020295
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5843,0.550520,0.123531,0.085342,-0.004564,-0.157699,0.100730,0.182447,0.163725,-0.313352,-0.504850,0.189059,0.480296,-0.079930,0.086134,-0.177661,0.266636,0.069486,-0.326882,0.177167,0.061077,-0.075401,0.005924,-0.054985,0.116749,0.350982,-0.009995,0.025124,0.258137,-0.101006,0.019791,0.139517,0.314711,-0.050073,0.151783,-0.315080,-0.038057,0.174074,0.046279,0.045448,0.198315,...,-0.292389,-0.348844,0.284930,-0.178565,-0.229020,0.004481,-0.058743,0.078119,0.072421,-0.073642,0.207777,-0.273681,-0.001055,-0.111282,-0.057597,-0.021924,0.275710,0.292648,0.308795,-0.103248,-0.112356,0.004885,0.058712,-0.029196,0.357402,0.264174,-0.171449,-0.002537,-0.199154,-0.133387,-0.137322,-0.075204,-0.012754,0.120464,0.174506,-0.025587,-0.230657,-0.119528,0.161885,0.023388
5844,0.632281,0.077009,0.196799,0.029385,-0.121222,0.154380,0.181994,0.157856,-0.427112,-0.413381,0.243967,0.560073,-0.059366,0.098407,-0.051147,0.413415,0.051677,-0.502991,0.128840,0.068116,-0.107713,-0.033806,-0.105611,0.190561,0.291399,0.022433,0.053567,0.215851,-0.127223,0.105848,0.158875,0.411351,-0.053292,0.101892,-0.314701,-0.058538,0.195481,0.084126,0.042007,0.293032,...,-0.314670,-0.372374,0.403059,-0.187628,-0.266738,0.047984,-0.035017,0.104417,0.087666,-0.103232,0.182720,-0.311880,-0.110534,-0.131663,-0.163434,0.049741,0.316707,0.347860,0.392011,-0.210202,-0.081645,0.057662,0.099600,-0.022488,0.451159,0.240024,-0.179797,-0.044203,-0.232817,-0.211197,-0.123126,-0.079111,-0.110097,0.069903,0.107465,0.046320,-0.175851,-0.202311,0.154608,-0.027259
5845,0.348791,0.083772,0.058605,0.008339,-0.098083,0.071398,0.120888,0.101017,-0.196383,-0.312920,0.121047,0.297962,-0.055693,0.058185,-0.106911,0.170782,0.037285,-0.211749,0.105903,0.031494,-0.043486,0.009631,-0.031994,0.069489,0.215218,-0.008992,0.021445,0.160968,-0.061668,0.016421,0.080642,0.196041,-0.028709,0.082852,-0.195428,-0.019639,0.104236,0.035161,0.028593,0.123926,...,-0.183117,-0.215457,0.186344,-0.099919,-0.144893,0.005009,-0.033539,0.047289,0.051513,-0.046396,0.130817,-0.164600,-0.011340,-0.066149,-0.029347,-0.009993,0.169899,0.182248,0.187937,-0.058759,-0.062350,0.000901,0.029508,-0.018416,0.219531,0.163779,-0.107103,-0.003651,-0.117623,-0.077279,-0.092449,-0.051767,-0.014389,0.076672,0.110581,-0.009370,-0.146343,-0.081085,0.100576,0.018713
5846,0.249486,0.062772,0.041905,0.002417,-0.068607,0.051631,0.085007,0.071377,-0.135219,-0.228775,0.087432,0.207068,-0.042975,0.041657,-0.082010,0.118597,0.023817,-0.142956,0.077513,0.019678,-0.027345,0.007915,-0.023020,0.047241,0.150307,-0.011736,0.017221,0.118449,-0.042509,0.010208,0.053487,0.135068,-0.016912,0.060847,-0.137600,-0.011406,0.069618,0.026216,0.020540,0.083837,...,-0.129360,-0.150727,0.130534,-0.074428,-0.104039,0.002013,-0.024912,0.032716,0.037925,-0.033899,0.094335,-0.112884,-0.001695,-0.044062,-0.015514,-0.013850,0.118661,0.130088,0.131075,-0.038092,-0.044497,-0.005368,0.016491,-0.011215,0.149955,0.115514,-0.073954,0.000039,-0.082708,-0.053678,-0.067780,-0.040849,-0.012277,0.060344,0.081635,-0.009071,-0.105625,-0.057781,0.069353,0.017358


Great! Now we have a dataframe with word vectors. These will be the features of our models. However, before modeling we want to add the author and text information. Again, creating a dataframe for each model. 

In [14]:
sentences1 = pd.concat([sentences[["author", "text"]],word2vec1], axis=1)
sentences1.dropna(inplace=True)

sentences2 = pd.concat([sentences[["author", "text"]],word2vec2], axis=1)
sentences2.dropna(inplace=True)

sentences3 = pd.concat([sentences[["author", "text"]],word2vec3], axis=1)
sentences3.dropna(inplace=True)

sentences1

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",0.593090,0.125660,0.102273,-0.002626,-0.153981,0.120061,0.197181,0.159078,-0.351577,-0.525317,0.197541,0.518661,-0.082424,0.088284,-0.176848,0.294890,0.066775,-0.368475,0.185509,0.063295,-0.077567,0.016506,-0.055045,0.133580,0.361915,-0.015884,0.022807,0.272579,-0.120107,0.026279,0.142475,0.342468,-0.057785,0.153448,-0.323533,-0.039495,0.189363,0.058391,...,-0.311339,-0.359907,0.311202,-0.191088,-0.241105,0.013136,-0.058113,0.072874,0.071527,-0.082586,0.223800,-0.286902,-0.003710,-0.118313,-0.072899,-0.015766,0.289712,0.312455,0.336744,-0.119179,-0.112208,0.004392,0.063863,-0.033848,0.390154,0.286367,-0.193072,-0.018720,-0.212526,-0.135100,-0.144363,-0.096167,-0.033384,0.118025,0.172062,-0.009105,-0.231549,-0.145739,0.166962,0.019277
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",0.499260,0.112178,0.072917,0.002343,-0.139124,0.098222,0.185955,0.142321,-0.274334,-0.469089,0.171020,0.426292,-0.077282,0.079499,-0.177450,0.241441,0.052851,-0.287148,0.161990,0.046533,-0.060256,0.020956,-0.049840,0.103143,0.320767,-0.000888,0.024832,0.244554,-0.086759,0.004027,0.117350,0.278075,-0.044566,0.135592,-0.281328,-0.029153,0.161841,0.038714,...,-0.263096,-0.316396,0.252927,-0.157610,-0.201413,0.007385,-0.052907,0.070242,0.060646,-0.062690,0.192317,-0.242210,0.004048,-0.096226,-0.036455,-0.018636,0.241959,0.261674,0.276624,-0.077251,-0.100240,-0.003521,0.036731,-0.029395,0.321417,0.236663,-0.156490,0.000020,-0.169704,-0.099070,-0.128151,-0.073530,-0.010956,0.104788,0.169803,-0.019622,-0.219825,-0.103355,0.138659,0.042323
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit, ...",0.726160,0.160354,0.095932,-0.013199,-0.195994,0.141977,0.244198,0.203197,-0.432478,-0.649302,0.230697,0.644938,-0.090413,0.107013,-0.224174,0.350699,0.090947,-0.456160,0.234558,0.077450,-0.102349,0.020014,-0.058745,0.156582,0.466838,-0.019526,-0.000245,0.339621,-0.153757,0.033871,0.191802,0.427233,-0.085512,0.191244,-0.405802,-0.052305,0.239530,0.064413,...,-0.390933,-0.449513,0.374585,-0.231309,-0.296132,-0.006417,-0.077322,0.076574,0.085414,-0.109791,0.276065,-0.373793,0.011528,-0.150279,-0.093985,-0.023399,0.362994,0.371862,0.425999,-0.138355,-0.142786,0.011233,0.091384,-0.048948,0.490272,0.368527,-0.247509,-0.016721,-0.265918,-0.161803,-0.174043,-0.116018,-0.019934,0.135186,0.208702,-0.014391,-0.291085,-0.175176,0.223361,0.027897
3,Carroll,"[oh, dear]",0.642033,0.150461,0.102995,-0.008951,-0.196570,0.127856,0.185137,0.193834,-0.383852,-0.589803,0.221474,0.558878,-0.087437,0.099326,-0.166237,0.298137,0.097000,-0.391374,0.205905,0.068253,-0.072592,-0.000542,-0.048091,0.119714,0.411945,-0.049757,0.018556,0.277931,-0.120092,0.034928,0.164277,0.375009,-0.048728,0.160510,-0.368117,-0.041199,0.172903,0.080095,...,-0.347010,-0.396360,0.329009,-0.193281,-0.276706,-0.005203,-0.067640,0.088278,0.100690,-0.088125,0.233926,-0.311733,0.002029,-0.117116,-0.073733,-0.050910,0.328872,0.315253,0.343059,-0.116519,-0.124305,0.000739,0.073612,-0.025891,0.420431,0.321188,-0.215922,0.009740,-0.234625,-0.175428,-0.184770,-0.114257,-0.010815,0.155462,0.199806,-0.021368,-0.256647,-0.165393,0.214295,0.017613
4,Carroll,"[shall, late]",0.495741,0.115527,0.076642,0.001828,-0.120230,0.106972,0.170313,0.132831,-0.291191,-0.424512,0.160836,0.421108,-0.073010,0.084443,-0.143384,0.248903,0.046370,-0.316512,0.140131,0.039509,-0.060905,0.013710,-0.045126,0.099644,0.293832,-0.017368,0.013536,0.222836,-0.097310,0.034687,0.118884,0.272684,-0.050687,0.116679,-0.272645,-0.030114,0.142990,0.053429,...,-0.262491,-0.290761,0.271875,-0.136660,-0.215397,-0.003056,-0.055246,0.052250,0.073315,-0.080253,0.188585,-0.240069,-0.002263,-0.092237,-0.061709,-0.017633,0.237478,0.251863,0.282271,-0.087030,-0.083744,0.000591,0.058069,-0.023724,0.322232,0.237114,-0.153073,-0.014703,-0.171192,-0.104445,-0.123495,-0.084454,-0.025412,0.093222,0.139940,-0.000588,-0.200263,-0.127221,0.151134,0.020295
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5843,Austen,"[spring, felicity, glow, spirit, friend, Anne,...",0.550520,0.123531,0.085342,-0.004564,-0.157699,0.100730,0.182447,0.163725,-0.313352,-0.504850,0.189059,0.480296,-0.079930,0.086134,-0.177661,0.266636,0.069486,-0.326882,0.177167,0.061077,-0.075401,0.005924,-0.054985,0.116749,0.350982,-0.009995,0.025124,0.258137,-0.101006,0.019791,0.139517,0.314711,-0.050073,0.151783,-0.315080,-0.038057,0.174074,0.046279,...,-0.292389,-0.348844,0.284930,-0.178565,-0.229020,0.004481,-0.058743,0.078119,0.072421,-0.073642,0.207777,-0.273681,-0.001055,-0.111282,-0.057597,-0.021924,0.275710,0.292648,0.308795,-0.103248,-0.112356,0.004885,0.058712,-0.029196,0.357402,0.264174,-0.171449,-0.002537,-0.199154,-0.133387,-0.137322,-0.075204,-0.012754,0.120464,0.174506,-0.025587,-0.230657,-0.119528,0.161885,0.023388
5844,Austen,"[Anne, tenderness, worth, Captain, Wentworth, ...",0.632281,0.077009,0.196799,0.029385,-0.121222,0.154380,0.181994,0.157856,-0.427112,-0.413381,0.243967,0.560073,-0.059366,0.098407,-0.051147,0.413415,0.051677,-0.502991,0.128840,0.068116,-0.107713,-0.033806,-0.105611,0.190561,0.291399,0.022433,0.053567,0.215851,-0.127223,0.105848,0.158875,0.411351,-0.053292,0.101892,-0.314701,-0.058538,0.195481,0.084126,...,-0.314670,-0.372374,0.403059,-0.187628,-0.266738,0.047984,-0.035017,0.104417,0.087666,-0.103232,0.182720,-0.311880,-0.110534,-0.131663,-0.163434,0.049741,0.316707,0.347860,0.392011,-0.210202,-0.081645,0.057662,0.099600,-0.022488,0.451159,0.240024,-0.179797,-0.044203,-0.232817,-0.211197,-0.123126,-0.079111,-0.110097,0.069903,0.107465,0.046320,-0.175851,-0.202311,0.154608,-0.027259
5845,Austen,"[profession, friend, wish, tenderness, dread, ...",0.348791,0.083772,0.058605,0.008339,-0.098083,0.071398,0.120888,0.101017,-0.196383,-0.312920,0.121047,0.297962,-0.055693,0.058185,-0.106911,0.170782,0.037285,-0.211749,0.105903,0.031494,-0.043486,0.009631,-0.031994,0.069489,0.215218,-0.008992,0.021445,0.160968,-0.061668,0.016421,0.080642,0.196041,-0.028709,0.082852,-0.195428,-0.019639,0.104236,0.035161,...,-0.183117,-0.215457,0.186344,-0.099919,-0.144893,0.005009,-0.033539,0.047289,0.051513,-0.046396,0.130817,-0.164600,-0.011340,-0.066149,-0.029347,-0.009993,0.169899,0.182248,0.187937,-0.058759,-0.062350,0.000901,0.029508,-0.018416,0.219531,0.163779,-0.107103,-0.003651,-0.117623,-0.077279,-0.092449,-0.051767,-0.014389,0.076672,0.110581,-0.009370,-0.146343,-0.081085,0.100576,0.018713
5846,Austen,"[glory, sailor, wife, pay, tax, quick, alarm, ...",0.249486,0.062772,0.041905,0.002417,-0.068607,0.051631,0.085007,0.071377,-0.135219,-0.228775,0.087432,0.207068,-0.042975,0.041657,-0.082010,0.118597,0.023817,-0.142956,0.077513,0.019678,-0.027345,0.007915,-0.023020,0.047241,0.150307,-0.011736,0.017221,0.118449,-0.042509,0.010208,0.053487,0.135068,-0.016912,0.060847,-0.137600,-0.011406,0.069618,0.026216,...,-0.129360,-0.150727,0.130534,-0.074428,-0.104039,0.002013,-0.024912,0.032716,0.037925,-0.033899,0.094335,-0.112884,-0.001695,-0.044062,-0.015514,-0.013850,0.118661,0.130088,0.131075,-0.038092,-0.044497,-0.005368,0.016491,-0.011215,0.149955,0.115514,-0.073954,0.000039,-0.082708,-0.053678,-0.067780,-0.040849,-0.012277,0.060344,0.081635,-0.009071,-0.105625,-0.057781,0.069353,0.017358


Great! Now we have 3 dataframes that are ready for modeling! WE will compare 3 different classification models: Logistic Regression, Random Forest Classifier, and GradientBoostingClassifier. Hopefully we find an optimal level for windows. It would be a bonus if we see a direct relationship between window and model performance. 

In [18]:
# Definings Xs and Ys

# For our Xs we want to remove the text and author columns. 
X1 = np.array(sentences1.drop(['text','author'], 1))
X2 = np.array(sentences2.drop(['text','author'], 1))
X3 = np.array(sentences3.drop(['text','author'], 1))

# For our Ys, we want to see if we can predict the author who wrote the sentence
Y1 = sentences1['author']
Y2 = sentences2['author']
Y3 = sentences3['author']

# The obligatory Train Test Split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, Y1, test_size=0.4, random_state=44)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, Y2, test_size=0.4, random_state=44)
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, Y3, test_size=0.4, random_state=44)

# Initializing the Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()


# Results Model 1
print('---------Word2Vec Model 1: window = 4------------')
# Fitting model 1
lr.fit(X_train1, y_train1)
rfc.fit(X_train1, y_train1)
gbc.fit(X_train1, y_train1)
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train1, y_train1))
print('\nTest set score:', lr.score(X_test1, y_test1))

print("--------------------------Random Forest Scores------------------------")
print('Training set score:', rfc.score(X_train1, y_train1))
print('\nTest set score:', rfc.score(X_test1, y_test1))

print("------------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train1, y_train1))
print('\nTest set score:', gbc.score(X_test1, y_test1))
print('\n') #New line to make nice!

# Results Model 2
print('---------Word2Vec Model 2: window = 6------------')
# Fitting model 2
lr.fit(X_train2, y_train2)
rfc.fit(X_train2, y_train2)
gbc.fit(X_train2, y_train2)
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train2, y_train2))
print('\nTest set score:', lr.score(X_test2, y_test2))

print("--------------------------Random Forest Scores------------------------")
print('Training set score:', rfc.score(X_train2, y_train2))
print('\nTest set score:', rfc.score(X_test2, y_test2))

print("------------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train2, y_train2))
print('\nTest set score:', gbc.score(X_test2, y_test2))
print('\n')

# Results Model 3
print('---------Word2Vec Model 3: window = 8------------')
# Fitting model 3
lr.fit(X_train3, y_train3)
rfc.fit(X_train3, y_train3)
gbc.fit(X_train3, y_train3)
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train3, y_train3))
print('\nTest set score:', lr.score(X_test3, y_test3))

print("--------------------------Random Forest Scores------------------------")
print('Training set score:', rfc.score(X_train3, y_train3))
print('\nTest set score:', rfc.score(X_test3, y_test3))

print("------------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train3, y_train3))
print('\nTest set score:', gbc.score(X_test3, y_test3))

---------Word2Vec Model 1: window = 4------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7832572298325723

Test set score: 0.7534246575342466
--------------------------Random Forest Scores------------------------
Training set score: 0.9939117199391172

Test set score: 0.8082191780821918
------------------------Gradient Boosting Scores----------------------
Training set score: 0.8952815829528158

Test set score: 0.8105022831050228


---------Word2Vec Model 2: window = 6------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7866057838660578

Test set score: 0.7639269406392694
--------------------------Random Forest Scores------------------------
Training set score: 0.9939117199391172

Test set score: 0.8054794520547945
------------------------Gradient Boosting Scores----------------------
Training set score: 0.9059360730593607

Test set score: 0.8164383561643835


---------Word2Vec Mo

Each model is different so let's see discuss how model performance changes with increasing window size

**Logistic Regression**
The Logistic Regression seems to perform better as we increase the window size. Of course, If I were to proceed modeling with Logistic Regression I would try increasing the window size some more. Most importantly, as performance increases, we don't seem to be overfitting more. So I would say that model 3 is the best Logistic Regression model. 

**Random Forest Classifier**
We have pretty severe overfitting in our RandomForest Models. Still model 3 is the best but I would say these models all have issues with overfitting. 

**Gradient Boosting Classifier**
Similar to Random Forest, these models are overfitting but less than the Random Forest models. They also perform better than the Random Forest Models. The models are not to different from each other but I would argue that Model 2 is the best GBC model because of the slightly higher test score. Still, not much separates the model. 

Now, from here I can proceed in many directions. While the RandomForest Classifier seems to perform worst, I will still keep it in, maybe by changing other parameters the model improves. So far, I think the Logistic Regression model 3 is the best although it has the lowest test scores. We are not yet overfitting and we may be able to continue to improve the model without overfitting too much. 

**We will now set window to 8**

Next: Let's change another variable and compare results again! 