<a href="https://colab.research.google.com/github/AlexBB999/NLP/blob/master/31_6_Assignment_NLP_features_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#pip install gensim

In [2]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
import nltk
from nltk.corpus import gutenberg
import gensim
import warnings
warnings.filterwarnings("ignore")

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


**UTITLITY FUNCTION FOR STANDARD TEXT CLEANING**

In [0]:
# utility function for standard text cleaning
def text_cleaner(text):
    # visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

**LOAD AND CLEAN THE DATA**

In [0]:
# load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# the chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

**PARSE THE CLEANED DATA**

In [0]:
# parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

**GROUP INTO SENTENCES**

In [6]:
# group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# combine the sentences from the two novels into one data frame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


**GET RID OF STOP WORDS AND PUNCTUATION**

In [0]:
# get rid off stop words and punctuation
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop]

**NOW READY TO VECTORIZE USING WORD2VEC**



**The Word2Vec class has several parameters**.

We set the following parameters:

**workers**=4: We set the number of threads to run in parallel to 4 (make sense if your computer has available computing units).

**min_count**=1: We set the minimum word count threshold to 1.

**window**=6: We set the number of words around target word to consider to 6.

**sg**=0: We use CBOW because our corpus is small.

**sample**=1e-3: We penalize frequent words.

**size**=100: We set the word vector length to 100.

**hs**=1: We use hierarchical softmax.

In [0]:
# train word2vec on the the sentences
model = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=6,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

Before jumping in the machine learning model for prediction,

 **let's play with our word2vec word representation we just trained**. 
 
 **specifically, we'll look into**:

The first five words that are closer to lady.

The word that doesn't fit in list: dad dinner mom aunt uncle.

The similarity score of woman and man.

The similarity score of horse and cat.

Note that all of the above calculations are based on the word2vec representations of the words we just trained above.

In [9]:
print(model.most_similar(positive=['lady', 'man'], negative=['woman'], topn=5))
print(model.doesnt_match("dad dinner mom aunt uncle".split()))
print(model.similarity('woman', 'man'))
print(model.similarity('horse', 'cat'))

[('heart', 0.9986150860786438), ('head', 0.9980312585830688), ('receive', 0.9979260563850403), ('send', 0.9978547096252441), ('compare', 0.9978461265563965)]
dinner
0.99793494
0.91907066


Well, the results make sense to some degree but it's obvious that our representations aren't perfect.

 **This is because our corpus is small**.
 
In order to get more meaningful results, we need to train word2vec representations using much larger corpuses.


Now, **let's create our numerical features using the word2vec representations of the words**. 

**In the following, we get the word2vec vectors of each word in a sentence and take the average of all the vectors in the high dimensional space** (in our case it's 100).

**So, as a result, we'll have a vector of 100 dimensions as the feature for a sentence**.
 
 **We then use each dimension as a separate feature**
 
  **which means that in our final dataset we'll have 100 numerical feature**

In [10]:
word2vec_arr = np.zeros((sentences.shape[0],100))

for i, sentence in enumerate(sentences["text"]):
    word2vec_arr[i,:] = np.mean([model[lemma] for lemma in sentence], axis=0)

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

sentences.head()

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",-0.210979,-0.06258,0.157565,-0.139776,-0.300331,0.030455,0.07755,-0.294559,0.40114,0.053366,0.024731,0.196388,0.190764,0.308147,-0.063027,0.412527,0.139875,-0.267728,0.283771,-0.213635,-0.280361,0.244138,0.130352,-0.276069,-0.276648,-0.039348,0.227371,0.139667,0.073735,0.31896,0.166239,0.384208,-0.207219,0.036184,0.017213,0.101088,0.002001,-0.128893,...,0.265528,0.080171,0.069522,-0.210469,-0.218688,-0.071758,-0.008416,-0.210619,-0.424323,0.063695,0.327052,0.050011,-0.039167,0.440649,-0.325632,-0.036917,0.013718,0.249279,-0.154799,0.150525,-0.568831,0.460826,0.001292,0.688294,0.550359,0.074429,-0.07923,-0.062074,0.422813,0.180806,0.015324,0.042619,0.001097,0.012137,0.023959,-0.027405,0.36184,-0.03483,0.025005,-0.131617
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",-0.14891,-0.049488,0.153844,-0.123738,-0.238799,0.019646,0.064064,-0.233655,0.317688,0.043966,0.001401,0.168656,0.153475,0.232277,-0.047278,0.338224,0.12262,-0.215447,0.23822,-0.164725,-0.229579,0.206839,0.102739,-0.233687,-0.229472,-0.01965,0.185433,0.122994,0.062729,0.273803,0.125416,0.322074,-0.176348,0.046174,0.005779,0.084155,0.006412,-0.092547,...,0.210656,0.070094,0.061898,-0.178471,-0.168583,-0.054268,0.004315,-0.163944,-0.350977,0.057233,0.272216,0.044247,-0.035227,0.374482,-0.268617,-0.022345,0.014725,0.196182,-0.134059,0.128452,-0.461159,0.401367,-0.002373,0.556046,0.462653,0.04941,-0.073997,-0.072971,0.3464,0.149298,0.009316,0.039262,-0.005203,0.011666,0.011551,-0.033964,0.284794,-0.020304,0.015079,-0.098131
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit, ...",-0.246879,-0.068721,0.18159,-0.154642,-0.352796,0.028803,0.064865,-0.357944,0.468414,0.064683,0.037273,0.22039,0.234696,0.367887,-0.059296,0.487883,0.1759,-0.325933,0.332101,-0.234599,-0.332688,0.266216,0.162187,-0.324126,-0.330758,-0.048694,0.252641,0.166388,0.083734,0.372707,0.201353,0.436548,-0.249705,0.05397,0.028631,0.128952,0.004195,-0.161062,...,0.300029,0.089528,0.102752,-0.254452,-0.235483,-0.071454,-0.009369,-0.244593,-0.498783,0.082906,0.381511,0.0492,-0.05088,0.504575,-0.389542,-0.05589,0.026004,0.290325,-0.190111,0.179262,-0.668596,0.547679,-0.001092,0.811608,0.642291,0.086922,-0.090486,-0.079367,0.495971,0.21719,0.012852,0.054913,0.003135,0.016187,0.026923,-0.038698,0.422641,-0.051065,0.027931,-0.150146
3,Carroll,"[oh, dear]",-0.198752,-0.055472,0.189465,-0.127368,-0.291132,0.009599,0.060696,-0.32327,0.389814,0.072731,0.030511,0.188868,0.196643,0.344093,-0.058606,0.410212,0.165069,-0.258981,0.308424,-0.193313,-0.292446,0.235932,0.147183,-0.302798,-0.306351,-0.023146,0.227181,0.175893,0.038049,0.337455,0.177974,0.361876,-0.216765,0.05785,0.007536,0.126324,-0.004724,-0.125484,...,0.280289,0.080346,0.10359,-0.197471,-0.192529,-0.036362,0.019838,-0.21062,-0.419918,0.048789,0.322911,0.028338,-0.049335,0.437095,-0.343182,-0.032262,0.02231,0.241956,-0.186074,0.153712,-0.577593,0.49017,0.013413,0.715294,0.542974,0.09705,-0.079153,-0.072409,0.398773,0.176581,0.013658,0.047693,-0.003861,0.023843,0.029919,-0.019725,0.360995,-0.019371,0.030002,-0.113956
4,Carroll,"[shall, late]",-0.170687,-0.064303,0.120265,-0.105524,-0.256076,0.023947,0.064427,-0.234864,0.308412,0.032545,0.007218,0.161663,0.152126,0.23769,-0.047349,0.336657,0.101153,-0.218439,0.222103,-0.166517,-0.208742,0.180475,0.116417,-0.210406,-0.219503,-0.041713,0.174058,0.119881,0.07181,0.25376,0.136691,0.306267,-0.166559,0.015745,0.01022,0.073358,-0.01418,-0.101173,...,0.197397,0.067974,0.041263,-0.165409,-0.17777,-0.067592,-0.001795,-0.164673,-0.335656,0.062644,0.26383,0.043861,-0.023529,0.337201,-0.256994,-0.019061,0.008255,0.201162,-0.123421,0.119953,-0.445364,0.372372,-0.002701,0.538392,0.44292,0.059975,-0.058816,-0.054449,0.32614,0.151117,0.005,0.026683,-0.000501,0.011815,0.020486,-0.02136,0.287478,-0.018363,0.004685,-0.101913


**This is a dataset format that we like**.

 **Now, we're ready to jump into modeling step with our features**. 

##**Word2vec in action**

**Notice that we now have a dataset where the columns named from 0 to 99 are the features we'll use in the following models**. 

We use the same models that we built in the previous checkpoints to predict the author of a sentence:

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.7485540334855403

Test set score: 0.7593607305936073
----------------------Random Forest Scores----------------------
Training set score: 0.9914764079147641

Test set score: 0.7986301369863014
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8867579908675799

Test set score: 0.8082191780821918


The scores aren't great compared to the scores of the previous checkpoints.

**The main reason is the small size of our corpus**.

**So, let's use word2vec vectors that are trained on a very large corpus**.

 For this, we use pre-trained vectors released by Google. Google released a quite large word2vec vectors that are trained on around 100 billion words from Google News dataset. 
 
 
**Their corpus contains 3 million words and the word vectors they trained have 300 features each**.

We'll download the **pre-trained vectors** from this address: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz. Note that the download and the following codes take some time. So, we recommend running the following cells in Google Colab

In [0]:
# Load Google's pre-trained Word2Vec model.
model_pretrained = gensim.models.KeyedVectors.load_word2vec_format(
    'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

**Now, we have the pre-trained vectors in a variable called model_pretrained**. 

**Next, we look for the vector representations of the words in our corpus**.

 **For simplicity, if a word in a sentence can't be found in the vocabulary of these pre-trained vectors,**
 
 **we just simply drop those sentences from our dataset**.
 
 ou can follow alternative approaches if you like

In [0]:
word2vec_arr = np.zeros((sentences.shape[0],300))

for i, sentence in enumerate(sentences["text"]):
  try:
    word2vec_arr[i,:] = np.mean([model_pretrained[lemma] for lemma in sentence], axis=0)
  except KeyError:
    word2vec_arr[i,:] = np.full((1,300), np.nan) #fill in with nan -- this vector will be dropped on line #12
    continue

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

print("Shape of the dataset: {}".format(sentences.shape))
sentences.head()

Shape of the dataset: (4610, 302)


Unnamed: 0,author,text,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,...,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",0.046265,0.016199,-0.036288,0.08241,-0.010284,0.015515,0.005437,-0.035947,0.067871,0.040186,0.002303,-0.071809,-0.002277,0.035602,-0.087659,0.067581,0.083479,0.109125,0.038206,-0.112296,0.021118,0.067197,0.000285,-0.046423,0.030869,0.000274,-0.084494,0.078152,0.047164,-0.023407,-0.105367,-0.03999,-0.110767,-0.065475,0.023956,0.010934,0.094955,0.015027,...,-0.055796,0.055115,-0.117415,-0.030527,-0.015355,0.163165,-0.034854,0.015172,-0.106117,0.035062,0.086723,0.159433,0.103741,0.062915,0.097021,-0.047644,-0.026648,-0.071558,0.01808,-0.039431,0.121521,-0.125867,0.006816,0.029865,0.046413,0.018112,-0.087307,0.042181,-0.015435,0.128412,-0.066516,0.029852,-0.042609,-0.044208,-0.056998,-0.063269,0.000244,-0.085071,-0.00034,-0.064371
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",0.046331,0.020463,-0.002012,0.101565,-0.066478,-0.035698,0.045293,-0.068695,0.04405,0.079996,0.010562,-0.09824,-0.024309,0.042576,-0.078658,0.026042,-0.025208,0.128391,0.054481,-0.081564,-0.022604,0.060187,0.014813,-0.00264,0.089216,0.010905,-0.080477,0.078742,0.071459,-0.042953,-0.011639,0.026516,-0.042924,-0.028997,-0.010134,-0.033885,0.051852,0.018926,...,0.037855,0.004276,-0.073813,0.033909,0.053077,0.063299,-0.044852,-0.004278,-0.053132,-0.035156,0.04793,0.12634,0.125036,0.04657,0.049766,-0.076279,-0.069141,-0.122912,-0.052948,0.055787,0.081729,0.011096,0.005422,0.050716,-0.050148,-0.008294,-0.072707,-0.002824,0.021307,0.035784,0.05594,0.085838,-0.067052,-0.013628,-0.027802,-0.033665,-0.023586,0.00962,0.030316,0.000908
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit, ...",0.061646,-0.006958,-0.013023,0.147003,-0.052933,-0.077866,0.033997,-0.06189,0.104706,0.151611,-0.083191,-0.102318,-0.043243,-0.060654,-0.060211,0.105164,0.127869,0.207825,-0.009186,0.009155,0.005402,0.077332,0.129974,-0.026632,0.149017,0.04354,-0.082504,0.020443,0.117149,-0.014988,-0.064789,-0.023331,-0.06897,0.002205,0.015739,0.018581,0.110168,0.057068,...,-0.073837,-0.021027,0.002594,0.025757,-0.004457,0.067825,-0.060242,-0.063232,-0.079094,0.098316,0.021147,0.124046,0.078278,0.056248,0.099792,-0.106703,0.034882,-0.111328,-0.009624,-0.011642,0.088547,-0.059265,-0.041046,0.069794,-0.002939,0.018978,-0.025116,-0.057938,0.007706,0.120476,-0.006882,0.030754,-0.073837,-0.010359,-0.086411,-0.156464,-0.000771,-0.000549,-0.003784,0.029114
3,Carroll,"[oh, dear]",0.073975,0.134277,0.141357,0.256348,-0.147949,0.09967,0.077148,-0.093628,0.108887,0.281738,-0.201172,-0.020752,-0.266602,0.000732,-0.036865,0.294434,0.158203,0.287109,-0.114624,0.03833,0.141357,-0.046021,0.407227,0.047852,0.322266,0.213379,-0.090576,0.022812,0.171265,-0.283203,0.193848,0.092285,-0.122803,0.02977,-0.116943,0.026123,0.137451,0.055298,...,-0.014648,0.112793,0.071716,-0.133911,-0.091553,-0.079041,-0.15625,-0.029053,-0.024719,0.102844,-0.084473,0.163086,-0.031738,-0.084473,0.14917,-0.082031,-0.023438,-0.199219,-0.253418,0.206055,0.160156,-0.05603,-0.138184,0.208496,0.030762,0.033447,-0.06189,-0.022461,-0.14624,-0.032959,0.058228,0.000854,-0.094971,-0.052668,-0.091919,-0.142456,-0.053711,-0.112671,-0.148193,0.186798
4,Carroll,"[shall, late]",0.095215,0.084473,0.206787,0.211182,0.043579,-0.155762,0.088379,-0.038574,0.065613,0.001221,-0.144287,0.001465,-0.000771,0.189453,-0.05835,-0.062134,0.045898,0.130127,0.211426,0.074341,-0.056122,-0.111145,0.104355,0.069946,0.191895,0.057404,-0.003906,0.107666,-0.040039,0.082275,-0.046707,-0.150635,-0.006226,0.04895,-0.088745,0.088501,-0.081573,-0.180542,...,0.133301,0.074219,0.049438,0.092743,0.077618,0.084229,-0.100586,-0.022217,0.043579,-0.029785,0.212158,0.073242,0.10022,0.062256,0.16748,0.010693,-0.139923,-0.013805,-0.127014,0.001465,-0.120972,0.06308,-0.024597,0.027847,0.010254,-0.073547,0.100098,0.023438,0.107178,0.065918,-0.021667,-0.103516,-0.038578,-0.007385,0.020264,0.134155,-0.177246,-0.254639,-0.212158,0.087646


As a result, we have a dataset of 4114 rows and 300 features (excluding the text and the author columns).

Now, we can run our classifiers using this dataset

**SEEMS EACH RUN PRODUCES A SLIGHTLY DIFF NUMBER OF ROWS**

**OR ELSE CORPUS HAS BEEN UPDATED**

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [0]:
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.8897324656543746

Test set score: 0.8449023861171366
----------------------Random Forest Scores----------------------
Training set score: 0.9924078091106291

Test set score: 0.7977223427331888
----------------------Gradient Boosting Scores----------------------
Training set score: 0.9508315256688359

Test set score: 0.8156182212581344


**IMPROVED RESULTS**

#**ASSIGNMENTS**

**Train your own word2vec representations** as we did in our first example in the checkpoint.

But, you need to experiment with the hyperparameters of the vectorization step

**Modify the hyperparameters and run the classification models again**

**Can you wrangle any improvements**?

**MAYBE, ONCE  I UNDERSTAND THE PARAMETERS**

**HERE ARE ORIGINAL PARAMETERS:**



The Word2Vec class has several parameters.

We set the following parameters:

**workers**=4: We set the number of threads to run in parallel to 4 (make sense if your computer has available computing units).

**min_count**=1: We set the minimum word count threshold to 1.

**window**=6: We set the number of words around target word to consider to 6.

**sg**=0: We use CBOW because our corpus is small.

**sample**=1e-3: We penalize frequent words.

**size**=100: We set the word vector length to 100.

**hs**=1: We use hierarchical softmax.

I WILL CHANGE A FEW:

SG=1  SKIP-GRAM

HS=0

In [0]:
# train word2vec on the the sentences
model = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=6,
    sg=1,
    sample=1e-3,
    size=100,
    hs=0
)

In [14]:
print(model.most_similar(positive=['lady', 'man'], negative=['woman'], topn=5))
print(model.doesnt_match("dad dinner mom aunt uncle".split()))
print(model.similarity('woman', 'man'))
print(model.similarity('horse', 'cat'))

[('brother', 0.9980732798576355), ('come', 0.99798583984375), ('sight', 0.9979526400566101), ('join', 0.997667133808136), ('Laconia', 0.9976547360420227)]
uncle
0.998033
0.9990655


**QUITE DIFFERENT**

In [15]:
word2vec_arr = np.zeros((sentences.shape[0],100))

for i, sentence in enumerate(sentences["text"]):
    word2vec_arr[i,:] = np.mean([model[lemma] for lemma in sentence], axis=0)

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

sentences.head()

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",-0.086738,-0.066973,0.194718,-0.038142,-0.092576,0.025404,-0.097356,-0.069098,0.344794,0.014709,-0.113208,0.086345,0.065842,0.111541,0.020347,0.383489,0.04294,-0.022669,0.145239,-0.206149,-0.35893,0.064531,-0.004138,-0.196842,-0.094288,-0.043455,0.080698,0.168811,0.099269,0.063211,-0.107292,0.240973,-0.115087,0.120025,-0.068237,-0.044756,-0.075345,-0.164414,...,0.067016,-0.042553,-0.051143,-0.095098,-0.14428,-0.175392,0.13998,-0.138948,-0.326689,-0.007645,0.30148,-0.021835,-0.026336,0.185951,-0.297374,0.01758,0.043338,0.190664,-0.153402,0.026305,-0.474025,0.226737,0.074406,0.277472,0.239781,0.005632,-0.151593,-0.079029,0.165642,-0.024022,-0.016095,0.008199,-0.024706,-0.034492,-0.004901,-0.14116,0.090249,-0.147267,-0.071003,-0.095955
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",-0.079642,-0.058659,0.176495,-0.034304,-0.08051,0.019714,-0.08896,-0.059914,0.30689,0.012223,-0.099544,0.077402,0.054977,0.097927,0.02027,0.345209,0.039257,-0.023266,0.131896,-0.183358,-0.318768,0.059231,-0.004875,-0.175451,-0.087784,-0.040046,0.072015,0.147458,0.089864,0.056492,-0.095106,0.215921,-0.101821,0.109779,-0.060913,-0.042119,-0.066199,-0.149445,...,0.059179,-0.036995,-0.044467,-0.08412,-0.129202,-0.154671,0.122542,-0.125052,-0.292851,-0.003369,0.269887,-0.019103,-0.024409,0.165669,-0.263451,0.019385,0.041925,0.171544,-0.134998,0.018321,-0.423333,0.200755,0.067218,0.249756,0.21581,0.0049,-0.138936,-0.070444,0.152848,-0.022898,-0.015348,0.002264,-0.023232,-0.031777,-0.004464,-0.125953,0.078503,-0.132127,-0.062781,-0.082511
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit, ...",-0.085359,-0.064368,0.194125,-0.031936,-0.089395,0.025563,-0.096264,-0.063209,0.33673,0.016716,-0.111266,0.084056,0.063879,0.109655,0.018697,0.376483,0.044901,-0.022722,0.145158,-0.201431,-0.35295,0.067148,-0.003106,-0.197068,-0.095209,-0.041725,0.078653,0.169217,0.097441,0.058873,-0.107858,0.235408,-0.116406,0.116216,-0.064614,-0.039719,-0.072207,-0.165831,...,0.066653,-0.040617,-0.048163,-0.091763,-0.140608,-0.172839,0.136259,-0.134173,-0.323789,-0.006166,0.296192,-0.02586,-0.025603,0.179698,-0.293718,0.02102,0.036755,0.194217,-0.151813,0.027326,-0.470922,0.226293,0.075123,0.27233,0.237724,0.002797,-0.143957,-0.076928,0.16065,-0.023743,-0.017215,0.008949,-0.021102,-0.03556,-0.002433,-0.14129,0.092959,-0.143728,-0.068994,-0.095584
3,Carroll,"[oh, dear]",-0.096373,-0.06453,0.197045,-0.045525,-0.086379,0.018699,-0.110723,-0.070469,0.354138,0.009603,-0.111456,0.087074,0.070432,0.119396,0.014106,0.390554,0.052142,-0.016757,0.151679,-0.209539,-0.366393,0.06742,-0.008687,-0.207316,-0.105511,-0.045703,0.078528,0.172799,0.101191,0.057001,-0.10292,0.239332,-0.118886,0.128884,-0.070392,-0.034584,-0.078401,-0.168955,...,0.069063,-0.037024,-0.043654,-0.100157,-0.145141,-0.170359,0.144143,-0.145218,-0.33638,-0.005551,0.303418,-0.031028,-0.029905,0.195296,-0.305234,0.015979,0.049964,0.195738,-0.159774,0.02984,-0.489543,0.222922,0.074329,0.288482,0.246239,0.008297,-0.150923,-0.076451,0.164616,-0.026882,-0.013977,0.011674,-0.020571,-0.029737,-0.004082,-0.149948,0.089315,-0.156816,-0.069571,-0.104728
4,Carroll,"[shall, late]",-0.090212,-0.071698,0.193818,-0.039499,-0.095114,0.027362,-0.101201,-0.071547,0.348967,0.008608,-0.111273,0.093129,0.060946,0.111183,0.022335,0.393584,0.043653,-0.030082,0.154541,-0.207357,-0.360797,0.070157,-0.00758,-0.199858,-0.099084,-0.048763,0.078787,0.168459,0.106232,0.064824,-0.103579,0.24188,-0.114393,0.124432,-0.068226,-0.049527,-0.08043,-0.165246,...,0.065275,-0.046945,-0.053838,-0.096982,-0.148763,-0.175016,0.139333,-0.143031,-0.327956,0.001885,0.306482,-0.026886,-0.021027,0.184712,-0.301301,0.020448,0.053619,0.1943,-0.153969,0.023008,-0.480453,0.224664,0.070125,0.285326,0.247936,0.003149,-0.161779,-0.081726,0.17144,-0.021733,-0.021447,0.003302,-0.025618,-0.036305,-0.004083,-0.138965,0.085015,-0.151768,-0.067145,-0.094273


**USE IN MODELS**

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.7367906066536204

Test set score: 0.7451076320939335
----------------------Random Forest Scores----------------------
Training set score: 0.9944553163731246

Test set score: 0.812133072407045
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8956294846705806

Test set score: 0.8082191780821918


**OLD SCORES:**

----------------------Logistic Regression Scores----------------------
Training set score: 0.7485540334855403

Test set score: 0.7593607305936073

----------------------Random Forest Scores----------------------
Training set score: 0.9914764079147641

Test set score: 0.7986301369863014

----------------------Gradient Boosting Scores----------------------
Training set score: 0.8867579908675799

Test set score: 0.8082191780821918

**OVERALL ABOUT THE SAME**

LOGITIC REGRESSION TEST SCORE IMPROVED

RANDOM FOREST TEST WORSE

GRADIENT BOOST -- IDENTICAL TEST SCORE

**///////////////////////////////////////////////////////////////////**

**TRY ONCE MORE WITH A FEW CHANGES**

INCREASE WINDOW 

INCREASE SIZE

BACK TO HIERARCHICAL



In [0]:
# train word2vec on the the sentences
model = gensim.models.Word2Vec(
    sentences["text"],
    workers=4,
    min_count=1,
    window=7,
    sg=1,
    sample=1e-3,
    size=150,
    hs=1
)

In [18]:
print(model.most_similar(positive=['lady', 'man'], negative=['woman'], topn=5))
print(model.doesnt_match("dad dinner mom aunt uncle".split()))
print(model.similarity('woman', 'man'))
print(model.similarity('horse', 'cat'))

[('drawing', 0.941796600818634), ('state', 0.9320310950279236), ('gallant', 0.9194947481155396), ('remain', 0.9146274924278259), ('Harvilles', 0.9136101007461548)]
dinner
0.9044591
0.55610454


**CREATE DATASET WITH 100 NUMERICAL FEATURES**

In [20]:
word2vec_arr = np.zeros((sentences.shape[0],150))

for i, sentence in enumerate(sentences["text"]):
    word2vec_arr[i,:] = np.mean([model[lemma] for lemma in sentence], axis=0)

word2vec_arr = pd.DataFrame(word2vec_arr)
sentences = pd.concat([sentences[["author", "text"]],word2vec_arr], axis=1)
sentences.dropna(inplace=True)

sentences.head()

Unnamed: 0,author,text,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,...,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149
0,Carroll,"[Alice, begin, tired, sit, sister, bank, have,...",-0.084378,0.018499,0.260216,0.046439,-0.221608,0.14642,0.08537,-0.027138,0.134531,-0.035061,0.029242,-0.004889,0.137644,0.224693,0.063269,0.220398,0.095539,-0.001302,0.128979,-0.10036,-0.069317,0.061253,0.054233,-0.220689,-0.097462,0.006515,-0.018772,0.130268,-0.031855,0.203759,-0.111214,0.19311,-0.128819,-0.00149,-0.059513,0.059552,-0.084287,-0.223194,...,0.006753,0.091144,-0.224683,0.108219,0.18038,-0.008803,0.12389,-0.188165,-0.156035,0.07936,0.062038,0.129331,-0.149438,-0.059564,-0.014594,-0.175428,0.04318,0.112494,0.000294,-0.296572,-0.012595,-0.111965,0.055118,-0.094838,0.202652,0.12611,0.083071,0.154961,-0.123595,-0.040393,-0.274358,0.014741,0.187632,0.10808,0.156315,-0.066466,0.02222,0.040301,0.129662,-0.171606
1,Carroll,"[consider, mind, hot, day, feel, sleepy, stupi...",-0.028472,0.028775,0.262261,0.07293,-0.180799,0.096593,0.080446,-0.003522,0.092186,-0.034481,0.002959,-0.011663,0.120195,0.201413,0.080199,0.207363,0.09701,-0.010622,0.132023,-0.07478,-0.064657,0.066582,0.067479,-0.197422,-0.080487,0.02071,0.017442,0.124838,-0.045598,0.193651,-0.101412,0.185312,-0.12177,-0.012521,-0.048765,0.007971,-0.082453,-0.17878,...,0.002039,0.090419,-0.223923,0.080785,0.19626,0.011093,0.153335,-0.165216,-0.121276,0.046865,0.074955,0.136277,-0.134411,-0.039464,-0.008681,-0.162899,0.011784,0.133724,-0.009869,-0.27096,0.007105,-0.128114,0.017706,-0.05869,0.190532,0.146836,0.099117,0.118884,-0.137521,-0.068001,-0.215993,-0.015412,0.128415,0.091065,0.10646,-0.051644,0.018346,0.023013,0.11906,-0.125782
2,Carroll,"[remarkable, Alice, think, way, hear, Rabbit, ...",-0.060983,0.032306,0.260433,0.126023,-0.193101,0.115526,0.076531,-0.001532,0.092028,9e-06,0.017373,-0.018527,0.145436,0.24844,0.077122,0.213051,0.097073,0.000576,0.107587,-0.067797,-0.053518,0.029185,0.083014,-0.218593,-0.079607,0.001452,0.00835,0.17161,-0.052628,0.173561,-0.141169,0.206908,-0.187256,-0.017524,-0.028762,0.088768,-0.100092,-0.228569,...,0.108609,0.097236,-0.242244,0.103491,0.198991,-0.0071,0.182838,-0.221555,-0.103729,0.0568,0.095354,0.177517,-0.172848,-0.074063,0.01542,-0.11962,-0.005409,0.151671,0.083081,-0.386724,-0.031066,-0.181881,0.018626,-0.07958,0.222341,0.110591,0.104018,0.15073,-0.129412,-0.092924,-0.271747,-0.023699,0.183407,0.084136,0.112303,-0.035835,-0.01341,0.053706,0.13078,-0.130038
3,Carroll,"[oh, dear]",-0.116564,0.104293,0.158561,0.05531,-0.157164,0.085965,-0.000582,-0.058927,0.095363,-0.006801,0.034306,-0.0423,0.187039,0.215381,0.050706,0.230333,0.078028,0.066463,0.095453,-0.040032,-0.073373,0.048078,0.045447,-0.180907,-0.10731,-0.020279,-0.061278,0.095666,-0.005257,0.133415,-0.072658,0.142433,-0.157838,0.002671,-0.032974,0.177357,-0.052596,-0.200444,...,0.068565,0.101806,-0.195483,0.136282,0.145833,-0.021141,0.14925,-0.254835,-0.10625,0.055685,0.025446,0.109745,-0.160365,-0.054904,-0.003928,-0.147603,-0.015027,0.057542,0.140618,-0.338251,-0.005533,-0.111485,0.119561,-0.067606,0.240044,0.074989,0.076062,0.143343,-0.104508,-0.06445,-0.292905,0.010636,0.189469,0.120415,0.133286,-0.063201,-0.067698,0.044902,0.112977,-0.170898
4,Carroll,"[shall, late]",-0.111882,0.044768,0.22298,0.069981,-0.172694,0.114725,0.000695,-0.051019,0.135941,-0.038545,-0.001035,-0.007677,0.139189,0.210357,0.089719,0.274147,0.079383,-0.021677,0.135527,-0.067735,-0.087572,0.104314,0.029106,-0.194809,-0.107184,0.003298,0.00582,0.063873,-0.029725,0.201526,-0.102864,0.16658,-0.116645,0.044853,-0.018541,0.037376,-0.074647,-0.17817,...,-0.06837,0.07202,-0.200865,0.101388,0.10945,0.006601,0.144892,-0.183364,-0.162378,0.075774,0.016419,0.140988,-0.11858,-0.041544,0.013988,-0.190777,-0.059691,0.10971,0.009864,-0.256601,0.004386,-0.072144,0.039087,-0.030055,0.268397,0.04366,0.072215,0.122976,-0.095808,-0.026498,-0.231246,0.001003,0.176714,0.089559,0.116098,-0.100573,0.039992,0.007634,0.120263,-0.150806


In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.8338590956887487

Test set score: 0.8170347003154574
----------------------Random Forest Scores----------------------
Training set score: 0.9922888187872415

Test set score: 0.8338590956887487
----------------------Gradient Boosting Scores----------------------
Training set score: 0.9120224325271644

Test set score: 0.8375394321766562


OLD SCORES:

----------------------Logistic Regression Scores---------------------- Training set score: 0.7485540334855403

Test set score: 0.7593607305936073

----------------------Random Forest Scores---------------------- Training set score: 0.9914764079147641

Test set score: 0.7986301369863014

----------------------Gradient Boosting Scores---------------------- Training set score: 0.8867579908675799

Test set score: 0.8082191780821918

**BETTER FOR LOGISTIC REGRESSION**

**LOWER FOR RF and GB**