<a href="https://colab.research.google.com/github/AlexBB999/NLP/blob/master/31_4_Assignment_NLP_Feature_Eng1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Bag of words (BoW)**

Our first feature generation approach will be something called Bag of Words (BoW). 

BoW is quite simple: 

**our goal is to create a feature matrix such that the rows are observations and there is a column for each unique word in our vocabulary**.

 **Then we'll fill in this matrix by counting how many times each word appears in each observation**.
 
  **We will then use those counts as features**.

 In the jargon of scikit-learn, generating BoW features is called **CountVectorizer** 

In [10]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
from nltk.corpus import gutenberg
import nltk
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


Now, we write a **helper function called text_clearner** for cleaning the text.

 Specifically, we remove some punctuations and numbers from the text

In [0]:
# utility function for standard text cleaning
def text_cleaner(text):
    # visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

**our unit of observation will be the sentences of these novels**

 or **in other terms**---->**our documents will be the sentences**.

In [0]:
# load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# the Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

Our cleaned texts are stored in two variables called alice and persuasion.

 **Note that we didn't split the texts into sentences so far**.
 
**We'll do that using SpaCy**.

 For that purpose, **we first load the English module of SpaCy and we parse both alice and persuasion texts with SpaCy**:

In [0]:
# parse the cleaned novels. this can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [81]:
alice_doc[0].is_alpha

True

In [82]:
alice_doc[:10]

Alice was beginning to get very tired of sitting by

In [21]:
len(alice_doc)

34363

In [22]:
type(alice_doc)

spacy.tokens.doc.Doc

**We can split our texts into sentences now**.

 You can see how that is easy using SpaCy.
 
 Since previously we parsed our documents with SpaCy, we can now use SpaCy's functionalities.
 
In this case, **SpaCy will take care of deriving the sentences from the texts**. 
  
**What we need to do is to iterate over the parsed documents after calling the .sents attribute**

In the following, **we iterate using the list comprehension**

In [112]:
# group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# combine the sentences from the two novels into one data frame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [84]:
len(sentences.loc[0][0])

67

In [85]:
sentences.shape

(5848, 2)

**As a result, we ended up having a dataset that consists of two columns**.

**In the first column, we have the sentences and in the second column**, **we have the authors**. 

**Before jumping in the BoW, we need to remove stopwords and punctuations**.
 
**Then we should convert our tokens to lemmas or stems**.

**In this example, we prefer to lemmatize our tokens**.
 
  Again, we'll make use of the attributes of the parsed documents by SpaCy:

In [0]:
# get rid off stop words and punctuation
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    #print(len(sentence))
    sentences.loc[i, "text"] = " ".join(        [token.lemma_ for token in sentence if not token.is_stop and not token.is_punct])

In [114]:
sentences.head()

Unnamed: 0,text,author
0,Alice begin tired sit sister bank have twice p...,Carroll
1,consider mind hot day feel sleepy stupid pleas...,Carroll
2,remarkable Alice think way hear Rabbit oh dear,Carroll
3,oh dear,Carroll
4,shall late,Carroll


In [91]:
list(sentences)

['text', 'author']

**Now we can start on converting the text in the first column of our dataset into a numerical form**.

As we said before, we'll use BoW approach. For this purpose, we'll use **CountVectorizer** from scikit-learn as follows:

**ADD SENTENCE LENGTH AS A FEATURE**

In [0]:
#[len(sentence) for sentence in sentences['text']]

In [0]:
sentence_length=pd.Series([len(sentence) for sentence in sentences['text']])

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
bow_df[len]=sentence_length
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)


In [96]:
bow_df

Unnamed: 0,1st,29th,abbreviation,abdication,abide,ability,able,abode,abominable,abominate,abroad,absence,absent,absolute,absolutely,abstraction,absurd,absurdity,abundance,abuse,abydos,accent,accept,acceptable,acceptance,accession,accident,accidental,accidentally,accommodate,accommodation,accompany,accomplish,accomplished,accomplishment,accord,accordingly,accost,account,accounting,...,wrap,wrapt,wreck,wretched,wretchedly,wretchedness,wriggle,wrinkle,wrist,write,writhing,writing,wrong,wrought,yard,yarmouth,yawn,ye,year,yearly,yell,yelp,yeoman,yer,yes,yesterday,yestermorn,yield,yielding,you,young,younker,youth,youthful,zeal,zealand,zealous,zealously,zigzag,5848
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,129
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,136
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5843,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,52
5844,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,49
5845,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63
5846,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,114


In [97]:
bow_df.shape

(5848, 4866)

As you can see, we have a dataset that we're familiar from the rest of the bootcamp.

It's in tabular form:

 **observations sit on rows** and **we have our features as columns**.
 
More importantly, we converted text into a numerical form so that we can apply machine learning algorithms using these as input.
  
  **This enables us to move to the modeling phase** as demonstrated below

##**BoW in action**

Now let's give the bag of words features a whirl by trying some **machine learning algorithms**

In [99]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9424173318129989

Test set score: 0.8636752136752137
----------------------Random Forest Scores----------------------
Training set score: 0.9732041049030786

Test set score: 0.8581196581196581
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8443557582668187

Test set score: 0.8235042735042735


In [0]:
----------------------Logistic Regression Scores----------------------
Training set score: 0.9261687571265679

Test set score: 0.8641025641025641
----------------------Random Forest Scores----------------------
Training set score: 0.9723489167616876

Test set score: 0.832905982905983
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8275370581527937

Test set score: 0.8128205128205128

**It looks like logistic regression and random forest overfit.**

 **Overfitting is a known problem when using bag of words**, 

 **since it involves throwing a massive number of features at a model** – 
 
**some of those features** (in this case, word frequencies) **will capture noise in the training set**.
 
Since overfitting is also a known problem with Random Forests, the divergence between training score and test score is expected. On the other hand, gradient boosting's training and test scores are close to each other.

**//////////////////////////////////////////////////////////////**

#**N-grams: words in context**

Consider the word ‘vain’ in these two sentences:

“She labored in vain, the rock would not move.” 

“She was so vain, her bathroom mirror was covered in lip prints.”
In both sentences, ‘vain’ is an adjective. In sentence 1, it signals a lack of success. In sentence 2, the same word means vanity.

**Since the two usages can’t be distinguished by their part of speech, how can we tell them apart?**

**N-grams incorporate context information by creating features made up of a series of consecutive words**.

The ‘N’ refers to the number of words included in the series. For example, the 2-gram representation of sentence 1 would be:

(She labored), (labored in), (in vain), (vain the), (the rock), (rock would), (would not), (not move).
The 3-gram representation of sentence 2 would be:

(She was so), (was so vain), (so vain her), (vain her bathroom), (her bathroom mirror), (bathroom mirror was), (mirror was covered), (was covered in), (covered in lip), (in lip prints).

**Each of the word sets can then operate as its own feature**.

**N-grams can be used to create term-document matrices** (though it would now be ngram-document matrices), or used in topic modeling.

In addition, **n-grams are useful for text prediction **

**as they can be used to determine what words are most likely to follow in a sentence, phrase, or search query**.

**For a sentence with X words, there will be  𝑋−(𝑁−1)  n-grams**. 2-gram phrases are also called ‘bigrams,’ 3-gram phrases are called ‘trigrams,’ etc.

##**Why use words alone**

Given the benefits of incorporating word context for distinguishing between different meanings of a word, **why would any NLP practitioner worth their salt ever use simple word features**?

**Because models based on single words have several advantages**:

First, n-gram models are considerably more sparse than single-word models.

 The two ‘vain’ sentences above share four words (‘she’, ‘in’, ‘vain’, ‘the’) but zero n-grams. Sparseness does mean that an n-gram model can be stored in a more memory-efficient way (for example, in a dict that only lists the n-grams that are present in each sentence rather than a set of columns with 1 if an n-gram is present and 0 otherwise). However, it also means that a larger corpus may be needed to detect any shared patterns across documents.
 
 **In other words, n-gram models may need more documents before they start to give good results**.

**Second, single-word models are straightforward to implement**,

**while models incorporating n-grams are more sensitive to fine distinctions of meaning.**

 **Which to choose depends on the goals of the NLP project and the trade-offs in ** **bold text**time** and performance for the specific corpus we are modeling**

**2-gram example**

Implementing n-grams is quite easy using scikit-learn's CountVectorizer.

**The only thing we should do is to give a tuple of range as values to a parameter of CountVectorizer which is called ngram_range**.

 As the code below demonstrates, we provide a value for the parameter ngram_range=(2,2) inside CountVectorizer. 
 
 This means that the vectorizer will produce 2-gram features. If we were to give ngram_range=(1,2) as value, then the vectorizer would produce both 1-gram and 2-gram together. However, we keep it as an assignment for you.

Let's generate our 2-grams and see what it looks like:

In [0]:
# we'll use 2-grams
vectorizer = CountVectorizer(analyzer='word', ngram_range=(2,2))
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)
sentences.head()

Unnamed: 0,29th september,abbreviation living,abdication neighbour,abide consequence,abide figure,ability affection,ability awkwardness,ability difficulty,able attempt,able avail,able avoid,able bear,able convince,able devise,able eat,able far,able feign,able join,able judge,able leave,able letter,able live,able marry,able persuade,able regard,able remain,able return,able ring,able rise,able set,able shew,able speak,able tell,able trace,able turn,able watch,abominable promise,abominate young,abroad intention,abroad supposition,...,young old,young parson,young people,young person,young sister,young squire,young thoughtless,young woman,young young,youth beauty,youth bloom,youth early,youth father,youth fine,youth hardly,youth hope,youth jaw,youth kill,youth learn,youth like,youth mention,youth possibly,youth restore,youth say,youth spring,youth value,youth vigour,youthful infatuation,zeal business,zeal common,zeal dwell,zeal sport,zeal think,zealand australia,zealous officer,zealous subject,zealously discharge,zigzag go,text,author
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,shall late,Carroll


**As you see, our new features are 2-gram**. 

Next, let's build the same machine learning models that we built before for the 1-gram case but this time using 2-gram as our features

In [0]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.895381984036488

Test set score: 0.782051282051282
----------------------Random Forest Scores----------------------
Training set score: 0.9304446978335233

Test set score: 0.6188034188034188
----------------------Gradient Boosting Scores----------------------
Training set score: 0.7528506271379704

Test set score: 0.7538461538461538


The results seem worse than 1-gram! Even overfitting in the logistic regression and the random forest is higher than before.

**That's because in the 2-gram case, we have more number of features than we have in 1-gram**.

**One possible solution to increase the performance of the models is using 1-gram and 2-gram together as features**. 

This will be one of your tasks in the assignments.

#*ASSIGNMENTS**
Your task is to increase the performance of the models we implemented in the BoW example. Suggested avenues of investigation include:

Other modeling techniques and models

Making more features that take advantage of the SpaCy information (include grammar, phrases, POS, etc)

Making sentence-level features (number of words, amount of punctuation)

Including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc)

Or anything else your heart desires.

Compare your models' performances with those of the example.

In the 2-gram example above, we only used 2-gram as our features. This time, use both 1-gram and 2-gram features together as your feature set. Run the same models in the example and compare the results.

##**TRY SVM**

In [0]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

In [0]:
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [0]:
print("----------------------SVM Scores----------------------")
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

----------------------SVM Scores----------------------
Training set score: 0.9327251995438997

Test set score: 0.8542735042735042


SOME OVERFITTING

**TRY RBF**

In [0]:
#Create a svm Classifier
rbf = svm.SVC(kernel='rbf') # Linear Kernel

#Train the model using the training sets
rbf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = rbf.predict(X_test)

In [0]:
print("----------------------RBF Scores----------------------")
print('Training set score:', rbf.score(X_train, y_train))
print('\nTest set score:', rbf.score(X_test, y_test))

----------------------SVM Scores----------------------
Training set score: 0.8993728620296465

Test set score: 0.8277777777777777


**NOT ANY BETTER**

TRY AGAIN WITH C=10

In [0]:
print("----------------------RBF Scores----------------------")
print('Training set score:', rbf.score(X_train, y_train))
print('\nTest set score:', rbf.score(X_test, y_test))

----------------------RBF Scores----------------------
Training set score: 0.9689281641961232

Test set score: 0.8196581196581196


MUCH WORSE
**REGULARIZATION IS INVERSLY PROPORTIONAL TO C**

TRY WITH C=.2

In [0]:
#Create a SVM Classifier
from sklearn import svm
rbf = svm.SVC(kernel='rbf',C=.1) # rbf KERNEL

#Train the model using the training sets
rbf.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = rbf.predict(X_test)

In [16]:
print("----------------------RBF Scores----------------------")
print('Training set score:', rbf.score(X_train, y_train))
print('\nTest set score:', rbf.score(X_test, y_test))

----------------------RBF Scores----------------------
Training set score: 0.7397377423033067

Test set score: 0.7423076923076923


**OVERFITTING FIXED**

**SCORE IS MEDIOCRE**

**WILL ADD FEATURE(S)**

In [0]:
from sklearn.pipeline import FeatureUnion

**THIS FEATURE HAS BEEN ALREADY BEEN ADDED -- SENTENCE LENGTH**

In [0]:
sentence_length=pd.Series([len(sentence) for sentence in sentences['text']])

**LET'S SEE IF IT IMPROVES RBF SCORE**

In [0]:
#Create a svm Classifier
rbf = svm.SVC(kernel='rbf') # Linear Kernel

#Train the model using the training sets
rbf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = rbf.predict(X_test)

In [102]:
print("----------------------RBF Scores----------------------")
print('Training set score:', rbf.score(X_train, y_train))
print('\nTest set score:', rbf.score(X_test, y_test))

----------------------RBF Scores----------------------
Training set score: 0.6590649942987458

Test set score: 0.6675213675213675


**MUCH WORSE**

**TRY WITH LINEAR KERNEL**

In [0]:
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [104]:
print("----------------------SVM Scores----------------------")
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

----------------------SVM Scores----------------------
Training set score: 0.9538198403648803

Test set score: 0.8572649572649572


**ACCURACY IS UP -- STILL OVERFITTING**

**CHANGE C TO REDUCE OVERFITTING**

In [0]:
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear',C=.5) # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [108]:
print("----------------------SVM Scores----------------------")
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

----------------------SVM Scores----------------------
Training set score: 0.9293044469783353

Test set score: 0.8542735042735042


**DID NOT HELP**


**TRY C=2**

In [0]:
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear',C=2) # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [110]:
print("----------------------SVM Scores----------------------")
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

----------------------SVM Scores----------------------
Training set score: 0.9569555302166477

Test set score: 0.8508547008547008


**UGH**

##**N-GRAMS**

WE'LL USE 1-GRAM AND 2-GRAM TOGETHER

In [115]:
# we'll use 2-grams
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2))
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)
sentences.head()

Unnamed: 0,1st,29th,29th september,abbreviation,abbreviation living,abdication,abdication neighbour,abide,abide consequence,abide figure,ability,ability affection,ability awkwardness,ability difficulty,able,able attempt,able avail,able avoid,able bear,able convince,able devise,able eat,able far,able feign,able join,able judge,able leave,able letter,able live,able marry,able persuade,able regard,able remain,able return,able ring,able rise,able set,able shew,able speak,able tell,...,young young,younker,youth,youth beauty,youth bloom,youth early,youth father,youth fine,youth hardly,youth hope,youth jaw,youth kill,youth learn,youth like,youth mention,youth possibly,youth restore,youth say,youth spring,youth value,youth vigour,youthful,youthful infatuation,zeal,zeal business,zeal common,zeal dwell,zeal sport,zeal think,zealand,zealand australia,zealous,zealous officer,zealous subject,zealously,zealously discharge,zigzag,zigzag go,text,author
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,shall late,Carroll


In [116]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9444127708095781

Test set score: 0.8662393162393163
----------------------Random Forest Scores----------------------
Training set score: 0.9723489167616876

Test set score: 0.8350427350427351
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8292474344355758

Test set score: 0.8158119658119658


**IMPROVEMENT OVER 2-GRAM ALONE**