# Notebook for Sentiment Classifiers

This notebook contains the codes to train a sentiment classifer for the selected data set. 

We randomly selected 2000 sentences from the body texts of the downloaded articles as the data set. After manually labelling the sentences as *positive*, *negative*, and *neutral*, we randomly selected 80% of the sentences in the data set as training data to train 8 different kinds of classification algorithms. We then compared the algothrims' performances to select the best classifier for our data set.


The notebook consists of 4 main sections:
1. [Construct the data set](#section_1)
2. [Prepare the data for modeling](#section_2)
3. [Classifier selection](#section_3)
> - 3.1. [K-Nearest Neighbours](#section_3.1)
> - 3.2. [Logistic Regression](#section_3.2)
> - 3.3. [Multinomial Naive Bayes](#section_3.3)
> - 3.4. [Devision Tress](#section_3.4)
> - 3.5. [Random Forest](#section_3.5)
> - 3.6. [Linear Support Vector Machines](#section_3.6)
> - 3.7. [Kernalized Support Vector Machines](#section_3.7)
> - 3.8. [Neural Network](#section_3.8)
4. [Test the selected model](#section_4)


By comparing their performances regarding test accuracy and other indexes, we selected `Random Forest` as the best classification algorithm for our data set, and tested it on 10 new sentences.


## <a id="section_1">1. Construct the data set</a>

To construct a data set for training the sentiment classifer, we first tokenized the body text in all the articles we downloaded from the website and put them into a large list. Then we randomly selected 2000 sentences from the list to be used as the data set to train the classifers.

In [1]:
import pandas as pd

# Load the html_metadata.csv as df
df = pd.read_csv("outcome/html_metadata.csv", sep="\t")

# Explore the number and columns and rows of the dataframe
df.shape

(8964, 5)

In [3]:
import nltk
from nltk.tokenize import sent_tokenize

# Create a new dataframe that contains only the body_text of each article
text = df[["body_text"]]

# Tokenize the body text into sentences
text["body_text"] = text["body_text"].astype(str)
text["sentences"] = text["body_text"].apply(lambda x: sent_tokenize(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text["body_text"] = text["body_text"].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text["sentences"] = text["body_text"].apply(lambda x: sent_tokenize(x))


In [4]:
# Concatenate all sentences from each article into a list
all_sentences = [sentence for sublist in text["sentences"] for sentence in sublist]
all_sentences

['Peas, in and out of the pod     The pea plant (Pisum sativum) is a member of the the Legume family (Fabaceae), which also contains beans, peanuts and alfalfa.',
 'Peas have been consumed in the Mediterranean and Middle East for at least 6,000 years, first being cultivated in Turkey 3,000 years ago.',
 'These plants are loved by farmers and gardeners for their ability to add nitrogen to the soil, thus enriching it for other crops.',
 'Today there are several subspecies of peas, including garden peas, snow peas, snap peas and field peas.',
 'Advertisement          Garden Pea            A dish of steamed garden peas     Garden peas (Pisum sativum var.',
 'sativum), also known as English peas, are the most common kind of pea plant.',
 'Like all other peas, garden pea plants are twining vines 1 to 1.5 feet in length, covered with tiny tendrils.',
 'They bear bright green, elongated oval-shaped leaves and five-petaled, irregular pink or white flowers which give way long green pods containi

We noticed there are many sentences that are not semantically meaningful for sentiment analysis, most of which are advertisements or providing credits for the images used in an article. Therefore, we chose to remove those sentences from the list, and consequently, the analysis. 

In [5]:
# Filter out non-meaningful sentences from the list
sentences = []

for sentence in all_sentences:
    if sentence.startswith("Image Credit:") == False:
        if sentence.startswith("Advertisement") == False:
            sentences.append(sentence)

From the cleaned sentence list, we randomly selected 2000 sentences and stored them in a dataframe as **labeled_sentences.csv** in the **outcome** folder. 

We created an empty column titled `sentiment` in the dataframe, where we can manually label the sentiment of each sentence. 

In [6]:
# Randomly selete 2000 sentences
import random
sample_size = 2000
random.seed(42)

sample_sentences = random.sample(sentences, sample_size)

# Transform the list into a dataframe
df_sample = pd.DataFrame(sample_sentences,columns = ["sentence"])
# Create the column of sentiment with no values
df_sample["sentiment"] = ""


In [14]:
df_sample

Unnamed: 0,sentence,sentiment
0,If the adhesive is not soft enough to remove w...,
1,Add 2 drops of biodegradable dishwashing liqui...,
2,Press the primer bulb back into the mount and ...,
3,Doing so will decrease the amount of fruit pro...,
4,Ensure the edges are aligned with the edges of...,
...,...,...
1995,There's also a shock-guard bumper that absorbs...,
1996,Let the poultice sit on the stain for 24 to 48...,
1997,Super glue only requires a small dollop and dr...,
1998,Reattach the screws.,


In [7]:
# Save the dataframe as a .csv file
df_sample.to_csv("outcome/labeled_sentences.csv")

## <a id="section_2">2. Prepare data for modeling</a>

Before this step, we downloaded the dataframe **labeled_sentence.csv** to our local drive and manually labeled the sentiment of sentences as *positive*, *neutral*, or *negative*. 

It should be noted that the website  __[Home Sweet Home](https://www.ehow.com/home-sweet-home)__ mainly consists of articles about home improviements. Therefore, most of our sentences are (i) imperatives that give plain instructions or (ii) declaratives that objectively describe a method or product, which all should have been labeled as *neutral*. 

For the purpose of the classification task, however, we tried to approach the labelling with a slightly different strategy, so that we can have more balanced distribution among the three sentiments:
- **Label plain instructions (imperative sentences) as *neutral*.**<br>
e.g. *Reattach the screws.* <br>
e.g. *Slightly overlap each session every time you start a new one.*<br>
<br/>

- **Label declarative sentences that describe the features or characteritics of a product/method that are not neccesarily good or bad as *neutral*.** <br>
e.g. *Found primarily in the South, these trees are distinctive for their long needles and seed cones.* <br>
e.g. *A nontoxic latex undercoating is now used ton metal sinks and has been used since the late '70s.*

<br/>

- **For questions, missing vlues, or unparsable phrases, label them as *neutral*.** <br>
e.g. *dry whiting, bolted20 lbs.* <br>
e.g. *Motion sensor?* <br>

<br/>

- **Label declarative sentences that describe the advantages of a product/method as *positive*.** <br>
e.g. *A swimming pool is an amenity that can bring hours of family enjoyment.* <br>
e.g. *Three round feet on the bottom help balance and keep the roll holder in place.*

<br/>

- **Label declarative sentences that describe the disadvantages of a product/method as *negative*.** <br>
e.g. *Others, including mosquitoes, fleas and sticks, will seek you out as a source food.* <br>
e.g. *Washing windows takes a long time, especially ef you have a second story.* <br>

<br/>


- **For sentences that have two clauses, one negative and one positive, label the sentence with the sentiment of the clause where the focus is on.** <br>
e.g. *While PVC is very durable, it does occasinally get stained.*  -- nagative <br>
e.g. *Even though the design of this Crate & Barrel fireplace screen is stunning to look at, its minimalism design makes it the perfect addition to any mid-century modern style home.* -- positive <br>

<br/> 

- **For rhetorical questions, label them by the meaning intended by author.** <br>
e.g. *But did you know you can make natural fertilizers from everyday kitchen waste?*. -- positive <br>

<br/>

After manually labelling all sentences, we uploaded the labelled data set to the **outcome** folder as **labeled_sentences_1.csv** for for analysis. 

In [18]:
# Load the data set with the 2000 labelled sentences
df = pd.read_csv("outcome/labeled_sentences_1.csv")
df = df[["sentence", "sentiment"]]
df

Unnamed: 0,sentence,sentiment
0,If the adhesive is not soft enough to remove w...,u
1,Add 2 drops of biodegradable dishwashing liqui...,u
2,Press the primer bulb back into the mount and ...,u
3,Doing so will decrease the amount of fruit pro...,n
4,Ensure the edges are aligned with the edges of...,u
...,...,...
1996,Let the poultice sit on the stain for 24 to 48...,u
1997,Super glue only requires a small dollop and dr...,u
1998,Reattach the screws.,u
1999,Place a second box fan in another window with ...,u


Count the number of each label in the data set.


In [26]:
p = len(df[df.sentiment=="p"])
n = len(df[df.sentiment=="n"])
u = len(df[df.sentiment=="u"])

print(f"The number of positive cases: {p}, negative cases: {n}, neutral caess: {u}.")


The number of positive cases: 352, negative cases: 189, neutral caess: 1459.


It can be noticed, although we employed an alternative strategy to label the sentiment, our data set is still significantly biased towards *neutral* by the nature of the articles, and we believe this will influence the performance of the classification algorithms greatly, espeically on new sentences.

Another mistake we noticed in our data set is that we forgot to filter out the sentences with missing values from the data set before selecting 2000 sentences. Therefore, we need to remove the sentences now before transformed them into the form for modelling. 

In [27]:
# Check the missing values in the sentence values of the data set
for i in df["sentence"]:
    if pd.isna(i):
        print(i)


nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan


There are in total 13 sentences in our data set that are missing. Although it will be better to remove them first before seleting (so we can still have 2000 sentences to work with), relabelling 2000 sentences is too time-consuming. 

However, only 13 sentences should not influence the accuracy of the models too much. 

In [28]:
# Clean the nan values from the data set
miss_list = list()
for i in range(len(df)):
    if pd.isna(df["sentence"][i]):
        miss_list.append(i)
print(miss_list)       
 
    
df = df.drop(df.index[miss_list])

df

[188, 242, 330, 373, 405, 733, 1293, 1298, 1411, 1437, 1766, 1857, 2000]


Unnamed: 0,sentence,sentiment
0,If the adhesive is not soft enough to remove w...,u
1,Add 2 drops of biodegradable dishwashing liqui...,u
2,Press the primer bulb back into the mount and ...,u
3,Doing so will decrease the amount of fruit pro...,n
4,Ensure the edges are aligned with the edges of...,u
...,...,...
1995,There's also a shock-guard bumper that absorbs...,p
1996,Let the poultice sit on the stain for 24 to 48...,u
1997,Super glue only requires a small dollop and dr...,u
1998,Reattach the screws.,u


Then, we prepared the data for modelling. 

The target value y is the labeled sentiment of each sentence, and the predictors are a $1988*5890$ matrix X.

In [34]:
y = df.sentiment

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.7)
X = vectorizer.fit_transform(df.sentence)
X

<1988x5890 sparse matrix of type '<class 'numpy.float64'>'
	with 18526 stored elements in Compressed Sparse Row format>

In [35]:
X.shape, y.shape

((1988, 5890), (1988,))

We then randomly split up the data set, used 80% of the data set as the training data to fit the models, and the remaining 20% as test data to evaluate the models' performances. 

In [36]:
# Split the data set into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [12]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1590, 5890), (1590,), (398, 5890), (398,))

## <a id="section_3">3. Classifier selection</a>

We trained 8 different classifiers on our training data. For models that can be fitted with various parameters, we used 5-fold cross-validation to select the best parameter. 

For each classifier, we recorded their test accuracy, precision, recall, f1-score, and time efficiency to select the best model.

In [37]:
from sklearn.metrics import confusion_matrix, classification_report

### <a id="section_3.1">3.1. K-Nearest Neighbors (K-NNs)</a> 

K-NN model requires us to select the best K.

Cross-validation results suggest the model performs the best when K = 10

In [38]:
from sklearn.neighbors import KNeighborsClassifier 

# Cross-validate to select the best model
from sklearn.model_selection import cross_val_score
score_max = 0                      # Score_max is a temoporay variable to store the max score 

for param in [1, 3, 10, 30]:
    model = KNeighborsClassifier(n_neighbors=param)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"k = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
       
    # updating the temporary variable param_best if the param variable yields a better performance
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param         

# Print the parameter with the best performance 
print(f"Highest score : {round(score_max, 3)} when k = {param_best}")

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


k = 1: [0.7327044  0.72641509 0.72641509 0.72327044 0.72955975]
0.728, 0.003

k = 3: [0.72641509 0.71698113 0.72641509 0.71698113 0.72327044]
0.722, 0.004

k = 10: [0.72012579 0.71698113 0.74842767 0.73899371 0.72641509]
0.73, 0.012

k = 30: [0.72955975 0.72641509 0.72955975 0.72955975 0.72327044]
0.728, 0.003

Highest score : 0.73 when k = 10


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


A K-NN (K=10) model was fit to the training data and tested on the test data. 

In [39]:
# Build the best model for models with parameters to decide
def train_test(X_train, X_test, y_train, y_test, cls): # cls is the model to be fit
    cls.fit(X_train, y_train)
    
    y_true = y_test
    y_pred = cls.predict(X_test)
    
    print(f"Train accuracy score: {round(cls.score(X_train, y_train), 3)}")
    print(f"Test accuracy score: {round(cls.score(X_test, y_test), 3)}\n")
    print(confusion_matrix(y_true, y_pred))
    print()
    print(classification_report(y_true, y_pred, zero_division=0))

In [40]:
# fit the knn with best parameter to the training data and test it on test data
print(f"k = {param_best}")
knc = KNeighborsClassifier(n_neighbors=param_best) # initiate the knn model
%time train_test(X_train, X_test, y_train, y_test, knc)

k = 10
Train accuracy score: 0.745
Test accuracy score: 0.731

[[  0   0  34]
 [  0   0  70]
 [  1   2 291]]

              precision    recall  f1-score   support

           n       0.00      0.00      0.00        34
           p       0.00      0.00      0.00        70
           u       0.74      0.99      0.84       294

    accuracy                           0.73       398
   macro avg       0.25      0.33      0.28       398
weighted avg       0.54      0.73      0.62       398

CPU times: user 123 ms, sys: 21.4 ms, total: 145 ms
Wall time: 138 ms


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


The precision, recall, and f1-score of the model to identify *negative* and *positive* sentences in the test data are low, arguably due to that our data set is biased towards *neutral* and doesn't contain enough input for the classifier to be properly trained. 

The overall test accuracy of the K-NN model is 0.73. We store the KNN test accuracy in the dictionary item `summary` for further comparison.

In [41]:
# store the KNN performance into summary
summary = {}
summary["k-NNs"] = round(knc.score(X_test, y_test), 3)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


### <a id="section_3.2">3.2. Logistic Regression </a> 

There is no parameter to select for the logistic regression model. 

Cross-validation shows that the average accuracy of the model on the training data is 0.73.

In [42]:
from sklearn.linear_model import LogisticRegression
# initiate the lr
lr = LogisticRegression(random_state=0)

# cross-validate to see the average performance
scores = cross_val_score(lr, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.72955975 0.72641509 0.72955975 0.72012579 0.71698113]
0.725, 0.005


We then trained the logistic regression model on the whole training data and tested on the test data.

In [43]:
# fit the model to the training data and test it with test data
%time train_test(X_train, X_test, y_train, y_test, lr)

Train accuracy score: 0.79
Test accuracy score: 0.739

[[  0   0  34]
 [  0   1  69]
 [  1   0 293]]

              precision    recall  f1-score   support

           n       0.00      0.00      0.00        34
           p       1.00      0.01      0.03        70
           u       0.74      1.00      0.85       294

    accuracy                           0.74       398
   macro avg       0.58      0.34      0.29       398
weighted avg       0.72      0.74      0.63       398

CPU times: user 5.19 s, sys: 26.2 s, total: 31.4 s
Wall time: 3.95 s


The precision, recall, and f1-score of the model to identify *negative* and *positive* sentences in the test data are low, arguably due to that our data set is biased towards *neutral* and doesn't contain enough input for the classifier to be properly trained. 

The overall test accuracy of the Logistic regression model is 0.74, better than the KNN model. However, it also took longer than the KNN model to fit.

We store the Logistic regression test accuracy in the dictionary item `summary` for further comparison.

In [23]:
# store the test accuracy in summary
summary["Logistic Regression"] = round(lr.score(X_test, y_test), 3)

{'k-NNs': 0.779, 'Logistic Regression': 0.779}

### <a id="section_3.3">3.3. Multinomial Naive Bayes </a> 

There is no parameter to select for the Multinomial Naive Bayes model. 

Cross-validation shows that the average accuracy of the model on the training data is 0.73.

In [45]:
from sklearn.naive_bayes import MultinomialNB
# initiate the multinomial naive bayes
mnb = MultinomialNB()

# cross-validation to see the average performance
scores = cross_val_score(mnb, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002


We then fit the model to the whole training data and tested it performance on the test data.

In [47]:
# Train the model on the whole training data and test it on test data
%time train_test(X_train, X_test, y_train, y_test, mnb)

Train accuracy score: 0.747
Test accuracy score: 0.739

[[  0   0  34]
 [  0   0  70]
 [  0   0 294]]

              precision    recall  f1-score   support

           n       0.00      0.00      0.00        34
           p       0.00      0.00      0.00        70
           u       0.74      1.00      0.85       294

    accuracy                           0.74       398
   macro avg       0.25      0.33      0.28       398
weighted avg       0.55      0.74      0.63       398

CPU times: user 13.6 ms, sys: 1.14 ms, total: 14.7 ms
Wall time: 14 ms


The precision, recall, and f1-score of the model to identify *negative* and *positive* sentences in the test data are low, arguably due to that our data set is biased towards *neutral* and doesn't contain enough input for the classifier to be properly trained. 

The overall test accuracy of the Multinomial Naive Bayes model is 0.74, better than the KNN model and itendical to Logistical regression. However, it was less time-consuming than the Logistic regression.

We store the Multinomial Naive Bayes test accuracy in the dictionary item `summary` for further comparison.

In [49]:
# store the test accuracy
summary["Multinomial Naive Bayes"] = round(mnb.score(X_test, y_test), 3)


### <a id="section_3.4">3.4. Decision Tree </a> 

There is no parameter to select for the Decision Tree model. 

Cross-validation shows that the average accuracy of the model on the training data is 0.7.

In [50]:
from sklearn.tree import DecisionTreeClassifier
# initate the model
dtc = DecisionTreeClassifier(random_state=0)

# cross-validate to see the average performance
scores = cross_val_score(dtc, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.70125786 0.6918239  0.68867925 0.70440252 0.68867925]
0.695, 0.007


We then fit the model to the whole training data and tested it performance on the test data.

In [51]:
# fit the model to the whole training data and test it on test data
%time train_test(X_train, X_test, y_train, y_test, dtc)

Train accuracy score: 1.0
Test accuracy score: 0.698

[[ 11   5  18]
 [  6  21  43]
 [ 23  25 246]]

              precision    recall  f1-score   support

           n       0.28      0.32      0.30        34
           p       0.41      0.30      0.35        70
           u       0.80      0.84      0.82       294

    accuracy                           0.70       398
   macro avg       0.50      0.49      0.49       398
weighted avg       0.69      0.70      0.69       398

CPU times: user 206 ms, sys: 13.4 ms, total: 220 ms
Wall time: 219 ms


The precision, recall, and f1-score of the model to identify *negative* and *positive* sentences in the test data are better than the previous models, but still low, arguably due to that our data set is biased towards *neutral* and doesn't contain enough input for the classifier to be properly trained. The performance indexes on detecting *neutral* sentences are, however, worse, when comparing with previous models. 

The overall test accuracy of the Decision Tree model is 0.7, being the worst among the models compared so far.

We store the Decision Tree test accuracy in the dictionary item `summary` for further comparison.

In [63]:
# record the test accuracy 
summary["Decision Trees"] = round(dtc.score(X_test, y_test), 3)

### <a id="section_3.5">3.5. Random Forest </a> 

There is no parameter to select for the Random Forest model. 

Cross-validation shows that the average accuracy of the model on the training data is 0.75.

In [53]:
from sklearn.ensemble import RandomForestClassifier
# initate the model
rfc = RandomForestClassifier(random_state=0)

# cross-validate to see the average performance
scores = cross_val_score(rfc, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.75157233 0.74528302 0.74528302 0.75471698 0.75471698]
0.75, 0.004


We then fit the model to the whole training data and tested it performance on the test data.

In [54]:
# train the model on training data and tested it performance on test data
%time train_test(X_train, X_test, y_train, y_test, rfc)

Train accuracy score: 1.0
Test accuracy score: 0.751

[[  3   1  30]
 [  1   7  62]
 [  1   4 289]]

              precision    recall  f1-score   support

           n       0.60      0.09      0.15        34
           p       0.58      0.10      0.17        70
           u       0.76      0.98      0.86       294

    accuracy                           0.75       398
   macro avg       0.65      0.39      0.39       398
weighted avg       0.71      0.75      0.68       398

CPU times: user 1.67 s, sys: 107 ms, total: 1.77 s
Wall time: 1.77 s


Random Forest had balanced performance among the three sentiments regarding precision; however, the recall and f1-score was still much higher in *neutral* sentences than the other two. 

The overall test accuracy of the Random Forest model is 0.75, being the best among the models constructed so far. On the other hand, this model is also rather time-consuming. 

We store the Random Forest test accuracy in the dictionary item `summary` for further comparison.

In [55]:
summary["Random Forest"] = round(rfc.score(X_test, y_test), 3)

### <a id="section_3.6">3.6. Linear Support Vector Machines (SVMs)</a>

A linear SVMs requires to select a best c parameter.

Cross-validation results suggest the model performs the best when C = 1.


In [56]:
from sklearn.svm import LinearSVC

# Cross validate the SVM to select the best c

score_max = 0

for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = LinearSVC(C=param, random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"C = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print(f"Highest score : {round(score_max, 3)} when C = {param_best}")

C = 0.01: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 0.03: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 0.1: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 0.3: [0.72955975 0.72641509 0.7327044  0.7327044  0.72955975]
0.73, 0.002

C = 1: [0.74213836 0.7327044  0.73899371 0.73584906 0.73899371]
0.738, 0.003

C = 3: [0.72641509 0.73584906 0.75157233 0.73899371 0.72327044]
0.735, 0.01

C = 10: [0.72641509 0.74528302 0.74528302 0.7327044  0.71383648]
0.733, 0.012

Highest score : 0.738 when C = 1


A linear SVMs with C=1 was fit to the training data and tested on the test data. 

In [57]:
print(f"C = {param_best}")
lsvc = LinearSVC(C=param_best) # initate the SVMs
# fit the model and test it on test data
%time train_test(X_train, X_test, y_train, y_test, lsvc)

C = 1
Train accuracy score: 1.0
Test accuracy score: 0.744

[[  5   2  27]
 [  1   5  64]
 [  4   4 286]]

              precision    recall  f1-score   support

           n       0.50      0.15      0.23        34
           p       0.45      0.07      0.12        70
           u       0.76      0.97      0.85       294

    accuracy                           0.74       398
   macro avg       0.57      0.40      0.40       398
weighted avg       0.68      0.74      0.67       398

CPU times: user 18.3 ms, sys: 2.81 ms, total: 21.1 ms
Wall time: 19.6 ms


Linear SVMs had rather balanced performance among the three sentiments regarding precision; however, the recall and f1-score was still much higher in *neutral* sentences than the other two. 

The overall test accuracy of the Random Forest model is 0.74, being the second best among the models constructed so far. Compared to the currently best model (Random Forest), SVMs is also much less time-consuming. 

We store the Linear SVMs test accuracy in the dictionary item `summary` for further comparison.

In [60]:
# record the test accuracy
summary["Linear SVMs"] = round(lsvc.score(X_test, y_test), 3)

### <a id="section_3.7"> 3.7. Kernelized Support Vector Machines (KSVMs)</a>

A kernelized SVMs requires to select a best C parameter.

Cross-validation results suggest the model performs the best when C = 1.

In [61]:
from sklearn.svm import SVC

# Cross-validate to select the best C

score_max = 0

for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = SVC(C=param, kernel="rbf", gamma="scale", random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"C = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print(f"Highest score : {round(score_max, 3)} when C = {param_best}")

C = 0.01: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 0.03: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 0.1: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 0.3: [0.72641509 0.72641509 0.72641509 0.72327044 0.72327044]
0.725, 0.002

C = 1: [0.72955975 0.72641509 0.72641509 0.72327044 0.72641509]
0.726, 0.002

C = 3: [0.72955975 0.72327044 0.72641509 0.72955975 0.72327044]
0.726, 0.003

C = 10: [0.72955975 0.72327044 0.72641509 0.72955975 0.72327044]
0.726, 0.003

Highest score : 0.726 when C = 1


A kernelized SVMs with C=1 was fit to the training data and tested on the test data. 

In [62]:
print(f"C = {param_best}")
# initate the KSVM
svc = SVC(C=param_best)
# fit the model to the training data and test it on test data
%time train_test(X_train, X_test, y_train, y_test, svc)

C = 1
Train accuracy score: 0.947
Test accuracy score: 0.741

[[  0   0  34]
 [  0   1  69]
 [  0   0 294]]

              precision    recall  f1-score   support

           n       0.00      0.00      0.00        34
           p       1.00      0.01      0.03        70
           u       0.74      1.00      0.85       294

    accuracy                           0.74       398
   macro avg       0.58      0.34      0.29       398
weighted avg       0.72      0.74      0.63       398

CPU times: user 709 ms, sys: 42.3 ms, total: 751 ms
Wall time: 749 ms


All indexes for identifying the *negative* sentences are very low. 

The precision for identifing *positive* sentences are ceiling, however, the recall and f1-score is low, suggesting that although all the recognized *positive* sentences are indeed *positive*, there remain a great number of *positive* sentences that are not detected by the model.

The model performend moderately well on identifying *neutral* sentences.

The overall test accuracy of the kernelized SVMs is 0.74. We store the test accuracy in the decision item `summary` for further comparisons. 

In [64]:
summary["Kernelized SVMs"] = round(svc.score(X_test, y_test), 3)

### <a id="section_3.8"> 3.8. Neural Networks </a>

Neural Networks model require to select the best hidden layer size.

Cross-validation results suggest the model performs the best with 100 hidden layers.

In [65]:
################################################
# Don't run this chunk during presentation
################################################

from sklearn.neural_network import MLPClassifier

# Cross validation to select the best number of hidden layers

score_max = 0

for param in [10, 30, 100]:
    model = MLPClassifier(hidden_layer_sizes=(param, ), random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"hidden_layer_size = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print(f"Highest score : {round(score_max, 3)} when hidden_layer_sizes = {param_best}")



hidden_layer_size = 10: [0.70754717 0.72012579 0.73584906 0.73899371 0.72327044]
0.725, 0.011

hidden_layer_size = 30: [0.73584906 0.72327044 0.7327044  0.72955975 0.72955975]
0.73, 0.004

hidden_layer_size = 100: [0.73584906 0.72327044 0.74842767 0.73584906 0.72012579]
0.733, 0.01

Highest score : 0.733 when hidden_layer_sizes = 100


Fit a neural network with 100 hidden layers to the training data and test it on test data

In [67]:
print(f"hidden_layer_size = {param_best}")
# initate the neural network
mlpc = MLPClassifier(hidden_layer_sizes=(param_best, ), random_state=0)
# fit the model to the training data and test it on test data
%time train_test(X_train, X_test, y_train, y_test, mlpc)

hidden_layer_size = 100
Train accuracy score: 1.0
Test accuracy score: 0.724

[[  8   5  21]
 [  3   8  59]
 [  9  13 272]]

              precision    recall  f1-score   support

           n       0.40      0.24      0.30        34
           p       0.31      0.11      0.17        70
           u       0.77      0.93      0.84       294

    accuracy                           0.72       398
   macro avg       0.49      0.42      0.44       398
weighted avg       0.66      0.72      0.68       398

CPU times: user 1min, sys: 4min 45s, total: 5min 45s
Wall time: 43.2 s


The model stil performed better at identifying *neutral* sentences. However, it should be the second best in recognizing the *negative* and *positive* sentences, following Decision Tree. 

The overall test accuracy of the model is 0.72. We store its test performance in the dictionary item `summary` for comparison. 

In [68]:
summary["Neural Networks"] = round(mlpc.score(X_test, y_test), 3)

In [70]:
#Choose the classifier that yields the best performance. 
summary

{'k-NNs': 0.731,
 'Multinomial Naive Bayes': 0.739,
 'Decision Trees': 0.698,
 'Random Forest': 0.751,
 'Linear SVMs': 0.744,
 'Kernelized SVMs': 0.741,
 'Neural Networks': 0.724}

By comparing the test accuracy of all the classification algorithms, **Random Forest** is selected as the best model to predict the sentence sentiment for our data set. 

## <a id="section_4"> 4. Test Random Forest </a>

In this section, we used new ten sentences to test the performance of **Random Forest** in predicting sentence sentiment. 

We semi-randomly selected 10 sentences from Amazon reviews and online articles with a balanced distribution of positive, negative, and neutral. 

In [71]:

text1 = "This sink is perfect for the industrial/studio look that we have at home. It's a great size for our bathroom sink.\
So far, we love it. The finish is nice and it appears to be made well. \
I don't know how I'll like the ffaucet being off to the side but I'll update my review when it's futexlly installed and running."

text2 = "The sprayer pullout and button is perfect! We couldn't be more pleased with this faucet!\
It looks so much more expensive than it is! And it feels quality - the metal is thick and heavy, the installation parts are nice...\
I adjusted the placement of my counterweight because the way the plumbing is set up wehn I pull down the sprayer and release it back \
into the spout, the counterweight gets stuck on one of the pipes. So, I just moved it up a little bit sacrificing some of the \
pull down length. My sink is really deep and I still get a good reach to the bottom."

text3 = "I wish I never purchased this oven. The functions are too complicataed to use, especially when the print on the buttons \
wears off within a couple months."

text4 = "I like this cute little machin, and I know my father has used them for years without incident. HOWEVER, within about \
20 uses, it started turning on and trying to brew by itself half of the time that I plugged it in without the power button being pressed.\
I did not get it wet or drop it or damage it in any way."

text5 = "I hate this little device from hell. I used to love it. I bought one for every room in my house, my office at work, and they \
were good. But now, despite Amazon's alleged spying, some genius at Amazon decided to allow this thing to wake you up all hours\
 of the night with a loud obnoxious noise and a bright yellow light."

text6 = "Very disppointed with these chairs. Hasn't been quite 5 months and t he wicker is unraveling on one of the chairs."

text7 = "Adorable. Exactly as pictured. Looks so nice between my two rockers on the porch and a perfect landing place for a cup\
 of coffee and a book."

text8 = "When coding categorical variables, there are a variety of coding systems we can choose for testing different set of \
linear hypotheses. On this page, we will cover some of the coding schemes for categorical variables. \
We will show how these coding schemes are constructed and interpreted."

text9 = "We bought this a month ago and we absolutely love it! Super high quality and pretty easy to put together.\
 I would definitely recommend! And my dog clearly loves it."

text10 = "Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. \
It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to \
every other computational science, whether that be in stochastic optimisation, control theory, operations research, \
econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic \
thinking is essential."

Prepare the sentence to fit the model.

The sentences are transoformed to a 10*5890 matrix as the predictors.

In [72]:
new_texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9, text10]
X_new = vectorizer.transform(new_texts)
X_new

<10x5890 sparse matrix of type '<class 'numpy.float64'>'
	with 142 stored elements in Compressed Sparse Row format>

Predict the sentiments of these sentences by the fitted model Random Forest.

In [73]:
# rfc.predict Predict the newe sentences with random forest
rfc.predict(X_new)

array(['p', 'p', 'u', 'u', 'u', 'u', 'p', 'u', 'u', 'u'], dtype=object)

In summary, the Random Forest model trained with our training data set did a farily good job in predicting the positive and neutral sentiments. However, it failed to predict any negative sentiments, which is not surprising, given the poor performance the algorithm had regarding negative sentences on our test data. 