# Assignment 3 Ruisi_Gu -- Sub-Notebook One

# Abstract

This report is to tackle a Kaggle Competition, "Sentiment Analysis on Movie Reviews". The aim of this competition is to classify the sentiment of sentences from the Rotten Tomatoes dataset. The evaluation standard is based on classification accuracy (the percent of labels that are predicted correctly) for every parsed phrase. The sentiment labels are: 0 - negative, 1 - somewhat negative, 2 - neutral, 3 - somewhat positive and 4 - positive, respectively. Therefore, the evaluation rules for my models would be the accuracy of confusion matrix.<br><br>
To get a better result, I would like to roughly separate the report into three parts. The first one is **Cleaning  and Preprocessing Data** and the second one is **Fit in Models and comparison**.<br><br>
Since the first two parts are using tensorflow and sklearn, the last one is using H2O library, I have to separate the notebook into two sub-notebooks. Here is the first one.

## Part One -- Cleaning and Preprocessing Data

#### Loading and Cleaning Data

In [2]:
import pandas


train_df = pandas.read_csv("data/train.tsv", sep="\t")
test_df = pandas.read_csv("data/test.tsv", sep="\t")
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


Stem the data using nltk PorterStemmer which changes the words like "pythoner", "pythoning" etc. to the "python". And lowered all phrases. 

In [3]:
import re
import nltk


STEMMER = nltk.stem.SnowballStemmer("english")


def custom_tokenizer(document):
    regexp = re.compile(r"(?u)\b\w\w+\b", flags=re.IGNORECASE)
    words = regexp.findall(document)
    return [STEMMER.stem(word.lower()) for word in words]

#### Build Model

In [4]:
x_train, y_train = train_df["Phrase"].copy(), train_df["Sentiment"].copy()
ids, x_test = test_df["PhraseId"].copy(), test_df["Phrase"].copy()

In [5]:
from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer(min_df=20, tokenizer=custom_tokenizer)
x_train_vect = vectorizer.fit_transform(x_train)
x_train_vect

<156060x4873 sparse matrix of type '<class 'numpy.int64'>'
	with 912378 stored elements in Compressed Sparse Row format>

In [6]:
from sklearn.model_selection import cross_val_score, StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB


estimator = MultinomialNB()
cv = StratifiedShuffleSplit(n_splits=10, test_size=.25, random_state=0)
cv_scores = cross_val_score(estimator, x_train_vect, y_train, cv=cv)
mean, std = cv_scores.mean(), cv_scores.std()

print(f"Mean cross-validation score: {mean:.3f} (+/- {std:.3f})")

Mean cross-validation score: 0.613 (+/- 0.002)


In [7]:
from sklearn.pipeline import make_pipeline


pipe = make_pipeline(vectorizer, estimator)
pipe.fit(x_train, y_train)

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
 ...   vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

#### Generate New Data frame

Do the predictions based the model we've created on both train and test dataset, create a new column to save the prediction. These two datasets are the one we will do further explore on. The names of new training and testing dataset are train_submission and submission06 respectively.

In [20]:
predictions = pipe.predict(x_test)

submission = pandas.DataFrame(data={
    "PhraseId": ids,
    "SentimentM1": predictions
})
submission.to_csv("submission06.csv", index=False)
submission.head(n=10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
PhraseId,156061,156062,156063,156064,156065,156066,156067,156068,156069,156070
SentimentM1,3,3,2,3,3,3,3,3,3,2


In [21]:
predictions_train = pipe.predict(x_train)

In [22]:
predictions_train

array([3, 2, 2, ..., 2, 2, 2])

In [23]:
train_submission  = pandas.read_csv("data/train.tsv", sep="\t")

In [24]:
train_submission['SentimentM1'] = predictions_train

In [25]:
train_submission.to_csv('train_submission.csv', sep=',', index=False)