## CS5785 Fall 2021 Applied Machine Learning Homework 2: PROGRAMMING EXERCISE 2 - Sentiment analysis of online reviews

### By Hao Geng (hg457),  Siyi Chen(sc2358)

In [1]:
import numpy as np
import pandas
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

import sklearn
from sklearn import preprocessing
# from sklearn.naive_bayes import GaussianNB #高斯分布型
# from sklearn.naive_bayes import BernoulliNB #伯努利型
# from sklearn.naive_bayes import MultinomialNB #多项式型
from sklearn import linear_model
# from sklearn.metrics import mean_squared_error, r2_score

### (a) Download [Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). There are three data ﬁles under the root folder. yelp_labelled.txt, amazon_cells_labelled.txt and imdb_labelled.txt. Parse each ﬁle with the speciﬁcations in readme.txt. Are the labels balanced? If not, what’s the ratio between the two labels? Explain how you process these ﬁles.

In [2]:
def read_df(filename):
  return pandas.read_csv('amazon_cells_labelled.txt', sep = '\t', 
                        header=None, names = ['sentences','scores'])

df_amazon = read_df('amazon_cells_labelled.txt')
df_imdb = read_df('imdb_labelled.txt')
df_yelp = read_df('yelp_labelled.txt')

In [3]:
print(sum(df_amazon['scores']))
print(sum(df_imdb['scores']))
print(sum(df_yelp['scores']))

500
500
500


### Summary:
Each dataframe has 500 negative and positive labels respectively. So they are balanced.

### (b) *Pick your preprocessing strategy*. Since these sentences are online reviews, they may contain signiﬁcant amounts of noise and garbage. You *may or may not* want to do one or all of the following. Explain the reasons for each of your decision (*why or why not*).
- Lower case all of the words.
- Lemmatization of all the words (i.e.,convert every word to its root so that all of “running,” “run,” and “runs” are converted to “run” and all of“good,” “well,” “better,” and “best” are converted to “good”; this is easily done using [nltk.stem](http://www.nltk.org/api/nltk.stem.html)).
- Strip punctuation.
- Strip the stop words,e.g., “the”, “and”, “or”.
- Something else? Tell us about it.

### Summary
- Lower case all of the words.
    - Most time words mean the same no matter their cases. Lowering case them can reduce the word count (feature count) and unify the intepretations of words with different cases. In this task, lower cases would be good.
- Lemmatization of all the words (i.e.,convert every word to its root so that all of “running,” “run,” and “runs” are converted to “run” and all of“good,” “well,” “better,” and “best” are converted to “good”; this is easily done using [nltk.stem](http://www.nltk.org/api/nltk.stem.html)).
    - Most every (content) word in English can take on several forms. Sometimes these changes are meaningful, and sometimes they are just to serve a certain grammatical context. Lemmatization words helps extricate high-quality information from text so that all variants are consistent across documents. But Lemmatization can lose accuracy, like 'best' should have a higher indication then 'good' for positive reviews. Not lemmatizing is the conservative approach, and should be favored unless there is a significant performance gain. In this task, lemmatize words would be good.
- Strip punctuation.
- Strip the stop words,e.g., “the”, “and”, “or”.
    - Stop words and punctuations are the most frequent words in any slice of text, and mostly seemingly contentless. But they may be relevant to the meaning and function of similar words in rare cases. And these stop words and punctuations may indicate the length of the text. Considering the number of them is limited, this task would not eliminate them.

### (c) Split training and testing set. In this assignment, for each ﬁle, please use the ﬁrst 400 instances for each label as the training set and the remaining 100 instances as testing set. In total,there are 2400 reviews for training and 600 reviews for testing.

In [4]:
def split_df_4_1(df_amazon, negative_index):
  df_amazon_positive = df_amazon[df_amazon['scores'] == 1]
  df_amazon_negative = df_amazon[df_amazon['scores'] == 0]

  df_amazon_train = df_amazon_positive[:negative_index].append(df_amazon_negative[:negative_index]).reset_index()
  df_amazon_test = df_amazon_positive[negative_index:].append(df_amazon_negative[negative_index:]).reset_index()

  return df_amazon_train, df_amazon_test

In [5]:
# Split training and testing set respectively
df_amazon_train, df_amazon_test = split_df_4_1(df_amazon, 400)
df_imdb_train, df_imdb_test = split_df_4_1(df_imdb, 400)
df_yelp_train, df_yelp_test = split_df_4_1(df_yelp, 400)

# combine the splited sets to a whole set
df_all_train = df_amazon_train.append(df_imdb_train).append(df_yelp_train).reset_index()
df_all_test = df_amazon_test.append(df_imdb_test).append(df_yelp_test).reset_index()

In [6]:
df_all_train

Unnamed: 0,level_0,index,sentences,scores
0,0,1,"Good case, Excellent value.",1
1,1,2,Great for the jawbone.,1
2,2,4,The mic is great.,1
3,3,7,If you are Razr owner...you must have this!,1
4,4,10,And the sound quality is great.,1
...,...,...,...,...
2395,795,814,Battery has no life.,0
2396,796,815,I checked everywhere and there is no feature f...,0
2397,797,818,Doesn't do the job.,0
2398,798,824,Awkward to use and unreliable.,0


In [7]:
df_all_test

Unnamed: 0,level_0,index,sentences,scores
0,0,778,This is a great deal.,1
1,1,787,It is simple to use and I like it.,1
2,2,788,"It's a great tool for entertainment, communica...",1
3,3,791,I own 2 of these cases and would order another.,1
4,4,792,Great Phone.,1
...,...,...,...,...
595,195,995,The screen does get smudged easily because it ...,0
596,196,996,What a piece of junk.. I lose more calls on th...,0
597,197,997,Item Does Not Match Picture.,0
598,198,998,The only thing that disappoint me is the infra...,0


### (d) Bag of Words model. Extract features and then represent each review using bag of words model, i.e., every word in the review becomes its own element in a feature vector. In orderto do this, ﬁrst, make one pass through all the reviews in the training set (Explain why we can’t use testing set at this point) and build a dictionary of unique words. Then, make another pass through the review in both the training set and testing set and count up the occurrences of each word in your dictionary. The ith element of a review’s feature vector is the number of occurrences of the ith dictionary word in the review. Implement the bag of words model and report feature vectors of any two reviews in the training set.

In [8]:
#call the nltk downloader
# nltk.download() # Click on Models tab and select punkt and click Download

In [9]:
#create an object of class PorterStemmer
porter = PorterStemmer()

In [10]:
def get_stemed_word_df(df):

    df_words = pandas.DataFrame()

    for sentence in df['sentences']:
        token_dict = {}

        token_words=word_tokenize(sentence)
        for word in token_words:
            stemed = porter.stem(word)
            token_dict[stemed] = token_dict.get(stemed, 0) + 1
    #     print(token_dict)

        df_words = df_words.append(token_dict, ignore_index=True)
        
    return df_words.fillna(0)

In [11]:
# df of training data with words as features
df_all_train_words = get_stemed_word_df(df_all_train)
dict_all_train_words = df_all_train_words.to_dict()
matrix_all_train = np.array(df_all_train_words)

# df of test data with words as features
df_all_test_words = get_stemed_word_df(df_all_test)

# exclude the columns of features not includeed in training data
to_drop = []
for col in df_all_test_words.columns.values.tolist():
    if col not in df_all_train_words.columns.values.tolist():
        to_drop.append(col)

df_all_test_words = df_all_test_words.drop(columns = to_drop)
matrix_all_test = np.array(df_all_test_words)

# df of all data with words as features, combine training and test data
df_all_words = df_all_train_words.append(df_all_test_words).fillna(0)
dict_all_words = df_all_words.to_dict()
matrix_all = np.array(df_all_words)

In [12]:
df_all_train_words

Unnamed: 0,",",.,case,excel,good,valu,for,great,jawbon,the,...,shout,telephon,wind,7.44,grtting,until,v3c,improp,everywher,awkward
0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2395,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2396,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2397,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2398,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [13]:
matrix_all_train

array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [14]:
df_all_words

Unnamed: 0,",",.,case,excel,good,valu,for,great,jawbon,the,...,shout,telephon,wind,7.44,grtting,until,v3c,improp,everywher,awkward
0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
596,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
597,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
matrix_all

array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

### Summary:
- Explain why we can’t use testing set at this point
    - We need to make one pass in the training set because we need to collect all the words(features). If we only go through the whole dataset, we will collect all the words, including those do not show up in the training dataset. Words not in the training dataset should not be included in the features since we can not draw information of their relevance to result based on training date. Any values we set to those words will be assumptions, which we should not do.

### (e) Pick your postprocessing strategy. Since the vast majority of English words will not appear in most of the reviews, most of the feature vector elements will be 0. This suggests that we need a postprocessing or normalization strategy that combats the huge variance of the elements in the feature vector. You may want to use one of the following strategies. Whatever choices you make, explain why you made the decision. 
- log-normalization. For each element of the feature vector x, transform it into f(x) = log(x+1).
- l1 normalization. Normalize the l1 norm of the feature vector, ˆ x= x / |x| .
- l2 normalization. Normalize the l2 norm of the feature vector, ˆ x= x / ||x||.
- Standardize the data by subtracting the mean and dividing by the variance. 

In [16]:
matrix_all_train_all_features = matrix_all[:2400] # training set
matrix_all_test_all_features = matrix_all[2400:] # test set

In [17]:
# log-normalization
matrix_all_train_log = np.log(matrix_all_train_all_features + 1)

In [18]:
# l1 normalization
matrix_all_train_normalized1 = preprocessing.normalize(matrix_all_train_all_features, 'l1')
matrix_all_test_normalized1 = preprocessing.normalize(matrix_all_test_all_features, 'l1')

In [19]:
matrix_all_train_normalized1

array([[0.16666667, 0.16666667, 0.16666667, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.2       , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.2       , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.16666667, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.16666667, 0.        , ..., 0.        , 0.        ,
        0.16666667],
       [0.        , 0.125     , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [20]:
# l2 normalization
matrix_all_train_normalized2 = preprocessing.normalize(matrix_all_train_all_features, 'l2')
matrix_all_test_normalized2 = preprocessing.normalize(matrix_all_test_all_features, 'l2')

In [21]:
matrix_all_train_normalized2

array([[0.40824829, 0.40824829, 0.40824829, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.4472136 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.4472136 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.40824829, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.40824829, 0.        , ..., 0.        , 0.        ,
        0.40824829],
       [0.        , 0.31622777, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [22]:
# Standardize
matrix_all_train_scaled = preprocessing.scale(matrix_all_train_all_features)
matrix_all_test_scaled = preprocessing.scale(matrix_all_test_all_features)

In [23]:
matrix_all_train_scaled

array([[ 1.17085051,  0.31700361,  5.45611865, ..., -0.03537746,
        -0.03537746, -0.03537746],
       [-0.48114316,  0.31700361, -0.18328047, ..., -0.03537746,
        -0.03537746, -0.03537746],
       [-0.48114316,  0.31700361, -0.18328047, ..., -0.03537746,
        -0.03537746, -0.03537746],
       ...,
       [-0.48114316,  0.31700361, -0.18328047, ..., -0.03537746,
        -0.03537746, -0.03537746],
       [-0.48114316,  0.31700361, -0.18328047, ..., -0.03537746,
        -0.03537746, 28.26658805],
       [-0.48114316,  0.31700361, -0.18328047, ..., -0.03537746,
        -0.03537746, -0.03537746]])

In [24]:
matrix_all_train_scaled.mean(axis=0)

array([-7.77156117e-18,  3.25665421e-17,  5.03301104e-17, ...,
        5.92118946e-18,  5.92118946e-18,  5.92118946e-18])

In [25]:
matrix_all_train_scaled.std(axis=0)

array([1., 1., 1., ..., 1., 1., 1.])

### Summary:
L1 normalization is considered good for slack matrix.

### (f) Sentiment prediction. Train a naive Bayes model on the training set and test on the testing set. Report the classiﬁcation accuracy and confusion matrix. 

In [42]:
def naiva_Bayers(X_train, X_test, Y_train):
    Y_union_size = 2
    sample_size, feature_num = X_train.shape # rows, columns
    ps = np.zeros([Y_union_size, feature_num])
    ph = np.zeros([Y_union_size])    
    for i in range(Y_union_size):
        x_i = X_train[Y_train == i]
        ps[i] = np.mean(x_i, axis=0)
        ph[i] = x_i.shape[0]/float(sample_size) # how many y = i
    
    sample_size, feature_num = X_test.shape # rows, columns
    ps = ps.reshape(Y_union_size, 1, feature_num)
    ps = ps.clip(1e-32, 1-1e-32) # avoid log 0
    logp_y1 = np.log(ph).reshape([Y_union_size, 1])    
    logp_y2 = X_test * np.log(ps) + (1-X_test) * np.log(1-ps)
    logp_y = logp_y1 + logp_y2.sum(axis = 2)
    Y_pred = logp_y.argmax(axis = 0).flatten()
    
    return Y_pred

In [43]:
X_train = matrix_all_train_normalized1
X_test = matrix_all_test_normalized1
Y_train = df_all_train['scores']
Y_test = df_all_test['scores']

In [44]:
score_pred = naiva_Bayers(X_train, X_test, Y_train)

# average accuracy
score = sklearn.metrics.accuracy_score(score_pred, Y_test)
print('accuracy_score =', score)

# # confusion matrix
cf = sklearn.metrics.confusion_matrix(Y_test, score_pred)
pandas.DataFrame(cf, ('true %i'%x for x in range(2)), ('pred %i'%x for x in range(2)))

accuracy_score = 0.765


Unnamed: 0,pred 0,pred 1
true 0,255,45
true 1,96,204


### (g) Logistic regression. Now repeat using logistic regression classiﬁcation, and compare performance (you can use existing packages here). Try using both L2 (ridge) regularization and L1 (lasso) regularization and report how these affect the classiﬁcation accuracy and the coefﬁcient vectors (hint: sklearn has a method called LogisticRegressionCV; also note that sklearn doesn’t actually have an implementation of unregularized logisticregression). Inspecting the coefﬁcient vectors, what are the words that play the most important roles in deciding the sentiment of the reviews? 

In [46]:
def train_Logistic(X_train, X_test, Y_train):
    regr = linear_model.LogisticRegressionCV()
    # regr = linear_model.LogisticRegression(C=100)
    regr.fit(X_train, Y_train)
    Y_pred = regr.predict(X_test)
    
    return Y_pred, regr.coef_

In [47]:
def most_impacted_coef(coef):
    # get coef with most impact
    max_coef = np.max(coef)
    max_coef_index = np.where(coef==max_coef)[1][0]
    column_name = df_all_words.columns.values[max_coef_index]
    print("max coef: %.2f, column name (word): %s, frequency: %d"
        % (max_coef, column_name, sum(df_all_words[column_name])))

    min_coef = np.min(coef)
    min_coef_index = np.where(coef==min_coef)[1][0]
    column_name = df_all_words.columns.values[min_coef_index]
    print("min coef: %.2f, column name (word): %s, frequency: %d"
        % (min_coef, column_name, sum(df_all_words[column_name])))

In [48]:
def Logistic_result(X_train, X_test, Y_train, Y_test):
    Y_pred, coef = train_Logistic(X_train, X_test, Y_train)

    # average accuracy
    score = sklearn.metrics.accuracy_score(Y_pred, Y_test)
    print('accuracy_score =', score)
    
    # most impacted coef
    most_impacted_coef(coef)

In [49]:
Y_train = df_all_train['scores']
Y_test = df_all_test['scores']

# L1
X_train = matrix_all_train_normalized1
X_test = matrix_all_test_normalized1
Logistic_result(X_train, X_test, Y_train, Y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

accuracy_score = 0.835
max coef: 52.98, column name (word): best, frequency: 69
min coef: -51.16, column name (word): not, frequency: 366


In [50]:
# L2
X_train = matrix_all_train_normalized2
X_test = matrix_all_test_normalized2
Logistic_result(X_train, X_test, Y_train, Y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

accuracy_score = 0.83
max coef: 16.80, column name (word): best, frequency: 69
min coef: -17.15, column name (word): not, frequency: 366


### Summary
- For L1:
    - accuracy_score = 0.835
    - max coef: 52.98, column name (word): best, frequency: 69
    - min coef: -51.16, column name (word): not, frequency: 366

- For L2:
    - accuracy_score = 0.83
    - max coef: 16.80, column name (word): best, frequency: 69
    - min coef: -17.15, column name (word): not, frequency: 366

The accuracy does not differ significantly with L1 and L2. Though the coeffcients are not the same, the most important words in deciding the sentiment of the reviews are "best" and "not" for both algorithms.


### (h) N-gram model. Similar to the bag of words model, but now you build up a dictionary of n-grams, which are contiguous sequences of words. For example, “Alice fell down the rabbit hole” would then map to the 2-grams sequence: ["Alice fell", "fell down", "down the", "the rabbit","rabbit hole"], and all ﬁve of those symbols would be members of the n-gram dictionary. Try n=2, repeat (d)-(g) and report your results. 

In [51]:
def get_stemed_2_words_df(df):
    df_words = pandas.DataFrame()

    previous_word = ''
    for sentence in df['sentences']:
        token_dict = {}

        token_words=word_tokenize(sentence)
        for word in token_words:
            stemed = porter.stem(word)
            if previous_word != '':
                token_dict[previous_word + stemed] = token_dict.get(stemed, 0) + 1
            previous_word = stemed + ' '
    #     print(token_dict)

        df_words = df_words.append(token_dict, ignore_index=True)
        
    return df_words.fillna(0)

In [53]:
# df of training data with words as features
df_all_train_words = get_stemed_2_words_df(df_all_train)

# df of test data with words as features
df_all_test_words = get_stemed_2_words_df(df_all_test)

# exclude the columns of features not includeed in training data
to_drop = []
for col in df_all_test_words.columns.values.tolist():
    if col not in df_all_train_words.columns.values.tolist():
        to_drop.append(col)

df_all_test_words = df_all_test_words.drop(columns = to_drop)

# df of all data with words as features, combine training and test data
df_all_words = df_all_train_words.append(df_all_test_words).fillna(0)
matrix_all = np.array(df_all_words)

In [55]:
df_all_train_words.shape

(2400, 5961)

In [56]:
matrix_all_train_all_features = matrix_all[:2400]
matrix_all_test_all_features = matrix_all[2400:]

In [57]:
# l1 normalization
matrix_all_train_normalized1 = preprocessing.normalize(matrix_all_train_all_features, 'l1')
matrix_all_test_normalized1 = preprocessing.normalize(matrix_all_test_all_features, 'l1')

# l2 normalization
matrix_all_train_normalized2 = preprocessing.normalize(matrix_all_train_all_features, 'l2')
matrix_all_test_normalized2 = preprocessing.normalize(matrix_all_test_all_features, 'l2')

In [58]:
Y_train = df_all_train['scores']
Y_test = df_all_test['scores']

In [59]:
# L1
X_train = matrix_all_train_normalized1
X_test = matrix_all_test_normalized1

# Bayes
score_pred = naiva_Bayers(X_train, X_test, Y_train)

# average accuracy
score = sklearn.metrics.accuracy_score(score_pred, Y_test)
print(score)

# confusion matrix
cf = sklearn.metrics.confusion_matrix(Y_test, score_pred)
pandas.DataFrame(cf, ('true %i'%x for x in range(2)), ('pred %i'%x for x in range(2)))

0.82


Unnamed: 0,pred 0,pred 1
true 0,240,60
true 1,48,252


In [61]:
# L1
# Logistic Regression
Logistic_result(X_train, X_test, Y_train, Y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

accuracy_score = 0.75
max coef: 11.81, column name (word): . great, frequency: 78
min coef: -8.23, column name (word): not work, frequency: 42


In [62]:
# L2
X_train = matrix_all_train_normalized2
X_test = matrix_all_test_normalized2

# Bayes
score_pred = naiva_Bayers(X_train, X_test, Y_train)

# average accuracy
score = sklearn.metrics.accuracy_score(score_pred, Y_test)
print(score)

# confusion matrix
cf = sklearn.metrics.confusion_matrix(Y_test, score_pred)
pandas.DataFrame(cf, ('true %i'%x for x in range(2)), ('pred %i'%x for x in range(2)))

0.8


Unnamed: 0,pred 0,pred 1
true 0,225,75
true 1,45,255


In [63]:
# L2
Logistic_result(X_train, X_test, Y_train, Y_test)

accuracy_score = 0.745
max coef: 4.80, column name (word): . great, frequency: 78
min coef: -2.91, column name (word): not work, frequency: 42


### Summary
- for accracy: 
  - L1:
    - Bayes: accuracy_score = 0.82, logistic regression: 0.75
  - L2:
    - Bayes: accuracy_score = 0.8, logistic regression: 0.745
- for coef of Logistic Regression:
    - L1:
        - max coef: 11.81, column name (word): . great, frequency: 78
        - min coef: -8.23, column name (word): not work, frequency: 42
    - L2:
        - max coef: 4.80, column name (word): . great, frequency: 78
        - min coef: -2.91, column name (word): not work, frequency: 42
    - Though the coeffcients are not the same, the most important words are ". great" and "not work" for both algorithms. 

### (i) Algorithms comparison and analysis. According to the above results, compare the performances of naive Bayes, logistic regression, naive Bayes with 2-grams, and logistic regression with 2-grams. Which method performs best in the prediction task and why? What do you learn about the language that people use in online reviews (e.g., expressions that will make the posts positive/negative)? Hint: Inspect the weights learned from logistic regression.

Accuracy wise, for 1-gram, logistic regression has higher accuracy scores. For 2-grams, native Bayes has higher accuracy scores. The accuracy difference for using different algorithms is not significate. 
2-grams keeps more context closer to the original sentences than 1-gram does, which is why native Bayers works better in 2-grams.
But logistic regression takes a significate longer time to run than naive Bayes, which is not ideal in practice.

In 1-gram and 2-grams, the maximun and minmum coefficients of L1's logistic regression are larger than the L2's. This means L1 highlights the key features better than L2 does.
1-gram gives 'best' and 'not' as most impactful words. 2-grams gives ". great" and "not work" as most impactful words. Both of the results include obviously positive and negative words, which makes sense.