### CS 506 HW4 code
The only required data file is `train.csv` from Kaggle class website. This CSV file should be put in the current dir

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560804 entries, 0 to 560803
Data columns (total 9 columns):
HelpfulnessDenominator    560804 non-null int64
HelpfulnessNumerator      560804 non-null int64
Id                        560804 non-null int64
ProductId                 560804 non-null object
Score                     460804 non-null float64
Summary                   560777 non-null object
Text                      560804 non-null object
Time                      560804 non-null int64
UserId                    560804 non-null object
dtypes: float64(1), int64(4), object(4)
memory usage: 38.5+ MB


In [4]:
df.head()

Unnamed: 0,HelpfulnessDenominator,HelpfulnessNumerator,Id,ProductId,Score,Summary,Text,Time,UserId
0,0,0,130058,B000CQIDHY,5.0,A worthy and welcome replacement,I don't know what has happened to formulation ...,1337817600,A3VZR9TPF2GERB
1,0,0,91622,B004YV80OE,4.0,"It was okay, good flavor",Kraft's a safe brand. They will produce food f...,1317254400,A1B1QMGK8VYG80
2,10,6,699,B000G6MBX2,1.0,"The ""Organic"" Label is Misleading","""Yeast Extract"" is listed as an ingredient. So...",1195084800,A1AQ2W2R4SOVGN
3,0,0,265935,B0001GDC4O,5.0,Fresh/Stale,Some of these espresso pods were fresh and som...,1272499200,A2IVH1D3GLACL3
4,1,1,199932,B000EDG430,5.0,Baked to perfection in my bread machine!,"I am not one to write reviews, but this bread ...",1336953600,AEOINN8F4D9DQ


Split the dataframe into training set and testing set:

In [3]:
train = df[df['Score']==df['Score']]
test = df[df['Score']!=df['Score']]
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 460804 entries, 0 to 460803
Data columns (total 9 columns):
HelpfulnessDenominator    460804 non-null int64
HelpfulnessNumerator      460804 non-null int64
Id                        460804 non-null int64
ProductId                 460804 non-null object
Score                     460804 non-null float64
Summary                   460782 non-null object
Text                      460804 non-null object
Time                      460804 non-null int64
UserId                    460804 non-null object
dtypes: float64(1), int64(4), object(4)
memory usage: 35.2+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 460804 to 560803
Data columns (total 9 columns):
HelpfulnessDenominator    100000 non-null int64
HelpfulnessNumerator      100000 non-null int64
Id                        100000 non-null int64
ProductId                 100000 non-null object
Score                     0 non-null float64
Summary                   99996 non-

In [6]:
import warnings
warnings.filterwarnings('ignore')

calculate tf-idf for both **text review** and **summary**:

In [4]:
vectorizer = TfidfVectorizer(stop_words='english', min_df=2, ngram_range=(1,2))
text_vec_tr = vectorizer.fit_transform(train.Text)
text_vec_test = vectorizer.transform(test.Text)
print('vec 1 finish')
vectorizer_sum = TfidfVectorizer(stop_words='english', min_df=2, ngram_range=(1,2))
sum_vec_tr = vectorizer.fit_transform(train.Summary.values.astype('U'))
sum_vec_test = vectorizer.transform(test.Summary.values.astype('U'))
print('vec 2 finish')
y = train.Score.values
X_train = hstack([sum_vec_tr,text_vec_tr])
X_test = hstack([sum_vec_test,text_vec_test])

vec 1 finish
vec 2 finish


Cross validation (3 fold) for logistic regression regularization:

In [8]:

best_C = 0.1
max_score = 0
for reg in np.arange(0.1, 1.0, 0.3):
    print('reg:',reg)
    clf = LogisticRegression(C=reg, solver='saga')
    scores = cross_val_score(clf, X_train, y, cv=3)
    s = scores.mean()
    if (s > max_score):
        best_C = reg
        max_score = s
        

for reg in np.arange(1, 21, 2):
    print('reg:',reg)
    clf = LogisticRegression(C=reg, solver='saga')
    scores = cross_val_score(clf, X_train, y, cv=3)
    s = scores.mean()
    if (s > max_score):
        best_C = reg
        max_score = s

        
print('the best C:', best_C)
print('max score:',max_score)

reg: 0.1
reg: 0.4
reg: 0.7
reg: 1
reg: 3
reg: 5
reg: 7
reg: 9
reg: 11
reg: 13
reg: 15
reg: 17
reg: 19
the best C: 9
max score: 0.827037080628


Save the result for logistic regression:

In [9]:
lr = LogisticRegression(C=best_C, solver='saga')
lr.fit(X_train, y)
y_pred_lr = lr.predict(X_test)
res = pd.DataFrame(data={'Id':test.Id, 'Score':y_pred_lr})
res.to_csv('result.csv', index = False)

Use Naive Bayes for classification:

In [5]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
y_pred_nb = nb.fit(X_train, y).predict(X_test)
res = pd.DataFrame(data={'Id':test.Id, 'Score':y_pred_nb})
res.to_csv('result_nb.csv', index = False)

Apply SVD to the tf-idf vectors:

Here I reduce the feature of summary to 50 dim and feature of text reviews to 500 dim.
The output X_train and X_test are applied again in the previous cell conducting logistic regression

In [12]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=500)
svd_sum = TruncatedSVD(n_components=50)
dm_text_vec_tr = svd.fit_transform(text_vec_tr)
dm_sum_vec_tr = svd_sum.fit_transform(sum_vec_tr)

# concatenate feature vec after dimension reduction
dm_sum_vec_test = svd_sum.transform(sum_vec_test)
dm_text_vec_test = svd.transform(text_vec_test)
from scipy.sparse import hstack
X_train = np.concatenate((dm_sum_vec_tr,dm_text_vec_tr), axis=1)
X_test = np.concatenate((dm_sum_vec_test,dm_text_vec_test), axis=1)

Stemming could be applied, but it took too long time:

In [61]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, sent_tokenize


stemmed_train_text = [" ".join(SnowballStemmer("english", ignore_stopwords=True).stem(word)  
         for sent in sent_tokenize(message)
        for word in word_tokenize(sent))
        for message in train.Text]

stemmed_test_text = [" ".join(SnowballStemmer("english", ignore_stopwords=True).stem(word)  
         for sent in sent_tokenize(message)
        for word in word_tokenize(sent))
        for message in test.Text]

KeyboardInterrupt: 