### Background

At Shopee, we always strive to ensure the customer’s highest satisfaction. Whatever product is sold on Shopee, we ensure the best user experience starting from product searching to product delivery, including product packaging, and product quality. Once a product is delivered, we always encourage our customer to rate the product and write their overall experience on the product landing page.

The rating and comments provided for a product by our buyers are most important to us. These product reviews help us to understand our customers needs and quickly adapt our services to provide a much better experience for our customers for the next order. The user's comments for a product ranges from aspects including delivery services, product packaging, product quality, product specifications, payment method, etc. Therefore it is important for us to build an accurate system to understand these reviews which has a great impact on overall Shopee’s user experience. This system is termed: "Shopee Product Review Sentiment Analyser".

### Task

In this competition, a multiple product review sentiment classification model needs to be built. There are ~150k product reviews from different categories, including electronics, furniture, home & living products like air-conditioner and fashion products like T-shirts, rings, etc. For data security purposes, the review ids will be desensitized. The evaluation metrics is top-1 accuracy.

In [1]:
# importing libraries
import numpy as np
import random
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


# NLP
import re
from bs4 import BeautifulSoup
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from nltk.tokenize import WhitespaceTokenizer 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import train_test_split

# Classifier libraries
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [2]:
df_train = pd.read_csv('train.csv',index_col=0)
df_test = pd.read_csv('test.csv',index_col=0)

In [3]:
df_train.head()

Unnamed: 0_level_0,review,rating
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Ga disappointed neat products .. Meletot Hilsn...,1
1,"Rdtanya replace broken glass, broken chargernya",1
2,Nyesel bngt dsni shopping antecedent photo mes...,1
3,Sent a light blue suit goods ga want a refund,1
4,Pendants came with dents and scratches on its ...,1


In [4]:
df_test.head()

Unnamed: 0_level_0,review
review_id,Unnamed: 1_level_1
1,"Great danger, cool, motif and cantik2 jg model..."
2,One of the shades don't fit well
3,Very comfortable
4,Fast delivery. Product expiry is on Dec 2022. ...
5,it's sooooo cute! i like playing with the glit...


In [5]:
df_train.isnull().sum()

review    0
rating    0
dtype: int64

In [6]:
df_test.isnull().sum()

review    0
dtype: int64

# EDA

In [7]:
df_train['rating'].value_counts(normalize=True)

4    0.285163
5    0.282779
3    0.244811
1    0.100708
2    0.086540
Name: rating, dtype: float64

In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146811 entries, 0 to 146810
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   review  146811 non-null  object
 1   rating  146811 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.4+ MB


# Preprocessing

In [None]:
def stem_text(raw_text):
    
    # Remove HTML tags
    review_text = BeautifulSoup(raw_text).get_text()
    
    # Remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
    # Convert words to lower case and split each word up
    words = letters_only.lower().split()
    
    # Convert stopwords to a set
    stops = set(stopwords.words('english'))
    
    # Adding on stopwords that were frequently occuring across all ratings
    stops.update(['good','product','quality','delivery','delivering'])
    
    # Remove stopwords
    meaningful_words = [w for w in words if w not in stops]
    
    # Instantiate PorterStemmer
    p_stemmer = PorterStemmer()
    
    # Stem words
    meaningful_words = [p_stemmer.stem(w) for w in meaningful_words]
    
    # Join words back into one string
    return(" ".join(meaningful_words))

In [None]:
# Pre-process raw text
df_train['review_clean'] = df_train['review'].map(stem_text)

In [None]:
df_test['review_clean'] = df_test['review'].map(stem_text)

In [None]:
df_test.head()

In [None]:
df_train.head()

# Vader

In [None]:
#instantiating Vader
vader = SentimentIntensityAnalyzer()

In [None]:
#creating functions to pull the scores
def neg_score(text):
    score = vader.polarity_scores(text)
    return score['neg']

def neu_score(text):
    score = vader.polarity_scores(text)
    return score['neu']
    
def positive_score(text):
    score = vader.polarity_scores(text)
    return score['pos']
    
def compound_score(text):
    score = vader.polarity_scores(text)
    return score['compound']    

In [None]:
# # Commetning out due to long processing time
# # adding the vader sentiment to the dataframe
# df_train['vader_negative'] = df_train['review'].apply(neg_score)
# df_train['vader_neutral'] = df_train['review'].apply(neu_score)
# df_train['vader_positive'] = df_train['review'].apply(positive_score)
# df_train['vader_compound'] = df_train['review'].apply(compound_score)

# df_train.to_csv('df_train.csv')

In [15]:
df_train = pd.read_csv('df_train.csv', index_col=0)

In [16]:
df_train.head()

Unnamed: 0_level_0,review,rating,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Ga disappointed neat products .. Meletot Hilsn...,1,ga disappoint neat product meletot hilsnyaa speed,0.172,0.5,0.328,0.4215
1,"Rdtanya replace broken glass, broken chargernya",1,rdtanya replac broken glass broken chargernya,0.608,0.392,0.0,-0.7351
2,Nyesel bngt dsni shopping antecedent photo mes...,1,nyesel bngt dsni shop anteced photo messag pic...,0.0,0.82,0.18,0.8834
3,Sent a light blue suit goods ga want a refund,1,sent light blue suit good ga want refund,0.0,0.874,0.126,0.0772
4,Pendants came with dents and scratches on its ...,1,pendant came dent scratch surfac coat look lik...,0.0,0.872,0.128,0.3612


In [17]:
df_train.groupby('rating')['vader_compound'].mean()

rating
1   -0.155839
2    0.132514
3    0.453286
4    0.635983
5    0.639142
Name: vader_compound, dtype: float64

In [18]:
# # Commetning out due to long processing time
# # adding the vader sentiment to the dataframe
# df_test['vader_negative'] = df_test['review'].apply(neg_score)
# df_test['vader_neutral'] = df_test['review'].apply(neu_score)
# df_test['vader_positive'] = df_test['review'].apply(positive_score)
# df_test['vader_compound'] = df_test['review'].apply(compound_score)

# df_test.to_csv('df_test.csv')

In [19]:
df_test = pd.read_csv('df_test.csv', index_col=0)

In [20]:
df_test.head()

Unnamed: 0_level_0,review,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,"Great danger, cool, motif and cantik2 jg model...",great danger cool motif cantik jg model cepet ...,0.106,0.562,0.331,0.7357
2,One of the shades don't fit well,one shade fit well,0.44,0.56,0.0,-0.4449
3,Very comfortable,comfort,0.0,0.218,0.782,0.5563
4,Fast delivery. Product expiry is on Dec 2022. ...,fast expiri dec wrap properli damag item,0.0,0.851,0.149,0.3875
5,it's sooooo cute! i like playing with the glit...,sooooo cute like play glitter better brows pho...,0.0,0.548,0.452,0.9823


In [21]:
df_test.groupby('vader_compound').mean()

Unnamed: 0_level_0,vader_negative,vader_neutral,vader_positive
vader_compound,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-0.9999,0.731,0.265,0.004
-0.9995,0.872,0.007,0.121
-0.9988,0.71,0.229,0.061
-0.9982,0.602,0.398,0.0
-0.998,0.702,0.298,0.0
-0.9968,0.909,0.052,0.039
-0.9948,0.576,0.346,0.078
-0.9943,0.459,0.479,0.062
-0.994,0.529,0.375,0.096
-0.9937,0.704,0.162,0.135


# Train valdiate split

In [22]:
df_train = df_train[df_train['review_clean'].notnull()]

In [23]:
df_test.isnull().sum()

review              0
review_clean      123
vader_negative      0
vader_neutral       0
vader_positive      0
vader_compound      0
dtype: int64

In [24]:
df_test.fillna(0,inplace=True)

In [25]:
df_test.isnull().sum()

review            0
review_clean      0
vader_negative    0
vader_neutral     0
vader_positive    0
vader_compound    0
dtype: int64

In [26]:
X = df_train.drop('rating', axis=1)
y = df_train['rating']

In [27]:
# Holdout data split
X_train, X_validate, y_train, y_validate = train_test_split(X,y,stratify=y, random_state=42, test_size=0.2)

In [28]:
print(X_train.shape)
print(X_validate.shape)
print(y_train.shape)
print(y_validate.shape)

(115244, 6)
(28812, 6)
(115244,)
(28812,)


In [29]:
X_train.drop(['review'],axis=1,inplace=True)
X_validate.drop(['review'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [30]:
X_train.head()

Unnamed: 0_level_0,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
66074,great servic cma good lost courier final refun...,0.119,0.67,0.211,0.4215
111969,cat like eat lot cool work wait eat hous locat...,0.0,0.764,0.236,0.9543
60740,consum may love ya ku,0.0,0.488,0.512,0.6369
93412,alhamdulillah mantab good kuuh boss joss tenant,0.0,1.0,0.0,0.0
127795,receiv yet tri,0.0,1.0,0.0,0.0


In [31]:
X_validate.head()

Unnamed: 0_level_0,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
88684,awesom awesom good ship speed,0.0,0.379,0.621,0.8481
52120,bagussss udh x orderrr bnyakk sllu cma somewha...,0.0,0.556,0.444,0.9325
19012,bag tweezer tightli pattern beauti easi slip s...,0.0,0.527,0.473,0.9127
3129,rotten fish ined piti money,0.524,0.476,0.0,-0.6705
100278,uda langganan pengiriman cpt n neat,0.0,0.625,0.375,0.4588


In [32]:
y_train

review_id
66074     4
111969    5
60740     3
93412     4
127795    5
139666    5
32194     3
86571     4
77837     4
65735     4
90785     4
48258     3
70036     4
26110     2
32427     3
93033     4
41861     3
74334     4
122931    5
23336     2
382       1
42412     3
26787     2
59258     3
70676     4
134333    5
85598     4
37596     3
32764     3
138127    5
94274     4
61601     3
73920     4
89399     4
18798     2
28510     3
90435     4
5323      1
98323     4
44543     3
5890      1
48946     3
53525     3
103266    4
98427     4
6215      1
136617    5
132941    5
45682     3
129031    5
73474     4
62740     3
23490     2
67251     4
92354     4
77960     4
128933    5
104484    4
54443     3
48516     3
40752     3
42118     3
86159     4
77814     4
24981     2
120402    5
86954     4
74255     4
90520     4
12010     1
31230     3
35380     3
95880     4
49978     3
111364    5
84266     4
35939     3
9313      1
21370     2
5485      1
72168     4
40667     3
98777 

# LS

# XGBoost

## XBG

In [None]:
pipe_xgb = Pipeline([
    ('smote', SMOTE()),
    ('ss', StandardScaler()),
    ('xgb', XGBClassifier())
])

pipe_xgb_params = {
    'smote__sampling_strategy': ['auto','minority'],
    'smote__k_neighbors': [5], #[3, 5, 7]
    'xgb__max_depth': [13],
    'xgb__learning_rate' : [0.1],
    'xgb__n_estimators' : [150],
    'xgb__objective' : ['binary:logistic'],
    'xgb__gamma': [0.5], #1
    'xgb__min_child_weight': [1], #[1,5]
    'xgb__subsample': [1.0],
    'xgb__colsample_bytree': [1.0],
    'xgb__random_state': [42]
}

gs_xgb = GridSearchCV(pipe_xgb,pipe_xgb_params,cv=5,verbose=1)
gs_xgb.fit(X_train.drop(['review_clean'],axis=1), np.ravel(y_train))

In [None]:
# best parameters for gs
gs_xgb.best_params_

In [None]:
# gs_1 score of training data
gs_xgb.score(X_train.drop(['review_clean'],axis=1), y_train)

In [None]:
# gs1 score of validation data
gs_xgb.score(X_validate.drop(['review_clean'],axis=1), y_validate)

In [None]:
# prediction of gs1 model on test data
predict_xgb = gs_xgb.predict(X_validate.drop(['review_clean'],axis=1))

In [None]:
# confusion matrix
confusion_matrix(predict_xgb, y_validate)

## CVEC->XBG

In [None]:
#Constructing a pipeline for countvectorizer
pipe_cvec = Pipeline([
    ('cvec', CountVectorizer()),
    ('xgb', XGBClassifier())
])

#Pipe parameters for cvec, log regression
pipe_cvec_params = {
    'cvec__max_features':[25], #10
    'cvec__min_df':[2], #3
    'cvec__max_df':[0.85], #0.95
    'cvec__ngram_range':[(1,1)] # (1,2)
}

#Gridsearching 
gs_cvec = GridSearchCV(pipe_cvec,pipe_cvec_params,cv=5,verbose=1)

#fitting the initial pipeline to get the best params for CVEC
gs_cvec.fit(X_train['review_clean'], np.ravel(y_train))

#Best params for CVEC
best_params = gs_cvec.best_params_
print('Best params are: ',best_params)

#Instantiating the CVEC with the best params
cvec= CountVectorizer(max_features=best_params['cvec__max_features'],
                     min_df=best_params['cvec__min_df'],
                     max_df=best_params['cvec__max_df'],
                     ngram_range=best_params['cvec__ngram_range'],
                     stop_words='english')

#transforming the sparse matrix into a dataframe and merging it back
train_sparse = pd.DataFrame(cvec.fit_transform(X_train['review_clean']).todense(),
                            index=X_train.index, 
                            columns=cvec.get_feature_names())

valdiate_sparse = pd.DataFrame(cvec.transform(X_validate['review_clean']).todense(), 
                               index=X_validate.index,
                               columns=cvec.get_feature_names())

#Final dataframe for modelling
X_train_cvec = X_train.drop(['review_clean'], axis=1).join(train_sparse)
X_validate_cvec = X_validate.drop(['review_clean'], axis=1).join(valdiate_sparse)

In [None]:
pipe_xgb = Pipeline([
    ('smote', SMOTE()),
    ('ss', StandardScaler()),
    ('xgb', XGBClassifier())
])

pipe_xgb_params = {
    'smote__sampling_strategy': ['auto','minority'],
    'smote__k_neighbors': [3], #[3, 5, 7]
    'xgb__max_depth': [13], #[5,9,13]
    'xgb__learning_rate' : [0.1],
    'xgb__n_estimators' : [150],
    'xgb__objective' : ['binary:logistic'],
    'xgb__gamma': [0.5], #1
    'xgb__min_child_weight': [5], #[1,5]
    'xgb__subsample': [1.0],
    'xgb__colsample_bytree': [1.0],
    'xgb__random_state': [42]
}

gs_1 = GridSearchCV(pipe_xgb,pipe_xgb_params,cv=5,verbose=1)
gs_1.fit(X_train_cvec, np.ravel(y_train))

In [None]:
# best parameters for gs_1
gs_1.best_params_

In [None]:
# gs_1 score of training data
gs_1.score(X_train_cvec, y_train)

In [None]:
# gs1 score of validation data
gs_1.score(X_validate_cvec, y_validate)

In [None]:
# prediction of gs1 model on test data
predict_1 = gs_1.predict(X_validate_cvec)

In [None]:
# confusion matrix
confusion_matrix(predict_1, y_validate)

## TFID->XBG

In [None]:
#Constructing a pipeline for countvectorizer
pipe_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('xgb', XGBClassifier())
])

#Pipe parameters for cvec, log regression
pipe_tfidf_params = {
    'tfidf__max_features':[25], #10
    'tfidf__min_df':[2,3], #3
    'tfidf__max_df':[0.85], #0.95
    'tfidf__ngram_range':[(1,1)],
}

#Gridsearching 
gs_tfidf = GridSearchCV(pipe_tfidf,pipe_tfidf_params,cv=5,verbose=1)

#fitting the initial pipeline to get the best params for tfidf
gs_tfidf.fit(X_train['review_clean'], np.ravel(y_train))

#Best params for tfidf
best_params = gs_tfidf.best_params_
print('Best params are: ',best_params)

#Instantiating the tfidf with the best params
tfidf= TfidfVectorizer(max_features=best_params['tfidf__max_features'],
                     min_df=best_params['tfidf__min_df'],
                     max_df=best_params['tfidf__max_df'],
                     ngram_range=best_params['tfidf__ngram_range'],
                     stop_words='english')

#transforming the sparse matrix into a dataframe and merging it back
train_sparse = pd.DataFrame(tfidf.fit_transform(X_train['review_clean']).todense(),
                            index=X_train.index, 
                            columns=tfidf.get_feature_names())

valdiate_sparse = pd.DataFrame(tfidf.transform(X_validate['review_clean']).todense(), 
                               index=X_validate.index,
                               columns=tfidf.get_feature_names())

#Final dataframe for modelling
X_train_tfidf = X_train.drop(['review_clean'], axis=1).join(train_sparse)
X_validate_tfidf = X_validate.drop(['review_clean'], axis=1).join(valdiate_sparse)

In [None]:
pipe_xgb = Pipeline([
    ('smote', SMOTE()),
    ('ss', StandardScaler()),
    ('xgb', XGBClassifier())
])

pipe_xgb_params = {
    'smote__sampling_strategy': ['minority'],
    'smote__k_neighbors': [7], #[3, 5, 7]
    'xgb__max_depth': [13],
    'xgb__learning_rate' : [0.1],
    'xgb__n_estimators' : [150],
    'xgb__objective' : ['binary:logistic'],
    'xgb__gamma': [1], #0.5
    'xgb__min_child_weight': [1], #[1,5]
    'xgb__subsample': [1.0],
    'xgb__colsample_bytree': [1.0],
    'xgb__random_state': [42]
}

gs_2 = GridSearchCV(pipe_xgb,pipe_xgb_params,cv=5,verbose=2)
gs_2.fit(X_train_tfidf, np.ravel(y_train))

In [None]:
# best parameters for gs_2
gs_2.best_params_

In [None]:
# gs_2 score of training data
gs_2.score(X_train_tfidf, y_train)

In [None]:
# gs_2 score of validation data
gs_2.score(X_validate_tfidf, y_validate)

In [None]:
# prediction of gs_2 model on test data
predict_2 = gs_2.predict(X_validate_tfidf)

In [None]:
# confusion matrix
confusion_matrix(predict_2, y_validate)

# Best model

In [33]:
df_train.head()

Unnamed: 0_level_0,review,rating,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Ga disappointed neat products .. Meletot Hilsn...,1,ga disappoint neat product meletot hilsnyaa speed,0.172,0.5,0.328,0.4215
1,"Rdtanya replace broken glass, broken chargernya",1,rdtanya replac broken glass broken chargernya,0.608,0.392,0.0,-0.7351
2,Nyesel bngt dsni shopping antecedent photo mes...,1,nyesel bngt dsni shop anteced photo messag pic...,0.0,0.82,0.18,0.8834
3,Sent a light blue suit goods ga want a refund,1,sent light blue suit good ga want refund,0.0,0.874,0.126,0.0772
4,Pendants came with dents and scratches on its ...,1,pendant came dent scratch surfac coat look lik...,0.0,0.872,0.128,0.3612


In [34]:
df_test.head()

Unnamed: 0_level_0,review,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,"Great danger, cool, motif and cantik2 jg model...",great danger cool motif cantik jg model cepet ...,0.106,0.562,0.331,0.7357
2,One of the shades don't fit well,one shade fit well,0.44,0.56,0.0,-0.4449
3,Very comfortable,comfort,0.0,0.218,0.782,0.5563
4,Fast delivery. Product expiry is on Dec 2022. ...,fast expiri dec wrap properli damag item,0.0,0.851,0.149,0.3875
5,it's sooooo cute! i like playing with the glit...,sooooo cute like play glitter better brows pho...,0.0,0.548,0.452,0.9823


In [35]:
Y = df_train['rating']
X = df_train.drop(['review', 'rating'],axis=1)
test = df_test.drop(['review'], axis=1)

In [36]:
test.head()

Unnamed: 0_level_0,review_clean,vader_negative,vader_neutral,vader_positive,vader_compound
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,great danger cool motif cantik jg model cepet ...,0.106,0.562,0.331,0.7357
2,one shade fit well,0.44,0.56,0.0,-0.4449
3,comfort,0.0,0.218,0.782,0.5563
4,fast expiri dec wrap properli damag item,0.0,0.851,0.149,0.3875
5,sooooo cute like play glitter better brows pho...,0.0,0.548,0.452,0.9823


In [37]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60427 entries, 1 to 60427
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   review_clean    60427 non-null  object 
 1   vader_negative  60427 non-null  float64
 2   vader_neutral   60427 non-null  float64
 3   vader_positive  60427 non-null  float64
 4   vader_compound  60427 non-null  float64
dtypes: float64(4), object(1)
memory usage: 2.8+ MB


In [38]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144056 entries, 0 to 146810
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   review_clean    144056 non-null  object 
 1   vader_negative  144056 non-null  float64
 2   vader_neutral   144056 non-null  float64
 3   vader_positive  144056 non-null  float64
 4   vader_compound  144056 non-null  float64
dtypes: float64(4), object(1)
memory usage: 6.6+ MB


## LS

In [None]:
pipe_ls = Pipeline([
    ('smote', SMOTE()),
    ('ls', LogisticRegression())
])

pipe_ls_params = {
    'smote__sampling_strategy': ['auto'],
    'smote__k_neighbors': [3], #[3, 5, 7]
    'ls__penalty': ['l2'],
    'ls__C': [1] #.1

}

gs_ls = GridSearchCV(pipe_ls,pipe_ls_params,cv=5,verbose=2)
gs_ls.fit(X.drop(['review_clean'], axis=1), np.ravel(Y))

In [None]:
# prediction of gs_2 model on test data
predict_ls = gs_ls.predict(test.drop(['review_clean'], axis=1))

In [None]:
predict_ls

In [None]:
test['rating'] = predict_ls
test.drop(['review_clean', 'vader_negative', 'vader_neutral', 'vader_positive', 'vader_compound'], axis=1, inplace=True)

test.to_csv('submission_ls.csv')

## Tfidf LS

In [39]:
#Constructing a pipeline for countvectorizer
pipe_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('ls', LogisticRegression())
])

#Pipe parameters for cvec, log regression
pipe_tfidf_params = {
    'tfidf__max_features':[1000], #10
    'tfidf__min_df':[5], #3
    'tfidf__max_df':[0.85], #0.95
    'tfidf__ngram_range':[(1,1)],
}

#Gridsearching 
gs_tfidf = GridSearchCV(pipe_tfidf,pipe_tfidf_params,cv=5,verbose=1)

#fitting the initial pipeline to get the best params for tfidf
gs_tfidf.fit(X_train['review_clean'], np.ravel(y_train))

#Best params for tfidf
best_params = gs_tfidf.best_params_
print('Best params are: ',best_params)

#Instantiating the tfidf with the best params
tfidf= TfidfVectorizer(max_features=best_params['tfidf__max_features'],
                     min_df=best_params['tfidf__min_df'],
                     max_df=best_params['tfidf__max_df'],
                     ngram_range=best_params['tfidf__ngram_range'],
                     stop_words='english')

#transforming the sparse matrix into a dataframe and merging it back
train_sparse = pd.DataFrame(tfidf.fit_transform(X['review_clean']).todense(),
                            index=X.index, 
                            columns=tfidf.get_feature_names())

test_sparse = pd.DataFrame(tfidf.transform(test['review_clean'].astype('str')).todense(), 
                               index=test.index,
                               columns=tfidf.get_feature_names())

#Final dataframe for modelling
X_tfidf = X.drop(['review_clean'], axis=1).join(train_sparse)
X_test_tfidf = test.drop(['review_clean'], axis=1).join(test_sparse)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative sol

Best params are:  {'tfidf__max_df': 0.85, 'tfidf__max_features': 1000, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 1)}


In [40]:
pipe_ls = Pipeline([
    ('smote', SMOTE()),
    ('ss', StandardScaler()),
    ('ls', LogisticRegression())
])

pipe_ls_params = {
    'smote__sampling_strategy': ['auto'],
    'smote__k_neighbors': [3], #[3, 5, 7]
    'ls__penalty': ['l2'],
    'ls__C': [1], #.1
    'ls__max_iter':[1000]

}

gs_ls = GridSearchCV(pipe_ls,pipe_ls_params,cv=5,verbose=2)
gs_ls.fit(X_tfidf, np.ravel(Y))

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto, total=28.3min
[CV] ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 28.3min remaining:    0.0s


[CV]  ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto, total=28.5min
[CV] ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto 
[CV]  ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto, total=27.5min
[CV] ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto 
[CV]  ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto, total=27.8min
[CV] ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto 
[CV]  ls__C=1, ls__max_iter=1000, ls__penalty=l2, smote__k_neighbors=3, smote__sampling_strategy=auto, total=27.8min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 139.9min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('smote', SMOTE()),
                                       ('ss', StandardScaler()),
                                       ('ls', LogisticRegression())]),
             param_grid={'ls__C': [1], 'ls__max_iter': [1000],
                         'ls__penalty': ['l2'], 'smote__k_neighbors': [3],
                         'smote__sampling_strategy': ['auto']},
             verbose=2)

In [41]:
# prediction of gs_2 model on test data
predict_ls = gs_ls.predict(X_test_tfidf)

In [42]:
test['rating'] = predict_ls
test.drop(['review_clean', 'vader_negative', 'vader_neutral', 'vader_positive', 'vader_compound'], axis=1, inplace=True)

test.to_csv('submission_ls_xgb.csv')

## CVEC XGB

In [None]:
#Constructing a pipeline for countvectorizer
pipe_cvec = Pipeline([
    ('cvec', CountVectorizer()),
    ('xgb', XGBClassifier())
])

#Pipe parameters for cvec, log regression
pipe_cvec_params = {
    'cvec__max_features':[25], #10
    'cvec__min_df':[2], #3
    'cvec__max_df':[0.85], #0.95
    'cvec__ngram_range':[(1,1)] # (1,2)
}

#Gridsearching 
gs_cvec = GridSearchCV(pipe_cvec,pipe_cvec_params,cv=5,verbose=1)

#fitting the initial pipeline to get the best params for CVEC
gs_cvec.fit(X['review_clean'], np.ravel(Y))

#Best params for CVEC
best_params = gs_cvec.best_params_
print('Best params are: ',best_params)

#Instantiating the CVEC with the best params
cvec= CountVectorizer(max_features=best_params['cvec__max_features'],
                     min_df=best_params['cvec__min_df'],
                     max_df=best_params['cvec__max_df'],
                     ngram_range=best_params['cvec__ngram_range'],
                     stop_words='english')

#transforming the sparse matrix into a dataframe and merging it back
train_sparse = pd.DataFrame(cvec.fit_transform(X['review_clean']).todense(),
                            index=X.index, 
                            columns=cvec.get_feature_names())

test_sparse = pd.DataFrame(cvec.transform(test['review_clean'].astype('str')).todense(), 
                               index=test.index,
                               columns=cvec.get_feature_names())

#Final dataframe for modelling
X_cvec = X.drop(['review_clean'], axis=1).join(train_sparse)
X_test_cvec = test.drop(['review_clean'], axis=1).join(test_sparse)

In [None]:
pipe_xgb = Pipeline([
    ('smote', SMOTE()),
    ('ss', StandardScaler()),
    ('xgb', XGBClassifier())
])

pipe_xgb_params = {
    'smote__sampling_strategy': ['minority'],
    'smote__k_neighbors': [3], #[3, 5, 7]
    'xgb__max_depth': [7],
    'xgb__learning_rate' : [0.1],
    'xgb__n_estimators' : [150],
    'xgb__objective' : ['binary:logistic'],
    'xgb__gamma': [0.5], #1
    'xgb__min_child_weight': [1], #[1,5]
    'xgb__subsample': [1.0],
    'xgb__colsample_bytree': [1.0],
    'xgb__random_state': [42]
}

gs_best = GridSearchCV(pipe_xgb,pipe_xgb_params,cv=5,verbose=2)
gs_best.fit(X_cvec, np.ravel(Y))

In [None]:
gs_best.score(X_cvec,Y)

In [None]:
# prediction of gs_2 model on test data
predict_best = gs_best.predict(X_test_cvec)

In [None]:
predict_best

In [None]:
test['rating'] = predict_best
test.drop(['review_clean', 'vader_negative', 'vader_neutral', 'vader_positive', 'vader_compound'], axis=1, inplace=True)

test.to_csv('submission.csv')

# BERT

In [4]:
submission_bert = pd.read_csv('submission_bert.csv', index_col=0)

In [5]:
submission_bert.head()

Unnamed: 0,0
0,[[-3.8078973 -1.5928369 1.8021883 1.58582...
1,[[-4.2543488 -2.2719142 0.7868131 2.54184...
2,[[ 1.4216733e+00 1.9333155e+00 8.8708407e-01...
3,[[-3.9170368e+00 -2.0339422e+00 1.8974890e+00...
4,[[ 1.9015703 0.5106638 -0.60099244 -0.49058...
