# Problem Statement 

Sentiment analysis can help improve the performance of the recommendation system. Recommendation algorithm alone predicts the items based on user's past behaviour. However the recommend items might not be liked by the other users. By using sentiment analysis we can recommend the product based on how it's been percieved by other users. 

This notebook focuses on building a sentiment prediction model using various Machine Learning Algorithms.

In [23]:
from sklearn.model_selection import train_test_split
from models import *
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def module_from_file(module_name, file_path):
    spec = importlib.util.spec_from_file_location(module_name, file_path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module

In [None]:
models = module_from_file("models","models.py")

In [4]:
#!pip install xgboost

In [6]:
import pandas as pd
df = pd.read_csv("pre_process_data.csv")
df.head()

Unnamed: 0,lemmatized_review,user_sentiment
0,love album good hip hop current pop sound hype...,1
1,good flavor review collect promotion,1
2,good flavor,1
3,read review look buy couple lubricant ultimate...,0
4,husband buy gel gel cause irritation feel like...,0


In [7]:
df.dropna(inplace=True)

In [8]:
X=df['lemmatized_review']
y=df['user_sentiment']

In [9]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=42)

In [10]:
vec = CountVectorizer(stop_words='english')

In [11]:
# transforming X_train to bow representation
X_train=vec.fit_transform(X_train).toarray()
X_test=vec.transform(X_test).toarray()

## Training the model using BOW Representation
### Naive Bayes

In [12]:
#training using naive bayes without hyperparameters
nb = models.NaiveBayes()
naive_bayes,metrics=nb.train_model_without_hp(X_train,y_train,X_test,y_test)

2023-02-08 18:39:09,522 - root - INFO - Training the model without hyperparameter tuning
2023-02-08 18:39:23,538 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=18, tm_min=39, tm_sec=23, tm_wday=2, tm_yday=39, tm_isdst=0)


In [13]:
model_performance={}
model_performance['naive_bayes_bow_without_hp']=metrics

In [14]:
# training naive bayes with hyperparameter
naive_bayes_hp,metrics=nb.train_model_with_hp(X_train,y_train,X_test,y_test)

2023-02-08 18:39:41,231 - root - INFO - Started training naive bayes with hyperparameter tuning
2023-02-08 18:41:22,624 - root - INFO - Best params {'alpha': 1e-07} 
2023-02-08 18:41:31,396 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=18, tm_min=41, tm_sec=31, tm_wday=2, tm_yday=39, tm_isdst=0)


In [15]:
model_performance['naive_bayes_bow_with_hp']=metrics

### Logistic Regression

In [16]:
# training the model using logistic regression
lr = models.LRClassification()
lr_model,metrics = lr.train_model_without_hp(X_train,y_train,X_test,y_test)

2023-02-08 18:41:49,662 - root - INFO - Training the model without hyperparameter tuning
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
2023-02-08 18:42:30,091 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=18, tm_min=42, tm_sec=30, tm_wday=2, tm_yday=39, tm_isdst=0)


In [17]:
lr_model.get_params

<bound method BaseEstimator.get_params of LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)>

In [18]:
model_performance['lr_bow_without_hp']=metrics

In [19]:
# tuning the model with smaller set
lr_model_hp,metrics = lr.train_model_with_hp(X_train[0:10000],y_train[0:10000],X_test[0:100],y_test[0:100])

2023-02-08 18:44:31,842 - root - INFO - Started training logistic regression with hyperparameter tuning
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
2023-02-08 18:52:54,971 - root - INFO - Best params {'tol': 0.01, 'C': 1} 
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-lear

In [20]:
model_performance['lr_bow_with_hp']=metrics

In [21]:
model_performance

{'naive_bayes_bow_without_hp': {'training_accuracy': 0.8921699371308821,
  'training_precision': 0.9141554824006077,
  'training_recall': 0.9694398195391805,
  'test_accuracy': 0.8865429492165796,
  'test_precision': 0.9059342701196145,
  'test_recall': 0.9736645032451323},
 'naive_bayes_bow_with_hp': {'training_accuracy': 0.9448942655743952,
  'training_precision': 0.9606901646264246,
  'training_recall': 0.9778720661689672,
  'test_accuracy': 0.9005445049449939,
  'test_precision': 0.9230769230769231,
  'test_recall': 0.9690464303544682},
 'lr_bow_without_hp': {'training_accuracy': 0.9395122880548676,
  'training_precision': 0.9976478687396019,
  'training_recall': 0.9339921585477201,
  'test_accuracy': 0.901211245693966,
  'test_precision': 0.9823919815793039,
  'test_recall': 0.9052670993509735},
 'lr_bow_with_hp': {'training_accuracy': 0.9425,
  'training_precision': 0.9975947083583885,
  'training_recall': 0.9372881355932203,
  'test_accuracy': 0.91,
  'test_precision': 0.9883720

### XGBoost

In [22]:
xgb = models.XGBoost()
xgb,metrics=xgb.train_model_without_hp(X_train,y_train,X_test,y_test)

2023-02-08 18:53:31,172 - root - INFO - Training the model without hyperparameter tuning
2023-02-08 19:05:54,376 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=19, tm_min=5, tm_sec=54, tm_wday=2, tm_yday=39, tm_isdst=0)


In [25]:
model_performance['xgb_without_hp'] = metrics

In [26]:
model_performance

{'naive_bayes_bow_without_hp': {'training_accuracy': 0.8921699371308821,
  'training_precision': 0.9141554824006077,
  'training_recall': 0.9694398195391805,
  'test_accuracy': 0.8865429492165796,
  'test_precision': 0.9059342701196145,
  'test_recall': 0.9736645032451323},
 'naive_bayes_bow_with_hp': {'training_accuracy': 0.9448942655743952,
  'training_precision': 0.9606901646264246,
  'training_recall': 0.9778720661689672,
  'test_accuracy': 0.9005445049449939,
  'test_precision': 0.9230769230769231,
  'test_recall': 0.9690464303544682},
 'lr_bow_without_hp': {'training_accuracy': 0.9395122880548676,
  'training_precision': 0.9976478687396019,
  'training_recall': 0.9339921585477201,
  'test_accuracy': 0.901211245693966,
  'test_precision': 0.9823919815793039,
  'test_recall': 0.9052670993509735},
 'lr_bow_with_hp': {'training_accuracy': 0.9425,
  'training_precision': 0.9975947083583885,
  'training_recall': 0.9372881355932203,
  'test_accuracy': 0.91,
  'test_precision': 0.9883720

### Evaluating performance of different ML algorithms trained on BOW model

In [27]:
import pandas as pd
bow_performance=pd.DataFrame(model_performance)
bow_performance

Unnamed: 0,naive_bayes_bow_without_hp,naive_bayes_bow_with_hp,lr_bow_without_hp,lr_bow_with_hp,xgb_without_hp
training_accuracy,0.89217,0.944894,0.939512,0.9425,0.934416
training_precision,0.914155,0.96069,0.997648,0.997595,0.936197
training_recall,0.96944,0.977872,0.933992,0.937288,0.99377
test_accuracy,0.886543,0.900545,0.901211,0.91,0.919547
test_precision,0.905934,0.923077,0.982392,0.988372,0.926698
test_recall,0.973665,0.969046,0.905267,0.913978,0.987768


## Training the model using TF-IDF

In [28]:
vec =TfidfVectorizer(stop_words='english')

In [29]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=42)

In [31]:
# transforming X_train to bow representation
X_train=vec.fit_transform(X_train).toarray()
X_test=vec.transform(X_test).toarray()

### Naive Bayes

In [32]:
#training using naive bayes
nb = models.NaiveBayes()
naive_bayes,metrics=nb.train_model_without_hp(X_train,y_train,X_test,y_test)

2023-02-08 19:08:49,122 - root - INFO - Training the model without hyperparameter tuning
2023-02-08 19:08:55,048 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=19, tm_min=8, tm_sec=55, tm_wday=2, tm_yday=39, tm_isdst=0)


In [33]:
model_performance={}
model_performance['naive_bayes_without_hp'] = metrics

In [34]:
naive_bayes_hp,metrics=nb.train_model_with_hp(X_train,y_train,X_test,y_test)

2023-02-08 19:09:26,408 - root - INFO - Started training naive bayes with hyperparameter tuning
2023-02-08 19:10:28,223 - root - INFO - Best params {'alpha': 1e-07} 
2023-02-08 19:10:33,703 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=19, tm_min=10, tm_sec=33, tm_wday=2, tm_yday=39, tm_isdst=0)


In [35]:
model_performance['naive_bayes_bow_with_hp']=metrics

### Logistic Regression

In [36]:
# training the model using logistic regression
lr = models.LRClassification()
lr_model,metrics = lr.train_model_without_hp(X_train,y_train,X_test,y_test)

2023-02-08 19:11:05,513 - root - INFO - Training the model without hyperparameter tuning
2023-02-08 19:11:16,611 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=19, tm_min=11, tm_sec=16, tm_wday=2, tm_yday=39, tm_isdst=0)


In [37]:
model_performance['lr_bow_without_hp']=metrics

In [38]:
lr_model_hp,metrics = lr.train_model_with_hp(X_train[0:10000],y_train[0:10000],X_test[0:100],y_test[0:100])

2023-02-08 19:11:43,943 - root - INFO - Started training logistic regression with hyperparameter tuning
2023-02-08 19:17:07,502 - root - INFO - Best params {'tol': 0.01, 'C': 1} 
2023-02-08 19:17:12,923 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=19, tm_min=17, tm_sec=12, tm_wday=2, tm_yday=39, tm_isdst=0)


In [39]:
model_performance['lr_bow_with_hp']=metrics

### XGBoost

In [40]:
xgbc = models.XGBoost()
xgb,metrics=xgbc.train_model_without_hp(X_train,y_train,X_test,y_test)

2023-02-08 19:17:28,729 - root - INFO - Training the model without hyperparameter tuning
2023-02-08 19:29:56,330 - root - INFO - Finished training at time.struct_time(tm_year=2023, tm_mon=2, tm_mday=8, tm_hour=19, tm_min=29, tm_sec=56, tm_wday=2, tm_yday=39, tm_isdst=0)


In [41]:
model_performance['xgb_without_hp'] = metrics

In [42]:
model_performance

{'naive_bayes_without_hp': {'training_accuracy': 0.8921699371308821,
  'training_precision': 0.9141554824006077,
  'training_recall': 0.9694398195391805,
  'test_accuracy': 0.8865429492165796,
  'test_precision': 0.9059342701196145,
  'test_recall': 0.9736645032451323},
 'naive_bayes_bow_with_hp': {'training_accuracy': 0.9448942655743952,
  'training_precision': 0.9606901646264246,
  'training_recall': 0.9778720661689672,
  'test_accuracy': 0.9005445049449939,
  'test_precision': 0.9230769230769231,
  'test_recall': 0.9690464303544682},
 'lr_bow_without_hp': {'training_accuracy': 0.9016479329396075,
  'training_precision': 0.9962230215827338,
  'training_recall': 0.8924754283259037,
  'test_accuracy': 0.8782086898544282,
  'test_precision': 0.9867680180180181,
  'test_recall': 0.8749375936095857},
 'lr_bow_with_hp': {'training_accuracy': 0.8995,
  'training_precision': 0.9957032730949071,
  'training_recall': 0.8902824858757062,
  'test_accuracy': 0.89,
  'test_precision': 0.9880952380

In [43]:
# xbgoost without hyperparameter
import pickle
with open('model.pkl','wb') as f:
    pickle.dump(xgb,f)

### Evaluating performance of different ML algorithms trained on TF-IDF model

In [44]:
tf_idf_performance=pd.DataFrame(model_performance)

In [45]:
tf_idf_performance

Unnamed: 0,naive_bayes_without_hp,naive_bayes_bow_with_hp,lr_bow_without_hp,lr_bow_with_hp,xgb_without_hp
training_accuracy,0.89217,0.944894,0.901648,0.8995,0.945656
training_precision,0.914155,0.96069,0.996223,0.995703,0.947374
training_recall,0.96944,0.977872,0.892475,0.890282,0.993931
test_accuracy,0.886543,0.900545,0.878209,0.89,0.923325
test_precision,0.905934,0.923077,0.986768,0.988095,0.932027
test_recall,0.973665,0.969046,0.874938,0.892473,0.985771
