# Sentiment Analysis on Yelp dataset

## Loading libraries


In [2]:
import numpy as np
import pandas as pd


## Loading the dataset


In [3]:
IO_TRAIN = "../input/yelp-sent-analysis-preprocess/train_processed.csv"
ylp_processed = pd.read_csv(IO_TRAIN)
ylp_processed.head()


Unnamed: 0,sentiment,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


---

## Benchmark model

Will start by using a naive Bayes classifier with unigram

naive Bayes is one of the oldest & fastest classifiers for text classifications


In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report


In [14]:
X = ylp_processed["review"]
y = ylp_processed["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

benchmark = make_pipeline(TfidfVectorizer(), MultinomialNB())


In [15]:
scores = cross_validate(benchmark, X, y, cv=5, n_jobs=-1, verbose=1)
np.mean(scores["test_score"])


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.6min finished


0.8749453136757473

The benchmark model gives accuracy of $0.875$


In [16]:
_ = benchmark.fit(X_train, y_train)
y_pred = benchmark.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))


              precision    recall  f1-score   support

         NEG     0.8790    0.8778    0.8784     55986
         POS     0.8780    0.8791    0.8786     55996

    accuracy                         0.8785    111982
   macro avg     0.8785    0.8785    0.8785    111982
weighted avg     0.8785    0.8785    0.8785    111982



In [None]:
import joblib

joblib.dump(benchmark, "benchmark.joblib")
