# Sentiment Analysis on Yelp dataset

## Loading libraries


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report


## Loading the dataset


In [2]:
IO_TRAIN = "../input/yelp-review-dataset/yelp_review_polarity_csv/train.csv"
ylp = pd.read_csv(IO_TRAIN, header=None)
ylp.columns = ["sentiment", "review"]
ylp.replace({1: "NEG", 2: "POS"}, inplace=True)
ylp["sentiment"] = ylp["sentiment"].astype("category")
ylp.head()


Unnamed: 0,sentiment,review
0,NEG,"Unfortunately, the frustration of being Dr. Go..."
1,POS,Been going to Dr. Goldberg for over 10 years. ...
2,NEG,I don't know what Dr. Goldberg was like before...
3,NEG,I'm writing this review to give you a heads up...
4,POS,All the food is great here. But the best thing...


In [3]:
!cp ../input/yelp-sent-analysis-preprocess/* ./


In [4]:
features = np.load("features.npy")
labels = ylp["sentiment"]
features.shape


## More classic ML models

as noticed from EDA, the labels/classes are 2, positive or negative, that makes the task, in essence, a binary classification, which calls for using a familiar model: Logistic Regression, as one of the fastest, and being fit for the binary classification tasks.


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

lr = LogisticRegression(n_jobs=-1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.2, random_state=42, stratify=labels
)


In [6]:
_ = lr.fit(X_train, y_train)
lr.score(X_train, y_train)


0.8935647321428571

In [7]:
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred, digits=4))


              precision    recall  f1-score   support

         NEG     0.8911    0.8955    0.8933     56000
         POS     0.8950    0.8906    0.8928     56000

    accuracy                         0.8930    112000
   macro avg     0.8930    0.8930    0.8930    112000
weighted avg     0.8930    0.8930    0.8930    112000



after using word embeddings to transform the reviews into vectors, and using logistic regression, the resulting accuracy (f1-score) $0.89$ slightly higher than the benchmark model at $0.875$, but close to the bigram variant at $0.9$

as a comment, the transformer transformed all the reviews into non-zero vectors, but after initial preprocessing, there were $93$ reviews that were deemed useless, it is a small number in comparison to $560000$, and it is likely to produce a small difference in accuracy, and not necessarily improve it.

---

another model could be helpful is `SVM` (support vector machine), which is common for classification tasks, especially with data with low to medium number of features, as seen after transforming the data we have vectors of size $768$.


In [8]:
import joblib

_ = joblib.dump(lr, "logistic-regression.joblib")


In [9]:
from sklearn.kernel_approximation import Nystroem
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline

# kernel approximation is recommended
nystroem = Nystroem(random_state=42, n_components=features.shape[1])
svm = LinearSVC(random_state=42)
pipeline = make_pipeline(nystroem, svm)


In [10]:
_ = pipeline.fit(X_train, y_train)
pipeline.score(X_train, y_train)


0.8925669642857142

In [11]:
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, digits=4))


              precision    recall  f1-score   support

         NEG     0.8922    0.8903    0.8912     56000
         POS     0.8906    0.8924    0.8915     56000

    accuracy                         0.8914    112000
   macro avg     0.8914    0.8914    0.8914    112000
weighted avg     0.8914    0.8914    0.8914    112000



the `SVM` classsifier is closer to the logistic regression at $0.89$

notice the warning, asking to increase the `max_iter` parameter

> TODO: perform RandomizedSearch for hyper-parameter tuning to obtain best results possible from classical models


In [12]:
_ = joblib.dump(pipeline, "linearsvm.joblib")
