# Sentiment Analysis on Yelp dataset

> NOTE: this particular notebook would require using GPU, unless you are willing to wait for $14$ hours for output

## Loading libraries


In [1]:
import numpy as np
import pandas as pd


## Loading the dataset


In [2]:
IO_TRAIN = "../input/yelp-review-dataset/yelp_review_polarity_csv/train.csv"
ylp = pd.read_csv(IO_TRAIN, header=None)
ylp.columns = ["sentiment", "review"]
ylp.replace({1: "NEG", 2: "POS"}, inplace=True)
ylp["sentiment"] = ylp["sentiment"].astype("category")
ylp.head()


Unnamed: 0,sentiment,review
0,NEG,"Unfortunately, the frustration of being Dr. Go..."
1,POS,Been going to Dr. Goldberg for over 10 years. ...
2,NEG,I don't know what Dr. Goldberg was like before...
3,NEG,I'm writing this review to give you a heads up...
4,POS,All the food is great here. But the best thing...


In [3]:
!cp '../input/yelp-sent-analysis-preprocess/train_processed.csv' './'


## Leveraging word embeddings

as seen from the variation on the benchmark model, the dataset has big number of features, so using a word embedding to decrease number of features, and still keep the context is the next course of action


In [4]:
!python3 -m pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m642.9 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=594ce1787c94970924760cc312687669386b404fdccec4af90f30d2631e0e2b2
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
[0m

In [5]:
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer("distilbert-base-nli-mean-tokens")
corpus = ylp["review"].values


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [6]:
features = transformer.encode(corpus)


Batches:   0%|          | 0/17500 [00:00<?, ?it/s]

In [7]:
np.save("features.npy", features)
