### File: Data Preparation

#### Goals and objectives of this file:

##### 1. Clean, and pre-process the dataset
##### => Basic Cleaning Process => duplicate removal => checking missing labels => removing dates
##### => Pre-Processing data => stemming => removing stop words => lemmatization

##### 2. Basic Sentiment Analysis
##### => Sentiment Polarity

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from pathlib import Path
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')
#nltk.download('vader_lexicon')
#nltk.download('averaged_perceptron_tagger')

In [3]:
df = pd.read_csv("smaller_dataset/yelp coffee/raw_yelp_review_data.csv")

In [4]:
df.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [5]:
df.shape

(7616, 3)

In [6]:
df.describe()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
count,7616,7616,7616
unique,79,6915,5
top,Epoch Coffee,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
freq,400,4,3780


### 1.1 Duplicate Removal

In [7]:
df.drop_duplicates(inplace = True)

### 1.2 Checking Missing Labels

In [8]:
df.isnull().sum()

coffee_shop_name    0
full_review_text    0
star_rating         0
dtype: int64

### 1.3 Removing Dates, and "Check in" text at the beginning of the text

In [9]:
df['full_review_text'] = [new_text[8:] for new_text in df['full_review_text']]
df['full_review_text'] = [new_text.replace("check-in","") for new_text in df['full_review_text']]
df['full_review_text'] = [new_text.lstrip('0123456789.- ') for new_text in df['full_review_text']]
df['full_review_text'] = [new_text.lstrip('s') for new_text in df['full_review_text']]

In [10]:
df.head(50)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,Love love loved the atmosphere! Every corner o...,5.0 star rating
1,The Factory - Cafe With a Soul,"Listed in Date Night: Austin, Ambiance in Aust...",4.0 star rating
2,The Factory - Cafe With a Soul,Listed in Brunch Spots I loved the eclectic an...,4.0 star rating
3,The Factory - Cafe With a Soul,Very cool decor! Good drinks Nice seating How...,2.0 star rating
4,The Factory - Cafe With a Soul,They are located within the Northcross mall sh...,4.0 star rating
5,The Factory - Cafe With a Soul,Very cute cafe! I think from the moment I step...,4.0 star rating
6,The Factory - Cafe With a Soul,"Listed in ""Nuptial Coffee Bliss!"", Anderson L...",4.0 star rating
7,The Factory - Cafe With a Soul,Love this place! 5 stars for cleanliness 5 s...,5.0 star rating
8,The Factory - Cafe With a Soul,"Ok, let's try this approach... Pros: Music Se...",3.0 star rating
9,The Factory - Cafe With a Soul,This place has been shown on my social media ...,5.0 star rating


### 1.4 Removing "star rating" from labels

In [11]:
if type(df['star_rating'][0]) == np.int64:
    pass
else:
    df['star_rating'] = df['star_rating'].str[:2]

In [12]:
df['star_rating'] = [int(rating) for rating in df['star_rating']]

In [13]:
type(df['star_rating'][0])

numpy.int64

In [14]:
df['star_rating']

0       5
1       4
2       4
3       2
4       4
       ..
7611    4
7612    5
7613    4
7614    3
7615    4
Name: star_rating, Length: 6915, dtype: int64

### 1.5 General Data Pre-Processing

### Note: In the "datasets" folder, a script has been written to automate the following section. 
### However, it has been kept here as a test for the original dataset that I am working on, and for continuity.

#### NLTK Library stop words were not sufficient enough to filter out contractions, and greetings. Therefore, some extra stop words were scrapped off the internet to filter these edge cases, and retrieve a proper corpus.

1. Text is made lowercase

2. Tokenization 

3. POS tagging

4. Lemmatization

In [15]:
df['full_review_text'][32]

" The Factory Cafe is overall such a beautiful and really cool place to just hang out with friends or work on homework. You'll probably see the people around you taking the time to perfect their aesthetic Instagram pictures and Snapchat stories (myself included). Although the cafe itself doesn't provide wifi, it's within range of other places with wifi so that shouldn't deter you from going! This place looks like it literally came straight off of Pinterest! So cute! "

In [16]:
def process_corpus(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(text)
    tokens = nltk.pos_tag(tokens)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token[0]) for token in tokens]
    return ' '.join(tokens)

In [17]:
corpus = [process_corpus(review_corpus) for review_corpus in df['full_review_text']]

In [18]:
corpus[32]

'the factory cafe is overall such a beautiful and really cool place to just hang out with friend or work on homework youll probably see the people around you taking the time to perfect their aesthetic instagram picture and snapchat story myself included although the cafe itself doesnt provide wifi it within range of other place with wifi so that shouldnt deter you from going this place look like it literally came straight off of pinterest so cute'

In [19]:
corpus_texts = corpus
corpus_labels = df['star_rating']

### 2. Basic Sentiment Analysis

In [20]:
sid = SentimentIntensityAnalyzer()

sent_polarity_info = [sid.polarity_scores(review) for review in df['full_review_text']]

sent_polarity_info

[{'neg': 0.0, 'neu': 0.821, 'pos': 0.179, 'compound': 0.9283},
 {'neg': 0.0, 'neu': 0.727, 'pos': 0.273, 'compound': 0.9187},
 {'neg': 0.004, 'neu': 0.815, 'pos': 0.181, 'compound': 0.9936},
 {'neg': 0.096, 'neu': 0.714, 'pos': 0.19, 'compound': 0.8047},
 {'neg': 0.016, 'neu': 0.833, 'pos': 0.15, 'compound': 0.9393},
 {'neg': 0.024, 'neu': 0.754, 'pos': 0.222, 'compound': 0.9852},
 {'neg': 0.017, 'neu': 0.848, 'pos': 0.135, 'compound': 0.9843},
 {'neg': 0.038, 'neu': 0.787, 'pos': 0.175, 'compound': 0.9919},
 {'neg': 0.052, 'neu': 0.733, 'pos': 0.215, 'compound': 0.997},
 {'neg': 0.053, 'neu': 0.826, 'pos': 0.121, 'compound': 0.8516},
 {'neg': 0.036, 'neu': 0.872, 'pos': 0.092, 'compound': 0.9474},
 {'neg': 0.173, 'neu': 0.712, 'pos': 0.115, 'compound': -0.6927},
 {'neg': 0.02, 'neu': 0.888, 'pos': 0.092, 'compound': 0.9023},
 {'neg': 0.0, 'neu': 0.785, 'pos': 0.215, 'compound': 0.7639},
 {'neg': 0.0, 'neu': 0.878, 'pos': 0.122, 'compound': 0.8176},
 {'neg': 0.027, 'neu': 0.76, 'pos': 

### 2.1 Sentiment Polarity

In [21]:
def classify_sentiment(score):
    if score['neg'] > score['pos']:
        return "Negative Sentiment"
    elif score['neg'] < score['pos']:
        return "Positive Sentiment"
    else:
        return "Neutral Sentiment"

In [22]:
def extract_sent_polarity(score):
    return score['compound']

In [23]:
review_sentiment = [classify_sentiment(scores) for scores in sent_polarity_info]

sent_polarity = [extract_sent_polarity(scores) for scores in sent_polarity_info]


df['str_sent'] = review_sentiment

df['sent_polarity'] = sent_polarity

more stop words source: https://www.ranks.nl/stopwords

even more stop words source: https://countwordsfree.com/stopwords