### File: Data Preparation

#### Goals and objectives of this file:

##### 1. Clean, and pre-process the dataset
##### => Basic Cleaning Process => duplicate removal => checking missing labels => removing dates
##### => Pre-Processing data => stemming => removing stop words => lemmatization

##### 2. Basic Sentiment Analysis
##### => Sentiment Polarity

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from pathlib import Path
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')
#nltk.download('vader_lexicon')
#nltk.download('averaged_perceptron_tagger')

In [3]:
df = pd.read_csv("datasets/yelp coffee/raw_yelp_review_data.csv")

In [4]:
df.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,str_sent,sent_polarity
0,The Factory - Cafe With a Soul,1 check-in Love love loved the atmosphere! Ev...,5,Positive Sentiment,0.9283
1,The Factory - Cafe With a Soul,"Listed in Date Night: Austin, Ambiance in Aust...",4,Positive Sentiment,0.9187
2,The Factory - Cafe With a Soul,1 check-in Listed in Brunch Spots I loved the...,4,Positive Sentiment,0.9936
3,The Factory - Cafe With a Soul,Very cool decor! Good drinks Nice seating Ho...,2,Positive Sentiment,0.8047
4,The Factory - Cafe With a Soul,1 check-in They are located within the Northcr...,4,Positive Sentiment,0.9393


In [5]:
df.shape

(6915, 5)

In [6]:
df.describe()

Unnamed: 0,star_rating,sent_polarity
count,6915.0,6915.0
mean,4.175127,0.794652
std,1.06166,0.385443
min,1.0,-0.9954
25%,4.0,0.8313
50%,4.0,0.9439
75%,5.0,0.978
max,5.0,0.9996


### 1.1 Duplicate Removal

In [7]:
df.drop_duplicates(inplace = True)

### 1.2 Checking Missing Labels

In [8]:
df.isnull().value_counts()

coffee_shop_name  full_review_text  star_rating  str_sent  sent_polarity
False             False             False        False     False            6915
dtype: int64

### 1.3 Removing Dates

In [9]:
df['full_review_text'] = df['full_review_text'].str[11:]

In [10]:
df.head(20)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,str_sent,sent_polarity
0,The Factory - Cafe With a Soul,Love love loved the atmosphere! Every corner ...,5,Positive Sentiment,0.9283
1,The Factory - Cafe With a Soul,"ate Night: Austin, Ambiance in Austin BEAUTIFU...",4,Positive Sentiment,0.9187
2,The Factory - Cafe With a Soul,Listed in Brunch Spots I loved the eclectic a...,4,Positive Sentiment,0.9936
3,The Factory - Cafe With a Soul,decor! Good drinks Nice seating However... J...,2,Positive Sentiment,0.8047
4,The Factory - Cafe With a Soul,They are located within the Northcross mall sh...,4,Positive Sentiment,0.9393
5,The Factory - Cafe With a Soul,Very cute cafe! I think from the moment I ste...,4,Positive Sentiment,0.9852
6,The Factory - Cafe With a Soul,"s Listed in ""Nuptial Coffee Bliss!"", Anderson ...",4,Positive Sentiment,0.9843
7,The Factory - Cafe With a Soul,Love this place! 5 stars for cleanliness 5 s...,5,Positive Sentiment,0.9919
8,The Factory - Cafe With a Soul,"Ok, let's try this approach... Pros: Music S...",3,Positive Sentiment,0.997
9,The Factory - Cafe With a Soul,s This place has been shown on my social media...,5,Positive Sentiment,0.8516


### 1.4 Removing "star rating" from labels

In [11]:
#df['star_rating'] = df['star_rating'].str[:2]

# This line was executed before, and the result was saved in the csv.
# Therefore, running it again would cause an error, as this column is no longer of type string.

In [12]:
df['star_rating'] = [int(rating) for rating in df['star_rating']]

In [13]:
type(df['star_rating'][0])

numpy.int64

### 1.4 General Data Pre-Processing

#### NLTK Library stop words were not sufficient enough to filter out contractions, and greetings. Therefore, some extra stop words were scrapped off the internet to filter these edge cases, and retrieve a proper corpus.

In [14]:
more_stop_words = [
"a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves,","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"
]

In [15]:
even_more_stop_words = ["able","about","above","abroad","according","accordingly","across","actually","adj","after","afterwards","again","against","ago","ahead","ain't","all","allow","allows","almost","alone","along","alongside","already","also","although","always","am","amid","amidst","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","a's","aside","ask","asking","associated","at","available","away","awfully","back","backward","backwards","be","became","because","become","becomes","becoming","been","before","beforehand","begin","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","came","can","cannot","cant","can't","caption","cause","causes","certain","certainly","changes","clearly","c'mon","co","co.","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","c's","currently","dare","daren't","definitely","described","despite","did","didn't","different","directly","do","does","doesn't","doing","done","don't","down","downwards","during","each","edu","eg","eight","eighty","either","else","elsewhere","end","ending","enough","entirely","especially","et","etc","even","ever","evermore","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","fairly","far","farther","few","fewer","fifth","first","five","followed","following","follows","for","forever","former","formerly","forth","forward","found","four","from","further","furthermore","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","had","hadn't","half","happens","hardly","has","hasn't","have","haven't","having","he","he'd","he'll","hello","help","hence","her","here","hereafter","hereby","herein","here's","hereupon","hers","herself","he's","hi","him","himself","his","hither","hopefully","how","howbeit","however","hundred","i'd","ie","if","ignored","i'll","i'm","immediate","in","inasmuch","inc","inc.","indeed","indicate","indicated","indicates","inner","inside","insofar","instead","into","inward","is","isn't","it","it'd","it'll","its","it's","itself","i've","just","k","keep","keeps","kept","know","known","knows","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","likewise","little","look","looking","looks","low","lower","ltd","made","mainly","make","makes","many","may","maybe","mayn't","me","mean","meantime","meanwhile","merely","might","mightn't","mine","minus","miss","more","moreover","most","mostly","mr","mrs","much","must","mustn't","my","myself","name","namely","nd","near","nearly","necessary","need","needn't","needs","neither","never","neverf","neverless","nevertheless","new","next","nine","ninety","no","nobody","non","none","nonetheless","noone","no-one","nor","normally","not","nothing","notwithstanding","novel","now","nowhere","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","one's","only","onto","opposite","or","other","others","otherwise","ought","oughtn't","our","ours","ourselves","out","outside","over","overall","own","particular","particularly","past","per","perhaps","placed","please","plus","possible","presumably","probably","provided","provides","que","quite","qv","rather","rd","re","really","reasonably","recent","recently","regarding","regardless","regards","relatively","respectively","right","round","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","since","six","so","some","somebody","someday","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","take","taken","taking","tell","tends","th","than","thank","thanks","thanx","that","that'll","thats","that's","that've","the","their","theirs","them","themselves","then","thence","there","thereafter","thereby","there'd","therefore","therein","there'll","there're","theres","there's","thereupon","there've","these","they","they'd","they'll","they're","they've","thing","things","think","third","thirty","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","till","to","together","too","took","toward","towards","tried","tries","truly","try","trying","t's","twice","two","un","under","underneath","undoing","unfortunately","unless","unlike","unlikely","until","unto","up","upon","upwards","us","use","used","useful","uses","using","usually","v","value","various","versus","very","via","viz","vs","want","wants","was","wasn't","way","we","we'd","welcome","well","we'll","went","were","we're","weren't","we've","what","whatever","what'll","what's","what've","when","whence","whenever","where","whereafter","whereas","whereby","wherein","where's","whereupon","wherever","whether","which","whichever","while","whilst","whither","who","who'd","whoever","whole","who'll","whom","whomever","who's","whose","why","will","willing","wish","with","within","without","wonder","won't","would","wouldn't","yes","yet","you","you'd","you'll","your","you're","yours","yourself","yourselves","you've","zero","a","how's","i","when's","why's","b","c","d","e","f","g","h","j","l","m","n","o","p","q","r","s","t","u","uucp","w","x","y","z","I","www","amount","bill","bottom","call","computer","con","couldnt","cry","de","describe","detail","due","eleven","empty","fifteen","fifty","fill","find","fire","forty","front","full","give","hasnt","herse","himse","interest","itse”","mill","move","myse”","part","put","show","side","sincere","sixty","system","ten","thick","thin","top","twelve","twenty","abst","accordance","act","added","adopted","affected","affecting","affects","ah","announce","anymore","apparently","approximately","aren","arent","arise","auth","beginning","beginnings","begins","biol","briefly","ca","date","ed","effect","et-al","ff","fix","gave","giving","heres","hes","hid","home","id","im","immediately","importance","important","index","information","invention","itd","keys","kg","km","largely","lets","line","'ll","means","mg","million","ml","mug","na","nay","necessarily","nos","noted","obtain","obtained","omitted","ord","owing","page","pages","poorly","possibly","potentially","pp","predominantly","present","previously","primarily","promptly","proud","quickly","ran","readily","ref","refs","related","research","resulted","resulting","results","run","sec","section","shed","shes","showed","shown","showns","shows","significant","significantly","similar","similarly","slightly","somethan","specifically","state","states","stop","strongly","substantially","successfully","sufficiently","suggest","thered","thereof","therere","thereto","theyd","theyre","thou","thoughh","thousand","throug","til","tip","ts","ups","usefully","usefulness","'ve","vol","vols","wed","whats","wheres","whim","whod","whos","widely","words","world","youd","youre"
]

In [16]:
extra_stop_words = more_stop_words + even_more_stop_words

In [22]:
def process_corpus(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [token for token in tokens if token.lower() not in extra_stop_words]
    tokens = nltk.pos_tag(tokens)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token[0]) for token in tokens]
    return ' '.join(tokens)

In [23]:
corpus = [process_corpus(review_corpus) for review_corpus in df['full_review_text']]

In [24]:
corpus[0]

'love love loved atmosphere corner coffee shop style swing ordered matcha latte muy fantastico ordering drink pretty streamlined ordered ipad included beverage selection ranged coffee wine desired level sweetness checkout latte minute hoping typical heart feather latte listing possibility art idea'

In [None]:
corpus_texts = corpus
corpus_labels = df['star_rating']

### 2. Basic Sentiment Analysis

In [None]:
sid = SentimentIntensityAnalyzer()

sent_polarity_info = [sid.polarity_scores(review) for review in df['full_review_text']]

sent_polarity_info

### 2.1 Sentiment Polarity

In [None]:
def classify_sentiment(score):
    if score['neg'] > score['pos']:
        return "Negative Sentiment"
    elif score['neg'] < score['pos']:
        return "Positive Sentiment"
    else:
        return "Neutral Sentiment"

In [None]:
def extract_sent_polarity(score):
    return score['compound']

In [None]:
review_sentiment = [classify_sentiment(scores) for scores in sent_polarity_info]

sent_polarity = [extract_sent_polarity(scores) for scores in sent_polarity_info]


df['str_sent'] = review_sentiment

df['sent_polarity'] = sent_polarity

### Extra: Saving Corpus as CSV, and Adding new Rows to Raw Dataset

In [None]:
# data_to_be_saved = {'corpus_text': corpus, 'corpus_labels': corpus_labels}
# to_csv_df = pd.DataFrame(data = data_to_be_saved, index=None)

In [None]:
# filepath = Path('datasets/yelp coffee/yelp_coffee_corpus.csv') 
# to_csv_df.to_csv(filepath,index=False)

In [None]:
# filepath_raw = Path('datasets/yelp coffee/raw_yelp_review_data.csv') 
# df.to_csv(filepath_raw,index=False)

more stop words source: https://www.ranks.nl/stopwords

even more stop words source: https://countwordsfree.com/stopwords