### File: Classification

#### Goals and objectives of this file:

##### 1. Clean, and pre-process the dataset
##### => Basic Cleaning Process => duplicate removal => checking missing labels => removing dates
##### => Pre-Processing data => stemming => removing stop words => bag of words => feature extraction/word vectorization

##### 2. Feature Engineering, and extra model optimization steps
##### => Feature Engineering => feature selection => word embeddings => outlier detection => over/under sampling the classes
##### => Model Optimization => (Hyper)parameter Tuning => different Algorithms tests => different neural network architectures test

##### 3. Training and Results
##### => Training and Testing => accuracy => precision => confusion matrix => roc/auc curves => learning curve


##### 4. Sentiment Analysis
##### => Sentiment Polarity => convert words into emotions => Visualize emotions and words 

##### 5. Desktop Application
##### => Model Pipeline => GUI => feedback loop

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
#nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
df = pd.read_csv("../datasets/yelp coffee/raw_yelp_review_data.csv")

In [4]:
df.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [5]:
df.shape

(7616, 3)

In [6]:
df.describe()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
count,7616,7616,7616
unique,79,6915,5
top,Epoch Coffee,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
freq,400,4,3780


### 1.1 Duplicate Removal

In [7]:
df.drop_duplicates(inplace = True)

### 1.2 Checking Missing Labels

In [8]:
df.isnull().value_counts()

coffee_shop_name  full_review_text  star_rating
False             False             False          6915
dtype: int64

### 1.3 Removing Dates

In [9]:
df['full_review_text'] = df['full_review_text'].str[11:]

In [10]:
df.head(20)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,1 check-in Love love loved the atmosphere! Ev...,5.0 star rating
1,The Factory - Cafe With a Soul,"Listed in Date Night: Austin, Ambiance in Aust...",4.0 star rating
2,The Factory - Cafe With a Soul,1 check-in Listed in Brunch Spots I loved the...,4.0 star rating
3,The Factory - Cafe With a Soul,Very cool decor! Good drinks Nice seating Ho...,2.0 star rating
4,The Factory - Cafe With a Soul,1 check-in They are located within the Northcr...,4.0 star rating
5,The Factory - Cafe With a Soul,1 check-in Very cute cafe! I think from the m...,4.0 star rating
6,The Factory - Cafe With a Soul,"2 check-ins Listed in ""Nuptial Coffee Bliss!""...",4.0 star rating
7,The Factory - Cafe With a Soul,2 check-ins Love this place! 5 stars for clea...,5.0 star rating
8,The Factory - Cafe With a Soul,"1 check-in Ok, let's try this approach... Pr...",3.0 star rating
9,The Factory - Cafe With a Soul,3 check-ins This place has been shown on my s...,5.0 star rating


### 1.4 Removing "star rating" from labels

In [11]:
df['star_rating'] = df['star_rating'].str[:2]

In [12]:
df['star_rating']

0        5
1        4
2        4
3        2
4        4
        ..
7611     4
7612     5
7613     4
7614     3
7615     4
Name: star_rating, Length: 6915, dtype: object

### 1.4 General Data Pre-Processing

#### NLTK Library stop words were not sufficient enough to filter out contractions, and greetings. Therefore, some extra stop words were scrapped off the internet to filter these edge cases, and retrieve a proper corpus.

In [13]:
more_stop_words = [
"a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves,","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"
]

In [14]:
even_more_stop_words = ["able","about","above","abroad","according","accordingly","across","actually","adj","after","afterwards","again","against","ago","ahead","ain't","all","allow","allows","almost","alone","along","alongside","already","also","although","always","am","amid","amidst","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","a's","aside","ask","asking","associated","at","available","away","awfully","back","backward","backwards","be","became","because","become","becomes","becoming","been","before","beforehand","begin","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","came","can","cannot","cant","can't","caption","cause","causes","certain","certainly","changes","clearly","c'mon","co","co.","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","c's","currently","dare","daren't","definitely","described","despite","did","didn't","different","directly","do","does","doesn't","doing","done","don't","down","downwards","during","each","edu","eg","eight","eighty","either","else","elsewhere","end","ending","enough","entirely","especially","et","etc","even","ever","evermore","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","fairly","far","farther","few","fewer","fifth","first","five","followed","following","follows","for","forever","former","formerly","forth","forward","found","four","from","further","furthermore","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","had","hadn't","half","happens","hardly","has","hasn't","have","haven't","having","he","he'd","he'll","hello","help","hence","her","here","hereafter","hereby","herein","here's","hereupon","hers","herself","he's","hi","him","himself","his","hither","hopefully","how","howbeit","however","hundred","i'd","ie","if","ignored","i'll","i'm","immediate","in","inasmuch","inc","inc.","indeed","indicate","indicated","indicates","inner","inside","insofar","instead","into","inward","is","isn't","it","it'd","it'll","its","it's","itself","i've","just","k","keep","keeps","kept","know","known","knows","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","likewise","little","look","looking","looks","low","lower","ltd","made","mainly","make","makes","many","may","maybe","mayn't","me","mean","meantime","meanwhile","merely","might","mightn't","mine","minus","miss","more","moreover","most","mostly","mr","mrs","much","must","mustn't","my","myself","name","namely","nd","near","nearly","necessary","need","needn't","needs","neither","never","neverf","neverless","nevertheless","new","next","nine","ninety","no","nobody","non","none","nonetheless","noone","no-one","nor","normally","not","nothing","notwithstanding","novel","now","nowhere","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","one's","only","onto","opposite","or","other","others","otherwise","ought","oughtn't","our","ours","ourselves","out","outside","over","overall","own","particular","particularly","past","per","perhaps","placed","please","plus","possible","presumably","probably","provided","provides","que","quite","qv","rather","rd","re","really","reasonably","recent","recently","regarding","regardless","regards","relatively","respectively","right","round","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","since","six","so","some","somebody","someday","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","take","taken","taking","tell","tends","th","than","thank","thanks","thanx","that","that'll","thats","that's","that've","the","their","theirs","them","themselves","then","thence","there","thereafter","thereby","there'd","therefore","therein","there'll","there're","theres","there's","thereupon","there've","these","they","they'd","they'll","they're","they've","thing","things","think","third","thirty","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","till","to","together","too","took","toward","towards","tried","tries","truly","try","trying","t's","twice","two","un","under","underneath","undoing","unfortunately","unless","unlike","unlikely","until","unto","up","upon","upwards","us","use","used","useful","uses","using","usually","v","value","various","versus","very","via","viz","vs","want","wants","was","wasn't","way","we","we'd","welcome","well","we'll","went","were","we're","weren't","we've","what","whatever","what'll","what's","what've","when","whence","whenever","where","whereafter","whereas","whereby","wherein","where's","whereupon","wherever","whether","which","whichever","while","whilst","whither","who","who'd","whoever","whole","who'll","whom","whomever","who's","whose","why","will","willing","wish","with","within","without","wonder","won't","would","wouldn't","yes","yet","you","you'd","you'll","your","you're","yours","yourself","yourselves","you've","zero","a","how's","i","when's","why's","b","c","d","e","f","g","h","j","l","m","n","o","p","q","r","s","t","u","uucp","w","x","y","z","I","www","amount","bill","bottom","call","computer","con","couldnt","cry","de","describe","detail","due","eleven","empty","fifteen","fifty","fill","find","fire","forty","front","full","give","hasnt","herse","himse","interest","itse”","mill","move","myse”","part","put","show","side","sincere","sixty","system","ten","thick","thin","top","twelve","twenty","abst","accordance","act","added","adopted","affected","affecting","affects","ah","announce","anymore","apparently","approximately","aren","arent","arise","auth","beginning","beginnings","begins","biol","briefly","ca","date","ed","effect","et-al","ff","fix","gave","giving","heres","hes","hid","home","id","im","immediately","importance","important","index","information","invention","itd","keys","kg","km","largely","lets","line","'ll","means","mg","million","ml","mug","na","nay","necessarily","nos","noted","obtain","obtained","omitted","ord","owing","page","pages","poorly","possibly","potentially","pp","predominantly","present","previously","primarily","promptly","proud","quickly","ran","readily","ref","refs","related","research","resulted","resulting","results","run","sec","section","shed","shes","showed","shown","showns","shows","significant","significantly","similar","similarly","slightly","somethan","specifically","state","states","stop","strongly","substantially","successfully","sufficiently","suggest","thered","thereof","therere","thereto","theyd","theyre","thou","thoughh","thousand","throug","til","tip","ts","ups","usefully","usefulness","'ve","vol","vols","wed","whats","wheres","whim","whod","whos","widely","words","world","youd","youre"
]

In [15]:
extra_stop_words = more_stop_words + even_more_stop_words

In [16]:
def create_corpus(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.lower() not in stop_words]
    tokens = [token for token in tokens if token.lower() not in extra_stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

In [17]:
max = -1
for review in df['full_review_text']:
    if len(review) > max:
        max = len(review)
print(max)

5067


In [18]:
corpus = [create_corpus(review_corpus) for review_corpus in df['full_review_text']]

In [19]:
corpus[0]

'1 checkin love love loved atmosphere corner coffee shop style swing ordered matcha latte muy fantastico ordering drink pretty streamlined ordered ipad included beverage selection ranged coffee wine desired level sweetness checkout latte minute hoping typical heart feather latte listing possibility art idea'

In [20]:
corpus_texts = corpus
corpus_labels = df['star_rating']

In [21]:
train_texts, test_texts, train_labels, test_labels = train_test_split(corpus_texts, corpus_labels, test_size=0.2,random_state = 42)

### 2. Feature Engineering (TF-IDF Vectorizer)

In [22]:
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(train_texts)
test_features = vectorizer.transform(test_texts)

### 3. Test Model Building

In [23]:
model = LogisticRegression(C=10, max_iter=1000)
model.fit(train_features, train_labels)
test_features = vectorizer.transform(test_texts)
accuracy = model.score(test_features, test_labels)
print("Model accuracy: {:.2f}%".format(accuracy * 100))

Model accuracy: 54.16%


In [24]:
model_NB = MultinomialNB()
model_NB.fit(train_features, train_labels)
test_features = vectorizer.transform(test_texts)
accuracy = model_NB.score(test_features, test_labels)
print("Model accuracy: {:.2f}%".format(accuracy * 100))

Model accuracy: 47.79%


### 4. Sentiment Analysis

In [25]:
sid = SentimentIntensityAnalyzer()

sent_polarity_info = [sid.polarity_scores(review) for review in df['full_review_text']]

sent_polarity_info

[{'neg': 0.0, 'neu': 0.823, 'pos': 0.177, 'compound': 0.9283},
 {'neg': 0.0, 'neu': 0.727, 'pos': 0.273, 'compound': 0.9187},
 {'neg': 0.004, 'neu': 0.816, 'pos': 0.18, 'compound': 0.9936},
 {'neg': 0.096, 'neu': 0.714, 'pos': 0.19, 'compound': 0.8047},
 {'neg': 0.016, 'neu': 0.835, 'pos': 0.149, 'compound': 0.9393},
 {'neg': 0.024, 'neu': 0.755, 'pos': 0.221, 'compound': 0.9852},
 {'neg': 0.017, 'neu': 0.849, 'pos': 0.134, 'compound': 0.9843},
 {'neg': 0.038, 'neu': 0.788, 'pos': 0.174, 'compound': 0.9919},
 {'neg': 0.052, 'neu': 0.734, 'pos': 0.214, 'compound': 0.997},
 {'neg': 0.053, 'neu': 0.828, 'pos': 0.12, 'compound': 0.8516},
 {'neg': 0.036, 'neu': 0.873, 'pos': 0.091, 'compound': 0.9474},
 {'neg': 0.173, 'neu': 0.712, 'pos': 0.115, 'compound': -0.6927},
 {'neg': 0.02, 'neu': 0.889, 'pos': 0.091, 'compound': 0.9023},
 {'neg': 0.0, 'neu': 0.785, 'pos': 0.215, 'compound': 0.7639},
 {'neg': 0.0, 'neu': 0.878, 'pos': 0.122, 'compound': 0.8176},
 {'neg': 0.027, 'neu': 0.76, 'pos': 0

In [26]:
def classify_sentiment(score):
    if score['neg'] > score['pos']:
        return "Negative Sentiment"
    elif score['neg'] < score['pos']:
        return "Positive Sentiment"
    else:
        return "Neutral Sentiment"

In [27]:
def extract_sent_polarity(score):
    return score['compound']

In [28]:
review_sentiment = [classify_sentiment(scores) for scores in sent_polarity_info]

sent_polarity = [extract_sent_polarity(scores) for scores in sent_polarity_info]


df['str_sent'] = review_sentiment

df['sent_polarity'] = sent_polarity

In [29]:
df.head(20)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,str_sent,sent_polarity
0,The Factory - Cafe With a Soul,1 check-in Love love loved the atmosphere! Ev...,5,Positive Sentiment,0.9283
1,The Factory - Cafe With a Soul,"Listed in Date Night: Austin, Ambiance in Aust...",4,Positive Sentiment,0.9187
2,The Factory - Cafe With a Soul,1 check-in Listed in Brunch Spots I loved the...,4,Positive Sentiment,0.9936
3,The Factory - Cafe With a Soul,Very cool decor! Good drinks Nice seating Ho...,2,Positive Sentiment,0.8047
4,The Factory - Cafe With a Soul,1 check-in They are located within the Northcr...,4,Positive Sentiment,0.9393
5,The Factory - Cafe With a Soul,1 check-in Very cute cafe! I think from the m...,4,Positive Sentiment,0.9852
6,The Factory - Cafe With a Soul,"2 check-ins Listed in ""Nuptial Coffee Bliss!""...",4,Positive Sentiment,0.9843
7,The Factory - Cafe With a Soul,2 check-ins Love this place! 5 stars for clea...,5,Positive Sentiment,0.9919
8,The Factory - Cafe With a Soul,"1 check-in Ok, let's try this approach... Pr...",3,Positive Sentiment,0.997
9,The Factory - Cafe With a Soul,3 check-ins This place has been shown on my s...,5,Positive Sentiment,0.8516


In [30]:
emotions_df = pd.read_csv("../datasets/yelp coffee/emotions.csv")
emotions_df

Unnamed: 0,word,emotion
0,victimized,cheated
1,accused,cheated
2,acquitted,singled out
3,adorable,loved
4,adored,loved
...,...,...
512,uncomfortable,anxious
513,underestimated,belittled
514,unhappy,sad
515,vindicated,singled out


more stop words source: https://www.ranks.nl/stopwords

even more stop words source: https://countwordsfree.com/stopwords