### File: Classification

#### Goals and objectives of this file:

##### 1. Clean, and pre-process the dataset
##### => Basic Cleaning Process => duplicate removal => checking missing labels => removing dates
##### => Pre-Processing data => stemming => removing stop words => bag of words => feature extraction/word vectorization

##### 2. Feature Engineering, and extra model optimization steps
##### => Feature Engineering => feature selection => word embeddings => outlier detection => over/under sampling the classes
##### => Model Optimization => (Hyper)parameter Tuning => different Algorithms tests => different neural network architectures test

##### 3. Training and Results
##### => Training and Testing => accuracy => precision => confusion matrix => roc/auc curves => learning curve

##### 4. Desktop Application
##### => Model Pipeline => GUI => feedback loop

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [2]:
df = pd.read_csv("../datasets/yelp coffee/raw_yelp_review_data.csv")

In [3]:
df.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [4]:
df.shape

(7616, 3)

In [5]:
df.describe()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
count,7616,7616,7616
unique,79,6915,5
top,Epoch Coffee,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
freq,400,4,3780


### 1.1 Duplicate Removal

In [6]:
#df.drop_duplicates()

### 1.2 Checking Missing Labels

In [7]:
df.isnull().value_counts()

coffee_shop_name  full_review_text  star_rating
False             False             False          7616
dtype: int64

### 1.3 Removing Dates

In [8]:
df['full_review_text'] = df['full_review_text'].str[11:]

In [9]:
df.head(20)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,1 check-in Love love loved the atmosphere! Ev...,5.0 star rating
1,The Factory - Cafe With a Soul,"Listed in Date Night: Austin, Ambiance in Aust...",4.0 star rating
2,The Factory - Cafe With a Soul,1 check-in Listed in Brunch Spots I loved the...,4.0 star rating
3,The Factory - Cafe With a Soul,Very cool decor! Good drinks Nice seating Ho...,2.0 star rating
4,The Factory - Cafe With a Soul,1 check-in They are located within the Northcr...,4.0 star rating
5,The Factory - Cafe With a Soul,1 check-in Very cute cafe! I think from the m...,4.0 star rating
6,The Factory - Cafe With a Soul,"2 check-ins Listed in ""Nuptial Coffee Bliss!""...",4.0 star rating
7,The Factory - Cafe With a Soul,2 check-ins Love this place! 5 stars for clea...,5.0 star rating
8,The Factory - Cafe With a Soul,"1 check-in Ok, let's try this approach... Pr...",3.0 star rating
9,The Factory - Cafe With a Soul,3 check-ins This place has been shown on my s...,5.0 star rating
