# Dataset and Question
---
Dataset: https://www.kaggle.com/benroshan/online-food-delivery-preferencesbangalore-region

With so many restaurants in Singapore, having the edge to attract more customers is lucrative.  
One way a restaurant can naturally attract more customers is through **reviews** from the customers.  
The plethora of online food delivery services act as a review bank for the many restaurants, hence, a good place to start exploring.  
We want to find out, if we were ever to setup our own restaurant with our own online delivery service, what would make customers come back and order again.  

>**What are the optimal factors for a restaurant to attract consumers via food delivery service?**

Our dataset is set in a metropolitan city in India, Bangalore.  
Due to the recent (2021) rise in demand of online delivery there, this dataset was gathered.

# Essential Libraries
---
    > NumPy : Library for Numeric Computations in Python
    > Pandas : Library for Data Acquisition and Preparation
    > Matplotlib : Low-level library for Data Visualization
    > Seaborn : Higher-level library for Data Visualization
    > Scikit Learn : Regressions and Classification

In [3]:
import gmaps

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import datacleaner as dc
import gmaps
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.tree import plot_tree
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import mean_squared_error
# from sklearn.metrics import confusion_matrix
# from sklearn.metrics import roc_curve
# from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
sb.set()

AttributeError: module 'collections' has no attribute 'MutableMapping'

In [21]:
foodDelivery = pd.read_csv("onlinedeliverydata.csv")

# EDA
--------

## Dataset Analysis
---

There are a lot of Categorical variables compared to Numerical variables, as seen from the many objects in the dataset info.

In [9]:
foodDelivery.head(None)

Unnamed: 0,Age,Gender,Marital Status,Occupation,Monthly Income,Educational Qualifications,Family size,latitude,longitude,Pin code,...,Less Delivery time,High Quality of package,Number of calls,Politeness,Freshness,Temperature,Good Taste,Good Quantity,Output,Reviews
0,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,...,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Yes,Nil\r\n
1,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.9770,77.5773,560009,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Yes,Nil
2,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,...,Important,Very Important,Moderately Important,Very Important,Very Important,Important,Very Important,Moderately Important,Yes,"Many a times payment gateways are an issue, so..."
3,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,...,Very Important,Important,Moderately Important,Very Important,Very Important,Very Important,Very Important,Important,Yes,nil
4,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.9850,77.5533,560010,...,Important,Important,Moderately Important,Important,Important,Important,Very Important,Very Important,Yes,NIL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
383,23,Female,Single,Student,No Income,Post Graduate,2,12.9766,77.5993,560001,...,Important,Important,Important,Important,Important,Important,Important,Important,Yes,Nil
384,23,Female,Single,Student,No Income,Post Graduate,4,12.9854,77.7081,560048,...,Moderately Important,Very Important,Moderately Important,Moderately Important,Moderately Important,Moderately Important,Very Important,Very Important,Yes,Nil
385,22,Female,Single,Student,No Income,Post Graduate,5,12.9850,77.5533,560010,...,Important,Very Important,Important,Important,Very Important,Very Important,Very Important,Very Important,Yes,Nil
386,23,Male,Single,Student,Below Rs.10000,Post Graduate,2,12.9770,77.5773,560009,...,Important,Very Important,Important,Very Important,Very Important,Important,Very Important,Very Important,Yes,Language barrier is also one major issue. Mosl...


In [10]:
foodDelivery.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388 entries, 0 to 387
Data columns (total 55 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Age                                        388 non-null    int64  
 1   Gender                                     388 non-null    object 
 2   Marital Status                             388 non-null    object 
 3   Occupation                                 388 non-null    object 
 4   Monthly Income                             388 non-null    object 
 5   Educational Qualifications                 388 non-null    object 
 6   Family size                                388 non-null    int64  
 7   latitude                                   388 non-null    float64
 8   longitude                                  388 non-null    float64
 9   Pin code                                   388 non-null    int64  
 10  Medium (P1)               

In [11]:
foodDelivery.columns

Index(['Age', 'Gender', 'Marital Status', 'Occupation', 'Monthly Income',
       'Educational Qualifications', 'Family size', 'latitude', 'longitude',
       'Pin code', 'Medium (P1)', 'Medium (P2)', 'Meal(P1)', 'Meal(P2)',
       'Perference(P1)', 'Perference(P2)', 'Ease and convenient',
       'Time saving', 'More restaurant choices', 'Easy Payment option',
       'More Offers and Discount', 'Good Food quality', 'Good Tracking system',
       'Self Cooking', 'Health Concern', 'Late Delivery', 'Poor Hygiene',
       'Bad past experience', 'Unavailability', 'Unaffordable',
       'Long delivery time', 'Delay of delivery person getting assigned',
       'Delay of delivery person picking up food', 'Wrong order delivered',
       'Missing item', 'Order placed by mistake', 'Influence of time',
       'Order Time', 'Maximum wait time', 'Residence in busy location',
       'Google Maps Accuracy', 'Good Road Condition', 'Low quantity low time',
       'Delivery person ability', 'Influence of 

## Univariate Analysis
---

#### For our dataset, since there are 55 variables, we have broken down the dataset into a few smaller categories:
- Consumer Demographics:
    - Basic Information:
        - Age
        - Gender
    - Family:
        - Marital Status
        - Occupation
        - Monthly Income
        - Educational Qualifications
        - Family size
    - Residence:
        - Latitude
        - Longitude
        - Pin code
    - Delivery Preferences:
        - Medium of order (Preference 1)
        - Medium of order (Preference 2)
        - Meal-of-the-day of order (Preference 1)
        - Meal-of-the-day of order (Preference 2)
        - General Type of Food (Preference 1)
        - General Type of Food (Preference 2)
        - Order Time (Time of day to order)
        - Maximum Wait Time (Before cancelling the order)


- Location:
    - Residence in busy location
    - Google Maps Accuracy
    - Good Road Condition  
    

- Customer Experience:
    - Time Factors:
        - Saves Time
        - Good Tracking System
        - Late Delivery
        - Long delivery time
        - Delay of delivery person getting assigned
        - Delay of delivery person picking up food
        - Low Quantity Low Time (Quantity of food affects delivery time)
    - Food Factors:
        - More restaurant choices available
        - Good Food Quality
        - Health concern
        - Poor Hygiene
        - Unavailability
        - Unaffordability
    - Others:
        - Ease and Convenience
        - Ease of Payment Option
        - More Offers and Discounts
        - Self-cooking (Customer cooks)
        - Bad Past Experiences
        - Delivery person ability
        - Wrong order delivered
        - Missing item
        - Order placed by mistake
        - Influence of Time (Order affects delivery time)
        - Influence of rating (Current restuarant rating affects order)
        - **Output** (Is the customer satisfied with the food order?)
        - Reviews


- Customer's Demands Importance:
    - Less Delivery Time
    - High Quality of Package
    - Number of calls
    - Politeness
    - Freshness
    - Temperature
    - Good Taste
    - Good Quantity

### Consumer Demographics - Basic Information
---
From the following plots,
- Concentration of people between the ages between 22-25
- About 50 more males than females  

We can conclude that,
- Data came from mostly young people
- Not much disparity in gender representation

In [None]:
f, axes = plt.subplots(2, 1, figsize = (10, 10))
f = sb.countplot(x = "Age", data = foodDelivery, ax = axes[0])
f = sb.countplot(x = "Gender", data = foodDelivery, ax = axes[1], order = foodDelivery["Gender"].value_counts().index)

### Consumer Demographics - Family 
--- 
From the following plots,
- Most are single
- Most are either students or employed
- Most have no income
- Most are graduates or post-graduates
- Most have family size of 3 or 2  

We can conclude that,
- Data came from university students




In [None]:
f, axes = plt.subplots(5, 1, figsize = (10, 30))
f = sb.countplot(x = "Marital Status", data = foodDelivery, ax = axes[0])
f = sb.countplot(x = "Occupation", data = foodDelivery, ax = axes[1])
f = sb.countplot(x = "Monthly Income", data = foodDelivery, ax = axes[2], order = foodDelivery["Monthly Income"].value_counts().index)
f = sb.countplot(x = "Educational Qualifications", data = foodDelivery, ax = axes[3], order = foodDelivery["Educational Qualifications"].value_counts().index)
f = sb.countplot(x = "Family size", data = foodDelivery, ax = axes[4], order = foodDelivery["Family size"].value_counts().index)

### Consumer Demographics - Residence
---
The Geolocation of the different clients is recorded on the survey, and we can use gmpas API to plot their locations in a heatmap on a map.

We can see how clients are spread thorughout Bangalore, but are more concentrated closer towards the centre of the city. 

Hence, it maybe beneficial for future restaurant owners to position their restaurants near darker spots on the heatmap to minimise on delivery times. 

Since Latitude, Longitude, and Pin code only show use where the customers come from, it cannot qualitatively help us in solving the problem.

In [22]:
gmaps.configure(api_key='AIzaSyA9m5OlBgrWywCl9u--IuArU6N2BaUmgNo') # Fill in with your API key

loc = pd.DataFrame(foodDelivery[['latitude','longitude']])

loc.head

<bound method NDFrame.head of      latitude  longitude
0     12.9766    77.5993
1     12.9770    77.5773
2     12.9551    77.6593
3     12.9473    77.5616
4     12.9850    77.5533
..        ...        ...
383   12.9766    77.5993
384   12.9854    77.7081
385   12.9850    77.5533
386   12.9770    77.5773
387   12.8988    77.5764

[388 rows x 2 columns]>

In [23]:
fig = gmaps.figure()
heatmap = gmaps.heatmap_layer(loc)

heatmap.max_intensity = 0.03
heatmap.point_radius = 0.014
heatmap.dissipating = False


##fig = gmaps.figure(map_type='SATELLITE')
fig.add_layer(heatmap)
fig



Figure(layout=FigureLayout(height='420px'))

In [None]:
foodDelivery = foodDelivery.drop(columns = ["latitude", "longitude", "Pin code"])

### Consumer Demographics - Delivery Preferences
---
Since Medium of order (be on all mediums), Meal-of-the-day of order (open all day), and Order Time (open all day) are variables we do not need to predict, we will be dropping these variables.
From the following plots,
- Most prefer Non-veg foods
- Maximum wait time for most is either 40 or 30 minutes

We can conclude that,
- University students mostly eat Non-veg foods
- Delivery time should not exceed 40 minutes


In [None]:
foodDelivery = foodDelivery.drop(columns = ["Medium (P1)", "Medium (P2)", "Meal(P1)", "Meal(P2)", "Order Time"])

f, axes = plt.subplots(3, 1, figsize = (15, 30))
f = sb.countplot(x = "Perference(P1)", data = foodDelivery, ax = axes[0], order = foodDelivery["Perference(P1)"].value_counts().index)
f = sb.countplot(x = "Perference(P2)", data = foodDelivery, ax = axes[1], order = foodDelivery["Perference(P2)"].value_counts().index)
f = sb.countplot(x = "Maximum wait time", data = foodDelivery, ax = axes[2], order = foodDelivery["Maximum wait time"].value_counts().index)

### Location
---
Since we cannot control these variables in real-time, we drop these variables.

In [14]:
foodDelivery = foodDelivery.drop(columns = ["Residence in busy location", "Google Maps Accuracy", "Good Road Condition"])

KeyError: "['Residence in busy location', 'Google Maps Accuracy', 'Good Road Condition'] not found in axis"

### Customer Experience: Time Factors
---
We will be performing Text Analysis on the Reviews.  
From the following plots,
- Most agree that online delivery saves time
- Most agree that there is a good delivery tracking system
- Most agree that their deliveries are late
- Most agree that their deliveries take a long time
- Most agree that there is a delay in delivery person getting assigned
- Most agree that there is a delay in delivery person picking up food
- Most agree that the lower the quantity of food they buy, the quicker their food is delivered

In [None]:
f, axes = plt.subplots(7, 1, figsize = (15, 50))
f = sb.countplot(x = "Time saving", data = foodDelivery, ax = axes[0], order = foodDelivery["Time saving"].value_counts().index)
f = sb.countplot(x = "Good Tracking system", data = foodDelivery, ax = axes[1], order = foodDelivery["Good Tracking system"].value_counts().index)
f = sb.countplot(x = "Late Delivery", data = foodDelivery, ax = axes[2], order = foodDelivery["Late Delivery"].value_counts().index)
f = sb.countplot(x = "Long delivery time", data = foodDelivery, ax = axes[3], order = foodDelivery["Long delivery time"].value_counts().index)
f = sb.countplot(x = "Delay of delivery person getting assigned", data = foodDelivery, ax = axes[4], order = foodDelivery["Delay of delivery person getting assigned"].value_counts().index)
f = sb.countplot(x = "Delay of delivery person picking up food", data = foodDelivery, ax = axes[5], order = foodDelivery["Delay of delivery person picking up food"].value_counts().index)
f = sb.countplot(x = "Low quantity low time", data = foodDelivery, ax = axes[6], order = foodDelivery["Low quantity low time"].value_counts().index)


### Customer Experience: Food Factors
---
We will be performing Text Analysis on the Reviews.  
From the following plots,
- Most agree that there are many restaurant choices
- Most agree that food quality is good
- Equal number of people agree and disagree they are concerned with health when ordering food online
- Equal number of people agree and disagree the restaurant has poor hygiene
- Most people disagree there is unavailability when ordering food
- Most disagree that online delivered food is unaffordable

In [None]:
f, axes = plt.subplots(6, 1, figsize = (15, 50))
f = sb.countplot(x = "More restaurant choices", data = foodDelivery, ax = axes[0], order = foodDelivery["More restaurant choices"].value_counts().index)
f = sb.countplot(x = "Good Food quality", data = foodDelivery, ax = axes[1], order = foodDelivery["Good Food quality"].value_counts().index)
f = sb.countplot(x = "Health Concern", data = foodDelivery, ax = axes[2], order = foodDelivery["Health Concern"].value_counts().index)
f = sb.countplot(x = "Poor Hygiene", data = foodDelivery, ax = axes[3], order = foodDelivery["Poor Hygiene"].value_counts().index)
f = sb.countplot(x = "Unavailability", data = foodDelivery, ax = axes[4], order = foodDelivery["Unavailability"].value_counts().index)
f = sb.countplot(x = "Unaffordable", data = foodDelivery, ax = axes[5], order = foodDelivery["Unaffordable"].value_counts().index)



### Customer Experience: Others
---
We will be performing Text Analysis on the Reviews.  
From the following plots,
- Most agree that food deliveries provide ease and convenient
- Most agree payment options are easy 
- Most agree there are more offers and discounts
- Almost an equal of people are cooking at home and ordering food online
- Most disagree they had bad past experiences with ordering food online
- Most agree the delivery person was good at delivering food
- Most disagree they had the wrong order delivered to them
- Most disagree they had a missing item in their orders
- Most disagree they place their orders by mistake
- Most said delivery time influences their order 
- Most said rating of restaurant influences their order
- Most people are satisfied with their food order  

We conclude that,
- ???


In [None]:
f, axes = plt.subplots(12, 1, figsize = (15, 70))
f = sb.countplot(x = "Ease and convenient", data = foodDelivery, ax = axes[0], order = foodDelivery["Ease and convenient"].value_counts().index)
f = sb.countplot(x = "Easy Payment option", data = foodDelivery, ax = axes[1], order = foodDelivery["Easy Payment option"].value_counts().index)
f = sb.countplot(x = "More Offers and Discount", data = foodDelivery, ax = axes[2], order = foodDelivery["More Offers and Discount"].value_counts().index)
f = sb.countplot(x = "Self Cooking", data = foodDelivery, ax = axes[3], order = foodDelivery["Self Cooking"].value_counts().index)
f = sb.countplot(x = "Bad past experience", data = foodDelivery, ax = axes[4], order = foodDelivery["Bad past experience"].value_counts().index)
f = sb.countplot(x = "Delivery person ability", data = foodDelivery, ax = axes[5], order = foodDelivery["Delivery person ability"].value_counts().index)
f = sb.countplot(x = "Wrong order delivered", data = foodDelivery, ax = axes[6], order = foodDelivery["Wrong order delivered"].value_counts().index)
f = sb.countplot(x = "Missing item", data = foodDelivery, ax = axes[7], order = foodDelivery["Missing item"].value_counts().index)
f = sb.countplot(x = "Order placed by mistake", data = foodDelivery, ax = axes[8], order = foodDelivery["Order placed by mistake"].value_counts().index)
f = sb.countplot(x = "Influence of time", data = foodDelivery, ax = axes[9], order = foodDelivery["Influence of time"].value_counts().index)
f = sb.countplot(x = "Influence of rating", data = foodDelivery, ax = axes[10], order = foodDelivery["Influence of rating"].value_counts().index)
f = sb.countplot(x = "Output", data = foodDelivery, ax = axes[11], order = foodDelivery["Output"].value_counts().index)

### Customer's Demands Importance
---
We will be performing Text Analysis on the Reviews.  
From the following plots,
- Most think less delivery time is important for satisfaction
- Most think higher quality of delivery is important for satisfaction
- Most think number of calls is important for satisfaction
- Most think politeness of delivery guy is important for satisfaction
- Most think freshness of food is very important for satisfaction
- Most think temperature of food is important for satisfaction
- Most think good taste of food is very important for satisfaction
- Most think good quantity of food is very important for satisfaction


In [None]:
f, axes = plt.subplots(8, 1, figsize = (15, 50))
f = sb.countplot(x = "Less Delivery time", data = foodDelivery, ax = axes[0], order = foodDelivery["Less Delivery time"].value_counts().index)
f = sb.countplot(x = "High Quality of package", data = foodDelivery, ax = axes[1], order = foodDelivery["High Quality of package"].value_counts().index)
f = sb.countplot(x = "Number of calls", data = foodDelivery, ax = axes[2], order = foodDelivery["Number of calls"].value_counts().index)
f = sb.countplot(x = "Politeness", data = foodDelivery, ax = axes[3], order = foodDelivery["Politeness"].value_counts().index)
f = sb.countplot(x = "Freshness ", data = foodDelivery, ax = axes[4], order = foodDelivery["Freshness "].value_counts().index)
f = sb.countplot(x = "Temperature", data = foodDelivery, ax = axes[5], order = foodDelivery["Temperature"].value_counts().index)
f = sb.countplot(x = "Good Taste ", data = foodDelivery, ax = axes[6], order = foodDelivery["Good Taste "].value_counts().index)
f = sb.countplot(x = "Good Quantity", data = foodDelivery, ax = axes[7], order = foodDelivery["Good Quantity"].value_counts().index)

# Multivariate Analysis?
---

# Data Cleaning
---

## Missing Values
---
Here, we check for NaN values in our dataset

In [None]:
foodDelivery.isnull().sum()

## Encoding 
---
To start, since the categories are Ordinal (Ordered categories of uneven intervals) we shall encode the different catergorical levels into numbers.

In [None]:
cleanup_nums = {"Gender": {"Male": 0, "Female": 1},
                "Marital Status": {"Single": 0, "Married": 1, "Prefer not to say": 2},
                "Occupation": {"Student": 0, "Employee": 1, "House wife": 2, "Self Employeed": 3},
                "Monthly Income": {"No Income": 0, "Below Rs.10000":1, "10001 to 25000": 2, "25001 to 50000": 3, "More than 50000": 4}, 
                "Educational Qualifications": {"Uneducated": 0, "School": 1, "Graduate": 2, "Post Graduate": 3, "Ph.D": 4},
                "Perference(P1)": {"Non Veg foods (Lunch / Dinner)": 0, "Veg foods (Breakfast / Lunch / Dinner)": 1, "Sweets": 2, "Bakery items (snacks)": 3},
                "Perference(P2)": {"Non Veg foods (Lunch / Dinner)": 0, " Veg foods (Breakfast / Lunch / Dinner)": 1, " Sweets": 2, " Bakery items (snacks)": 3, " Ice cream / Cool drinks": 4},
                "Ease and convenient": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Time saving": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "More restaurant choices": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Easy Payment option": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "More Offers and Discount": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Good Food quality": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Good Tracking system": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Self Cooking": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Health Concern": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Late Delivery": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Poor Hygiene": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Bad past experience": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Unavailability": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Unaffordable": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Long delivery time": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Delay of delivery person getting assigned": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Delay of delivery person picking up food": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Wrong order delivered": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Missing item": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Order placed by mistake": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly agree": 4},
                "Influence of time": {"No": 0, "Maybe": 1, "Yes": 2},
#                "Order Time": {"Anytime (Mon-Sun)": 0, "Weekdays (Mon-Fri)": 1, "Weekend (Sat & Sun)": 2},
                "Maximum wait time": {"15 minutes": 0, "30 minutes": 1, "45 minutes": 2, "60 minutes": 3, "More than 60 minutes": 4},
#                "Residence in busy location": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly Agree": 4},
#                "Google Maps Accuracy": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly Agree": 4},
#                "Good Road Condition": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly Agree": 4},
                "Low quantity low time": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly Agree": 4},
                "Delivery person ability": {"Strongly disagree": 0, "Disagree": 1, "Neutral": 2, "Agree": 3, "Strongly Agree": 4},
                "Influence of rating": {"No": 0, "Maybe": 1, "Yes": 2},
                "Less Delivery time": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "High Quality of package": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Number of calls": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Politeness": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Freshness ": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Temperature": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Good Taste ": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Good Quantity": {"Unimportant": 0, "Slightly Important": 1, "Important": 2, "Moderately Important": 3, "Very Important": 4},
                "Output": {"No": 0, "Yes": 1}
               }

In [None]:
foodDelivery = foodDelivery.replace(cleanup_nums)

## Text Analysis
---
Since models cannot use text to train and test, we need to convert text into either numbers or arrays    
To convert text into numbers, we can simply use datasets of positive words or negative words to +1 or -1 to a sentiment count respectively (Does Sentiment Analysis based on connatation of word)   
To convert text into arrays, Sklearn provides a few ways:
>One-hot Encoding, this is used more on categorical variables, not suitable for text data  
>Bag of Words, converting the Reviews data into array of numbers that correspond to each word

Sklearn provides 3 ways to do this, CountVectorizer, TfidfVectorizer, and HashingVectorizer (Does Sentiment Analysis based on rareness of words)    
>CountVectorizer simply tokenizes existing words into an array.
>TfidfVectorizer (Term-Frequency Inverse-Document-Frequency) builts on CountVectorizer by calculating word frequencies.
>HashingVectorizer builts on TfidfVectorizer by hashing the words instead. This is useful for very large sets of words.  

We will be using TfidfVectorizer since it has a higher accuracy than CountVectorizer and our dataset is not that large 

TfidfVectorizer will calculate the Term Frequency(Number of times a word appears in a sentence) and Inverse Document Frequency (How rare or common a word is in a sentence) to derive TF-IDF  
Term Frequency tf(t, d) = (Number of times term t appears in a document) / (Total number of terms in the document)  
Inverse Document Frequency idf(t) = log ( Number of sentences / df(t) ) + 1   
TF-IDF(t, d) = tf(t, d) * idf(t)  

In [None]:
reviews = pd.DataFrame(foodDelivery['Reviews']) # x

# Method 1: Datacleaner Method
# def datacleaner(dataframe):
#     dc.autoclean(dataframe)
# datacleaner(reviews)
# reviews.head()

# Method 2: 
# Data Cleaning Reviews: Remove punctuations and convert all to lowercase alphabets for converting into matrix of words
# reviews will be a dataframe of 2 columns, index and Reviews, containing cleaned sentences
def clean_dataset(dataset):
    for i in range(dataset.shape[0]):
        newstring = ""
        for char in dataset.iloc[i][0]:
            if char.isalpha() or char == " ":
                newstring += char.lower()
        dataset.iloc[i][0] = newstring
clean_dataset(reviews)        
review_x = reviews['Reviews'] # extract the column of sentences, throwing away the index
tfidf = TfidfVectorizer(max_features = 1000, stop_words = None) # can limit the maximum number of words tracked and exclude stopwords (common, non-sentimental words)
review_x = tfidf.fit_transform(review_x) # transform the column into sparse arrays and calculate TFIDF
review_x = review_x.toarray()
review_x = pd.DataFrame(review_x) # transform back into dataframe to concat with rest of variables

In [None]:
foodDelivery = pd.concat([foodDelivery, review_x], axis = 1)
foodDelivery.head()

# Regression Model
---
To answer our question, we shall use Logistic Regression.

In [None]:
Input = foodDelivery.drop(['Output'], axis = 1)
sc = StandardScaler()
Input = sc.fit_transform(Input)
Output = foodDelivery['Output']

X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(Input, Output, test_size = 0.20, random_state = 0)
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train_log, y_train_log)

y_train_pred_log = logreg.predict(X_train_log)
y_test_pred_log = logreg.predict(X_test_log)

print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", logreg.score(X_train_log, y_train_log))
print("Mean Squared Error (MSE) \t:", metrics.mean_squared_error(y_train_log, y_train_pred_log))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(metrics.mean_squared_error(y_train_log, y_train_pred_log)))
print()


TN_train_log = metrics.confusion_matrix(y_train_log, y_train_pred_log)[0][0]
FP_train_log = metrics.confusion_matrix(y_train_log, y_train_pred_log)[0][1]
FN_train_log = metrics.confusion_matrix(y_train_log, y_train_pred_log)[1][0]
TP_train_log = metrics.confusion_matrix(y_train_log, y_train_pred_log)[1][1]

FPRate_train_log = FP_train_log / (TN_train_log + FP_train_log)
FNRate_train_log = FN_train_log / (TP_train_log + FN_train_log)
print("False Positive Rate \t\t:", FPRate_train_log)
print("True Positive Rate \t\t:", 1 - FNRate_train_log)

print()

print("Accuracy:",metrics.accuracy_score(y_train_log, y_train_pred_log))
print("Precision:",metrics.precision_score(y_train_log, y_train_pred_log))
print("Recall:",metrics.recall_score(y_train_log, y_train_pred_log))

print()
print()
print()

print("Goodness of Fit of Model \tTest Dataset")
print("Explained Variance (R^2) \t:", logreg.score(X_test_log, y_test_log))
print("Mean Squared Error (MSE) \t:", metrics.mean_squared_error(y_test_log, y_test_pred_log))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(metrics.mean_squared_error(y_test_log, y_test_pred_log)))
print()



# Check the Goodness of Fit (on Test Data)
TN_test_log = metrics.confusion_matrix(y_test_log, y_test_pred_log)[0][0]
FP_test_log = metrics.confusion_matrix(y_test_log, y_test_pred_log)[0][1]
FN_test_log = metrics.confusion_matrix(y_test_log, y_test_pred_log)[1][0]
TP_test_log = metrics.confusion_matrix(y_test_log, y_test_pred_log)[1][1]

FPRate_test_log = FP_test_log / (TN_test_log + FP_test_log)
FNRate_test_log = FN_test_log / (TP_test_log + FN_test_log)
print("False Positive Rate \t\t:", FPRate_test_log)
print("True Positive Rate \t\t:", 1 - FNRate_test_log)
print()

print("Accuracy:",metrics.accuracy_score(y_test_log, y_test_pred_log))
print("Precision:",metrics.precision_score(y_test_log, y_test_pred_log))
print("Recall:",metrics.recall_score(y_test_log, y_test_pred_log))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(metrics.confusion_matrix(y_train_log, y_train_pred_log),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(metrics.confusion_matrix(y_test_log, y_test_pred_log), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Random Forest
---

In [None]:
Input = foodDelivery.drop(['Output'], axis = 1)
#sc = StandardScaler()
#Input = sc.fit_transform(Input)
Output = foodDelivery['Output']

X_train_forest, X_test_forest, y_train_forest, y_test_forest = train_test_split(Input, Output, test_size = 0.20, random_state = 0)
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train_forest, y_train_forest)

y_train_pred_forest = forest.predict(X_train_forest)
y_test_pred_forest = forest.predict(X_test_forest)

print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", forest.score(X_train_forest, y_train_forest))
print("Mean Squared Error (MSE) \t:", metrics.mean_squared_error(y_train_forest, y_train_pred_forest))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(metrics.mean_squared_error(y_train_forest, y_train_pred_forest)))
print()


TN_train_forest = metrics.confusion_matrix(y_train_forest, y_train_pred_forest)[0][0]
FP_train_forest = metrics.confusion_matrix(y_train_forest, y_train_pred_forest)[0][1]
FN_train_forest = metrics.confusion_matrix(y_train_forest, y_train_pred_forest)[1][0]
TP_train_forest = metrics.confusion_matrix(y_train_forest, y_train_pred_forest)[1][1]

FPRate_train_forest = FP_train_forest / (TN_train_forest + FP_train_forest)
FNRate_train_forest = FN_train_forest / (TP_train_forest + FN_train_forest)
print("False Positive Rate \t\t:", FPRate_train_forest)
print("True Positive Rate \t\t:", 1 - FNRate_train_forest)

print()

print("Accuracy:",metrics.accuracy_score(y_train_forest, y_train_pred_forest))
print("Precision:",metrics.precision_score(y_train_forest, y_train_pred_forest))
print("Recall:",metrics.recall_score(y_train_forest, y_train_pred_forest))

print()
print()
print()

print("Goodness of Fit of Model \tTest Dataset")
print("Explained Variance (R^2) \t:", logreg.score(X_test_forest, y_test_forest))
print("Mean Squared Error (MSE) \t:", metrics.mean_squared_error(y_test_forest, y_test_pred_forest))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(metrics.mean_squared_error(y_test_forest, y_test_pred_forest)))
print()



# Check the Goodness of Fit (on Test Data)
TN_test_forest = metrics.confusion_matrix(y_test_forest, y_test_pred_forest)[0][0]
FP_test_forest = metrics.confusion_matrix(y_test_forest, y_test_pred_forest)[0][1]
FN_test_forest = metrics.confusion_matrix(y_test_forest, y_test_pred_forest)[1][0]
TP_test_forest = metrics.confusion_matrix(y_test_forest, y_test_pred_forest)[1][1]

FPRate_test_forest = FP_test_forest / (TN_test_forest + FP_test_forest)
FNRate_test_forest = FN_test_forest / (TP_test_forest + FN_test_forest)
print("False Positive Rate \t\t:", FPRate_test_forest)
print("True Positive Rate \t\t:", 1 - FNRate_test_forest)
print()

print("Accuracy:",metrics.accuracy_score(y_test_forest, y_test_pred_forest))
print("Precision:",metrics.precision_score(y_test_forest, y_test_pred_forest))
print("Recall:",metrics.recall_score(y_test_forest, y_test_pred_forest))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(metrics.confusion_matrix(y_train_forest, y_train_pred_forest),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(metrics.confusion_matrix(y_test_forest, y_test_pred_forest), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])




