Our client is one of the biggest restaurants in Lagos, Nigeria, Mama Tee Restaurant.


The objective of the regression task is to predict the amount of tip (gratuity in Nigeria naira) given to a food server based on total_bill, gender, smoker (whether they smoke in the party or not), day (day of the week for the party), time (time of the day whether for lunch or dinner), and size (size of the party) in Mama Tee restaurant.


Label: The label for this problem is tip.


Features: There are 6 features and they include total bill, gender, smoker, day, time, and size.

In [None]:
# ! pip install pandas==0.25.3
# ! pip install numpy==1.16.5
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pandas_profiling import ProfileReport

In [None]:

tip = pd.read_csv('tips.csv')

In [None]:
tip.head()

In [None]:
tip.shape

In [None]:
#ProfileReport(tip)

### Relationships with Categorical Variables

### Gender

In [None]:
sns.boxplot(x = 'gender', y = 'tip', data = tip)
plt.ylabel('Amount of tip')

### Smoker

In [None]:
sns.boxplot(x = 'smoker', y = 'tip', data = tip)
plt.ylabel('Amount of tip')

Smokers and non-smokers gave almost the same amount of tip.

### Time

In [None]:
sns.boxplot(x = 'time', y = 'tip', data = tip)
plt.ylabel('Amount of tip')

Lunch and dinner gave almost the same amount of tip.

## Let's train the model

In [None]:
from sklearn import metrics #For evaluating the model built
from sklearn.model_selection import train_test_split

## We need to split the data into features and label

In [None]:
X = tip.drop(["tip"], axis = "columns") 
y = tip["tip"]

In [None]:
X.head(2)

In [None]:
X.head()

In [None]:
y.head()

#### We need to create a one-hot encoding for all the categorical features in the data because Scikit Learn Linear Regression algorithm cannot work with categorical data directly. 
#### They require all input variables and output variables to be numeric.
#### In this case, we will create a one-hot encoding for gender, smoker, day and time by using `pd.get_dummies()`

In [None]:
X = pd.get_dummies(X)
X.head()

In [None]:
X.shape

#### We will split our dataset (Features (X) and Label (Y)) into training and test data by using `train_test_split()` function from sklearn. The training set will be 80% while the test set will be 20%. The random_state that is set to 42 is for all of us to have the same set of data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [None]:

from sklearn.linear_model import LinearRegression

from sklearn.metrics import f1_score
model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
predictions[:10]

In [None]:
# f1_score(y_test, predictions)

#### Let's compare the actual values and predictions

In [None]:
comparison = pd.DataFrame({'Actual Values':y_test, 'Predictions':predictions})

In [None]:
comparison.shape

In [None]:
comparison.head(30)

In [None]:
#The error
y_test - predictions

In [None]:
mse = metrics.mean_squared_error(y_test, predictions)
print("Mean squared error:", round(mse, 3))

rmse = np.sqrt(mse)
print("Root mean squared error:", round(rmse, 3))

In [None]:
print("Maximum tip:", np.max(tip['tip']))
print("Minimum tip:", np.min(tip['tip']))