## Restaurant Revenue Prediction


#### This is about "Restaurant revenue" dataset which contains 100 observations of restaurants with 8 attributes.

Dataset Attributes:

1. ID - Restaurant ID
2. Name - Name of the Restaurant
3. Franchise - Restaurant has franchise or not
4. Category - specific type of category provided by restaurant
5. No_of_item - Different types of items provided by restaurant
6. Order_Placed - Order placed by customer to restaurant (in lacs)
7. Revenue - Total amount of income generated by the restaurant
Task is to predict the restaurant revenue based on the independent features using a Linear Regression Algorithm.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [6]:
df = pd.read_csv('data/revenue_prediction.csv')

#### Show Top 5 Records

In [7]:
df.head()

Unnamed: 0,Id,Name,Franchise,Category,City,No_Of_Item,Order_Placed,Revenue
0,101,HungryHowie'sPizza,Yes,Mexican,Bengaluru,55,5.5,5953753
1,102,CharleysPhillySteaks,No,Varied Menu,Gurugram,72,6.8,7223131
2,103,Chuy's,Yes,Chicken,Pune,25,1.9,2555379
3,104,O'Charley's,Yes,Italian/Pizza,Mumbai,18,2.5,2175511
4,105,PolloTropical,Yes,Pizza,Noida,48,4.2,4816715


#### Shape of the dataset

In [8]:
df.shape

(100, 8)

### 2.2 Dataset information

- gender : sex of students  -> (Male/female)
- race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
- parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
- lunch : having lunch before test (standard or free/reduced) 
- test preparation course : complete or not complete before test
- math score
- reading score
- writing score

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [9]:
df.isna().sum()

Id              0
Name            0
Franchise       0
Category        0
City            0
No_Of_Item      0
Order_Placed    0
Revenue         0
dtype: int64

#### There are no missing values in the data set

### 3.2 Check Duplicates

In [10]:
df.duplicated().sum()

0

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [11]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            100 non-null    int64  
 1   Name          100 non-null    object 
 2   Franchise     100 non-null    object 
 3   Category      100 non-null    object 
 4   City          100 non-null    object 
 5   No_Of_Item    100 non-null    int64  
 6   Order_Placed  100 non-null    float64
 7   Revenue       100 non-null    int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 6.4+ KB


### 3.4 Checking the number of unique values of each column

In [12]:
df.nunique()

Id              100
Name            100
Franchise         2
Category         20
City              5
No_Of_Item       53
Order_Placed     55
Revenue         100
dtype: int64

### 3.5 Check statistics of data set

In [13]:
df.describe()

Unnamed: 0,Id,No_Of_Item,Order_Placed,Revenue
count,100.0,100.0,100.0,100.0
mean,150.5,49.08,4.086,4395161.0
std,29.011492,22.370923,2.055101,2659932.0
min,101.0,18.0,1.0,849870.0
25%,125.75,34.75,2.75,2688328.0
50%,150.5,45.0,3.65,3911401.0
75%,175.25,57.25,5.1,5330084.0
max,200.0,126.0,13.0,19696940.0


#### Insight
- From above description of numerical data, all means are very close to each other - between 66 and 68.05;
- All standard deviations are also close - between 14.6 and 15.19;
- While there is a minimum score  0 for math, for writing minimum is much higher = 10 and for reading myet higher = 17

### 3.7 Exploring Data

In [14]:
df.head()

Unnamed: 0,Id,Name,Franchise,Category,City,No_Of_Item,Order_Placed,Revenue
0,101,HungryHowie'sPizza,Yes,Mexican,Bengaluru,55,5.5,5953753
1,102,CharleysPhillySteaks,No,Varied Menu,Gurugram,72,6.8,7223131
2,103,Chuy's,Yes,Chicken,Pune,25,1.9,2555379
3,104,O'Charley's,Yes,Italian/Pizza,Mumbai,18,2.5,2175511
4,105,PolloTropical,Yes,Pizza,Noida,48,4.2,4816715


In [15]:
print("Categories in 'Franchise' variable:     ",end=" " )
print(df['Franchise'].unique())

print("Categories in 'City' variable:  ",end=" ")
print(df['City'].unique())

Categories in 'Franchise' variable:      ['Yes' 'No']
Categories in 'City' variable:   ['Bengaluru' 'Gurugram' 'Pune' 'Mumbai' 'Noida']


In [16]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 4 numerical features : ['Id', 'No_Of_Item', 'Order_Placed', 'Revenue']

We have 4 categorical features : ['Name', 'Franchise', 'Category', 'City']


### 4. Exploring Data ( Visualization )
#### 4.1 Visualize average score distribution to make some conclusion. 
- Histogram
- Kernel Distribution Function (KDE)

#### 4.1.1 Histogram & KDE

In [17]:
# fig, axs = plt.subplots(1, 2, figsize=(15, 7))
# plt.subplot(121)
# sns.histplot(data=df,x='average',bins=30,kde=True,color='g')
# plt.subplot(122)
# sns.histplot(data=df,x='average',kde=True,hue='gender')
# plt.show()

In [18]:
# fig, axs = plt.subplots(1, 2, figsize=(15, 7))
# plt.subplot(121)
# sns.histplot(data=df,x='total score',bins=30,kde=True,color='g')
# plt.subplot(122)
# sns.histplot(data=df,x='total score',kde=True,hue='gender')
# plt.show()

#####  Insights
- Female students tend to perform well then male students.

In [19]:
# plt.subplots(1,3,figsize=(25,6))
# plt.subplot(141)
# sns.histplot(data=df,x='average',kde=True,hue='lunch')
# plt.subplot(142)
# sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='lunch')
# plt.subplot(143)
# sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='lunch')
# plt.show()

#####  Insights
- Standard lunch helps perform well in exams.
- Standard lunch helps perform well in exams be it a male or a female.

In [20]:
# plt.subplots(1,3,figsize=(25,6))
# plt.subplot(141)
# ax =sns.histplot(data=df,x='average',kde=True,hue='parental level of education')
# plt.subplot(142)
# ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental level of education')
# plt.subplot(143)
# ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental level of education')
# plt.show()

#####  Insights
- In general parent's education don't help student perform well in exam.
- 2nd plot shows that parent's whose education is of associate's degree or master's degree their male child tend to perform well in exam
- 3rd plot we can see there is no effect of parent's education on female students.

In [21]:
# plt.subplots(1,3,figsize=(25,6))
# plt.subplot(141)
# ax =sns.histplot(data=df,x='average',kde=True,hue='race/ethnicity')
# plt.subplot(142)
# ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race/ethnicity')
# plt.subplot(143)
# ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race/ethnicity')
# plt.show()

#####  Insights
- Students of group A and group B tends to perform poorly in exam.
- Students of group A and group B tends to perform poorly in exam irrespective of whether they are male or female

#### 4.2 Maximumum score of students in all three subjects

In [22]:

# plt.figure(figsize=(18,8))
# plt.subplot(1, 4, 1)
# plt.title('MATH SCORES')
# sns.violinplot(y='math score',data=df,color='red',linewidth=3)
# plt.subplot(1, 4, 2)
# plt.title('READING SCORES')
# sns.violinplot(y='reading score',data=df,color='green',linewidth=3)
# plt.subplot(1, 4, 3)
# plt.title('WRITING SCORES')
# sns.violinplot(y='writing score',data=df,color='blue',linewidth=3)
# plt.show()

#### Insights
- From the above three plots its clearly visible that most of the students score in between 60-80 in Maths whereas in reading and writing most of them score from 50-80

#### 4.4 Feature Wise Visualization
#### 4.4.1 GENDER COLUMN
- How is distribution of Gender ?
- Is gender has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( How is distribution of Gender ? )

In [23]:
# f,ax=plt.subplots(1,2,figsize=(20,10))
# sns.countplot(x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
# for container in ax[0].containers:
#     ax[0].bar_label(container,color='black',size=20)
    
# plt.pie(x=df['gender'].value_counts(),labels=['Male','Female'],explode=[0,0.1],autopct='%1.1f%%',shadow=True,colors=['#ff4d4d','#ff8000'])
# plt.show()

#### Insights 
- Gender has balanced data with female students are 518 (48%) and male students are 482 (52%) 

#### 4.4.6 CHECKING OUTLIERS

In [24]:
# plt.subplots(1,4,figsize=(16,5))
# plt.subplot(141)
# sns.boxplot(df['math score'],color='skyblue')
# plt.subplot(142)
# sns.boxplot(df['reading score'],color='hotpink')
# plt.subplot(143)
# sns.boxplot(df['writing score'],color='yellow')
# plt.subplot(144)
# sns.boxplot(df['average'],color='lightgreen')
# plt.show()

#### Insights
- From the above plot it is clear that all the scores increase linearly with each other.

### 5. Conclusions
- Student's Performance is related with lunch, race, parental level education
- Females lead in pass percentage and also are top-scorers
- Student's Performance is not much related with test preparation course
- Finishing preparation course is benefitial.