# Restaurant Revenue Prediction

![](https://storage.googleapis.com/kaggle-competitions/kaggle/4272/media/TAB_banner2.png)

## Description

With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures. 

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred. 

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

TFI would love to hire an expert Kaggler like you to head up their growing data science team in Istanbul or Shanghai. You'd be tackling problems like the one featured in this competition on a global scale.

Source: https://www.kaggle.com/competitions/restaurant-revenue-prediction/overview

## Data Description

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis.

### File descriptions

    train.csv - the training set. Use this dataset for training your model. 
    test.csv - the test set. To deter manual "guess" predictions, Kaggle has supplemented the test set with additional "ignored" data. These are not counted in the scoring.
    sampleSubmission.csv - a sample submission file in the correct format.

### Data fields

    Id: Restaurant id. 
    Open Date: opening date for a restaurant.
    City: City that the restaurant is in. Note that there are unicode in the names. 
    City Group: Type of the city. Big cities, or Other. 
    Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile.
    P1, P2 - P37: There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.
    Revenue: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values. 

Source: https://www.kaggle.com/competitions/restaurant-revenue-prediction/data

In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Read datasets

In [None]:
train_data = pd.read_csv('/kaggle/input/restaurant-revenue-prediction/train.csv.zip')
test_data = pd.read_csv('/kaggle/input/restaurant-revenue-prediction/test.csv.zip')
train_data.drop('Id', axis=1, inplace=True)

In [None]:
train_data

In [None]:
train_data.info()

In [None]:
train_data[['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10']].describe()

In [None]:
train_data[['P11', 'P12', 'P13', 'P14', 'P15', 'P16', 'P17', 'P18', 'P19', 'P20']].describe()

In [None]:
train_data[['P21', 'P22', 'P23', 'P24', 'P25', 'P26', 'P27', 'P28', 'P29', 'P30']].describe()

In [None]:
train_data[['P31', 'P32', 'P33', 'P34', 'P35', 'P36', 'P37']].describe()

In [None]:
test_data

## Data Processing

### Convert the string Open Date to date format

In [None]:
train_data['Open Date'] = pd.to_datetime(train_data['Open Date'], format='%m/%d/%Y')
test_data['Open Date'] = pd.to_datetime(test_data['Open Date'], format='%m/%d/%Y')

In [None]:
train_data['Open Date'].min(), train_data['Open Date'].max()

In [None]:
test_data['Open Date'].min(), test_data['Open Date'].max()

#### Calculate the number of open days since the restaurant was opened

In [None]:
date_last_train = pd.DataFrame({'Date':np.repeat(['01/01/2015'], [len(train_data)])})
date_last_train['Date'] = pd.to_datetime(date_last_train['Date'], format='%m/%d/%Y')
date_last_test = pd.DataFrame({'Date':np.repeat(['01/01/2015'], [len(test_data)])})
date_last_test['Date'] = pd.to_datetime(date_last_test['Date'], format='%m/%d/%Y')

In [None]:
train_data['OpenDays'] = ''
test_data['OpenDays'] = ''

train_data['OpenDays'] = date_last_train['Date'] - train_data['Open Date']
test_data['OpenDays'] = date_last_test['Date'] - test_data['Open Date']

train_data['OpenDays'] = train_data['OpenDays'].astype('timedelta64[D]').astype(int)
test_data['OpenDays'] = test_data['OpenDays'].astype('timedelta64[D]').astype(int)

#### Calculate the number of open years

In [None]:
train_data['OpenYears'] = ''
test_data['OpenYears'] = ''

train_data['OpenYears'] = date_last_train['Date'] - train_data['Open Date']
test_data['OpenYears'] = date_last_test['Date'] - test_data['Open Date']

train_data['OpenYears'] = train_data['OpenYears'].astype('timedelta64[Y]').astype(int)
test_data['OpenYears'] = test_data['OpenYears'].astype('timedelta64[Y]').astype(int)

In [None]:
train_data = train_data.drop('Open Date', axis=1)
test_data = test_data.drop('Open Date', axis=1)

#### Convert City Group into two boolean columns

In [None]:
citygroup_train = pd.get_dummies(train_data['City Group'])
train_data = train_data.join(citygroup_train)

citygroup_test = pd.get_dummies(test_data['City Group'])
test_data = test_data.join(citygroup_test)

train_data = train_data.drop('City Group', axis=1)
test_data = test_data.drop('City Group', axis=1)

#### Convert Type column into a numeric column

In [None]:
train_data['Type_int'] = train_data['Type'].replace(['FC', 'IL', 'DT', 'MB'], [0, 1, 2, 4])
test_data['Type_int'] = test_data['Type'].replace(['FC', 'IL', 'DT', 'MB'], [0, 1, 2, 4])

#### Remove City column

In [None]:
train_data.drop('City', axis=1, inplace=True)
test_data.drop('City', axis=1, inplace=True)

In [None]:
train_data.head(10)

## Charts

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data, x='revenue', kde=True)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Restaurant - Revenue', fontsize=20, fontweight='bold')
plt.xlabel('Revenue', fontsize=16, fontweight='bold')
plt.ylabel('Number of Restaurants', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

### Revenue by City Group

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data.query('`Big Cities` == 1'), x='revenue', kde=True)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Revenue of Restaurants from Big Cities', fontsize=20, fontweight='bold')
plt.xlabel('Revenue', fontsize=16, fontweight='bold')
plt.ylabel('Number of Restaurants', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data.query('`Other` == 1'), x='revenue', kde=True)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Revenue of Restaurants from Small Cities', fontsize=20, fontweight='bold')
plt.xlabel('Revenue', fontsize=16, fontweight='bold')
plt.ylabel('Number of Restaurants', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

### Revenue by open time

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data, x='OpenDays', kde=True)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Restaurant - Open Days', fontsize=20, fontweight='bold')
plt.xlabel('Open Days', fontsize=16, fontweight='bold')
plt.ylabel('Number of Restaurants', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data, x='OpenYears', kde=True)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Restaurant - Open Years', fontsize=20, fontweight='bold')
plt.xlabel('Open Years', fontsize=16, fontweight='bold')
plt.ylabel('Number of Restaurants', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data, x='revenue', y='OpenDays')
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Restaurant - Revenue by Open Days', fontsize=20, fontweight='bold')
plt.xlabel('Revenue', fontsize=16, fontweight='bold')
plt.ylabel('Open Days', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
ax = sns.histplot(data=train_data, x='revenue', y='OpenYears')
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Restaurant - Revenue by Open Years', fontsize=20, fontweight='bold')
plt.xlabel('Revenue', fontsize=16, fontweight='bold')
plt.ylabel('Open Years', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

### Revenue by Type

In [None]:
plt.figure(figsize=(8,6))
ax = sns.histplot(data=train_data, x='Type', y='revenue')
ax.get_yaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.title('Revenue by Restaurant Type', fontsize=20, fontweight='bold')
plt.xlabel('Restaurant Type', fontsize=16, fontweight='bold')
plt.ylabel('Revenue', fontsize=16, fontweight='bold')
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

### Analyzing correlation

In [None]:
plt.figure(figsize=(25, 10))
#https://www.tylervigen.com/spurious-correlations
mask = np.triu(np.ones_like(train_data.corr(), dtype=bool))
heatmap = sns.heatmap(train_data.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation', fontdict={'fontsize':18}, pad=16)

## Predict Restaurant Revenues

In [None]:
x_train = train_data.query('revenue <= 10000000').drop(['Type', 'revenue'], axis=1)
y_train = train_data.query('revenue <= 10000000')['revenue']

In [None]:
x_test = test_data.drop(['Id', 'Type'], axis=1)

In [None]:
from sklearn import tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(x_train, y_train)

In [None]:
predict = clf.predict(x_test)
predict

In [None]:
submission = pd.DataFrame({'Id': test_data['Id'], 'Prediction': predict})
submission.to_csv('submission.csv', header=True, index=False)
submission.head()