# Prediction Code

This file demonstrate the code we used to predict the missing values in our dataset.

We used a machine learning technique (Xgb regressor) to fill in missing values from the restaurant section. There are 13 columns in the restaurant section, with 504 data records, which are extracted from our cleaned database with [COL N] and 504 records. 


In [1]:
import warnings
warnings.simplefilter(action='ignore')
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
import matplotlib.pyplot as plt 
import numpy as np
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.base import BaseEstimator


The data set of the first layer of predictions was the data from ID, REST, and three related fields from the LRC section [WHY LRC? ](i.e. 17 columns in total, 1 ID column, 13 columns from the restaurant section, and 3 related columns from the LRC section). We marked this section of data as REST1 for the rest of the article.  After we spilt the REST1 into the training data set and the testing data set, we trained our XgbRegressor model with the training data set and tuned the parameters of the XgbRegressor using grid search and cross-validation method. 

In [2]:
def SplitSet(target: str, filename: str) -> pd.DataFrame:
    """
    Input: the target column name, the name of the original dataset
    Output: x, y, xtrain, xtest, ytrain, ytest
    This function split the original file into train and test set.
    """
    df = pd.read_csv(filename)
    df = df.drop(['ID'],axis = 1)
    df = df.dropna(axis=0, how='all')
    df = df.dropna(axis=0,subset = [target])
    y = df[target]
    x = df.drop([target],axis = 1)
    xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15,random_state =10)
    return x, y, xtrain, xtest, ytrain, ytest

def XgbModel(xtrain: pd.DataFrame, ytrain: pd.DataFrame,x: pd.DataFrame,y: pd.DataFrame) -> BaseEstimator:
    """
    Input: the trainset for x, the training set for y, x, and y
    Output: the model
    This function gives the best xgbmodel using gridserch and cross validation and then return it.
    """
    params = {
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100,600,1000],
    'max_depth': [1,3],
    'lambda':[0,0.5,1],
    'alpha':[0,0.5,1]
    }

    xgbr = xgb.XGBRegressor(seed = 20)
    clf = GridSearchCV(estimator=xgbr, 
                    param_grid=params,
                    scoring='neg_mean_absolute_percentage_error', 
                    verbose=1,
                    cv=2,
                    )
    clf.fit(xtrain, ytrain)
    print("Best parameters:", clf.best_params_)
    print("Best scores:", clf.best_score_)

    xgb_model = clf.best_estimator_
    scores = cross_val_score(xgbr, x, y, cv=5, scoring='neg_mean_absolute_percentage_error')
    scores =  np.absolute(scores)
    print('MAPE CV Score: %.3f (%.3f)' % (scores.mean(), scores.std()) )
    
    return xgb_model



def Predict(previous_file: str, target: str, xgbr: BaseEstimator , new_file_name : str , p_v: str) -> None:
    """
    Input: the previous_file name , the name of the target column, the xgb model, the name of the new file, the addition string
    Output: None
    This function add the prediction of a traget column to previous file and then save it in a new file.
    """
    previous_df = pd.read_csv(previous_file)
    orginal_df = pd.read_csv('./data/REST_data_1.csv')

    predict_df = orginal_df[pd.isnull(orginal_df[target])]
    true_x_for_prediction = predict_df.drop(columns=['ID',target],inplace= False)
    ypred = xgbr.predict(true_x_for_prediction)

    ypred = [ '%.2f' % elem for elem in ypred ]

    predict_df[target] = ypred
    predict_df[target] = predict_df[target].apply(lambda x: f"{x}"+p_v)

    previous_df.set_index(['ID'], inplace=True)
    previous_df.update(predict_df.set_index(['ID']))
    previous_df.reset_index( inplace=True)
    previous_df.to_csv(new_file_name,index = False)

The data set of the first layer of predictions was the data from ID, REST, and three related fields from the LRC section [WHY LRC? ](i.e. 17 columns in total, 1 ID column, 13 columns from the restaurant section, and 3 related columns from the LRC section). We marked this section of data as REST1 for the rest of the article.  After we spilt the REST1 into the training data set and the testing data set, we trained our XgbRegressor model with the training data set and tuned the parameters of the XgbRegressor using grid search and cross-validation method. Then we predicted the missing value from the 13 fields of the restaurant section in REST1. If the MAPE value for a particular field is > 0.35 (MAPE value is a metric that shows the model’s performance and the accuracy of the predicted data, the higher the MAPE is, meaning that the prediction is ), we dropped the prediction of that field and if the MAPE value for a particular field is < 0.35, we filling in the missing value of that particular field in the format of *(p1). The fields that are filling in missing values are:  REST: Staff size MOD",  "REST: No Wtr MOD","REST: No Ktch Wkr MOD","REST: No tables MOD","REST: Monthly Expenses MOD". The detailed MAPE value can be checked in the Perdiction.ipynb file.

After the first layer of prediction, we conduct the second layer of prediction.The data set of the second layer of predictions was the data from REST1 and filled in missing data for columns that have MAPE < .35. We marked this section of data as REST2 for the rest of the article. Then we conducted the same procedure for REST1 to REST2. And we predicted the missing value for the 8 (13 - 5 fields that are predicted in layer 1) fields of the restaurant section in REST2. On the second layer of prediction, if the MAPE value for a particular field is > 0.5, we dropped the prediction of that field and if the MAPE value for a particular field is < 0.5, we fill in the missing value of that particular field in the format of *(p2). The fields that are filling in missing values are:  REST: Monthly Expenses MOD, REST: Total Partners MOD","REST: No Act Part MOD","REST:  No Pass Part MOD. The detailed MAPE value can be checked in the Perdiction.ipynb file. The detailed MAPE value can be checked in the Perdiction.ipynb file.


In [27]:
#doing first round of prediction
Target = "REST: Staff size MOD" 
print("First round prediction start for:",Target)
x,y,xtrain, xtest, ytrain, ytest = SplitSet(Target,'./data/PredictionData/REST_data_1_Cleaned.csv')
xgbr = XgbModel(xtrain,ytrain, x,y)
Predict('./data/PredictionData/REST_data_1.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v1_without_label.csv','')
Predict('./data/PredictionData/REST_data_1.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v1.csv','(p1)')
print("-----Prediction END-------")

Targetlist = [ "REST: No Wtr MOD","REST: No Ktch Wkr MOD","REST: No tables MOD","REST: Monthly Expenses MOD"]
for Target in Targetlist:
    print("First round prediction start for:",Target)
    x,y,xtrain, xtest, ytrain, ytest = SplitSet(Target,'./data/PredictionData/REST_data_1_Cleaned.csv')
    xgbr = XgbModel(xtrain,ytrain, x,y)
    Predict('./data/PredictionData/REST_data_1_predicted_v1_without_label.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v1_without_label.csv','')
    Predict('./data/PredictionData/REST_data_1_predicted_v1.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v1.csv','(p1)')
    print("-----Prediction END-------")


#doing second round of prediction
Target = "REST: Monthly Expenses MOD"
print("Second round prediction start for:",Target)
x,y,xtrain, xtest, ytrain, ytest = SplitSet(Target,'./data/PredictionData/REST_data_1_predicted_v1_without_label.csv')
xgbr = XgbModel(xtrain,ytrain, x,y)
Predict('./data/PredictionData/REST_data_1_predicted_v1_without_label.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v2_without_label.csv','')
Predict('./data/PredictionData/REST_data_1_predicted_v1.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v2.csv','(p2)')
print("-----Prediction END-------")

Targetlist = ["REST: Total Partners MOD","REST: No Act Part MOD","REST:  No Pass Part MOD"]
for Target in Targetlist:
    print("Second round prediction start for:",Target)
    x,y,xtrain, xtest, ytrain, ytest = SplitSet(Target,'./data/PredictionData/REST_data_1_predicted_v1_without_label.csv')
    xgbr = XgbModel(xtrain,ytrain, x,y)
    Predict('./data/PredictionData/REST_data_1_predicted_v2_without_label.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v2_without_label.csv','')
    Predict('./data/PredictionData/REST_data_1_predicted_v2.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v2.csv','(p2)')
    print("-----Prediction END-------")

First round prediction start for: REST: Staff size MOD
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best parameters: {'alpha': 0, 'lambda': 1, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
Best scores: -0.4712055346919227
MAPE CV Score: 0.346 (0.154)
-----Prediction END-------
First round prediction start for: REST: No Wtr MOD
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best parameters: {'alpha': 0.5, 'lambda': 1, 'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100}
Best scores: -0.31172121821608
MAPE CV Score: 0.222 (0.089)
-----Prediction END-------
First round prediction start for: REST: No Ktch Wkr MOD
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best parameters: {'alpha': 0, 'lambda': 0, 'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 600}
Best scores: -0.233649934857513
MAPE CV Score: 0.297 (0.193)
-----Prediction END-------
First round prediction start for: REST: No tables MOD
Fitting 2 folds for 

After the second layer of prediction, we conduct the third layer of prediction. The data set of the third layer of predictions was the data from REST2 and filled in missing data for columns that have MAPE < .5 We marked this section of data as REST3 for the rest of the article. Then we conducted the same procedure for REST1 to REST3. And we predicted the missing value for the 4 (13 - 5 fields that are predicted in layer 1 - 4 fields that are predicted in layer 2) fields of the restaurant section in REST3. On the third layer of prediction, we fill in all the missing values of these fields in the format of *(p3). ATTENTION: the MAPE for the fields in the third layer of prediction are > 0.5, meaning that the XgbRegressor is not doing a great job predicting data in the fields from the third layer, thus be careful when using the prediction from the third layer. The fields that are filling in missing values are:  REST: Monthly Sales MOD" REST: Dividends MOD","REST: Monthly rent MOD","REST: Total Capital MOD"
The detailed MAPE value can be checked in the Perdiction.ipynb file.

In [41]:
#third round
Target = "REST: Monthly Sales MOD"
print("Third round prediction start for:",Target)
x,y,xtrain, xtest, ytrain, ytest = SplitSet(Target,'./data/PredictionData/REST_data_1_predicted_v2_without_label.csv')
xgbr = XgbModel(xtrain,ytrain, x,y)
Predict('./data/PredictionData/REST_data_1_predicted_v2.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v3.csv','(p3)')

print("-----Prediction END-------")

Targetlist = ["REST: Dividends MOD","REST: Monthly rent MOD","REST: Total Capital MOD"]
for Target in Targetlist:
    print("Third round prediction start for:",Target)
    x,y,xtrain, xtest, ytrain, ytest = SplitSet(Target,'./data/PredictionData/REST_data_1_predicted_v2_without_label.csv')
    xgbr = XgbModel(xtrain,ytrain, x,y)
    Predict('./data/PredictionData/REST_data_1_predicted_v3.csv',Target, xgbr,'./data/PredictionData/REST_data_1_predicted_v3.csv','(p3)')
    print("-----Prediction END-------")

Third round prediction start for: REST: Monthly Sales MOD
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best parameters: {'alpha': 1, 'lambda': 1, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
Best scores: -0.5222423824244933
MAPE CV Score: 0.576 (0.180)
-----Prediction END-------
Third round prediction start for: REST: Dividends MOD
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best parameters: {'alpha': 1, 'lambda': 0.5, 'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 100}
Best scores: -0.6039485981551684
MAPE CV Score: 0.901 (0.249)
-----Prediction END-------
Third round prediction start for: REST: Monthly rent MOD
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best parameters: {'alpha': 0, 'lambda': 1, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
Best scores: -0.5886857927641571
MAPE CV Score: 0.721 (0.330)
-----Prediction END-------
Third round prediction start for: REST: Total Capital MOD
Fittin