## Target Prediction
<div class="alert alert-block alert-info"> 
In this notebook we present the prediction of our target variable using the best model obteined in a previous notebook trhough a **Pipeline**.
    
- We first apply the pre-processing to the dataset without the target
    
- Then we apply the model with the best parameters, and predict our target

- Finally, we save our target in a JSON file

<b> Now the real staff beggins!</b> 🥁🤹‍♀️  
</div>

<div class="alert alert-success">
<b>Importing the necessary libraries</b>

</div>

In [1]:
# Import basic required libraries
import pandas as pd
import numpy as np 
import json

## Plotting Libraries
import matplotlib.pyplot as plt
import matplotlib_inline
from IPython.display import Image ##use the IPython Image object to display an Image
##To Plot interactively within an IPython notebook
%matplotlib inline 
import matplotlib.image as mpimg
import seaborn as sns 

## Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import basic modules from sklearn
from scipy import stats ## to check normality and variance of columns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler, LabelBinarizer, OrdinalEncoder

## Importn Classification models
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

## Sabe pipeline
import joblib


<div class="alert alert-success">
<b>Load the dataset for training</b>

- And do a basic check
 
</div>

In [119]:
# df = pd.read_csv('test_NEW.csv')
df_train = pd.read_csv('train_NEW.csv')
# df.head()

In [120]:
## Printing the shape of the df
print(df_train.shape)

## Showing df information
df_train.info()

(276982, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276982 entries, 0 to 276981
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   Date                   276982 non-null  object
 1   Origin Country         276982 non-null  object
 2   Origin Continent       276982 non-null  object
 3   Destination Country    276982 non-null  int64 
 4   Destination Continent  276982 non-null  object
 5   Total flights          276982 non-null  object
 6   Total seats            276982 non-null  object
 7   Total ASKs             276982 non-null  object
dtypes: int64(1), object(7)
memory usage: 16.9+ MB


In [121]:
## Display training dataset
df_train

Unnamed: 0,Date,Origin Country,Origin Continent,Destination Country,Destination Continent,Total flights,Total seats,Total ASKs
0,Jul 2009,United Kingdom,Europe,13,Europe,9032,1531683,2447559137
1,Apr 2008,Lebanon,Middle East,9,Europe,5,760,2389940
2,Apr 2005,Switzerland,Europe,11,Europe,1471,158661,66533450
3,Aug 2016,Israel,Middle East,19,Europe,117,23366,61557637
4,Feb 2019,Albania,Europe,8,Europe,80,12854,9837347
...,...,...,...,...,...,...,...,...
276977,Jul 2012,Iraq,Middle East,21,Europe,4,744,2635372
276978,Jun 2007,Cyprus,Europe,4,Europe,64,9519,19409611
276979,Sep 2007,Panama,Central America,17,North America,442,63061,161243990
276980,Nov 2015,Russian Federation,Europe,9,Europe,90,13422,30147391


<div class="alert alert-success">
<b>Pre-processing of the test dataset</b>

- Here, we apply all the processing and pre-processing to the dataset detailed in previous notebooks.

- As a summary we change the dates (and do not encode them due to worts results), ensure all numbers are floats, and encode the Origin country    

</div>

In [122]:
# Convert the 'Date' column to datetime indicating the format in the same column Date
df_train['Date'] = pd.to_datetime(df_train['Date'], format='%b %Y')
## But with this we get also a day 01 which is not true
## We creat a column for the Year
df_train['Year'] = df_train['Date'].dt.year
## A column for the month
df_train['Month'] = df_train['Date'].dt.month
## Drope the column Date
df_train = df_train.drop(columns = ['Date'], axis = 1)

## and finally, reorganize the columns so the Year and Month appears at the begining
df_train_columns = df_train.columns.to_list()
df_train = df_train[df_train_columns[-2:] + df_train_columns[:-2]]

## Converting the string/number columns to float

## We do it with a for loop
## We first get again the df_train_columns list
df_train_columns = df_train.columns
## We do the loop for the last 3 columns, the ones we want to convert to float (-3)
for column in df_train_columns[-3:]: 
    ## Replacing the comma for an empty space
    df_train[column] = df_train[column].str.replace(',', '') 
    ## Converting  each column to float and filling nanas with 0
    df_train[column] = pd.to_numeric(df_train[column], errors='coerce').fillna(0).astype(float)
       

## Display X
df_train


Unnamed: 0,Year,Month,Origin Country,Origin Continent,Destination Country,Destination Continent,Total flights,Total seats,Total ASKs
0,2009,7,United Kingdom,Europe,13,Europe,9032.0,1531683.0,2.447559e+09
1,2008,4,Lebanon,Middle East,9,Europe,5.0,760.0,2.389940e+06
2,2005,4,Switzerland,Europe,11,Europe,1471.0,158661.0,6.653345e+07
3,2016,8,Israel,Middle East,19,Europe,117.0,23366.0,6.155764e+07
4,2019,2,Albania,Europe,8,Europe,80.0,12854.0,9.837347e+06
...,...,...,...,...,...,...,...,...,...
276977,2012,7,Iraq,Middle East,21,Europe,4.0,744.0,2.635372e+06
276978,2007,6,Cyprus,Europe,4,Europe,64.0,9519.0,1.940961e+07
276979,2007,9,Panama,Central America,17,North America,442.0,63061.0,1.612440e+08
276980,2015,11,Russian Federation,Europe,9,Europe,90.0,13422.0,3.014739e+07


In [123]:
## Encoding Manually with ordinals the Origine country, origin continent and destination continent
# read the JSON file into a dictionary
with open('encode_countries.json', 'r') as f:
    #label_mapping = json.load(f)
    encodedCountries = json.load(f)


## Creat a list with all the unique labels of countris
unique_OriginContry = df_train['Origin Country'].unique()

## Loop throug the countries list
for label in unique_OriginContry:
    ## if the label is not in the countries dictionary, assingne 1 more of the maximum number and addit to the list
    if label not in encodedCountries:
        encodedCountries[label] = max(encodedCountries.values()) + 1

## Map the Origin Country column with the dicctionary of codes
df_train['Origin Country'] = df_train['Origin Country'].map(encodedCountries)


## Display X
df_train


Unnamed: 0,Year,Month,Origin Country,Origin Continent,Destination Country,Destination Continent,Total flights,Total seats,Total ASKs
0,2009,7,15,Europe,13,Europe,9032.0,1531683.0,2.447559e+09
1,2008,4,25,Middle East,9,Europe,5.0,760.0,2.389940e+06
2,2005,4,14,Europe,11,Europe,1471.0,158661.0,6.653345e+07
3,2016,8,26,Middle East,19,Europe,117.0,23366.0,6.155764e+07
4,2019,2,27,Europe,8,Europe,80.0,12854.0,9.837347e+06
...,...,...,...,...,...,...,...,...,...
276977,2012,7,76,Middle East,21,Europe,4.0,744.0,2.635372e+06
276978,2007,6,70,Europe,4,Europe,64.0,9519.0,1.940961e+07
276979,2007,9,177,Central America,17,North America,442.0,63061.0,1.612440e+08
276980,2015,11,2,Europe,9,Europe,90.0,13422.0,3.014739e+07


<div class="alert alert-success">
<b>Pipeline Pre-processing of the test dataset</b>

- Here, we include all the workfllow to the pipeline.
    1. Preprocessing od the training set
        - Robust scaler for 'Total flights', 'Total seats', 'Total ASKs' columns
        - One Hote encoder for 'Origin Continent', 'Destination Continent' columns
        - MinMax scaler for 'Origin Country', 'Year', 'Month' columns


</div>

In [124]:
numeric_features = ['Total flights', 'Total seats', 'Total ASKs']
numeric_transformer = Pipeline(
    [
        ('imputer_num', SimpleImputer(strategy = 'median')),
        ('robust', RobustScaler())
    ]
)


categorical_features = ['Origin Continent', 'Destination Continent']
categorical_transformer = Pipeline(
    [
        ('imputer_cat', SimpleImputer(strategy = 'most_frequent', fill_value = 'missing')),
        ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
    ]
)

numeric_features_minmax = ['Origin Country', 'Year', 'Month', 'Total flights', 'Total seats', 'Total ASKs']
numeric_transformer_minmax = Pipeline(
    [
        ('imputer_num', SimpleImputer(strategy = 'median')),
        ('minmax', MinMaxScaler())
    ]
)

<div class="alert alert-success">
<b></b>

2. Put all these pre-processing to the object preprocessor


</div>

In [125]:
preprocessor = ColumnTransformer(
    [
        ('categoricals', categorical_transformer, categorical_features),
        ('numericals', numeric_transformer, numeric_features),
        ('numericals_minmax', numeric_transformer_minmax, numeric_features_minmax)
        
    ],
    remainder = 'drop'
)

<div class="alert alert-success">
<b></b>

3. Include the preprocessor object and the model to the pipeline


</div>

In [126]:
pipeline = Pipeline(
    [
        ('preprocessing', preprocessor),
        ('model', RandomForestClassifier(n_estimators= 200, 
                                 min_samples_split = 8,
                                 min_samples_leaf = 4, 
                                 max_features = 'auto',
                                 max_depth = None,
                                 bootstrap = False, 
                                 random_state = 42))
    ]
)
pipeline

<div class="alert alert-success">
<b>Train the Pipeline</b>

- First separate the training data into features and target

</div>

In [127]:
## Create features and target , X and y, fromdf
## 
X_train = df_train
## Create features df
X_train = df_train.drop(['Destination Country'], axis=1)

## Crete target df
y_train = df_train['Destination Country']

### Drop everything from the train dataframes to free up space
df_train.iloc[0:0]

## Display X
X_train


Unnamed: 0,Year,Month,Origin Country,Origin Continent,Destination Continent,Total flights,Total seats,Total ASKs
0,2009,7,15,Europe,Europe,9032.0,1531683.0,2.447559e+09
1,2008,4,25,Middle East,Europe,5.0,760.0,2.389940e+06
2,2005,4,14,Europe,Europe,1471.0,158661.0,6.653345e+07
3,2016,8,26,Middle East,Europe,117.0,23366.0,6.155764e+07
4,2019,2,27,Europe,Europe,80.0,12854.0,9.837347e+06
...,...,...,...,...,...,...,...,...
276977,2012,7,76,Middle East,Europe,4.0,744.0,2.635372e+06
276978,2007,6,70,Europe,Europe,64.0,9519.0,1.940961e+07
276979,2007,9,177,Central America,North America,442.0,63061.0,1.612440e+08
276980,2015,11,2,Europe,Europe,90.0,13422.0,3.014739e+07


In [128]:
## Display y_train
y_train

0         13
1          9
2         11
3         19
4          8
          ..
276977    21
276978     4
276979    17
276980     9
276981    13
Name: Destination Country, Length: 276982, dtype: int64

In [129]:
## Train the pipline with training data
pipeline.fit(X_train,y_train)

In [130]:
### Drop everything from the train dataframes to free up space
y_train.iloc[0:0]
X_train.iloc[0:0]

Unnamed: 0,Year,Month,Origin Country,Origin Continent,Destination Continent,Total flights,Total seats,Total ASKs


<div class="alert alert-success">
<b>Predict the target with Pipeline</b>

- First load the data 
- Then apply the first processing to it
- Use thepipeline to predict the target


</div>

In [131]:
## Loading test data
XX_test = pd.read_csv('test_NEW.csv')

In [132]:
## Create features for test data

# Convert the 'Date' column to datetime indicating the format in the same column Date
XX_test['Date'] = pd.to_datetime(XX_test['Date'], format='%b %Y')
## But with this we get also a day 01 which is not true
## We creat a column for the Year
XX_test['Year'] = XX_test['Date'].dt.year
## A column for the month
XX_test['Month'] = XX_test['Date'].dt.month
## Drope the column Date
XX_test = XX_test.drop(columns = ['Date'], axis = 1)

## and finally, reorganize the columns so the Year and Month appears at the begining
XX_test_columns = XX_test.columns.to_list()
XX_test = XX_test[XX_test_columns[-2:] + XX_test_columns[:-2]]

## Converting the string/number columns to float

## We do it with a for loop
## We first get again the XX_test_columns list
XX_test_columns = XX_test.columns
## We do the loop for the last 3 columns, the ones we want to convert to float (-3)
for column in XX_test_columns[-3:]: 
    ## Replacing the comma for an empty space
    XX_test[column] = XX_test[column].str.replace(',', '') 
    ## Converting  each column to float and filling nanas with 0
    XX_test[column] = pd.to_numeric(XX_test[column], errors='coerce').fillna(0).astype(float)
    

## Encoding Manually with ordinals the Origine country, origin continent and destination continent
# read the JSON file into a dictionary
with open('encode_countries.json', 'r') as f:
    #label_mapping = json.load(f)
    encodedCountries = json.load(f)


## Creat a list with all the unique labels of countris
unique_OriginContry = XX_test['Origin Country'].unique()

## Loop throug the countries list
for label in unique_OriginContry:
    ## if the label is not in the countries dictionary, assingne 1 more of the maximum number and addit to the list
    if label not in encodedCountries:
        encodedCountries[label] = max(encodedCountries.values()) + 1

## Map the Origin Country column with the dicctionary of codes
XX_test['Origin Country'] = XX_test['Origin Country'].map(encodedCountries)

## Display X
XX_test


Unnamed: 0,Year,Month,Origin Country,Origin Continent,Destination Continent,Total flights,Total seats,Total ASKs
0,2006,2,2,Europe,Europe,168.0,18269.0,30375905.0
1,2017,4,19,Europe,Europe,211.0,38974.0,63519202.0
2,2014,9,15,Europe,Asia,459.0,129105.0,924093500.0
3,2004,8,15,Europe,Europe,3607.0,530792.0,680683678.0
4,2010,12,25,Middle East,Europe,35.0,5860.0,21011724.0
...,...,...,...,...,...,...,...,...
69241,2017,9,24,Middle East,Europe,120.0,39753.0,202770716.0
69242,2020,9,20,Europe,Africa,1.0,174.0,635379.0
69243,2013,4,106,Africa,Middle East,51.0,12792.0,64930542.0
69244,2007,12,13,Europe,Africa,70.0,10842.0,35771177.0


In [133]:
## predict results with the pipeline and test data
y_pred = pipeline.predict(XX_test)

In [134]:
## Display predicted data
y_pred

array([ 4, 13,  1, ...,  3, 10,  1], dtype=int64)

In [135]:
## Drop everything from the dataframe to free up space
XX_test.iloc[0:0]

Unnamed: 0,Year,Month,Origin Country,Origin Continent,Destination Continent,Total flights,Total seats,Total ASKs


<div class="alert alert-success">
<b>Write the prediction to a JSON file!</b>

- The instructions are:
    - The file you have to submit should be a JSON like this:
    - You must indicate the key target of how it is displayed; the value will be the JSON of the results of the prediction model. In these predictions the key will correspond to the row with the test_idx and the value will be the prediction made by your model.

    
</div>

In [136]:
## Define a function to write the JSON file with the prediction as required
def write_to_json(array, filename):
    data = {"target": {str(i): int(v) for i, v in enumerate(array)}}
    with open(filename, "w") as f:
        json.dump(data, f)

In [137]:
write_to_json(y_pred, "predictions_CBiCa_x.json")

In [138]:
with open('predictions_CBiCa_x.json', 'r') as f:
    prediction = json.load(f)

In [118]:
prediction

{'target': {'0': 4,
  '1': 8,
  '2': 1,
  '3': 7,
  '4': 19,
  '5': 7,
  '6': 4,
  '7': 10,
  '8': 8,
  '9': 11,
  '10': 21,
  '11': 1,
  '12': 10,
  '13': 18,
  '14': 7,
  '15': 2,
  '16': 2,
  '17': 19,
  '18': 16,
  '19': 10,
  '20': 10,
  '21': 20,
  '22': 19,
  '23': 22,
  '24': 22,
  '25': 14,
  '26': 6,
  '27': 5,
  '28': 17,
  '29': 6,
  '30': 17,
  '31': 8,
  '32': 7,
  '33': 8,
  '34': 4,
  '35': 13,
  '36': 15,
  '37': 0,
  '38': 10,
  '39': 2,
  '40': 15,
  '41': 15,
  '42': 21,
  '43': 7,
  '44': 18,
  '45': 23,
  '46': 12,
  '47': 23,
  '48': 9,
  '49': 16,
  '50': 2,
  '51': 13,
  '52': 1,
  '53': 22,
  '54': 22,
  '55': 12,
  '56': 5,
  '57': 15,
  '58': 6,
  '59': 4,
  '60': 12,
  '61': 13,
  '62': 19,
  '63': 8,
  '64': 6,
  '65': 12,
  '66': 20,
  '67': 14,
  '68': 2,
  '69': 15,
  '70': 2,
  '71': 21,
  '72': 15,
  '73': 14,
  '74': 23,
  '75': 2,
  '76': 11,
  '77': 11,
  '78': 15,
  '79': 17,
  '80': 11,
  '81': 10,
  '82': 11,
  '83': 2,
  '84': 17,
  '85': 13,
 

In [76]:
## Cheking the dictionary values if they are corresponding to the reox index
inner_dict = prediction['target']
shape = len(inner_dict)
print(shape)

69246
