# Feature Engineering and Modelling

---

1. Import packages
2. Load data
3. Modelling

---

## 1. Import packages

In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

---
## 2. Load data

In [3]:
df = pd.read_csv('/content/data_for_predictions.csv')
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,id,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,24011ae4ebbe3035111d65fa7c15bc57,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,...,2,6,0,0,1,0,0,0,0,1
1,d29c2c54acc38ff3c0614d0a653813dd,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,...,76,4,1,0,0,0,0,1,0,0
2,764c75f661154dac3a6c254cd082ea7d,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,...,68,8,0,0,1,0,0,1,0,0
3,bba03439a292a1e166f80264c16191cb,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,...,69,9,0,0,0,1,0,1,0,0
4,149d57cf92fc41cf94415803a877cb4b,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,...,71,9,1,0,0,0,0,1,0,0


---

## 3. Modelling

We now have a dataset containing features that we have engineered and we are ready to start training a predictive model. Remember, we only need to focus on training a `Random Forest` classifier.

In [4]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### Data sampling

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 75-25% split between train and test respectively.

In [5]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)

(14606, 61)
(14606,)


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10954, 61)
(10954,)
(3652, 61)
(3652,)


### Model training

Once again, we are using a `Random Forest` classifier in this example. A Random Forest sits within the category of `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms. As the data scientist, you can control how large the forest is (that is, how many decision trees you want to include).

The reason why an `ensemble` algorithm is powerful is because of the laws of averaging, weak learners and the central limit theorem. If we take a single decision tree and give it a sample of data and some parameters, it will learn patterns from the data. It may be overfit or it may be underfit, but that is now our only hope, that single algorithm.

With `ensemble` methods, instead of banking on 1 single trained model, we can train 1000's of decision trees, all using different splits of the data and learning different patterns. It would be like asking 1000 people to all learn how to code. You would end up with 1000 people with different answers, methods and styles! The weak learner notion applies here too, it has been found that if you train your learners not to overfit, but to learn weak patterns within the data and you have a lot of these weak learners, together they come together to form a highly predictive pool of knowledge! This is a real life application of many brains are better than 1.

Now instead of relying on 1 single decision tree for prediction, the random forest puts it to the overall views of the entire collection of decision trees. Some ensemble algorithms using a voting approach to decide which prediction is best, others using averaging.

As we increase the number of learners, the idea is that the random forest's performance should converge to its best possible solution.

Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble

In [8]:
# Add model training in here!
model = RandomForestClassifier() # Add parameters to the model!
model.fit(X_train, y_train) # Complete this method call!

### Evaluation

Now let's evaluate how well this trained model is able to predict the values of the test dataset.

In [10]:
# Generate predictions here!
model_prediction =model.predict(X_test)


In [11]:
# Calculate performance metrics here!
model_accuracy= metrics.accuracy_score(model_prediction, y_test)
print (model_accuracy)

0.9033406352683461


In [18]:
# Converting the accuracy to percentage
model_accuracy_percent=(round(metrics.accuracy_score(model_prediction, y_test), 4)* 100)
print ("The accuracy of the model is", model_accuracy_percent,"%")

The accuracy of the model is 90.33 %


In [20]:
X.head()

Unnamed: 0,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,has_gas,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,1,...,2,6,0,0,1,0,0,0,0,1
1,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,0,...,76,4,1,0,0,0,0,1,0,0
2,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,0,...,68,8,0,0,1,0,0,1,0,0
3,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,0,...,69,9,0,0,0,1,0,1,0,0
4,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,0,...,71,9,1,0,0,0,0,1,0,0


## **PREDICTION TESTING**

### **Prediction 1**

**So lets check if our model predicts perfectly based on the variable use for the train, so a number of a row to predict will be choose randomly, i was born on the 9th, so we will test the prediction on the 9th row, to redict the churn status**

---


**First step**

we will make sure the variables in data we will predict is arranged as the trained data, i will do that on excel.

In [22]:
X.iloc[8:11]

Unnamed: 0,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,has_gas,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
8,3.471732,0.0,0.0,2.648731,0.0,1.2266,0.145711,0.0,44.311378,0,...,51,3,0,0,0,0,1,1,0,0
9,4.416058,0.0,3.340246,3.437608,0.0,2.118695,0.115761,0.099419,40.606701,0,...,8,7,0,0,0,1,0,0,0,1
10,4.034709,0.0,3.493179,3.081196,0.0,1.341237,0.164637,0.087381,44.311378,0,...,53,5,0,0,0,0,1,1,0,0


In [25]:
input_data = (4.416057729,0,3.340245762,3.437607888,0,2.118694508,0.115761,0.099419,40.606701,0,2.343585821,33.42,33.42,1,329.6,31.5,1.71E-05,3.38E-06,3.08E-06,0,0,0,1.71E-05,3.38E-06,3.08E-06,1.11E-05,2.90E-06,4.86E-10,0,0,0,1.11E-05,2.90E-06,4.86E-10,-0.007137,0,0.02042725,0.02916625,0.0495935,16.29155496,8.14577508,24.43733004,0.022671,0.031988,0.054659,16.29155496,8.14577508,24.43733004,6,67,4,8,7,0,0,0,1,0,0,0,1,
)

# changing the input_data to numpy array
input_data_as_numpy_array= np.asarray(input_data)

# reshaping the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if (prediction[0]==0):
  print ("There is no churn")
else:
  print("There is a churn")

[0]
There is no churn




 **The prediction is that there no churn
let us check the 9th row in churn dataset to confirm the prediction**

In [27]:
y.iloc[8:11]

8     0
9     0
10    0
Name: churn, dtype: int64

## **Vwala!!!**
**Our prediction is perfect**

As the churn status for the 9th row is 0, which means there is no churn



---



### **Prediction 2**


Lets try another row, lets say row 2961

In [29]:
X.iloc[2960:2963]

Unnamed: 0,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,has_gas,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
2960,4.202161,0.0,2.973128,3.235198,0.0,2.117139,0.116509,0.101397,40.606701,0,...,7,2,0,0,1,0,0,0,0,1
2961,4.594503,0.0,3.526856,3.393561,0.0,2.151523,0.115237,0.100123,40.606701,0,...,25,1,0,0,1,0,0,0,0,1
2962,3.710963,0.0,0.0,2.753031,0.0,1.878752,0.16464,0.087382,44.311378,0,...,6,4,1,0,0,0,0,0,0,1


In [30]:
input_data = (4.594503044,0,3.526855987,3.393561165,0,2.151523068,0.115237,0.100123,40.606701,0,2.333245699,3.24,3.24,1,279.39,17.321,1.63E-05,3.48E-06,1.52E-06,0.004343068,0.001563573,0.000694959,0.004359408,0.001567049,0.000696481,1.02E-05,3.46E-06,2.52E-06,0.007077419,0.002548121,0.001132637,0.007087623,0.002551584,0.001135153,-0.007016,0.1629156,0.020016364,0.030025909,0.050042273,16.2382394,8.119117298,24.35735669,0.022138,0.031942,0.05408,16.29155496,8.14577508,24.43733004,6,65,10,25,1,0,0,1,0,0,0,0,1
)

# changing the input_data to numpy array
input_data_as_numpy_array= np.asarray(input_data)

# reshaping the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if (prediction[0]==0):
  print ("There is no churn")
else:
  print("There is a churn")

[1]
There is a churn




## **Again lets confirm from the churn dataset, if our prediction is correct**

In [31]:
y.iloc[2961]

1

 ## **And VWALAA!!!**
Our prdiction is perfecto!