<p style="text-align:center;">
<img src=https://noodle.digitalfutures.com/studentuploads/Data_Cygnets_logo.png width = 150px, height=150px/
     style="float: center; " />
</p>

# YDF on Swan Teleco data

Using Gradient Boost Trees 🌲 and Random Forest 🌲🌲 algorithms from YDF
### by Data Cygnets
🦢 Jamie M   
🦢 Muqadas   
🦢 Sennan   
🦢 Maarja

### Data load and prep 🗂️

In [None]:
# install YDF - browser only currently!
!pip install ydf -U

In [None]:
# import required modules
import ydf
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [None]:
# load in the data
df = pd.read_excel("1 - Project Data.xlsx")

In [None]:
# check the first couple of lines
df.head()

In [None]:
# check all columns names
df.columns

In [None]:
# drop columns we don't need - CustomerID has been recommended for dropping by the YDF docs
df.drop(columns = ['Count', 'Zip Code','CustomerID', 'Country', 'State', 'Lat Long', 'Latitude', 'Longitude', 'Churn Label', 'Total Charges', 'Churn Reason'], inplace = True)

See YDF docs here: [How to improve models]('https://ydf-legacy.readthedocs.io/improve_model.html')

In [None]:
# check columns to verify that columns have been dropped
df.columns

In [None]:
# create features and target
features = list(df.columns)
features.remove('Churn Value')

y = df['Churn Value'] # Target
X = df[features] # Features

In [None]:
# do the train-test split, then bring back the target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1123, stratify=y) # stratify = y --- a parameter that ensures stratification between Train and Test, according to target

X_train['Churn Value'] = y_train # add the label/target back to X_train
X_test['Churn Value'] = y_test   # add the label/target back to X_test

In [None]:
# check and verify that training data shapes match
X_train.shape[0] == y_train.shape[0]

## Modelling using Gradient Boosted Trees (GBT) 🌲




### Create and train the model 🚂

In [None]:
# create a model
model_gbt = ydf.GradientBoostedTreesLearner(label='Churn Value', num_trees = 500, use_hessian_gain=True) # default num_trees is 300

# train the model
model_gbt = model_gbt.train(X_train)

### Evaluate the model

In [None]:
# evaluate on train
model_gbt.evaluate(X_train)

In [None]:
# evaluate on test
model_gbt.evaluate(X_test)

In [None]:
# plot the tree
model_gbt.plot_tree()

### Model summary

In [None]:
# comprehensive summary
model_gbt.describe()

### Optimise and re-train the model 🚅


In [None]:
# model opimisation
tuner = ydf.RandomSearchTuner(num_trials=100)

In [None]:
# fit the new model
new_model_gbt = ydf.GradientBoostedTreesLearner(tuner=tuner, label='Churn Value', num_trees = 100, use_hessian_gain=True)
new_model_gbt = new_model_gbt.train(X_train)

In [None]:
# evaluate improved model on Train
new_model_gbt.evaluate(X_train)

In [None]:
# evaluate on new test
new_model_gbt.evaluate(X_test)

## Modelling using Random Forests (RF) 🌲🌲

### Create and train the model 🚂




In [None]:
# create a model
model_rf = ydf.GradientBoostedTreesLearner(label='Churn Value', num_trees = 500)

# train the model
model_rf = model_rf.train(X_train)

### Evaluate the model

In [None]:
# evaluate on train
model_rf.evaluate(X_train)

In [None]:
# evaluate on test
model_rf.evaluate(X_test)

In [None]:
# plot the tree
model_rf.plot_tree()

### Model summary

In [None]:
# comprehensive summary
model_rf.describe()

#### Optimise and re-train the model 🚅

In [None]:
# model opimisation
tuner = ydf.RandomSearchTuner(num_trials=100)

In [None]:
# fit the new model
new_model_rf = ydf.GradientBoostedTreesLearner(tuner=tuner, label='Churn Value', num_trees = 500)
new_model_rf = new_model_rf.train(X_train)

In [None]:
# evaluate "improved" model on Train
new_model_rf.evaluate(X_train)

In [None]:
# evaluate on new test
new_model_rf.evaluate(X_test)

# Conclusion:

YDF is not providing a viable solution, despite trying different hyperparameters, tuner optimisation, selection of features 😞    
The model is overfitting by about 10%.