<a href="https://colab.research.google.com/github/Mechatronian/MachineLearninig/blob/master/Predictive_Analytics_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center">
<img src="https://github.com/Darek-github/CoE-Academy-Data-preparation/blob/master/logo_DP.png?raw=True" alt = "DataProc icon" width="50%">
</p>
<br><br>

## **Predictive Analytics**

Welcome to this hands-on training for aspiring machine learning practitioners. Using Python with `pandas` and `scikit learn`, we'll learn how to process data for machine learning and create predictions on a telcom case study. In this session you will learn:
- How to apply data preprocessing for machine learning including feature engineering.
- The different types of machine learning and when to use them.
- How to apply supervised machine learning models to generate predictions.

## **The Dataset**

The dataset to be used in this webinar is a CSV file named `telco.csv`, which contains data on telecom customers churning and some of their key behaviors. It contains the following columns:

**Features**:

- `customerID`: Unique identifier of a customer.
- `gender`: Gender of customer.
- `SeniorCitizen`: Binary variable indicating if customer is senior citizen.
- `Partner`: Binary variable if customer has a partner.
- `Dependents`: Binary variable if customer has dependent.
- `tenure`: Number of weeks as a customer.
- `PhoneService`: Whether customer has phone service.
- `MultipleLines`: Whether customer has multiple lines.
- `InternetService`: What type of internet service customer has (`"DSL"`, `"Fiber optic"`, `"No"`).
- `OnlineSecurity`: Whether customer has online security service.
- `OnlineBackup`: Whether customer has online backup service.
- `DeviceProtection`: Whether customer has device protection service.
- `TechSupport`: Whether customer has tech support service.
- `StreamingTV`: Whether customer has TV streaming service.
- `StreamingMovies`: Whether customer has movies streaming service.
- `Contract`: Customer Contract Type (`'Month-to-month'`, `'One year'`, `'Two year'`).
- `PaperlessBilling`: Whether paperless billing is enabled.
- `PaymentMethod`: Payment method.
- `MonthlyCharges`: Amount of monthly charges in $.
- `TotalCharges`: Amount of total charges so far.

**Target Variable**:

- `Churn`: Whether customer `'Stayed'` or `'Churned'`.


In [None]:
# Import libraries
import pandas as pd             #data handling, manipulation, and analysis
import matplotlib.pyplot as plt #data visualisation
import seaborn as sns           #data visualisation
import numpy as np              #for any work with matrices, especially math operations
import sklearn                  #machine learning

*Tip - To run block of code use mouse or CTRL+Enter*

# **Data Exploration**

Sample of source csv data file
```
,customerID,gender,SeniorCitizen,Partner ...
0,7590-VHVEG,Female,No,Yes ...
1,5575-GNVDE,Male,No,No ...
2,3668-QPYBK,Male,No ...
3,7795-CFOCW,Male,No,No ...
...
```

In [None]:
# Read in dataset
url = 'https://raw.githubusercontent.com/Darek-github/CoE-Academy-Data-preparation/master/telcom.csv'
telco = pd.read_csv(url, index_col = "Unnamed: 0")
pd.set_option('display.max_columns', None) # Display all columns

In [None]:
# Check type of object telco


In [None]:
# Print shape of dataset


**Observations:**


In [None]:
# Print header


**In general there are broadly 3 types of data:**
- Continous _(e.g. age)_ data. 
- Categorical data _(e.g. marriage status)_. 
- Other *(e.g. image, tweets, etc...)*

**Terminology:**
- **Column:** Variable = Feature -> Dimension -> Customer features
- **Row:** Case = Point -> Customer record





**Observations:**


In [None]:
# Print names of variables


In [None]:
# Show statistics of numeric variables


In [None]:
# Show statistics of categoric variables


**Observations:**


In [None]:
# Select 1 variable (serie) - method 1


In [None]:
# Select 1 variable (serie) - method 2


In [None]:
# Select multiple variables


In [None]:
# Show statistics of numerical variable -> Min, max, mean, median, std, ...


In [None]:
# Categorical variable -> Count unique values


In [None]:
# Categorical variable -> Cross-tabulation (or "crosstab" for short)


**Note**: Try parameters: `margins`=`True` and `normalize`=`True`

## **Data Filtering**

In [None]:
#Single filter (checking "Zero" values)


**Observations**:


In [None]:
# Multiple filters


In [None]:
# Print variable types


**Observations:**

In [None]:
# Take a look at unique values in telco


**Observations:**

In [None]:
# Unique values of internet service


**Observations:** 

## <center> Summary
- The data table is a Pandas **dataframe**
- The data column is a Pandas **series**
- We can obtain information of the **dataset**:
  - 10 top rows: `df.head(10)`
  - Number of rows & variables: `df.shape`
  - Names of variables: `df.columns`
  - Types of variables: `df.info()`
  - Stats of numerical variables: `df.describe()`
  - Stats of categorical variables: `df.describe(include=['object', 'bool'])`
- Information of **numerical variables**:
  - Minimum `df.num_var.min()`
  - Maximun `df.num_var.max()`
  - Mean `df.num_var.mean()`
  - Median `df.num_var.median()`
  - Std `df.variable.std()`
- Information of **categorial variables**:
  - Unique values (count): `df.cat_var.value_counts()`
  - Unique values (percen): `df.cat_var.value_counts(normalize=True)`
  - 2 categorial vars (count): `pd.crosstab(df.cat_var1, df.cat_var2, margins=True)`
  - 2 categorial vars (percen): `pd.crosstab(df.cat_var1, df.cat_var2, margins=True, normalize=True)`
- **Fitering rows**:
  - One filter: `df[ condition ]`
  - Multiple filters: `df[ (condition1) & (condition2) | (condition3)]`

---
<center><h1> Q&A 1</h1> </center>

---

# **Data Cleaning**

### **Task 1: Dropping** `customerID` **column**


In [None]:
# Drop customer ID column


### **Task 2: Converting** `TotalCharges` **column**


In [None]:
# Convert TotalCharges to numeric


In [None]:
# Print info to confirm convertion


### **Task 3: Handling Missing Data**


In [None]:
# Print number of missing values


**Observations:** 

In [None]:
# Get distribution of TotalCharges


In [None]:
# Visualize distribution of TotalCharges


#### **Optional: Repleace missing values**<BR>
```
df.loc[row condition, column label]
```

See also [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)

#### **Drop rows with missing values**

In [None]:
# Remove missing values

# Reset index


In [None]:
# Make sure no more NaN 


### **Task 4: Collapse** values of `InternetService` **column**

```
df['col_A'] = df['col_A'].replace({old_value : new_value})
```


In [None]:
# Collapse 'dsl' into 'DSL'

# Confirm changes


---
<center><h1> Q&A 2</h1> </center>

---

# **Exploratory Data Analysis**

Let's visualize how continous and categorical data in `telco` behave with `Churn`.

In [None]:
# Grab a look at the header



> ```
> my_list = [i for i in range(0,10)]
>
> print(my_list)
>
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
```

>```
># Create a list of the doubles of values from 0 to 9
>my_list = [i for i in range(0,10) if i > 3]
>
>print(my_list)
>
>[4, 5, 6, 7, 8, 9]
```

In [None]:
# Get dtypes of column


In [None]:
# Get all features


# Get all categorical features


# Get all numeric columns


In [None]:
# Print them out and make sure


### **Target variable**



To visualize the count of different categorical values by Churn, we can use the `sns.countplot(x, hue, data)` function which takes in:

    x: The column name being counted.
    hue: The column name used for grouping the data.
    data: The DataFrame being visualized.



In [None]:
# Plot the percentage values of the Churn attribute


We have a **binary classification** problem with a slightly unbalanced target:
```
print(telco['Churn'].value_counts(normalize=True) * 100)
```


In [None]:
# Let us verify how the distribution of Churn is shaped


### **Visualizing categorical features**


> #### **Data Visualization Refresher** 
> 
> A `matplotlib` visualization is made of 3 components:
> - A **figure** which houses in one or many subplots (or axes).
> - The **axes** objects ~ the subplots within the figure.
> - The plot inside each subplot or axes.
>
> We can generate a figure with subplots using the following function:
>
> `fig, axes = plt.subplots(nrow, ncol)`

To visualize the count of different categorical values by `Churn`, we can use again the `sns.countplot(x, hue, data, ax)` function which takes in:
- `x`: The column name being counted.
- `hue`: The column name used for grouping the data.
- `data`: The DataFrame being visualized.
- **`ax`: Which axes in the figure to assign the plot.**

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 5
sns.set(font_scale=5) 

# Create figure and axes
fig, axes = plt.subplots(4, 4, figsize = (100, 100))

# Iterate over each axes, and plot a countplot with categorical columns
for ax, column in zip(axes.flatten(), categorical):
    
    # Create countplot
    
    
    # Set the title of each subplott
    ax.set_title(column)

    # Improve legends
    handles, labels = ax.get_legend_handles_labels()
    fig.legend(handles, labels, loc='right', fontsize = 48)
    ax.get_legend().remove()

**Observations:**


### **Visualizing continuous features**


It can be visusalized as such:

- `sns.boxplot(x=, y=, data=)`
  - `x`: Categorical variable we want to group our data by.
  - `y`: Numeric variable being observed by group.
  - `data`: The DataFrame being used.
  

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 1
sns.set(font_scale=1) 
 
# Create figure and axes
fig, axes = plt.subplots(1, 3, figsize = (16, 6))

# Iterate over each axes, and plot a boxplot with numeric columns
for ax, column in zip(axes.flatten(), numeric):
    
    # Create a boxplot
    
    
    # Set title
    ax.set_title(column)

**Observations:**


The probability density distribution can be estimate using the seaborn kdeplot function.

It can be visusalized as such:

- `sns.kdeplot(data=, color=, label=, ax=)`
  - `data`: Input data.
  - `color`: Color of plot.
  - `label`: Name of label.
  - `ax`: Axes to plot on.

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 1
sns.set(font_scale=1) 

# Create figure and axes
fig, axes = plt.subplots(3, 1, figsize = (12, 12))

# Iterate over each axes, and plot a density with numeric columns
for ax, column in zip(axes.flatten(), numeric):
    
    # Create a density plot
    ax0 = sns.kdeplot(          , color= 'navy', label= 'Stayed',ax = ax)
    ax1 = sns.kdeplot(          , color= 'orange', label= 'Churned',ax = ax)
    # Set title
    ax.set_title(column)

**Observations:** 



### **Handling Outliers**

In [None]:
# Handle Outliers for TotalCharges

# Let's filter those outliers


## **Correlation**

In [None]:
# Plot correlation of variables


## **Bonus - Pandas Profiling**

In [None]:
# Install last version of pandas-profiling
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
#import libraries
from pandas_profiling import ProfileReport

In [None]:
# Create report


In [None]:
# Display report in html format


In [None]:
# Save raport


---
<center><h1> Q&A 3</h1> </center>

---

# **Feature Engineering**


## **Continuous/numeric data**

#### **Binning**



In [None]:
# Bin the "tenure" values into 6 groups
new_tenure = [] #List

for i in telco["tenure"]:
  if i <= 12:
    
  elif i <= 24:
    
  elif i <= 36:
    
  elif i <= 48:
    
  elif i <= 60:
    
  else:
    

telco["tenure"] = new_tenure

In [None]:
#Print the head of dataset


We can get the same effect with the `cut` method:
```
telco["tenure"] = pd.cut(x=telco["tenure"], bins=[12, 24, 36, 48, 60], labels=[0, 1, 2, 3, 4, 5])
```

#### **Standardization**

We can do this easily in `sklearn` by using the `StandardScaler()` function. Many operations in `sklearn` fit the following `.fit()` $\rightarrow$ `.transform()` paradigm and `StandardScaler()` is no different:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on data
scaler.fit(df[my_column])

# Transformed
column_scaled = scaler.transform(df[my_column])

# Replace column
df[my_column] = column_scaled
```

In [None]:
# Split data between X and label


In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split data into train test splits


In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Intialize a scaler
scaler = 

# Fit on training data


# Transform training and test data
train_numeric_transform = 
test_numeric_transform = 

In [None]:
# Replace columns in training and testing data accordingly
train_X[numeric] = 
test_X[numeric] = 

## **Categorical data**

### **Features Encoding**
Using dummy encoding in `pandas` is actually very easy - we can use the `pd.get_dummies()` function which takes:

- The DataFrame being converted.
- `columns`: The name of the categorical columns to be converted.
- `drop_first`: Boolean to indicate onehot encoding (`False`) or dummy encoding (`True`).




In [None]:
# One hot encode cat variables
train_X = 
test_X = 

In [None]:
# Re-add Churned to add to train and test
train_X['Churn'] = train_Y
test_X['Churn'] = test_Y

In [None]:
# Check out header again


### **New Feature Creating**



In [None]:
# Service columns
service_columns = ['OnlineSecurity_Yes', 'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes']

# Create in_ecosystem column
train_X['in_ecosystem'] = 

# Visualize churn by number of services subscribed
sns.countplot



In [None]:
# Create feature that is 1 if 2 or more services subscribed, 0 otherwise
train_X['in_ecosystem'] = 

# Apply the same on test_X
test_X['in_ecosystem'] = 
test_X['in_ecosystem'] = 

# Visualize churn by number of services subscribed
sns.countplot

**Observations:**

In [None]:
# Print crosstab for new feature
pd.crosstab

**Observations:**

In [None]:
# Drop target variable from training and testing data again 


In [None]:
# Print shape of training and testing data


## **Summary** 

Other Feature Engineering ideeas:
*   `Responsibility_Score`: Average of partner, dependents and senior citizen to 
determine a score which determines the amount of life-related responsibility one has.
*   `Phone_Reliance`: Average of PhoneService and MultipleLines to determine a score which determines the extent of reliance one has on phone service.
*   `Support`: Average of InternetService, StreamingTV and StreamingMovies to determine a score which determines the amount of support services one has.
*   `Duration`: Average of tenure and contract to determine an aggregation score of time-related factors

```
Responsibility_Score = (Partner + Dependents + SeniorCitizen)/3
Phone_Reliance = (PhoneService + MultipleLines)/2
Support = (OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport)/4
Online_Services = (InternetService + StreamingTV + StreamingMovies)/3
Duration = (tenure + Contract)/2
```



---
<center><h1> Q&A 4</h1> </center>

---

# **Modelling**

**Accuracy** has the following definition::

<br>


$$\large{accuracy = \frac{Number \space of \space correct \space predictions}{Total \space number \space of \space predictions}}$$



In [None]:
# Check the distribution of labels


The **"baseline model"**.
<BR>

$$\large{baseline \space accuracy = \frac{\# \space times \space model \space predicted \space "Stayed"}{total \space number \space of \space predictions}}$$

In [None]:
# Find the baseline model


In this particular instance, the **baseline model** (always predicting `"Stayed"`) is 73.4% - and any meaningful model that improves performance will have to break that accuracy score. 

## **Classification**


## **Using K-Nearest Neighbors to Generate Predictions**

The `KNeighborsClassifier()` needs to be instantiated and follows the `.fit()` $\rightarrow$ `.predict()` paradigm as such:

```
# Import algorithm
from sklearn.neighbors import KNeighborsClassifier

# Instantiate it
knn = KNeighborsClassifier(n_neighbors = k)

# Fit on training data
knn.fit(train_X, train_Y)

# Create predictions
predictions_Y = knn.predict(test_X)

# Calculate accuracy score on testing data
test_accuracy = accuracy_score(test_Y, predictions_Y)
```

In [None]:
# Import K-Nearest Neighbor Classifier and accuracy_score
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score,classification_report

# Instantiate K Nearest Neighbors with 6 neighbors
knn = 

# Fit on training data
knn.fit()

# Create Predictions
pred_test_Y = 
pred_train_Y = 

# Calculate accuracy score on testing data
test_accuracy = 
train_accuracy = 

# Print test accuracy score rounded to 4 decimals
print('Test accuracy:',     )
print('Train accuracy:',    )

## **Model Performance Metrics**

### **Confusion Matrix**
<p align="center">
<img src="https://raw.githubusercontent.com/Darek-github/CoE-Academy-Data-preparation/master/ConfusionMatrxi.jpg" width="60%">
</p>

<br>
To create the Confusion Matrix using pandas, you’ll need to apply the `pd.crosstab` as follows:

```
pd.crosstab(y_true, y_pred)
```

In [None]:
# Print confusion matrix of kNN
confusion_matrix = 
print( )

### **Other metrics**
The `classification_report` function builds a text report showing the main classification metrics. 
```
classification_report(y_true, y_pred, target_names=target_names)
```

In [None]:
# Print other metrics
print(classification_report( ))

## **Using Decision Trees and Random Forests to Generate Predictions**


`DecisionTreeClassifier`, `RandomForestClassifier()` object - also fits the `.fit()` $\rightarrow$ `.predict()` paradigm.

In [None]:
# Import relevant packages
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report

# Instantiate decision tree and random forest classifiers
dec_tree = 
rand_forest = 

# Fit decision tree and random forest on data



# Create Predictions on test and train data using decision tree
pred_test_Y_tree = 
pred_train_Y_tree = 

# Create Predictions on test and train data using random forest
pred_test_Y_forest = 
pred_train_Y_forest = 

# Calculate test and train accuracy score on decision tree
test_accuracy_tree = accuracy_score()
train_accuracy_tree = accuracy_score()

# Calculate test and train accuracy score on random forest
test_accuracy_forest = accuracy_score()
train_accuracy_forest = accuracy_score()

# Print test accuracy score rounded to 4 decimals
print('Tree test accuracy:', )
print('Tree train accuracy:', )

# Print test accuracy score rounded to 4 decimals
print('\nForest test accuracy:', )
print('Forest train accuracy:', )

In [None]:
# Print confusion matrix of Decision Tree
confusion_matrix = 
print( )

### **Feature importance and Feature selection**




In [None]:
# Call feature_importances_ method for DT


In [None]:
# Plot features importances
imp = pd.Series(data=dec_tree.feature_importances_, index=train_X.columns).sort_values(ascending=False)
plt.figure(figsize=(10,12))
plt.title("Feature importance")
ax = sns.barplot(y=imp.index, x=imp.values, palette="Blues_d", orient='h')

In [None]:
# Call feature_importances_ method for RF
rand_forest.feature_importances_

In [None]:
# Plot features importances
imp = pd.Series(data=rand_forest.feature_importances_, index=train_X.columns).sort_values(ascending=False)
plt.figure(figsize=(10,12))
plt.title("Feature importance")
ax = sns.barplot(y=imp.index, x=imp.values, palette="Blues_d", orient='h')

**Observations:**

In [None]:
#Dacision tree visualisation
from sklearn import tree
fig, axes = plt.subplots(nrows = 1,ncols = 1, dpi=150)
tree.plot_tree(dec_tree, max_depth=2, fontsize=5)
plt.show()

## **Overfitting, the bias-variance tradeoff and cross validation**
**Model Variance**
A model is said to have high variance if it creates an elaborate decision boundary around data points for different sets of training data. 

<ins> It can be diagnosed if **training accuracy** >>> **test accuracy**. </ins>

**Model Bias**
A model underfits the data, or is said to have high bias if the decision boundary does not fit the data - and generates non-accurate predictions on both training and testing data.

<ins> It can be diagnosed if both **training accuracy** and **test accuracy** are low. </ins>


**Cross Validation**
Cross-validation can be done by using the `cross_val_score()` in `sklearn` - it takes in as arguments the following:

- The instantiated model in question.
- The training data and label.
- `cv`: The number of cross validation folds.


In [None]:
# Import relevant modules 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Instantiate decision tree
dec_tree = 

# Get cross validation scores
cv_scores = 

# Fit on training data and get predictions
dec_tree.fit( )
y_pred = 

# Fit on data
print(cv_scores)
print("\nMean cross-val score:", round(np.mean(cv_scores), 4))
print("\nTest score:", round(accuracy_score(y_pred, test_Y), 4))

---
<center><h1> Q&A 5</h1> </center>

---

## **Hyperparameter Tuning and grid-search**


In [None]:
# Get all parameters of a decision tree
dec_tree = DecisionTreeClassifier()
dec_tree.get_params()

**Tuning maximum depth**

See the `sklearn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

 Let's try a `max_depth` of 4.

In [None]:
# Import relevant modules
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score

# Instantiate a decision tree with max_depth = 4
dec_tree = 

# Get cross validation scores
cv_scores = 

# Fit on training data and get predictions
dec_tree.fit()
y_pred = 

# Print accuracy scores
print(cv_scores)
print("\nMean cross-val score:", round(np.mean(cv_scores), 4))
print("\nTest score:", round(accuracy_score(y_pred, test_Y), 4))

**Tuning maximum features**

See `sklearn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
# Import relevant modules
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score

# Instantiate a decision tree with max_depth = 4 and max_features = 25
dec_tree = 

# Get cross validation scores
cv_scores = 

# Fit on training data and get predictions
dec_tree.fit()
y_pred = 

# Print accuracy scores
print(cv_scores)
print("\nMean cross-val score:", round(np.mean(cv_scores), 4))
print("\nTest score:", round(accuracy_score(y_pred, test_Y), 4))

**Using grid-search**

Grid-search can be done using the `GridSearchCV()` function - it takes in as arguments:

- The model being used.
- The possible parameters to test - inputted as a dictionary. 
- `cv`: The number of cross-validation folds.
- `verbose`: More detailed output if `2`.

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define parameter grid
params = {'max_depth': [       ],
          'max_features': [       ]}

# Instantiate a decision tree classifier 
dec_tree = 

# Instantiate a GridSearchCV classifier with 10 fold cross-validation
clf = 

# Fit clf on training data
clf

In [None]:
# Generate predictions and calculate accuracy error
y_pred = 
print('Best parameters: ', )
print('\n',round(accuracy_score(y_pred, test_Y), 4))

---
<center><h1> Q&A 6</h1> </center>

---


<center><h1>Homework</h1> </center>

Try to break the **80%** accuracy threshold on the test data.

*Tips:* <br>

- Use different models (logistic regression, SVM and more)
- Try hyperparameter-tuning these models - make sure you read the sklearn - - documentation for each model.
- Investigate engineering new features for your model.

*Submission details:*<br>

Share with us a code snippet with your output.

