<a href="https://colab.research.google.com/github/LesCavesdAlbert/lescavesdalbert.github.io/blob/main/TP3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Session #3 : Different types of ML and Data, Feature Engineering

After completing our first three lessons, we have gained the foundational knowledge necessary to take the first steps in developing a machine learning project. We are now able to understand and manipulate data, perform essential preprocessing tasks, and apply basic techniques to prepare datasets for model training.

Throughout this learning journey, we have also deepened our understanding of the different types of machine learning, particularly supervised and unsupervised learning. We now recognize their distinct purposes and how they can be applied to different problems. Our objective here will aslo be to combine these two approaches in a real-world business use case to explore their complementarity. By doing so, we will gain insights into how leveraging both techniques together can lead to more powerful and insightful solutions.

# **Objective : Churn Analysis/Prediction**
Your task is to analyze customer data and build predictive models to identify customers at risk of churning. You will explore both unsupervised and supervised learning techniques to gain insights and optimize predictions.

# **Instructions**
**Data Import & Initial Analysis**

- Download the customers_raw.csv file from Blackboard.
  
  The dataset provides information about customers, their use of telecommunication services (minutes, calls, charges), and their engagement (account duration, customer service interactions, subscribed plans).
- Load the dataset and explore its structure.
- Generate key metrics (KPIs) and visualizations to understand the data distribution, trends, and patterns.

**Data Cleaning & Preparation**

- Identify and handle missing values, outliers, and any - inconsistencies in the dataset.
- Rename columns if necessary for clarity and consistency.
- Perform any other preprocessing steps that could improve data quality and model performance.

**Unsupervised Learning: Customer Clustering**

- Implement a clustering algorithm (e.g., K-Means or another suitable model) to segment customers based on relevant features.
- Analyze the clusters to identify patterns and insights that could help with churn prediction.

**Supervised Learning: Churn Prediction**

- Train a binary classification model to predict whether a customer is likely to churn.
- Experiment with different algorithms and evaluate their effectiveness.

**Feature Engineering & Hyperparameter Tuning**

- Extract and engineer relevant features before training your model(s) (encoding ?)
- Fine-tune the hyperparameters of both your clustering and classification models.

# **Guidelines**
- Focus on feature engineering and hyperparameter tuning to enhance your models.
- Do not aim to maximize a specific metric yet—this will be covered later in the course.
- Clearly document your approach, justifying the choices you make for preprocessing, feature selection, and model tuning.
- Use appropriate visualizations to support your analysis.
- All of the steps above aren't exhaustive, please feel free to explore and test any kind of relevant feature for this use-case.




**This project is an opportunity to apply various machine learning techniques and develop a structured approach to predictive modeling. Take the time to explore your data and iterate on your models for better insights and performance.**

**In Class demo**

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [10]:
df = pd.read_csv('customers_raw.csv')
df.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Eve Mins,Eve Calls,Night Mins,Night Calls,Intl Mins,Intl Calls,CustServ Calls,Day Charge,Eve Charge,Night Charge,Intl Charge,Churn?
0,OH,24,415,654-3228,no,no,35,195.36,90,233.62,69,269.55,47,5.38,6,5,33.21,19.86,12.13,1.45,False
1,FL,149,408,600-1596,no,yes,20,149.99,119,252.17,105,169.48,111,9.83,8,6,25.5,21.43,7.63,2.65,True
2,GA,138,408,254-4606,no,no,27,284.33,56,211.35,121,239.46,62,15.02,10,4,48.34,17.96,10.78,4.06,
3,IL,110,510,726-4889,no,yes,0,232.79,87,148.46,68,180.93,71,13.17,10,2,39.57,12.62,8.14,3.56,
4,OH,78,510,534-2395,no,no,13,277.44,131,234.29,148,333.01,54,7.06,1,9,47.16,19.91,14.99,1.91,True


In [11]:
print(df['Churn?'].isna().sum())

1090


In [12]:
df_demo = df.loc[df['Churn?'].notna()]
df_demo.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Eve Mins,Eve Calls,Night Mins,Night Calls,Intl Mins,Intl Calls,CustServ Calls,Day Charge,Eve Charge,Night Charge,Intl Charge,Churn?
0,OH,24,415,654-3228,no,no,35,195.36,90,233.62,69,269.55,47,5.38,6,5,33.21,19.86,12.13,1.45,False
1,FL,149,408,600-1596,no,yes,20,149.99,119,252.17,105,169.48,111,9.83,8,6,25.5,21.43,7.63,2.65,True
4,OH,78,510,534-2395,no,no,13,277.44,131,234.29,148,333.01,54,7.06,1,9,47.16,19.91,14.99,1.91,True
5,MI,101,510,662-8529,no,no,19,170.67,94,202.82,113,275.09,107,11.25,12,9,29.01,17.24,12.38,3.04,True
6,TX,146,408,211-1452,no,no,8,218.22,58,145.26,92,180.18,79,9.47,8,8,37.1,12.35,8.11,2.56,False


In [25]:
columns_to_keep = ['Account Length', 'Day Mins','Day Calls', 'Day Charge']
X_train, X_test, y_train, y_test = train_test_split(
    df_demo[columns_to_keep], df_demo['Churn?'], test_size=0.15, random_state=42)

y_train = y_train.astype(int)
y_test = y_test.astype(int)

In [27]:
X_train

Unnamed: 0,Account Length,Day Mins,Day Calls,Day Charge
4095,106,243.60,122,41.41
4723,198,258.97,62,44.02
4879,186,239.07,126,40.64
5071,124,184.56,129,31.38
1239,70,122.39,123,20.81
...,...,...,...,...
1429,205,219.93,91,37.39
1641,46,304.28,109,51.73
1091,116,194.58,62,33.08
4423,122,189.86,138,32.28


In [17]:
# Create Models
model_1 = LogisticRegression()

In [26]:
# Training the models

model_1.fit(X_train, y_train)

In [28]:
prediction_test = model_1.predict(X_test)

In [29]:
model_accuracy = accuracy_score(y_test, prediction_test)

In [30]:
print(model_accuracy)

0.5740131578947368


In [31]:
# Prediction

model_1.predict(df.loc[df['Churn?'].isna(), columns_to_keep])


array([0, 0, 0, ..., 0, 0, 0])