<a href="https://colab.research.google.com/github/KennethLengo/KL-ML-Basics-Assignments/blob/main/ML_Basics_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Coding Exercise - ML Basics**

# **Part 2: Predicting Customer Churn**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report


# Generate sample customer data
# ChatGPT prompts with 100+ rows
data = ('https://raw.githubusercontent.com/KennethLengo/MISTEST2026/refs/heads/main/customer_churn_dataset.csv')
df = pd.read_csv(data)

# Features and target
X = df[['age', 'monthly_usage_hours', 'purchase_amount', 'customer_service_calls',
'region']]
y = df['churn']

# Preprocessing: Scale numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'monthly_usage_hours', 'purchase_amount',
'customer_service_calls']),
('cat', OneHotEncoder(sparse_output=False), ['region'])
])

# Create pipeline with preprocessing and model
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict using data set
y_pred = model.predict(X_test)

# Assess accuracy of model using prediction data
print(f"Accuracy on Test Set: {accuracy_score(y_test, y_pred):.2f}")
print("Classification Report on Test Set:")
print(classification_report(y_test, y_pred))

# Predict churn probability for a new customer
new_customer = pd.DataFrame({
'age': [35],
'monthly_usage_hours': [20],
'purchase_amount': [150],
'customer_service_calls': [5],
'region': ['West']
})

# Probability of churn (class 1)
churn_probability = model.predict_proba(new_customer)[0][1]

# Classify based on threshold (0.5)
threshold = 0.5
churn_prediction = 1 if churn_probability > threshold else 0
print(f"Churn Probability For New Customer: {churn_probability:.2f}")
print(f"Churn Prediction (1 = churn, 0 = no churn): {churn_prediction}")

# Display model coefficients
num_features = ['age', 'monthly_usage_hours', 'purchase_amount', 'customer_service_calls']

cat_features = model.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(['region']).tolist()
feature_names = num_features + cat_features
coefficients = model.named_steps['classifier'].coef_[0]

print("\nModel Coefficients:")
for feature, coef in zip(feature_names, coefficients):
  print(f"{feature}: {coef:.2f}")

# Explanations And Interpretations of Data
print('''\nThe churn rate basically refers to the percentage of customers who stop buying a product or service within a certain time frame. A lower churn probability is desirable,
as it means the business has a higher chance of retaining the customer. In this scenario, the churn threshold of 0.5 means that a customer is believed to leave if the churn rate
probability is above 0.5, while staying if it is below 0.5. \n''')


print('''The Classification Report Table shows the business how accurate the model is at predicting churn. Despite 100% test accuracy, the model will require further testing and cross-validation
to ensure it performs consistently on new datasets.''')

print('''\nFor the region coefficients, region plays a significant role in decreasing or increasing churn. Since the model separates the churn of customers in different
regions, it provides the business with proper insights on which locations to target in order to retain customers. For example, customers in the East, North and West regions show a slightly higher likelihood of
churn due to the positive coefficients. Customers in the South region exhibit a lower likelihood of churn due to their negative coefficients.

The East region's coefficient of 0.20 suggests a slight increase in churn, and the North region's coefficient of 0.03 has a minimal increase in churn. The South region has a coefficient of -0.28 and that suggests
that the chance of churn is slightly reduced. The West region's coefficient of 0.04 implies that it slightly increases the chance of churn.

Age slightly increases the chance of churn, because the coefficient (0.24) is modest. Monthly usage hours is -1.52, which means that it has a significant decrease on churn.
Higher purchase amounts significantly reduce churn with a coefficient of -1.41. Lastly, while more customer service calls has a significant effect in increasing churn, with a coefficient of 1.85.''')

print('''\nThe improvement made includes producing a classification report, which allows the businesses to assess the model's performance.''')


Accuracy on Test Set: 1.00
Classification Report on Test Set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      1.00      1.00         6

    accuracy                           1.00        19
   macro avg       1.00      1.00      1.00        19
weighted avg       1.00      1.00      1.00        19

Churn Probability For New Customer: 0.12
Churn Prediction (1 = churn, 0 = no churn): 0

Model Coefficients:
age: 0.24
monthly_usage_hours: -1.52
purchase_amount: -1.41
customer_service_calls: 1.85
region_East: 0.20
region_North: 0.03
region_South: -0.28
region_West: 0.04

The churn rate basically refers to the percentage of customers who stop buying a product or service within a certain time frame. A lower churn probability is desirable,
as it means the business has a higher chance of retaining the customer. In this scenario, the churn threshold of 0.5 means that a customer is believed to leave if the ch