In [1]:
# --- Essential Imports (always at the top) ---
import pandas as pd
import io
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

Customer Churn Prediction and Data Quality Analysis
Problem Statement


This project aims to demonstrate fundamental data science and machine learning skills by addressing a common business problem: customer churn. Customer churn, or customer attrition, refers to the rate at which customers stop doing business with an entity. Predicting churn allows businesses to proactively identify at-risk customers and implement retention strategies, which is often more cost-effective than acquiring new customers.

Specifically, this project will:

Perform essential data cleaning and preparation on a raw customer dataset.
Build and train a basic machine learning model (Logistic Regression) to predict whether a customer is likely to churn.
Evaluate the model's performance.
This serves as a foundational portfolio piece showcasing skills in data handling, basic machine learning, and results interpretation.

In [2]:
# --- 1. Data Loading and Initial Inspection ---

# Mock CSV data representing raw customer information
csv_data = """customer_id,age,gender,annual_income,has_loyalty_card,last_purchase_amount
101,34,Female,55000,True,120.50
102,45,Male,70000,False,23.10
103,28,Female,48000,True,300.00
104,52,Male,85000,True,75.25
105,,Female,60000,False,
106,30,Male,null,True,50.00
107,39,Female,62000,False,180.75
108,60,Male,90000,True,15.99
"""

# Load the data into a Pandas DataFrame
df = pd.read_csv(io.StringIO(csv_data))

# Convert 'null' string in annual_income to NaN (NumPy's representation for missing values)
# and ensure the column is numeric. This is important for calculations like the mean.
df['annual_income'] = df['annual_income'].replace('null', np.nan)
df['annual_income'] = pd.to_numeric(df['annual_income'])

print("--- Original DataFrame Head (First 5 Rows) ---")
print(df.head())
print("\n--- Original Missing Values Count per Column ---")
print(df.isnull().sum())
print("\n--- Original DataFrame Info (Data Types and Non-Nulls) ---")
df.info() # Displays data types and non-null counts

--- Original DataFrame Head (First 5 Rows) ---
   customer_id   age  gender  annual_income  has_loyalty_card  \
0          101  34.0  Female        55000.0              True   
1          102  45.0    Male        70000.0             False   
2          103  28.0  Female        48000.0              True   
3          104  52.0    Male        85000.0              True   
4          105   NaN  Female        60000.0             False   

   last_purchase_amount  
0                120.50  
1                 23.10  
2                300.00  
3                 75.25  
4                   NaN  

--- Original Missing Values Count per Column ---
customer_id             0
age                     1
gender                  0
annual_income           1
has_loyalty_card        0
last_purchase_amount    1
dtype: int64

--- Original DataFrame Info (Data Types and Non-Nulls) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column                

Data Description

The dataset used in this project is a mock customer dataset containing 8 entries (rows) and 6 features (columns). Each row represents a unique customer with attributes such as:

customer_id: Unique identifier for each customer.
age: Age of the customer.
gender: Gender of the customer (categorical: Female/Male).
annual_income: Customer's annual income.
has_loyalty_card: Boolean indicating if the customer has a loyalty card.
last_purchase_amount: Amount of the customer's last purchase.
Initial inspection revealed missing values in the age, annual_income (represented as 'null' string initially), and last_purchase_amount columns, which require cleaning before model building.

In [3]:
# --- 2. Data Cleaning Steps ---

# Explanation of Data Cleaning:
# Real-world data is often messy. Missing values can cause errors in models or lead to biased results.
# Here, we employ common strategies to handle them:

# 2.1 Drop rows where 'age' is missing.
# Rationale: 'age' is a crucial demographic feature. If an age is missing, it's hard to impute accurately
# without more complex methods. Dropping the row is a simple solution for a small dataset.
df.dropna(subset=['age'], inplace=True)
print("\n--- After dropping rows with missing 'age' ---")
print(df.isnull().sum()) # Verify remaining missing values
print(df.head()) # See updated DataFrame


# 2.2 Fill missing 'annual_income' values with the MEAN of the 'annual_income' column.
# Rationale: Annual income is numerical. Filling with the mean (average) is a simple imputation
# strategy that preserves the overall distribution of the column, preventing data loss.
mean_annual_income = df['annual_income'].mean()
df['annual_income'] = df['annual_income'].fillna(mean_annual_income) # Assign back to avoid FutureWarning
print(f"\n--- After filling missing 'annual_income' with mean ({mean_annual_income:.2f}) ---")
print(df.isnull().sum())
print(df.head())


# 2.3 Fill missing 'last_purchase_amount' values with 0.
# Rationale: For purchase amounts, a missing value might imply no recent purchase,
# or a zero-value purchase, so imputing with 0 is a reasonable business assumption here.
df['last_purchase_amount'] = df['last_purchase_amount'].fillna(0) # Assign back to avoid FutureWarning
print("\n--- After filling missing 'last_purchase_amount' with 0 ---")
print(df.isnull().sum())
print(df.head())

print("\n--- Final Cleaned DataFrame Head ---")
print(df.head())
print("\n--- Final Missing Values Count (Should all be 0) ---")
print(df.isnull().sum())


--- After dropping rows with missing 'age' ---
customer_id             0
age                     0
gender                  0
annual_income           1
has_loyalty_card        0
last_purchase_amount    0
dtype: int64
   customer_id   age  gender  annual_income  has_loyalty_card  \
0          101  34.0  Female        55000.0              True   
1          102  45.0    Male        70000.0             False   
2          103  28.0  Female        48000.0              True   
3          104  52.0    Male        85000.0              True   
5          106  30.0    Male            NaN              True   

   last_purchase_amount  
0                120.50  
1                 23.10  
2                300.00  
3                 75.25  
5                 50.00  

--- After filling missing 'annual_income' with mean (68333.33) ---
customer_id             0
age                     0
gender                  0
annual_income           0
has_loyalty_card        0
last_purchase_amount    0
dtype: int64

Data Preparation for Machine Learning

After cleaning, the data needs further preparation to be suitable for a machine learning model. This involves:

Handling Categorical Data: Converting text-based categories (like 'gender') into numerical representations that models can understand.
Adding Target Variable: Introducing the churn column, which is the variable our model will learn to predict.
Defining Features (X) and Target (y): Separating the input variables from the output variable.
Splitting Data: Dividing the dataset into training and testing sets to ensure robust model evaluation and prevent overfitting.
Feature Scaling: Standardizing numerical features to ensure all features contribute equally to the model's learning process, especially for algorithms like Logistic Regression.

In [4]:
# --- 3. Data Preparation for Machine Learning ---

# 3.1 Handle Categorical Data: Convert 'gender' to numerical (Label Encoding)
# Rationale: ML models require numerical input. For binary categories, mapping 'Female' to 0 and 'Male' to 1 is a simple encoding.
df['gender_encoded'] = df['gender'].map({'Female': 0, 'Male': 1})
df.drop('gender', axis=1, inplace=True) # Drop the original 'gender' column after encoding
print("\n--- After Gender Encoding ---")
print(df.head())


# 3.2 Add a dummy 'churn' column (0=No Churn, 1=Churn)
# Rationale: This simulates the target variable we want to predict. In a real scenario, this column would be part of the raw data.
# The number of values here MUST match the number of rows in your 'df' after cleaning!
# (Original 8 rows - 1 dropped row = 7 rows)
df['churn'] = [0, 1, 0, 0, 1, 0, 1]
print("\n--- DataFrame with Churn Column ---")
print(df.head())


# 3.3 Define X (features) and y (target)
# X contains the independent variables (features) used for prediction.
# y contains the dependent variable (target) that the model predicts.
X = df[['age', 'annual_income', 'last_purchase_amount', 'has_loyalty_card', 'gender_encoded']]
y = df['churn']
print("\n--- Features (X) Head ---")
print(X.head())
print("\n--- Target (y) Head ---")
print(y.head())


# 3.4 Split Data for Training and Testing
# Rationale: This is critical for evaluating model performance on unseen data.
# 80% of the data will be used for training, 20% for testing.
# random_state ensures the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nShape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")


# 3.5 Scale the Features (X_train and X_test)
# Rationale: Scaling ensures that features with larger numerical ranges (like annual_income)
# don't disproportionately influence the model compared to features with smaller ranges (like age).
# StandardScaler transforms data to have a mean of 0 and a standard deviation of 1.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit scaler on training data, then transform
X_test_scaled = scaler.transform(X_test)     # Use the *same* fitted scaler to transform test data
print("\nFeatures scaled for training and testing.")


--- After Gender Encoding ---
   customer_id   age  annual_income  has_loyalty_card  last_purchase_amount  \
0          101  34.0   55000.000000              True                120.50   
1          102  45.0   70000.000000             False                 23.10   
2          103  28.0   48000.000000              True                300.00   
3          104  52.0   85000.000000              True                 75.25   
5          106  30.0   68333.333333              True                 50.00   

   gender_encoded  
0               0  
1               1  
2               0  
3               1  
5               1  

--- DataFrame with Churn Column ---
   customer_id   age  annual_income  has_loyalty_card  last_purchase_amount  \
0          101  34.0   55000.000000              True                120.50   
1          102  45.0   70000.000000             False                 23.10   
2          103  28.0   48000.000000              True                300.00   
3          104  52.0 

Model Building: Logistic Regression

For this binary classification problem (predicting churn: Yes/No), we chose Logistic Regression.

Logistic Regression is a statistical model that, despite its "regression" in the name, is primarily used for binary classification. It estimates the probability of an instance belonging to a particular class. It's a simple yet powerful and interpretable algorithm, making it a good starting point for many classification tasks.
The model is initialized with random_state for reproducibility and max_iter increased to ensure the optimization algorithm converges properly, addressing common warnings.

In [5]:
# --- 4. Build and Train Your Machine Learning Model ---

print("\n--- Building and Training the Model ---")

# Initialize the Logistic Regression model
# max_iter increased to help the solver converge and avoid warnings
model = LogisticRegression(random_state=42, max_iter=1000)

# Train (fit) the model using your SCALED TRAINING data
# The model learns patterns from X_train_scaled to predict y_train
model.fit(X_train_scaled, y_train)
print("Model training complete!")


# Make a prediction for one customer from your X_test data
# This demonstrates the model's ability to predict on unseen, scaled features.
sample_customer_index = 0
sample_customer_features_scaled = X_test_scaled[sample_customer_index].reshape(1, -1)

# Get the actual churn value for this sample customer from y_test for comparison
actual_churn_value = y_test.iloc[sample_customer_index]

# Make the prediction for the single sample
predicted_churn_sample = model.predict(sample_customer_features_scaled)

print(f"\nFeatures of the sample test customer (scaled array):\n{X_test_scaled[sample_customer_index]}")
print(f"Original Features of the sample test customer:\n{X_test.iloc[sample_customer_index]}") # Show original for context
print(f"Actual Churn for this customer: {actual_churn_value}")
print(f"Predicted Churn for this customer: {predicted_churn_sample[0]}") # [0] to get the single value from the array



--- Building and Training the Model ---
Model training complete!

Features of the sample test customer (scaled array):
[-0.62740222 -1.0226094  -0.03760807  0.5        -1.22474487]
Original Features of the sample test customer:
age                        34.0
annual_income           55000.0
last_purchase_amount      120.5
has_loyalty_card           True
gender_encoded                0
Name: 0, dtype: object
Actual Churn for this customer: 0
Predicted Churn for this customer: 0


Results and Insights

After training the Logistic Regression model, we evaluate its performance on the unseen test set.

Sample Prediction Insight:
The model predicted a churn status of {predicted_churn_sample[0]} for a customer with original features {X_test.iloc[sample_customer_index].to_dict()} whose actual churn status was {actual_churn_value}. This single prediction gives a snapshot of how the model performs on an individual case.

Overall Model Performance:
To assess the model's generalizability, we evaluate it on the entire test set (X_test).

Accuracy: Measures the proportion of correctly predicted instances out of the total instances.
Classification Report: Provides more detailed metrics for each class (0=No Churn, 1=Churn):
Precision: Of all instances predicted as a certain class, how many were actually that class? (e.g., of all predicted churners, how many actually churned?)
Recall: Of all instances that actually belong to a certain class, how many did the model correctly identify? (e.g., of all actual churners, how many did the model catch?)
F1-Score: The harmonic mean of precision and recall, offering a balance between the two.
Support: The number of actual occurrences of each class in the test set.
(Note: Due to the very small size of this mock dataset, the accuracy and classification report may show extreme values or appear trivial. In real-world scenarios, these metrics are crucial for understanding a model's strengths and weaknesses.)

In [6]:
# --- 5. Model Evaluation ---

print("\n--- Model Evaluation on Test Set ---")

# Predict churn for the entire X_test_scaled set
# This provides predictions for all customers in the test set, which we compare to y_test.
y_pred = model.predict(X_test_scaled)

# Calculate and print overall accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy on Test Set: {accuracy:.2f}")

# Print a detailed classification report
# This shows precision, recall, and f1-score for each class (0 and 1)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\n--- Project Complete! ---")


--- Model Evaluation on Test Set ---
Model Accuracy on Test Set: 0.50

Classification Report:
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2


--- Project Complete! ---


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Future Work/Improvements

This project provides a foundational understanding. For a more robust and production-ready solution, several improvements could be considered:

More Data: Real-world datasets are much larger and more diverse.
Feature Engineering: Creating new, more informative features from existing ones (e.g., days_since_last_purchase).
Advanced Data Cleaning: Handling outliers, more sophisticated imputation methods for missing data.
Other Models: Experimenting with different machine learning algorithms (e.g., RandomForestClassifier, GradientBoostingClassifier) which might perform better.
Hyperparameter Tuning: Optimizing the settings of the chosen model for better performance.
Cross-Validation: A more robust method for evaluating model performance on small datasets.
Deployment: Turning the model into an API or integrating it into a business application.
This project serves as an excellent starting point for further exploration and skill development in AI and machine learning.