# Customer Churn Prediction: Data Preprocessing

This notebook covers the data preprocessing steps for our customer churn prediction project. We'll load the data, clean it, handle missing values, and perform feature engineering.

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## 1. Load the Dataset

In [8]:
# Load the dataset
df = pd.read_csv('../data/raw/customer-dataset.csv')

# Display the first few rows and basic information about the dataset
print(df.head())
print(df.info())

   CustomerID  Age      Gender     Location SubscriptionStart SubscriptionEnd  \
0        1001   32      Female     New York        2023-01-15      2024-01-14   
1        1002   45        Male  Los Angeles        2023-02-01      2023-08-01   
2        1003   28  Non-Binary      Chicago        2023-03-10      2024-03-09   
3        1004   55      Female      Houston        2023-01-01      2023-12-31   
4        1005   39        Male        Miami        2023-04-05      2024-04-04   

  Churn  TotalPurchases LastPurchaseDate  AvgOrderValue LoginFrequency  \
0    No              18       2024-09-10          75.50          Daily   
1   Yes               5       2023-07-15         120.00         Weekly   
2    No              25       2024-09-20          50.25          Daily   
3   Yes              12       2023-11-30          85.75        Monthly   
4    No              30       2024-09-18          95.00          Daily   

   SupportInteractions  
0                    2  
1                 

## 2. Data Cleaning

In [13]:
# Convert date columns to datetime
date_columns = ['SubscriptionStart', 'SubscriptionEnd', 'LastPurchaseDate']
for col in date_columns:
    df[col] = pd.to_datetime(df[col])


# Separate numeric and non-numeric columns
numeric_columns = df.select_dtypes(include=[np.number]).columns
categorical_columns = df.select_dtypes(exclude=[np.number]).columns

# Fill missing values in numeric columns with the mean
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

# Fill missing values in categorical columns with the mode (most frequent value)
df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])

# Verify that missing values have been handled
print(df.isnull().sum())


CustomerID             0
Age                    0
Gender                 0
Location               0
SubscriptionStart      0
SubscriptionEnd        0
Churn                  0
TotalPurchases         0
LastPurchaseDate       0
AvgOrderValue          0
LoginFrequency         0
SupportInteractions    0
dtype: int64


## 3. Feature Engineering

In [None]:
# Calculate customer tenure in days
df['Tenure'] = (df['SubscriptionEnd'] - df['SubscriptionStart']).dt.days

# Calculate days since last purchase
current_date = datetime(2024, 9, 25)  # Using the current date from the conversation context
df['DaysSinceLastPurchase'] = (current_date - df['LastPurchaseDate']).dt.days

# Convert LoginFrequency to numeric
login_frequency_map = {'Daily': 30, 'Bi-weekly': 2, 'Weekly': 4, 'Monthly': 1}
df['LoginFrequencyNumeric'] = df['LoginFrequency'].map(login_frequency_map)

# Calculate average purchases per month
df['AvgPurchasesPerMonth'] = df['TotalPurchases'] / (df['Tenure'] / 30)

# Create a loyalty score
df['LoyaltyScore'] = (df['Tenure'] / 365 * 0.5 +
                      df['AvgPurchasesPerMonth'] * 0.3 +
                      df['LoginFrequencyNumeric'] * 0.2)

# Display the first few rows of the updated dataset
print(df.head())

## 4. Encoding Categorical Variables

In [None]:
# Convert categorical variables to numeric using one-hot encoding
df = pd.get_dummies(df, columns=['Gender', 'Location'])

# Display the first few rows and updated info of the dataset
print(df.head())
print(df.info())

## 5. Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numeric columns for scaling
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Display the first few rows of the scaled dataset
print(df.head())

## 6. Save Processed Dataset

In [None]:
# Save the processed dataset
df.to_csv('../data/processed/processed_customer_data.csv', index=False)
print("Processed data saved successfully.")

## 7. Summary of Preprocessing Steps

1. Loaded the raw dataset
2. Converted date columns to datetime format
3. Handled missing values
4. Performed feature engineering:
   - Calculated customer tenure
   - Calculated days since last purchase
   - Converted login frequency to numeric
   - Calculated average purchases per month
   - Created a loyalty score
5. Encoded categorical variables using one-hot encoding
6. Scaled numeric features using StandardScaler
7. Saved the processed dataset

The preprocessed data is now ready for exploratory data analysis and model building.