# Customer Churn Prediction: Data Preprocessing

This notebook covers the data preprocessing steps for our customer churn prediction project. We'll load the data, clean it, handle missing values, and perform feature engineering.

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## 1. Load the Dataset

In [6]:
# Load the dataset
df = pd.read_csv('../data/raw/customer_dataset.csv')

# Display the first few rows and basic information about the dataset
print(df.head())
print(df.info())

FileNotFoundError: [Errno 2] No such file or directory: '/data/raw/customer_dataset.csv'

## 2. Data Cleaning

In [None]:
# Convert date columns to datetime
date_columns = ['SubscriptionStart', 'SubscriptionEnd', 'LastPurchaseDate']
for col in date_columns:
    df[col] = pd.to_datetime(df[col])

# Check for missing values
print(df.isnull().sum())

# Handle missing values (if any)
df = df.fillna(df.mean())

# Verify that missing values have been handled
print(df.isnull().sum())

## 3. Feature Engineering

In [None]:
# Calculate customer tenure in days
df['Tenure'] = (df['SubscriptionEnd'] - df['SubscriptionStart']).dt.days

# Calculate days since last purchase
current_date = datetime(2024, 9, 25)  # Using the current date from the conversation context
df['DaysSinceLastPurchase'] = (current_date - df['LastPurchaseDate']).dt.days

# Convert LoginFrequency to numeric
login_frequency_map = {'Daily': 30, 'Bi-weekly': 2, 'Weekly': 4, 'Monthly': 1}
df['LoginFrequencyNumeric'] = df['LoginFrequency'].map(login_frequency_map)

# Calculate average purchases per month
df['AvgPurchasesPerMonth'] = df['TotalPurchases'] / (df['Tenure'] / 30)

# Create a loyalty score
df['LoyaltyScore'] = (df['Tenure'] / 365 * 0.5 +
                      df['AvgPurchasesPerMonth'] * 0.3 +
                      df['LoginFrequencyNumeric'] * 0.2)

# Display the first few rows of the updated dataset
print(df.head())

## 4. Encoding Categorical Variables

In [None]:
# Convert categorical variables to numeric using one-hot encoding
df = pd.get_dummies(df, columns=['Gender', 'Location'])

# Display the first few rows and updated info of the dataset
print(df.head())
print(df.info())

## 5. Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numeric columns for scaling
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Display the first few rows of the scaled dataset
print(df.head())

## 6. Save Processed Dataset

In [None]:
# Save the processed dataset
df.to_csv('../data/processed/processed_customer_data.csv', index=False)
print("Processed data saved successfully.")

## 7. Summary of Preprocessing Steps

1. Loaded the raw dataset
2. Converted date columns to datetime format
3. Handled missing values
4. Performed feature engineering:
   - Calculated customer tenure
   - Calculated days since last purchase
   - Converted login frequency to numeric
   - Calculated average purchases per month
   - Created a loyalty score
5. Encoded categorical variables using one-hot encoding
6. Scaled numeric features using StandardScaler
7. Saved the processed dataset

The preprocessed data is now ready for exploratory data analysis and model building.