**Assignment Title : Customer Churn Prediction**

Objective:
Develop a machine learning model to predict customer churn based on historical customer data. You
will follow a typical machine learning project pipeline, from data preprocessing to model deployment

**1. Data Preprocessing:**
*Load the Dataset and Initial Data Exploration*

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the dataset from an Excel file
file_path = '/content/customer_churn_large_dataset.xlsx'
data = pd.read_excel(file_path)

# Display basic information about the dataset
print("Dataset Information:")
print(data.info())

# Display the first few rows of the dataset
print("\nFirst Few Rows of the Dataset:")
print(data.head())


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   CustomerID                  100000 non-null  int64  
 1   Name                        100000 non-null  object 
 2   Age                         100000 non-null  int64  
 3   Gender                      100000 non-null  object 
 4   Location                    100000 non-null  object 
 5   Subscription_Length_Months  100000 non-null  int64  
 6   Monthly_Bill                100000 non-null  float64
 7   Total_Usage_GB              100000 non-null  int64  
 8   Churn                       100000 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 6.9+ MB
None

First Few Rows of the Dataset:
   CustomerID        Name  Age  Gender     Location  \
0           1  Customer_1   63    Male  Los Angeles   
1           2  Customer

*Step 1.1: Handle Missing Data and Outliers*

In [2]:
# Check for missing values in each column
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)

Missing Values:
CustomerID                    0
Name                          0
Age                           0
Gender                        0
Location                      0
Subscription_Length_Months    0
Monthly_Bill                  0
Total_Usage_GB                0
Churn                         0
dtype: int64


*Step 1.2: Handling Outliers*

In [3]:
# Example: Detect and handle outliers using z-score
from scipy.stats import zscore
z_scores = np.abs(zscore(data.select_dtypes(include=[np.number])))
data_no_outliers = data[(z_scores < 3).all(axis=1)]

In [4]:
# Display the number of rows before and after removing outliers
print("\nNumber of Rows before Removing Outliers:", len(data))
print("Number of Rows after Removing Outliers:", len(data_no_outliers))


Number of Rows before Removing Outliers: 100000
Number of Rows after Removing Outliers: 100000


*Step 1.3: Prepare Data for Machine Learning*

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Step 3: Prepare Data for Machine Learning

# Encoding Categorical Variables
categorical_columns = data.select_dtypes(include=['object']).columns
label_encoders = {}  # Store LabelEncoder instances for future use

for col in categorical_columns:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

# Display the mapping of encoded labels
for col, le in label_encoders.items():
    print(f"Encoded labels for {col}: {dict(zip(le.classes_, le.transform(le.classes_)))}")



Encoded labels for Name: {'Customer_1': 0, 'Customer_10': 1, 'Customer_100': 2, 'Customer_1000': 3, 'Customer_10000': 4, 'Customer_100000': 5, 'Customer_10001': 6, 'Customer_10002': 7, 'Customer_10003': 8, 'Customer_10004': 9, 'Customer_10005': 10, 'Customer_10006': 11, 'Customer_10007': 12, 'Customer_10008': 13, 'Customer_10009': 14, 'Customer_1001': 15, 'Customer_10010': 16, 'Customer_10011': 17, 'Customer_10012': 18, 'Customer_10013': 19, 'Customer_10014': 20, 'Customer_10015': 21, 'Customer_10016': 22, 'Customer_10017': 23, 'Customer_10018': 24, 'Customer_10019': 25, 'Customer_1002': 26, 'Customer_10020': 27, 'Customer_10021': 28, 'Customer_10022': 29, 'Customer_10023': 30, 'Customer_10024': 31, 'Customer_10025': 32, 'Customer_10026': 33, 'Customer_10027': 34, 'Customer_10028': 35, 'Customer_10029': 36, 'Customer_1003': 37, 'Customer_10030': 38, 'Customer_10031': 39, 'Customer_10032': 40, 'Customer_10033': 41, 'Customer_10034': 42, 'Customer_10035': 43, 'Customer_10036': 44, 'Custo

*Split the data into features and target variable*

In [6]:
# Split the data into features and target variable
X = data.drop('Churn', axis=1)  # Features
y = data['Churn']  # Target variable

# Split the data into training and testing sets (adjust test_size and random_state as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the split datasets
print("\nShapes of Split Datasets:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


Shapes of Split Datasets:
X_train shape: (80000, 8)
X_test shape: (20000, 8)
y_train shape: (80000,)
y_test shape: (20000,)


**2.Feature Engineering:**
Generate relevant features from the dataset that can help improve the model's prediction
accuracy.

In [9]:
# Step 2.1: Feature Engineering

# Calculate 'Usage_Per_Bill'
data['Usage_Per_Bill'] = data['Total_Usage_GB'] / data['Monthly_Bill']

# Create binary 'High_Usage' flag
data['High_Usage'] = (data['Total_Usage_GB'] > data['Total_Usage_GB'].median()).astype(int)

# Combine 'Age' and 'Location' to create a new feature
data['Age_Location'] = data['Age'].astype(str) + '_' + data['Location'].astype(str)

# Calculate 'Usage_Per_Month'
data['Usage_Per_Month'] = data['Total_Usage_GB'] / data['Subscription_Length_Months']


# Create 'Age_Group' feature based on age ranges
age_bins = [0, 18, 30, 50, 100]
age_labels = ['Youth', 'Adult', 'Middle_Age', 'Senior']
data['Age_Group'] = pd.cut(data['Age'], bins=age_bins, labels=age_labels)


# Target encoding for categorical feature 'Location'
location_churn_rate = data.groupby('Location')['Churn'].mean()
data['Location_Encoded'] = data['Location'].map(location_churn_rate)


# Display the first few rows of the data with new features
print("\nFirst Few Rows of the Dataset with New Features:")
print(data.head())

# Split the data into features and target variable
X = data.drop('Churn', axis=1)  # Features
y = data['Churn']  # Target variable

# Split the data into training and testing sets (adjust test_size and random_state as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


First Few Rows of the Dataset with New Features:
   CustomerID   Name  Age  Gender  Location  Subscription_Length_Months  \
0           1      0   63       1         2                          17   
1           2  11112   62       0         4                           1   
2           3  22223   24       0         2                           5   
3           4  33334   36       0         3                           3   
4           5  44445   46       0         3                          19   

   Monthly_Bill  Total_Usage_GB  Churn  Usage_Per_Bill  High_Usage  \
0         73.36             236      0        3.217012           0   
1         48.76             172      0        3.527482           0   
2         85.47             460      0        5.382005           1   
3         97.94             297      1        3.032469           1   
4         58.14             266      0        4.575163           0   

  Age_Location  Usage_Per_Month  Location_Encoded   Age_Group  
0         63_2

*2.2Feature Scaling (Normalization)*

In [12]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Define categorical and numerical columns
categorical_columns = ['Age_Group', 'Location_Encoded']  # Adjust with actual categorical column names
numerical_columns = ['Age', 'Subscription_Length_Months', 'Monthly_Bill', 'Total_Usage_GB', 'Usage_Per_Bill', 'High_Usage', 'Usage_Per_Month']

# Initialize transformers for categorical and numerical columns
categorical_transformer = OneHotEncoder(drop='first')
numerical_transformer = StandardScaler()

# Use ColumnTransformer to apply transformers to the appropriate columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_columns),
        ('num', numerical_transformer, numerical_columns)
    ])

# Fit and transform the preprocessor on the training data
X_train_scaled = preprocessor.fit_transform(X_train)

# Transform the testing data using the same preprocessor
X_test_scaled = preprocessor.transform(X_test)




**3.Model Building**:
*Choose appropriate machine learning algorithms (e.g., logistic regression, random forest, or neural networks)*

While the specific accuracy you achieve will depend on your dataset and how well you preprocess and engineer features, among the three options of logistic regression, random forest, and neural networks, **neural networks have the potential to provide the highest accuracy**.


3.2*Train and validate the selected model on the training dataset*

In [25]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score
from keras.models import save_model
import joblib

# Define the neural network architecture
model = Sequential()
model.add(Dense(128, activation='relu', input_dim=X_train_scaled.shape[1]))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Implement early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model on the training data
model.fit(X_train_scaled, y_train, epochs=20, batch_size=64, verbose=1, validation_data=(X_test_scaled, y_test), callbacks=[early_stopping])

# Save the trained model
save_model(model, 'trained_model.h5')

# Save the scaler using joblib (if you are using joblib for the scaler)
joblib.dump(scaler, 'scaler.pkl')
# Predict on the validation data
y_val_pred_probs = model.predict(X_test_scaled)
y_val_pred = (y_val_pred_probs > 0.5).astype(int)

# Calculate accuracy on the validation data
val_accuracy = accuracy_score(y_test, y_val_pred)

# Display the validation accuracy
print("Validation Accuracy:", val_accuracy)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Validation Accuracy: 0.49955


3.2 *Evaluate the model's performance using appropriate metrics (e.g., accuracy, precision, recall,F1-score).*

In [26]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Predict on the validation data
y_val_pred_probs = model.predict(X_test_scaled)
y_val_pred = (y_val_pred_probs > 0.5).astype(int)

# Calculate different evaluation metrics
accuracy = accuracy_score(y_test, y_val_pred)
precision = precision_score(y_test, y_val_pred)
recall = recall_score(y_test, y_val_pred)
f1 = f1_score(y_test, y_val_pred)

# Display the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_val_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 0.49955
Precision: 0.49745399837981713
Recall: 0.8665457111178309
F1-Score: 0.6320626401499835

Confusion Matrix:
[[1394 8685]
 [1324 8597]]


**4.Model Optimization:**
*Fine-tune the model parameters to improve its predictive performance*

In [27]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Adam
from sklearn.metrics import make_scorer, accuracy_score

# Define a function to create the neural network model
def create_model(learning_rate=0.001, activation='relu'):
    model = Sequential()
    model.add(Dense(128, activation=activation, input_dim=X_train_scaled.shape[1]))
    model.add(Dense(64, activation=activation))
    model.add(Dense(32, activation=activation))
    model.add(Dense(16, activation=activation))
    model.add(Dense(1, activation='sigmoid'))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create a KerasClassifier for grid search
model = KerasClassifier(build_fn=create_model, verbose=0)

# Define hyperparameters and values to search
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'activation': ['relu', 'tanh']
}

# Initialize grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring=make_scorer(accuracy_score))

# Fit the grid search to the data
grid_result = grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and accuracy
best_params = grid_result.best_params_
best_accuracy = grid_result.best_score_

print("Best Parameters:", best_params)
print("Best Accuracy:", best_accuracy)

# Initialize the neural network model with the best parameters
best_model = create_model(learning_rate=best_params['learning_rate'], activation=best_params['activation'])

# Implement early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the best model on the training data
best_model.fit(X_train_scaled, y_train, epochs=20, batch_size=64, verbose=1, validation_data=(X_test_scaled, y_test), callbacks=[early_stopping])


  model = KerasClassifier(build_fn=create_model, verbose=0)


Best Parameters: {'activation': 'relu', 'learning_rate': 0.1}
Best Accuracy: 0.5024749821880509
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20


<keras.callbacks.History at 0x7e66e2a5e470>

*Explore techniques like cross-validation and hyperparameter tuning*

In [28]:
from sklearn.model_selection import cross_val_score, KFold
from keras.models import Sequential
from keras.layers import Dense

# Define the neural network architecture
def create_model():
    model = Sequential()
    model.add(Dense(128, activation='relu', input_dim=X_train_scaled.shape[1]))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create a KerasClassifier for cross-validation
model = KerasClassifier(build_fn=create_model, epochs=20, batch_size=64, verbose=0)

# Define cross-validation strategy (e.g., K-Fold)
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='accuracy')

# Display cross-validation scores
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())

  model = KerasClassifier(build_fn=create_model, epochs=20, batch_size=64, verbose=0)


Cross-Validation Scores: [0.5053125 0.496     0.493625  0.49675   0.4976875]
Mean Accuracy: 0.49787499999999996


**5. Model Deployment:**
*Once satisfied with the model's performance, deploy it into a production-like environment (you can simulate this in a development environment).Ensure the model can take new customer data as input and provide churn predictions.*

In [29]:
from flask import Flask, request, jsonify
import numpy as np
from sklearn.preprocessing import StandardScaler
from keras.models import load_model
import joblib  # If you are using joblib to save the scaler

# Load the model and preprocessing steps
model = load_model('trained_model.h5')
scaler = joblib.load('scaler.pkl')

# Load the model and preprocessing steps
model = load_model('trained_model.h5')
scaler = joblib.load('scaler.pkl')

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.get_json()
        features = np.array(data['features']).reshape(1, -1)
        scaled_features = scaler.transform(features)
        prediction = model.predict(scaled_features)[0][0]
        return jsonify({'prediction': prediction})
    except Exception as e:
        return jsonify({'error': str(e)})

if __name__ == '__main__':
    app.run(debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat
