### Step 1: Setup the Environment

In this step, I will import all the necessary libraries and modules needed for the analysis. This includes libraries for data manipulation, machine learning, and visualization. I will be using Python’s `pandas`, `numpy`, `matplotlib`, and `sklearn` for data preprocessing, clustering, model training, and evaluation.

In [51]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


### Step 2: Load the Dataset

In this step, I will load the dataset into a pandas dataframe from the provided path. I will also take a quick look at the first few rows of the dataset to understand its structure and identify the features we need to work with.

In [52]:
data = pd.read_csv('/Users/darienprall/Documents/GitHub/School/Capella/CSC-4030_Introduction_to_Machine_Learning/Assessment_9/credit_card_customers.csv')
print(data.head())

   CLIENTNUM     Attrition_Flag  Customer_Age Gender  Dependent_count  \
0  768805383  Existing Customer            45      M                3   
1  818770008  Existing Customer            49      F                5   
2  713982108  Existing Customer            51      M                3   
3  769911858  Existing Customer            40      F                4   
4  709106358  Existing Customer            40      M                3   

  Education_Level Marital_Status Income_Category Card_Category  \
0     High School        Married     $60K - $80K          Blue   
1        Graduate         Single  Less than $40K          Blue   
2        Graduate        Married    $80K - $120K          Blue   
3     High School        Unknown  Less than $40K          Blue   
4      Uneducated        Married     $60K - $80K          Blue   

   Months_on_book  ...  Credit_Limit  Total_Revolving_Bal  Avg_Open_To_Buy  \
0              39  ...       12691.0                  777          11914.0   
1       

### Step 3: Data Preprocessing

In this step, I will preprocess the data by:
- Separating the features (`X`) and the target variable (`y`).
- Encoding categorical features using OneHotEncoder.
- Scaling numerical features like `Age` and `CreditCardLimit` using StandardScaler.
I will use a `ColumnTransformer` to apply these transformations.


In [None]:
data['Churn'] = data['Attrition_Flag'].map({
    'Attrited Customer': 1, 
    'Existing Customer': 0
})

X = data.drop(['Attrition_Flag', 'Churn'], axis = 1)
y = data['Churn']

categorical_features = [
    'Gender',
    'Education_Level',
    'Marital_Status',
    'Income_Category',
    'Card_Category'
]

numerical_features = [
    'Customer_Age',
    'Dependent_count',
    'Months_on_book',
    'Total_Relationship_Count',
    'Months_Inactive_12_mon',
    'Contacts_Count_12_mon',
    'Credit_Limit',
    'Total_Revolving_Bal',
    'Avg_Open_To_Buy',
    'Total_Amt_Chng_Q4_Q1',
    'Total_Trans_Amt',
    'Total_Trans_Ct',
    'Total_Ct_Chng_Q4_Q1',
    'Avg_Utilization_Ratio',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
]

preprocess = make_column_transformer(
    (StandardScaler(), numerical_features),
    (OneHotEncoder(), categorical_features),
    remainder='passthrough'
)
#print(X.columns)
X_processed = preprocess.fit_transform(X)
print(X_processed[:5])

[[-1.65405580e-01  5.03368127e-01  3.84620878e-01  7.63942609e-01
  -1.32713603e+00  4.92403766e-01  4.46621903e-01 -4.73422218e-01
   4.88970818e-01  2.62349444e+00 -9.59706574e-01 -9.73895182e-01
   3.83400260e+00 -7.75882235e-01 -4.37753814e-01  4.37763128e-01
   0.00000000e+00  1.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  7.68805383e+08]
 [ 3.33570383e-01  2.04319867e+00  1.01071482e+00  1.40730617e+00
  -1.32713603e+00 -4.11615984e-01 -4.13666521e-02 -3.66666822e-01
  -8.48598788e-03  3.56329284e+00 -9.16432607e-01 -1.35734038e+00
   1.26085729e+01 -6.16275655e-01 -4.37853975e-01  4.37845257e-01
   1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   1.0000

### Step 4: Split the Dataset into Training and Test Sets

I will split the preprocessed dataset into training and testing sets. This will help me evaluate the model's performance on unseen data.

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.3, random_state=42)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

X_train shape: (7088, 40)
X_test shape: (3039, 40)


### Step 5: Apply K-Means Clustering

In this step, I will apply the K-Means clustering algorithm to segment customers into distinct clusters. These clusters can reveal underlying patterns in customer behavior that may help improve the prediction accuracy of our churn model.

In [55]:
kmeans = KMeans(n_clusters = 4, random_state=42).fit(X_train)

X_train_with_clusters = np.c_[X_train, kmeans.labels_]
X_test_with_clusters = np.c_[X_test, kmeans.predict(X_test)]
print(f"X_train_with_clusters shape: {X_train_with_clusters.shape}")
print(f"X_test_with_clusters shape: {X_test_with_clusters.shape}")

X_train_with_clusters shape: (7088, 41)
X_test_with_clusters shape: (3039, 41)
