## Python For Data Analytics
### Final Project
#### Notebook by: rmabano

In [None]:
# Importing Basic essential Libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pickle

import warnings
warnings.filterwarnings('ignore')

### 1. Data Preparation

In [None]:
# Load the dataset
Credit_Card_data = pd.read_csv('CC GENERAL.csv')


### 2. Exploratory Data Analysis

In [None]:
# Display the first few rows of the dataset
Credit_Card_data.head()

In [None]:
CC_INFO = Credit_Card_data.info()
print(CC_INFO)

In [None]:
# checking the mathematical statistics/characteristics 
Credit_Card_data.describe()

In [None]:
# Visualizing the distribution of each feature or column
plt.figure(figsize=(15, 12))
for i, col in enumerate(Credit_Card_data.columns[1:]):  # Excluding CUST_ID for visualization
    plt.subplot(5, 4, i+1)
    sns.histplot(Credit_Card_data[col], kde=True)
    plt.title(col)
    plt.tight_layout()

In [None]:
# Checking for missing values
Credit_Card_data.isnull().sum()

In [None]:
#USING BOXPLOTS TO CHECK FOR OUTLIERS
# Plotting box plots for each numerical feature
plt.figure(figsize=(15, 10))
for i, col in enumerate(Credit_Card_data.columns[1:]):  # Excluding CUST_ID for visualization
    plt.subplot(5, 4, i+1)
    sns.boxplot(y=Credit_Card_data[col])
    plt.title(col)
    plt.tight_layout()


### 3. Data Preprocessing (Handling Missing Values, Outliers, and Encoding) 

#### Handling Missing Values

In [None]:
# Handling missing values by replacing them with the mean of each column
Credit_Card_data.fillna(Credit_Card_data.mean(), inplace=True)

# Recheck missing values to ensure they are filled
Credit_Card_data.isnull().sum()

#### Handling Outliers

Dealing with outliers is a crucial step in preparing the data for analysis and modeling. There are several strategies for handling outliers. Given the nature of this dataset (financial transactions), it might be more appropriate to use methods that retain the outliers but reduce their impact, rather than simply removing them. Two common approaches are:

* Log Transformation: This is effective for right-skewed distributions. It can't be applied directly to values of zero or negative values, so we need to adjust for that.
* Capping: Outliers are capped at a certain percentile. For instance, values above the 95th percentile can be set to the 95th percentile value.

In [None]:
"""
This code will apply a log transformation to each numerical feature, adjusting for zero values. 
After the transformation, we'll visualize the distributions again to see the effect.
Applying log transformation to features with significant outliers.

"""
# Adding 1 to avoid log(0) which is undefined
for col in Credit_Card_data.columns[1:]:  # Excluding CUST_ID for transformation
    if Credit_Card_data[col].min() > 0:  # If no zero or negative values in the column
        Credit_Card_data[col] = np.log(Credit_Card_data[col])
    else:
        Credit_Card_data[col] = np.log(Credit_Card_data[col] + 1)  # Adjusting for zero values

# Visualizing the distributions post-transformation
plt.figure(figsize=(15, 12))
for i, col in enumerate(Credit_Card_data.columns[1:]):  # Excluding CUST_ID for visualization
    plt.subplot(5, 4, i+1)
    sns.histplot(Credit_Card_data[col], kde=True)
    plt.title(col)
    plt.tight_layout()


In [None]:
#USING BOXPLOTS TO CHECK FOR OUTLIERS
# Plotting box plots for each numerical feature
plt.figure(figsize=(15, 10))
for i, col in enumerate(Credit_Card_data.columns[1:]):  # Excluding CUST_ID for visualization
    plt.subplot(5, 4, i+1)
    sns.boxplot(y=Credit_Card_data[col])
    plt.title(col)
    plt.tight_layout()


**Post-Log Transformation Distributions**

After applying the log transformation to the dataset (excluding CUST_ID), we have the following observations:

* Improved Distributions: The log transformation has significantly improved the skewness in most features. The distributions now appear more normalized, which is beneficial for many statistical models and machine learning algorithms.

* Reduced Impact of Outliers: The transformation has reduced the extreme values' impact, making the dataset more uniform and less skewed.

**Implications for Modeling**

* For this dataset, log transformation was a suitable initial approach due to the nature of the data and the goal of retaining as much information as possible. Whether to also apply capping depends on further analysis and the specific requirements of the subsequent modeling steps.

* Enhanced Model Performance: Many algorithms perform better when the data does not have extreme values or heavy skewness. The log transformation thus potentially enhances model performance.

* Feature Interpretation: Post-transformation, the features now represent the logarithm of their original values. This should be kept in mind while interpreting the results.

#### Label Encoding (If needed)

In [None]:
# Checking for categorical variables so that they are encoded
categorical_cols = Credit_Card_data.select_dtypes(include=['object']).columns
categorical_cols


The dataset contains only one categorical column, CUST_ID, which is an identifier and not a feature for modeling. Hence, no label encoding is required for the dataset

#### Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler


# Scaling the data excluding CUST_ID
scaler = StandardScaler()
CC_data_scaled = Credit_Card_data.copy()

#SCALING COLUMNS EXCEPT FOR THE CUST_ID DATA
CC_data_scaled[Credit_Card_data.columns[1:]] = scaler.fit_transform(Credit_Card_data[Credit_Card_data.columns[1:]])

# Saving the scaler for later use
with open('standard_scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)


**Data Scaling**

- The numerical features (excluding CUST_ID) have been scaled using the Standard Scaler.
- This scaling is essential for algorithms that are sensitive to the scale of the data.

**Standard Scaler Serialization**

- The fitted Standard Scaler has been saved as a pickle file standard_scaler.pkl.

- This scaler can be retrieved during model deployment to ensure that new data is scaled consistently with the training data.

### 4. Unsupervised Model Creation & Evaluation 

In [None]:
from sklearn.cluster import KMeans

#finfing optimum k using 

# Elbow Method
inertia = []
K = range(1, 11)  # Testing 1 to 10 clusters
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(CC_data_scaled.drop('CUST_ID', axis=1))
    inertia.append(kmeanModel.inertia_)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal k')

# Identifying the "elbow" point and annotating it on the plot
plt.annotate('Optimal K', xy=(2, inertia[1]), xytext=(3, inertia[1] + 2000),
             arrowprops=dict(facecolor='black', arrowstyle='->'),)

plt.show()


In [None]:
from sklearn.metrics import silhouette_score

def calculate_silhouette_scores(data):
    """
    This function will run the silhouette Score test for each cluster from 2-11.
    This metric is used to validate the choice we made in the cell above.
    
    """
    silhouette_scores = []
    
    for k in range(2, 12):  # Testing clusters from 2 to 11
        kmeans = KMeans(n_clusters=k)
        cluster_labels = kmeans.fit_predict(data.drop('CUST_ID', axis=1))
        silhouette_avg = silhouette_score(data.drop('CUST_ID', axis=1), cluster_labels)
        silhouette_scores.append(silhouette_avg)
        print(f"For k = {k}, Silhouette Score = {silhouette_avg}")

    return silhouette_scores

silhouette_scores = calculate_silhouette_scores(CC_data_scaled)


As can be seen from the chosen K values, 2 provides the highest Silhouette Score, making it the best cluster value

In [None]:
# Proceeding with KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
CC_data_scaled['Cluster'] = kmeans.fit_predict(CC_data_scaled.drop('CUST_ID', axis=1))

# Saving the labeled dataset
output_file = 'rmabano-cc-labelled.csv'
CC_data_scaled.to_csv(output_file, index=False)

print('You have successfully generated:', output_file)  # Returning the count of NaN values after imputation and the file path for download


# labelled_dataset.head(5)

In [None]:
"""
Now we visualize the clusters we have created with the help of PCA
"""

from sklearn.decomposition import PCA

# Loading the labeled dataset
labelled_dataset = pd.read_csv('rmabano-cc-labelled.csv')

labelled_dataset.fillna(Credit_Card_data.mean(), inplace=True)

# Applying PCA to reduce the data to two dimensions for visualization
pca = PCA(n_components=2)
labeled_data_pca = pca.fit_transform(labelled_dataset.drop(['CUST_ID', 'Cluster'], axis=1))

# Creating a DataFrame for the PCA results
pca_df = pd.DataFrame(data=labeled_data_pca, columns=['PCA1', 'PCA2'])
pca_df['Cluster'] = labelled_dataset['Cluster']




# Adjusting the scatter plot as per the new requirements
plt.figure(figsize=(10, 6))

# Cluster 0: Marked with 'x' and in red color
plt.scatter(pca_df[pca_df['Cluster'] == 0]['PCA1'], pca_df[pca_df['Cluster'] == 0]['PCA2'], 
            label='Cluster 0', alpha=0.5, marker='x', color='red')

# Cluster 1: Marked with circles and in blue color
plt.scatter(pca_df[pca_df['Cluster'] == 1]['PCA1'], pca_df[pca_df['Cluster'] == 1]['PCA2'], 
            label='Cluster 1', alpha=0.5, marker='o', color='blue')

plt.title('2D PCA of Credit Card Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()



The 2D PCA scatter plot above visualizes the two clusters (Cluster 0 and Cluster 1) in the credit card dataset. Each point represents a customer, and the color indicates the cluster to which the customer belongs. 
This visualization helps to see how well-separated the clusters are and provides a visual representation of the customer segmentation. Each point represents a customer, and the markers and colors help to visually differentiate between the customer segments based on their clustering:

- Cluster 0 is represented by red 'x' markers.
- Cluster 1 is represented by blue circle markers.

Next, let's analyze the cluster centroids to understand the characteristics of each cluster. This involves examining the average values of the original features for each cluster, which can provide insights into what defines each segment. Let's proceed with this analysis.

In [None]:
"""
evaluating the kmeans model
"""
kmeans = KMeans(n_clusters=2, n_init=10, random_state=42, max_iter=500)

clust_lab = kmeans.fit_predict(CC_data_scaled.drop(columns = "CUST_ID"))

In [None]:
"""
EVALUATION METRIC 1: Silhouette Score
"""

from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# Print the silhouette score
k_sil_score = silhouette_score(pca_df, kmeans.labels_)
print(f"Silhouette score : {k_sil_score}")


In [None]:
from sklearn.metrics import silhouette_score, calinski_harabasz_score

"""
EVALUATION METRIC 2: Calinski Harabasz Score
"""

Harabasz_score = calinski_harabasz_score(pca_df, kmeans.labels_)
print(f"Calinski and Harabasz score : {Harabasz_score}")


In [None]:
# Calculating the centroids of each cluster
cluster_centroids = labelled_dataset.groupby('Cluster').mean()

# Displaying the centroids
cluster_centroids

The table above shows the centroids of each cluster, representing the average values of each feature within the cluster. These centroids help us understand the defining characteristics of each customer segment:

**Cluster 0:**

- This cluster tends to have lower balances and less frequent balance updates.
- Customers in this cluster make more purchases, both one-off and installment, compared to Cluster 1.
- They use cash advances less frequently and have fewer cash advance transactions.
- They have a higher frequency of purchases and are more likely to make purchases in installments.
- These customers generally have a higher full payment rate and slightly longer tenure.


**Cluster 1:**

- Customers in this cluster have higher balances and more frequent balance updates.
- They make fewer purchases, both one-off and installment.
- This cluster uses cash advances more frequently and has more cash advance transactions.
- They have a lower frequency of purchases and are less likely to make purchases in installments.
- These customers have a lower full payment rate and slightly shorter tenure.

These insights can be used to tailor marketing strategies, product offerings, and services to each customer segment. For example, customers in Cluster 0 might be more responsive to promotions related to installment purchases, while those in Cluster 1 might be targeted with offers related to cash advances or products for customers with higher balances.

### 5. Supervised Model Creation & Evaluation 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

"""
Building model with all 17 features
"""

import pickle

# Separating the features and the target variable
X = labelled_dataset.drop(['CUST_ID', 'Cluster'], axis=1)
y = labelled_dataset['Cluster']

# Splitting the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Random Forest Classifier
First_rf_model = RandomForestClassifier(max_depth=5, random_state=42)

# Perform cross-validation
init_cv_scores = cross_val_score(First_rf_model, X_train, y_train, cv=10)

# Train the model on the entire training set
First_rf_model.fit(X_train, y_train)

# Predict on the test set
y_pred = First_rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Calculate additional metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)

# Store all metrics in a dictionary
initial_model_metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1,
    'AUC-ROC': auc_roc
}

## Saving the model
model_save_path = 'random_forest_classifier.pkl'
with open(model_save_path, 'wb') as file:
    pickle.dump(First_rf_model, file)
    
print("Model Performance Metrics:")
for metric, value in initial_model_metrics.items():
    print(f"{metric}: {value}")

print(f"Cross-validation scores: {init_cv_scores}")
print(f"Accuracy on test set: {accuracy}")
print("Classification Report:")
print(class_report)


Learning Curve

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

# Function to plot the learning curves
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    # Mean and Standard Deviation of training scores
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    
    # Mean and Standard Deviation of cross-validation scores
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

# Plotting the learning curve for the Random Forest Classifier
plot_learning_curve(First_rf_model, "Learning Curves (Random Forest)", X_train, y_train, cv=5, n_jobs=4)
plt.show()


**Insight From Curve:**

1. **High Training Score:**
   - The training score starts and remains high as more data points are added. This indicates that the model is able to fit the training data very well.

2. **Converging Scores:**
   - The cross-validation score increases with the number of training examples and is converging towards the training score. This suggests that the model is generalizing well and is not overfitting. The gap between the training and cross-validation scores is small, which is a good sign.

3. **Plateauing of Scores:**
   - Both scores plateau towards the right end of the graph, which suggests that adding more training data might not lead to significant improvements in the model's performance. The model seems to have learned as much as it can about the data and reached its performance limit given the current feature set and model complexity.

4. **No High Bias or High Variance:**
   - There's no evidence of high bias (underfitting) as both the training and validation scores are high.
   - There's no evidence of high variance (overfitting) as the training and validation scores are close to each other.

In conclusion, the learning curves indicate that the Random Forest model is performing well on this dataset. There doesn't appear to be a problem with overfitting or underfitting. The model seems to have reached a good balance between bias and variance, providing good generalization performance.


#### Reasons for choosing Random Forest Classifier Model

When selecting a machine learning algorithm for a particular task, several factors are considered, including the nature of the data, the complexity of the problem, the interpretability of the model, and computational efficiency. In the case of our credit card dataset, **I opted for the Random Forest Classifier** for the following reasons:

1. **Handling Non-Linear Relationships:**
   - Random Forest is an ensemble learning method that is particularly effective in handling non-linear relationships in data. Credit card data often involve complex interactions between variables that linear models like logistic regression might not capture effectively.

2. **Robustness to Overfitting:**
   - While models like decision trees are prone to overfitting, Random Forest mitigates this by averaging multiple decision trees, each trained on different subsets of the data. This makes it more robust to overfitting, especially when dealing with high-dimensional data.

3. **Feature Importance:**
   - Random Forest provides useful insights into feature importance, helping us understand which features are most influential in predicting customer segments. This is valuable for interpretability and can guide business decisions.

4. **Versatility and Performance:**
   - Random Forest often performs well in a wide range of classification tasks and is less sensitive to hyperparameter tuning, making it a good initial choice for a baseline model.

5. **Comparison with SVM:**
   - Support Vector Machines (SVM) are powerful for classification problems, especially in high-dimensional spaces. However, SVM models can be computationally intensive, especially with large datasets, and require careful tuning of hyperparameters. In contrast, Random Forests are generally more scalable and easier to tune.

6. **Interpretability vs. Accuracy:**
   - Logistic regression is highly interpretable but might not provide the level of accuracy required for complex classification tasks, especially in the presence of non-linear relationships. Random Forest strikes a balance between interpretability and accuracy.


### 6.  Feature Selection and Engineering

In [None]:
"""
using builtin function to find optimum features
"""


# Setup the RandomForestClassifier
random_forest=RandomForestClassifier(n_estimators=500,random_state=1)
random_forest.fit(X_train,y_train)

# Set the columnns from the X dataset as the labels for the graph
labels=X.columns

# Select the feature importances
feature_importances=random_forest.feature_importances_
feature_indices=np.argsort(feature_importances)[::-1]

# Select the features whose importance is greater than the mean importance
mean_importance = feature_importances.mean()

#Create a list to hold the labels of the features that qualify
optimal_features = []

for feature in range(X_train.shape[1]):
    if feature_importances[feature_indices[feature]] > mean_importance:
        optimal_features.append(labels[feature_indices[feature]])
        print("{:2d} {:25s} {:.3f}".format(feature+1, labels[feature_indices[feature]], 
                                       feature_importances[feature_indices[feature]]))

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

"""
calculating mean importance to remain with the most important features only
"""


def calculate_feature_importances(classifier, X_data):
    # Extract feature importances
    feature_importances = classifier.feature_importances_

    # Create a DataFrame of features and their importance scores
    features_df = pd.DataFrame({
        'Feature': X_data.columns,
        'Importance': feature_importances
    })

    # Sort the DataFrame by importance
    features_df_sorted = features_df.sort_values(by='Importance', ascending=True)

    return features_df_sorted


def extract_feature_importances(classifier, feature_data):
    # Extract feature importances
    feature_importances = classifier.feature_importances_

    # Create a DataFrame of features and their importance scores
    features_df = pd.DataFrame({
        'Feature': feature_data.columns,
        'Importance': feature_importances
    })

    # Sort the DataFrame by importance
    features_df_sorted = features_df.sort_values(by='Importance', ascending=True)

    return features_df_sorted

def plot_optimal_features(features_df_sorted):
    # Selecting only the top features (adjust the number as needed)
    top_features = features_df_sorted.tail(5)  # Plots top 5 features

    # Plotting only the top features
    plt.figure(figsize=(10, 8))
    sns.barplot(x='Importance', y='Feature', data=top_features)
    plt.title('Top Feature Importances')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.show()

# Assuming 'my_classifier' is your classifier and 'my_features' is your feature dataset
# Calculating feature importances
feature_importance_result = extract_feature_importances(random_forest, X_train)

# Plotting only the top features based on importance scores
plot_optimal_features(feature_importance_result)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
"""
Building model with selected features
"""


# Select the data for these features
X_new = X[optimal_features]

# Split the dataset into training and testing sets (80% train, 20% test)
X_new_train, X_new_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)

# Initialize the classifier
classifier = RandomForestClassifier(random_state=42)

# Train the model on the training set
classifier.fit(X_new_train, y_train)

# Predict on the test set
y_pred_new = classifier.predict(X_new_test)

# Evaluate the model using various metrics
new_accuracy = accuracy_score(y_test, y_pred_new)
new_precision = precision_score(y_test, y_pred_new)
new_recall = recall_score(y_test, y_pred_new)
new_f1 = f1_score(y_test, y_pred_new)
new_auc_roc = roc_auc_score(y_test, y_pred_new)
new_class_report = classification_report(y_test, y_pred_new)

# Store all metrics in a dictionary
new_model_metrics = {
    'Accuracy': new_accuracy,
    'Precision': new_precision,
    'Recall': new_recall,
    'F1 Score': new_f1,
    'AUC-ROC': new_auc_roc
}

# Print the metrics
print("New Model Metrics with Selected Features:")
for metric, value in new_model_metrics.items():
    print(f"{metric}: {value}")

# Print the classification report
print("\nClassification Report:")
print(new_class_report)

# Save the model
model_filename = 'finalized_model.sav'
with open(model_filename, 'wb') as file:
    pickle.dump(classifier, file)


cv_scores = cross_val_score(classifier, X_new_train, y_train, cv=10)
print("\nCross-validation scores: ", cv_scores)


Comparing Performance of Initial Model VS New Model with selected Features

In [None]:
# Create DataFrames for each model's metrics
df_initial = pd.DataFrame(list(initial_model_metrics.items()), columns=['Metric', 'Score'])
df_initial['Model'] = 'Initial Model'

df_new = pd.DataFrame(list(new_model_metrics.items()), columns=['Metric', 'Score'])
df_new['Model'] = 'New Model'

# Concatenate the DataFrames
metrics_df = pd.concat([df_initial, df_new])

# Plotting the model metrics
plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=metrics_df)
plt.title('Comparison of Model Metrics')
plt.xticks(rotation=45)

plt.show()



In [None]:
from sklearn.model_selection import validation_curve
import matplotlib.pyplot as plt
import numpy as np

# Assuming 'X' and 'y' are already defined and 'optimal_features' is a list of feature names
optimal_features = ['PURCHASES_FREQUENCY', 'PURCHASES_TRX', 'PURCHASES', 
                    'CASH_ADVANCE', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX']

X_new = X[optimal_features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_range = np.arange(1, 250, 25)  # Range of n_estimators to evaluate



# Validation curve for the initial model
train_scores, test_scores = validation_curve(
    RandomForestClassifier(max_depth=5, random_state=42),
    X_train, y_train, param_name="n_estimators", param_range=param_range,
    cv=10, scoring="accuracy", n_jobs=-1
)

# Calculate mean and standard deviation for train set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Validation Curve with Random Forest (Initial Model)")
plt.plot(param_range, train_mean, label="Training score", color="blue")
plt.plot(param_range, test_mean, label="Cross-validation score", color="red")
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color="gray")
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color="gainsboro")
plt.xlabel("Number of Trees")
plt.ylabel("Accuracy")
plt.legend(loc="best")

X_train_OPT, X_test_OPT, y_train_OPT, y_test_OPT = train_test_split(X_new, y, test_size=0.3, random_state=42)

# Validation curve for the model with optimal features
train_scores, test_scores = validation_curve(
    RandomForestClassifier(random_state=42),
    X_train_OPT, y_train_OPT, param_name="n_estimators", param_range=param_range,
    cv=10, scoring="accuracy", n_jobs=-1
)

# Calculate mean and standard deviation for train set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.subplot(1, 2, 2)
plt.title("Validation Curve with Random Forest (Optimal Features)")
plt.plot(param_range, train_mean, label="Training score", color="blue")
plt.plot(param_range, test_mean, label="Cross-validation score", color="red")
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color="gray")
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color="gainsboro")
plt.xlabel("Number of Trees")
plt.ylabel("Accuracy")
plt.legend(loc="best")

plt.tight_layout()
plt.show()


The graphs generated show validation curves for two Random Forest models: one using the initial set of features and another using a subset of optimal features. The validation curves plot the training and cross-validation accuracy scores as a function of the number of trees in the Random Forest (`n_estimators`).


### Validation Curve with Random Forest (Initial Model)

- **Training Score (Blue Line)**: The accuracy of the Random Forest model on the training data. It starts high and continues to increase slightly as the number of trees grows, which is typical since more trees can capture more complex patterns in the data.
- **Cross-validation Score (Red Line)**: The accuracy of the model on the validation data. It initially increases with the number of trees, indicating that adding more trees is helping the model generalize better. However, it plateaus after a certain point, suggesting that adding more trees beyond this point does not significantly improve model performance on unseen data.
- **Shaded Area (Gray for Training, Light Gray for Cross-validation)**: Represents the variability (one standard deviation) of the accuracy scores across the CV folds. A large shaded area indicates more variability, meaning the model's performance is more sensitive to the particular folds of the data used in cross-validation.

### Validation Curve with Random Forest (Optimal Features)

- **Training Score (Blue Line)**: Similar to the initial model, the training score is high, indicating good performance on the training set. However, this score starts closer to the maximum accuracy of 1.0 and remains stable, which might suggest that the model with optimal features is able to capture the patterns in the data with fewer trees.
- **Cross-validation Score (Red Line)**: The cross-validation score is also high and stable, showing that the model with optimal features generalizes very well to unseen data. The fact that the score is consistently high across the range of `n_estimators` values suggests robustness and low overfitting.
- **Shaded Area (Gray for Training, Light Gray for Cross-validation)**: The variability is minimal for both the training and validation scores, which is a sign of a stable model. The model's predictions are consistent across different subsets of the data.

### Comparative Analysis

When comparing the two graphs, it's evident that the model trained with optimal features performs better in terms of both training and validation accuracy, and it does so with less variance. The consistent high accuracy across the CV folds for the optimal features model indicates that it is a more reliable model when generalized to unseen data.

In terms of the number of trees, both models show that after a certain point, increasing the number of trees does not lead to significant improvements in accuracy. This could imply that beyond this point, the benefits of adding more trees are marginal and may not justify the additional computational cost and complexity.

In summary, the optimal features model not only achieves higher accuracy but also exhibits greater stability and reliability, making it the preferred model based on these validation curves.

Using the metrics, it is very hard to convincingly tell which model does better, since the scores all look similar. We then resort to usng the cross validated scores for each model, which will lead us to seeing exactly which 

### 7. Hyper Parameter Tuning

In [None]:
"""
The benchmark model is the one with selected features, 
and therefore we now find the right parameters to tune in this cell.
"""

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split


# Select the data for these features
X_new = X[optimal_features]
y_new = y  

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.2, random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [50,100,200],
    'max_depth': [10,15,30],
    'min_samples_split': range(2,10),
    'min_samples_leaf': range(2,20)
}

# Initialize the classifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Perform the search
grid_search.fit(X_train, y_train)

# Best estimator
best_rf = grid_search.best_estimator_

# Save the model
tuned_model_filename = 'tuned_random_forest_model.sav'
with open(tuned_model_filename, 'wb') as file:
    pickle.dump(best_rf, file)

# Print out the best hyperparameters
print("Best hyperparameters found: ")
print(grid_search.best_params_)


Justification for Selected Hyperparameters:

- **n_estimators:** More trees will generally lead to better performance but also to longer training times. A grid of increasing sizes allows us to find a sweet spot.

- **max_depth:** Controls the depth of each tree. Deeper trees can model more complex patterns but may lead to overfitting. Limiting tree depth can create a simpler model that may generalize better.

- **min_samples_split and min_samples_leaf:** These parameters control the minimum number of samples required to split a node and to be at a leaf node, respectively, and can help prevent a tree from growing too deep.

In [None]:
#constructing random forest model with tuned hyper parameters


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


# Initialize the classifier with the tuned hyperparameters
Tuned_classifier = RandomForestClassifier(n_estimators=grid_search.best_params_['n_estimators'],
                                    max_depth=grid_search.best_params_['max_depth'],
                                    min_samples_split=grid_search.best_params_['min_samples_split'],
                                    min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                    random_state=42)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)

# Evaluate the model using cross-validation
Tuned_cv_scores = cross_val_score(Tuned_classifier, X_train, y_train, cv=10)

# Train the model on the entire training set
Tuned_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = Tuned_classifier.predict(X_test)

# Calculate the performance metrics
tuned_model_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'AUC-ROC': roc_auc_score(y_test, y_pred)
}

# Save the tuned model's cross-validation scores for later use
tuned_model_cv_scores = list(Tuned_cv_scores)

# Save the tuned model's metrics dictionary for later use
tuned_model_metrics_dict = tuned_model_metrics

# Generate the classification report
class_report = classification_report(y_test, y_pred)

print("Classification Report:")
print(class_report)

# Save the classification report to a text file (optional)
with open('classification_report.txt', 'w') as f:
    f.write("Classification Report:\n")
    f.write(class_report)

# Save the model to disk
model_filename = 'tuned_random_forest_model.sav'
with open(model_filename, 'wb') as file:
    pickle.dump(Tuned_classifier, file)

# Output the cross-validation scores and metrics
print(f"Cross-validation scores: {tuned_model_cv_scores}")
print(f"Tuned Model Metrics: {tuned_model_metrics_dict}")


In [None]:
grid_search.best_estimator_

In [None]:
# Convert metrics to a DataFrame for visualization
metrics_df = pd.DataFrame([new_model_metrics, tuned_model_metrics_dict], index=['Benchmark Model', 'Tuned Model']).T
metrics_df = metrics_df.reset_index().melt(id_vars='index').rename(columns={'index': 'Metric', 'value': 'Score'})

# Plotting the model metrics comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Score', hue='variable', data=metrics_df)
plt.title('Comparison of Model Metrics')
plt.xlabel('Metric')
plt.ylabel('Score')
plt.legend(title='Model')
plt.show()


In [None]:
cv_scores

In [2]:
#saving the final benchmark model

import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Assuming 'grid_search' is your GridSearchCV object with the best parameters found
Tuned_classifier = RandomForestClassifier(n_estimators=grid_search.best_params_['n_estimators'],
                                    max_depth=grid_search.best_params_['max_depth'],
                                    min_samples_split=grid_search.best_params_['min_samples_split'],
                                    min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                    random_state=42)

# Assuming 'X_train' and 'y_train' are your training data
Tuned_classifier.fit(X_train, y_train)

# Save the classifier as the final benchmark model
final_benchmark_model_path = 'final_benchmark_model.pkl'
with open(final_benchmark_model_path, 'wb') as file:
    pickle.dump(Tuned_classifier, file)


NameError: name 'grid_search' is not defined

### 8. MODEL DEPLOYMENT

In [None]:
#%pip install flask

In [None]:
#%pip install streamlit

In [1]:
from flask import Flask, request, render_template
import pickle
import numpy as np

app = Flask(__name__)

# Load the saved benchmark model and the scaler
model = pickle.load(open('final_benchmark_model.pkl', 'rb'))

# Define the home route
@app.route('/')
def home():
    # Render the home page with form
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Fetch the form inputs
        purchase_frequency = float(request.form['PURCHASES_FREQUENCY'])
        purchases_trx = float(request.form['PURCHASES_TRX'])
        purchases = float(request.form['PURCHASES'])
        cash_advance = float(request.form['CASH_ADVANCE'])
        cash_advance_frequency = float(request.form['CASH_ADVANCE_FREQUENCY'])
        cash_advance_trx = float(request.form['CASH_ADVANCE_TRX'])

        inputs = [purchase_frequency, purchases_trx, purchases,
                      cash_advance, cash_advance_frequency, cash_advance_trx]

        input_array = np.array(inputs)
        inputs_values = input_array.reshape(1,-1)
            
        result = model.predict(inputs_values)

       # Generate the results that will be displayed to the user
        if int(result)== 0:
            predicted_class ='Premium Client'
            color='Aquamarine'
        else:
            predicted_class ='Regular Client'
            color='HoneyDew'

    except ValueError as e:
        print(f"Caught a ValueError: {e}")
    
    # Render the prediction result page
    return render_template('predict.html', prediction=predicted_class, color_signal=color)

if __name__=='__main__':
    app.run(host='localhost', port=1887, debug=True, use_reloader=False)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://localhost:1887
Press CTRL+C to quit
127.0.0.1 - - [17/Dec/2023 09:56:19] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [17/Dec/2023 09:57:16] "POST /predict HTTP/1.1" 200 -
