# Project - Predictive Modelling

## Import Libraries

1. General libraries to work with data and visualize data:

In [3]:
import numpy as np
import pandas as pd

# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2. sklearn libraries to perform regression and classifications:

In [4]:
# For randomized data splitting
from sklearn.model_selection import train_test_split

## 1. Logistic Regression
from sklearn.linear_model import LogisticRegression

## 2. Linear Discriminat Analysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

## 3. CART
from sklearn.tree import DecisionTreeClassifier

3. To check model performance:

In [5]:
from sklearn import metrics

# calculate accuracy measures and confusion matrix

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score

# 2 Logistic Regression, LDA, CART

Problem 2: Logistic Regression, LDA and CART

You are a statistician at the Republic of Indonesia Ministry of Health and you are provided with a data of 1473 females collected from a Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of the survey.

The problem is to predict do/don't they use a contraceptive method of choice based on their demographic and socio-economic characteristics.

## 2.1 Data Ingestion:

Read the dataset. Do the descriptive statistics and do null value condition check, check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate Analysis and Multivariate Analysis.

### Import dataset and explore

In [6]:
cData= pd.read_excel('Contraceptive_method_dataset.xlsx')

In [5]:
cData.shape

(1473, 10)

There are 1473 rows and 10 columns in the dataset

In [6]:
cData.head()

Unnamed: 0,Wife_age,Wife_ education,Husband_education,No_of_children_born,Wife_religion,Wife_Working,Husband_Occupation,Standard_of_living_index,Media_exposure,Contraceptive_method_used
0,24.0,Primary,Secondary,3.0,Scientology,No,2,High,Exposed,No
1,45.0,Uneducated,Secondary,10.0,Scientology,No,3,Very High,Exposed,No
2,43.0,Primary,Secondary,7.0,Scientology,No,3,Very High,Exposed,No
3,42.0,Secondary,Primary,9.0,Scientology,No,3,High,Exposed,No
4,36.0,Secondary,Secondary,8.0,Scientology,No,3,Low,Exposed,No


In [7]:
cData.tail()

Unnamed: 0,Wife_age,Wife_ education,Husband_education,No_of_children_born,Wife_religion,Wife_Working,Husband_Occupation,Standard_of_living_index,Media_exposure,Contraceptive_method_used
1468,33.0,Tertiary,Tertiary,,Scientology,Yes,2,Very High,Exposed,Yes
1469,33.0,Tertiary,Tertiary,,Scientology,No,1,Very High,Exposed,Yes
1470,39.0,Secondary,Secondary,,Scientology,Yes,1,Very High,Exposed,Yes
1471,33.0,Secondary,Secondary,,Scientology,Yes,2,Low,Exposed,Yes
1472,17.0,Secondary,Secondary,1.0,Scientology,No,2,Very High,Exposed,Yes


In [8]:
cData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1473 entries, 0 to 1472
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Wife_age                   1402 non-null   float64
 1   Wife_ education            1473 non-null   object 
 2   Husband_education          1473 non-null   object 
 3   No_of_children_born        1452 non-null   float64
 4   Wife_religion              1473 non-null   object 
 5   Wife_Working               1473 non-null   object 
 6   Husband_Occupation         1473 non-null   int64  
 7   Standard_of_living_index   1473 non-null   object 
 8   Media_exposure             1473 non-null   object 
 9   Contraceptive_method_used  1473 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 115.2+ KB


In [9]:
cData.dtypes

Wife_age                     float64
Wife_ education               object
Husband_education             object
No_of_children_born          float64
Wife_religion                 object
Wife_Working                  object
Husband_Occupation             int64
Standard_of_living_index      object
Media_exposure                object
Contraceptive_method_used     object
dtype: object

Most of the numerical columns are of type float64, int64 already.

'Husband_Occupation' data type is already encoded from categorical variable.

Categorical columns like Education (both 'Wife_ education', 'Husband_education'), 'Wife_religion', 'Wife_Working' , 'Standard_of_living_index', 'Media_exposure', 'Contraceptive_method_used' are object type. These can be encoded.

In [10]:
cData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Wife_age,1402.0,32.606277,8.274927,16.0,26.0,32.0,39.0,49.0
No_of_children_born,1452.0,3.254132,2.365212,0.0,1.0,3.0,4.0,16.0
Husband_Occupation,1473.0,2.137814,0.864857,1.0,1.0,2.0,3.0,4.0


In [11]:
cData.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Wife_ education,1473,4,Tertiary,577
Husband_education,1473,4,Tertiary,899
Wife_religion,1473,2,Scientology,1253
Wife_Working,1473,2,No,1104
Standard_of_living_index,1473,4,Very High,684
Media_exposure,1473,2,Exposed,1364
Contraceptive_method_used,1473,2,Yes,844


### Missing Values

In [12]:
cData.isnull().sum()

Wife_age                     71
Wife_ education               0
Husband_education             0
No_of_children_born          21
Wife_religion                 0
Wife_Working                  0
Husband_Occupation            0
Standard_of_living_index      0
Media_exposure                0
Contraceptive_method_used     0
dtype: int64

In [None]:
cData.apply(lambda x: x.isnull().value_counts())

There are 71 missing values in Wife_age column, 21 missing values in No_of_children_born column

In [7]:
cData['Wife_age'].fillna(cData['Wife_age'].median(), inplace=True)

In [8]:
cData['No_of_children_born'].value_counts().to_frame()

Unnamed: 0,No_of_children_born
2.0,274
1.0,273
3.0,255
4.0,192
5.0,131
0.0,97
6.0,90
7.0,49
8.0,46
9.0,16


In [9]:
cData['No_of_children_born'].fillna('2', inplace=True)
#Fix 'No_of_children_born' with frequent. Mean, Median are not good options right?

Fix: drop null 'No_of_children_born'?

In [10]:
cData.isnull().sum()

Wife_age                     0
Wife_ education              0
Husband_education            0
No_of_children_born          0
Wife_religion                0
Wife_Working                 0
Husband_Occupation           0
Standard_of_living_index     0
Media_exposure               0
Contraceptive_method_used    0
dtype: int64

In [11]:
#fix
# let's convert 'No_of_children_born' object column from object type to float type
i='No_of_children_born'
cData[i] = cData[i].astype(float)

In [None]:
cData.info()

### Outliers

In [None]:
cData_out = cData.loc[:, [
    'Wife_age', 'No_of_children_born','Husband_Occupation']]

In [None]:
#cData_out = cData.drop(['Contraceptive_method_used'], axis=1)
plt.figure(figsize=(3, 10))

feature_list = cData_out.columns
for i in range(len(feature_list)):
    plt.subplot(4,1, i + 1)
    sns.boxplot(y = cData_out[feature_list[i]], data=cData_out,dodge=True)
    plt.title(feature_list[i])
    plt.tight_layout()

The outliers are in the No_of_children_born category. It would be necessary to treat the outliers in this feature. However, we will proceed without treating them in this model.

Fix

https://wellsr.com/python/how-to-make-seaborn-boxplots-in-python/

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='No_of_children_born', y= 'Husband_education',hue='Contraceptive_method_used',  order = ['Uneducated',"Primary", "Secondary",'Tertiary'],data=cData)
plt.title('Box Plot of Husband Education vs No of children')
plt.show();

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='No_of_children_born', y= 'Wife_ education',hue='Contraceptive_method_used',  order = ['Uneducated',"Primary", "Secondary",'Tertiary'],data=cData)
plt.title('Box Plot of Wife Education vs No of children')
plt.show();

In [None]:
sns.boxplot(x='No_of_children_born', y='Wife_age', data=cData)
plt.show();

In [None]:
sns.boxplot(x='No_of_children_born', y='Wife_age', hue='Contraceptive_method_used', data=cData)
plt.show();

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='No_of_children_born', y= 'Wife_Working',hue='Contraceptive_method_used',data=cData)
plt.title('Box Plot of Working Wife vs No of children')
plt.show();

In [None]:
k=6
k_child_df = cData[cData['No_of_children_born']> k]
k_child_df.describe(include='all').T

### Univariate Analysis

### Bivariate Analysis

In [None]:
sns.pairplot(cData, hue='Contraceptive_method_used', diag_kind='kde');

### Mutlivariate Analysis

## 2.2 Data Preparation

Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and CART.

### 2.2.1 Encode Data

In [None]:
for i in cData.columns:
    if cData[i].dtypes== 'object':
        print(i, ': \n')
        print(cData[i].value_counts())
        print('       ****       ')

In [None]:
## Convert the target variable to Nominal
## We are coding up the 'Contraceptive_method_used' variable in #Fix Yes 0, No 1
cData['Contraceptive_method_used']=np.where(cData['Contraceptive_method_used'] =='Yes', '0', cData['Contraceptive_method_used'])
cData['Contraceptive_method_used']=np.where(cData['Contraceptive_method_used'] =='No', '1', cData['Contraceptive_method_used'])

In [None]:
#Fix Media_exposure 'Media_exposure ' name has extra white space in column name.
cData = pd.get_dummies(cData, columns=['Media_exposure ','Standard_of_living_index', 'Wife_Working', 'Wife_religion', 'Husband_education','Wife_ education'], drop_first=True)
cData.head()

Fix do we need turn all categorical variables via one hot encoding or nominal 1 2 3. If latter, which order?

In [None]:
#fix
# let's convert the object column from object type to float type
for i in cData.columns:
    if cData[i].dtypes== 'object':
        cData[i] = cData[i].astype(float)

In [None]:
cData.info()

### 2.2.2 Split Data

In [None]:
X= cData.drop('Contraceptive_method_used', axis=1)
y= cData['Contraceptive_method_used']

X_train,X_test, y_train, y_test = train_test_split(X,y, test_size= .3, random_state= 1)

In [None]:
X_train.head()

In [None]:
X_test.head()

### 2.2.3 Logistic Regression, LDA, CART

#### Logistic Regression

In [None]:
log_model = LogisticRegression(max_iter=1000)

In [None]:
log_model.fit(X_train,y_train)

In [None]:
ytrain_predict_log = log_model.predict(X_train)
ytrain_predict_log = log_model.predict(X_test)

#### LDA

In [None]:
lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(X_train, y_train)

In [None]:
#dtree_predictions= dtree.predict(X_test)
ytrain_predict_lda = lda.predict(X_train)
ytest_predict_lda = lda.predict(X_test)

#### CART

In [None]:
dtree = DecisionTreeClassifier(criterion='gini')

In [None]:
dtree.fit(X_train, y_train)

In [None]:
#dtree_predictions= dtree.predict(X_test)
ytrain_predict_cart = dtree.predict(X_train)
ytest_predict_cart = dtree.predict(X_test)

## 2.3 Performance Metrics:
Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the models and write inference which model is best/optimized.

### Accuracy

In [None]:
model_score = log_model.score(X_test, y_test)
print('Accuracy Score is ',model_score)

### Confusion Matrix

In [12]:
def conf_mat(y_test, y_predict):
    # Compute confusion matrix
    cm = metrics.confusion_matrix(y_test, y_predict)
    plt.figure(figsize=(6, 4))
    plt.clf()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    classNames = ['Contraception Yes', 'Contraception No']
    plt.title('Confusion Matrix - Test Data for {} Model'.format(model))
    plt.ylabel('Actual (True) label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=45)
    plt.yticks(tick_marks, classNames)
    plt.xticks(rotation=45)
    s = [['TN', 'FP'], ['FN', 'TP']]
    for i in range(2):
        for j in range(2):
            plt.text(j, i, str(s[i][j]) + " = " + str(cm[i][j]))
    plt.show()

In [None]:
conf_mat(y_test, y_test_pred)

In [None]:
# ALternate Way to print Confusion matrix

In [None]:
def plot_confusion_matrix(y_true, y_pred, classes, cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    # Create figure and axis
    fig, ax = plt.subplots(figsize=(4,3))
    
    # Plot the confusion matrix as heatmap
    sns.heatmap(cm, annot=True, cmap=cmap, fmt='g', xticklabels=classes, yticklabels=classes, ax=ax)

    # Set axis labels and title
    ax.set_xlabel('Predicted label')
    ax.set_ylabel('True label')
    ax.set_title('Confusion matrix')
    ax.xaxis.set_ticklabels(classes)
    ax.yaxis.set_ticklabels(classes)
    plt.xticks(rotation=45)

    # Show plot
    plt.show()


In [None]:
classNames = ['Contraception Yes', 'Contraception No']
plot_confusion_matrix(y_test, y_test_pred,classNames)

### ROC Curve

In [None]:
probs=log_model.predict_proba(X_train)
probs=probs[:,1]

In [None]:
y_true = np.array[0,1] #Fix

In [None]:
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)

In [None]:
plt.plot([0,1],[0,1], linestyle='--')
plt.plot(train_fpr, train_tpr);

### Best Model

## 2.4 Inference:
Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper business interpretation and actionable insights present.

## Reflection Report

Please reflect on all that you learnt and fill this reflection report. You have to copy the link and paste it on the URL bar of your respective browser.
https://docs.google.com/forms/d/e/1FAIpQLScKuVyrmTTM7Pboh0IB4YIBUbJp2NrDZcsY4SCRn3ZUkwmLGg/viewform

In [13]:
# Define the models
models = [('Logistic Regression', LogisticRegression(max_iter=1000)),
          ('LDA', LinearDiscriminantAnalysis(n_components=1)),
          ('CART', DecisionTreeClassifier(criterion='gini'))]

In [14]:
for name, model in models:
    # fit the model on the training data
    model.fit(X_train, y_train)

    # predict the training and test data
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #Accuracy Scores
    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)

    #Confusion Matrix & Classifcication Report
    print("Confusion Matrix for {}:".format(model))
    print(confusion_matrix(y_test, y_pred_test))
    conf_mat(y_test, y_test_pred)
    print('Classification Report {}:'.format(model))
    print(classification_report(y_test, y_pred_test))
    
    #Get the predicted probabilities for train and test data
    y_train_proba = model.predict_proba(X_train)[:, 1]
    y_test_proba = model.predict_proba(X_test)[:, 1]

    fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
    plt.plot(fpr, tpr)
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title('ROC Curve for {}:'.format(model))
    plt.show()
    
    # Calculate ROC_AUC score for Logistic Regression model
    roc_auc_log = roc_auc_score(y_test, y_test_proba)
    print("ROC AUC Score:", roc_auc_log)

NameError: name 'X_train' is not defined

In [None]:
# Evaluate each model using k-fold cross-validation
results = []
names = []
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cv_results = cross_val_score(model,
                                 X_train,
                                 y_train,
                                 cv=10,
                                 scoring='recall')
    results.append(cv_results)
    names.append(name)
    #print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

    # Print the confusion matrix and classification report
    print('Model: ', name)
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    print('Classification Report:')
    print(classification_report(y_test, y_pred))
    print('-----------------------')

# Compare the models using boxplots

plt.boxplot(results, labels=names)
plt.title('Model Comparison')
plt.ylabel('Accuracy')
plt.show()

### Additional

In [None]:
# Evaluate each model using k-fold cross-validation
results = []
names = []
for name, model in models:
    cv_results = cross_val_score(model, X, y, cv=10, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

# Compare the models using boxplots
import matplotlib.pyplot as plt
plt.boxplot(results, labels=names)
plt.title('Model Comparison')
plt.ylabel('Accuracy')
plt.show()


In [None]:
# Train and evaluate each model on the test set
for name, model in models:
    # Fit the model to the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = model.predict(X_test)
    
    # Print the confusion matrix and classification report
    print('Model: ', name)
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    print('Classification Report:')
    print(classification_report(y_test, y_pred))
    print('-----------------------')


In [None]:
### Create Logistic Regression model and fit to the training data
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Evaluate Logistic Regression model on train and test sets
y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
print("Logistic Regression Training Accuracy:", train_acc)
print("Logistic Regression Testing Accuracy:", test_acc)
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
conf_mat(y_test, y_test_pred)
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('-----------------------')

# Plot ROC curve for Logistic Regression model
y_test_proba = lda.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Logistic Regression Model")
plt.show()

# Calculate ROC_AUC score for Logistic Regression model
roc_auc_log = roc_auc_score(y_test, y_test_proba)
print("Logistic Regression model ROC AUC Score:", roc_auc_log)

# Create LDA model and fit to the training data
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Evaluate LDA model on train and test sets
y_train_pred = lda.predict(X_train)
y_test_pred = lda.predict(X_test)
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
print("LDA Training Accuracy:", train_acc)
print("LDA Testing Accuracy:", test_acc)
print("LDA Confusion Matrix:")
conf_mat(y_test, y_test_pred)
print(confusion_matrix(y_test, y_test_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('-----------------------')

# Plot ROC curve for LDA model
y_test_proba = lda.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for LDA Model")
plt.show()

# Calculate ROC_AUC score for LDA model
roc_auc_lda = roc_auc_score(y_test, y_test_proba)
print("LDA ROC AUC Score:", roc_auc_lda)

# Create CART model and fit to the training data
cart = DecisionTreeClassifier()
cart.fit(X_train, y_train)

# Evaluate CART model on train and test sets
y_train_pred = cart.predict(X_train)
y_test_pred = cart.predict(X_test)
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
#Print Model Stats
printStats()
print("CART Training Accuracy:", train_acc)
print("CART Testing Accuracy:", test_acc)
print("CART Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
conf_mat(y_test, y_test_pred)
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('-----------------------')

# Plot ROC curve for CART model
y_test_proba = cart.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for CART Model")
plt.show()

# Calculate ROC_AUC score for CART model
roc_auc_cart = roc_auc_score(y_test, y_test_proba)
print("CART ROC AUC Score:", roc_auc_cart)


In [None]:
# Define the models
models = [('Logistic Regression', LogisticRegression(max_iter=1000)),
          ('LDA', LinearDiscriminantAnalysis(n_components=1)),
          ('CART', DecisionTreeClassifier(criterion='gini'))]

In [None]:
# define empty lists to contain the performance metrics:

model_names = []
r2_train = []
r2_test = []
rmse_train = []
rmse_test = []
adj_r2_train = []
adj_r2_test = []
accuracy_train = []
accuracy_test = []
confusion_train = []
confusion_test = []
report_train = []
report_test = []
roc_auc_train = []
roc_auc_test =[]

In [None]:
# create empty dataframe to store performance metrics
performance_df = pd.DataFrame(columns=['Model', 'R-squared (train)', 'R-squared (test)', 'RMSE (train)', 'RMSE (test)',
                                       'Adj R-squared (train)', 'Adj R-squared (test)', 'Accuracy (train)',
                                       'Accuracy (test)'])

In [None]:
performance_df.head()

In [None]:
# loop through the models
for name, model in models:

    # fit the model on the training data
    model.fit(X_train, y_train)

    # predict the training and test data
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # calculate the performance metrics
    r2_train_score = r2_score(y_train, y_pred_train)
    r2_test_score = r2_score(y_test, y_pred_test)
    rmse_train_score = mean_squared_error(y_train, y_pred_train, squared=False)
    rmse_test_score = mean_squared_error(y_test, y_pred_test, squared=False)
    accuracy_train_score = accuracy_score(y_train, y_pred_train)
    accuracy_test_score = accuracy_score(y_test, y_pred_test)
    confusion_train_score = confusion_matrix(y_train, y_pred_train)
    confusion_test_score = confusion_matrix(y_test, y_pred_test)
    report_train_score = classification_report(y_train, y_pred_train)
    report_test_score = classification_report(y_test, y_pred_test)

    # Get the predicted probabilities for train and test data
    y_train_proba = model.predict_proba(X_train)[:, 1]
    y_test_proba = model.predict_proba(X_test)[:, 1]

    # Calculate the roc_auc_score for train and test data
    train_roc_auc_score = roc_auc_score(y_train, y_train_proba)
    test_roc_auc_score = roc_auc_score(y_test, y_test_proba)

    # assuming y_train, y_test, y_train_pred, y_test_pred are already defined
    train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_pred_train)
    test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_pred_test)
    
    # plot ROC curves
    plt.plot(train_fpr, train_tpr, label='Train ROC Curve (area = %0.2f)' % roc_auc_score(y_train, y_pred_train))
    plt.plot(test_fpr, test_tpr, label='Test ROC Curve (area = %0.2f)' % roc_auc_score(y_test, y_pred_test))

    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve for {} Model".format(model))
    plt.legend(loc="lower right")
    plt.show()
    
    # Plot Confusion Matrix
    conf_mat(y_test, y_test_pred)

    # calculate adjusted R-squared for train and test data
    n_train = len(y_train)
    p_train = X_train.shape[1]
    adj_r2_train_score = 1 - ((1 - r2_train_score) *
                              (n_train - 1)) / (n_train - p_train - 1)
    n_test = len(y_test)
    p_test = X_test.shape[1]
    adj_r2_test_score = 1 - ((1 - r2_test_score) *
                             (n_test - 1)) / (n_test - p_test - 1)

    # append the results to the respective lists
    model_names.append(name)
    r2_train.append(r2_train_score)
    r2_test.append(r2_test_score)
    rmse_train.append(rmse_train_score)
    rmse_test.append(rmse_test_score)
    adj_r2_train.append(adj_r2_train_score)
    adj_r2_test.append(adj_r2_test_score)
    accuracy_train.append(accuracy_train_score)
    accuracy_test.append(accuracy_test_score)
    confusion_train.append(confusion_train_score)
    confusion_test.append(confusion_test_score)
    report_train.append(report_train_score)
    report_test.append(report_test_score)
    roc_auc_train.append(train_roc_auc_score)
    roc_auc_test.append(test_roc_auc_score)

In [None]:
# create a dataframe to store the results
results_df = pd.DataFrame({
    'Model': model_names,
    'R-squared (Train)': r2_train,
    'R-squared (Test)': r2_test,
    'RMSE (train)': rmse_train,
    'RMSE (test)': rmse_test,
    'Adj R-squared (train)': adj_r2_train,
    'Adj R-squared (test)': adj_r2_test,
    'Accuracy (train)': accuracy_train,
    'Accuracy (test)': accuracy_test,
    'ROC AUC (Train)': roc_auc_train,
    'ROC AUC (Test)': roc_auc_test
})

In [None]:
results_df.head(5)

The R-squared (R2) value typically ranges from 0 to 1, with 1 indicating a perfect fit between the model and the data. A higher R2 value indicates that the model explains more of the variance in the data.

The root mean square error (RMSE) represents the average difference between the actual and predicted values of the outcome variable. It is measured in the same units as the outcome variable. There is no specific range for RMSE, but a lower RMSE value indicates that the model has better predictive power.

The adjusted R-squared (R2) is a modified version of the R-squared value that adjusts for the number of predictor variables in the model. It typically ranges from negative infinity to 1, with a higher value indicating a better fit between the model and the data. The adjusted R2 penalizes the inclusion of irrelevant predictors in the model and rewards the inclusion of relevant predictors.

In [None]:
confusion_train

In [None]:
confusion_test

In [None]:
report_train

In [None]:
report_test