# BUILDING A CLASSIFACTION MODEL

# Instructor: Ekpe Okorafor

# The CODATA-RDA School for Research Data Science


## Introduction:

Building classification models is one of the most important data science use cases. Classification models are models that predict a categorical label. A few examples of this include predicting whether a customer will churn or whether a bank loan will default. In this guide, you will learn how to build and evaluate a classification model in Python. We will train the logistic regression algorithm, which is one of the oldest yet most powerful classification algorithms.

## 0.	Data

In this exercise, we will use a fictitious dataset of loan applicants containing about 614 observations and 12 variables, as described below:

1. **Gender:** Whether the applicant is a male ("Male") or a female ("Female")
2. **Marital_status:** Whether the applicant is married or not ("Yes") or not ("No")    
3. **Dependent:** Total number of dependents    
4. **Education:** Whether the applicant is a graduate (“Graduate”) or not (“Not Graduate")    
5. **Self_employed:** Whether the applicant is a self-employed ("Yes") or not (“No”)
6. **ApplicantIncome:** Monthly Income of the applicant (in USD)
7. **CoapplicantIncome:** Monthly Income of the coapplicant (in USD)
8. **Loan_amount:** Loan amount (in USD) for which the application was submitted
9. **Loan_amount_term:** Terms of the loan in months
10. **Credit_history:** Whether the applicant has a credit history ("1") or not ("0")
11. **Property_area:** Where property is located – rural (“Rural”), semiurban (“Semiurban”) or urban ("Urban")
12. **Loan_status:** Whether the loan application was approved ("Y") or not ("N")

Let's start by loading the required libraries and the data.



In [None]:
import pandas as pd

# Read the CSV file
dat = pd.read_csv('data.csv')

# Display a summary of the dataframe
dat.info()

# Display the first few rows of the dataframe
dat.head()

The output shows that the dataset has four numerical (labeled as int). If the other six character variables are labeled as object, we will convert these into factor variables using the line of code below.

In [None]:
# Display the initial data types
print("Initial data types:")
print(dat.dtypes)

# Convert columns from 'object' to 'int64'
for col in dat.select_dtypes(include='object').columns:
    dat[col] = dat[col].astype('category')


# Display the data types after conversion
print("\nData types after conversion:")
print(dat.dtypes)


# Display the first few rows of the dataframe
dat.head()



Great! We have 5 numerical and the other variables are now labelled as objects.

Now, let's get on with building this classification model.

We will proceed as follow:

Step 1: Check continuous variables
Step 2: Check factor variables
Step 3: Summary statistic
Step 4: Train/test set
Step 5: Build the model
Step 6: Assess the performance of the model

## 1.	Step 1) Check continuous variables

In the first step, you can see the distribution of the continuous variables.


In [None]:
 # Select numeric columns
continuous = dat.select_dtypes(include='number')


# Display a summary of the numeric columns
summary = continuous.describe()

summary


The code above selects the columns that are numeric type, and then display a summary of those columns.

From the above table, you can see that the data have totally different scales. 'ApplicantIncome' & 'CoapplicantIncome' have large outliers ( i.e., look at the last quartile and maximum value).

**You can deal with it following two steps:**

 1: Plot the distribution of the variables with the outliers (ApplicantIncome & CoapplicantIncome)

 2: Standardize the continuous variables

Let's go ahead and plot the distribution.

### 1. Plot the distribution

Let's look closer at the distribution of 'ApplicantIncome'

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting histogram with kernel density curve
plt.figure(figsize=(10, 6))
sns.histplot(continuous['ApplicantIncome'], kde=True, color='#FF6666', alpha=0.2)
plt.title('Distribution of ApplicantIncome')
plt.xlabel('ApplicantIncome')
plt.ylabel('Density')
plt.show()


This is histogram plot with a kernel density curve for the 'ApplicantIncome' column.

We can see that the variable has some outliers. You can partially tackle this problem by deleting the top 0.02 percent of the ApplicantIncome.

To compute the 98th percentile of the 'ApplicantIncome' column and display it, you can use the 'numpy' library in Python.

In [None]:
import numpy as np

# Compute the 98th percentile of ApplicantIncome
applicant_income = np.percentile(dat['ApplicantIncome'], 98)

# Display the result
applicant_income


98 percent of the population makes under $19666.04 per month.

You can drop the observations above this threshold.


To filter the DataFrame in Python to drop observations where 'ApplicantIncome' is above the 98th percentile, you can use the 'pandas' library.

**Compute the 98th Percentile of ApplicantIncome and Drop Observations Above the Threshold**

In [None]:
 # Compute the 98th percentile of ApplicantIncome
applicant_income = np.percentile(dat['ApplicantIncome'], 98)

# Drop observations above this threshold
dat_drop = dat[dat['ApplicantIncome'] < applicant_income]

# Display the dimensions of the resulting DataFrame
dat_drop.shape


Compare this with the original “dat”. Observe that some rows have been dropped.

### 2. Standardize the continuous variables

You can standardize each column to improve the performance, especially if your data does not have the same scale.

To standardize the numeric columns in a DataFrame in Python, you can use the StandardScaler from the sklearn.preprocessing module.

**Standardize Numeric Columns**

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize numeric columns
scaler = StandardScaler()

# Selecet only the numeric columns
numeric_columns = dat_drop.select_dtypes(include='number').columns

#Standardize the numeric columns
dat_drop[numeric_columns] = scaler.fit_transform(dat_drop[numeric_columns])

# Display the first few rows of the rescaled DataFrame
dat_rescale = dat_drop
dat_rescale.head()


## 2.	Step 2) Check factor variables

This step has two objectives:
 - Check the level in each categorical column
 - Define new levels

We will divide this step into three parts:
 - Select the categorical columns
 - Store the bar chart of each column in a list
 - Print the graphs

To select categorical (factor) columns from a DataFrame in Python using pandas, you can filter columns based on their data type after rescaling and standardizing the DataFrame.


We can select the factor columns with the code below:


In [None]:
# Select categorical columns
factor = dat_rescale.select_dtypes(include='category')

# Count the number of categorical columns
num_factor_columns = factor.shape[1]

num_factor_columns


The dataset contains 7 categorical variables

The second step is more skilled. You want to plot a bar chart for each column in the data frame factor. It is more convenient to automate the process, especially in situations where there are lots of columns.


To create bar charts for each categorical column in a pandas DataFrame in Python, you can use matplotlib or seaborn for plotting.

In [None]:
import matplotlib.pyplot as plt

# Assuming 'factor' DataFrame is already defined with categorical columns

# Calculate appropriate figure height
num_columns = len(factor.columns)
fig_height = num_columns * 4  # Adjust the multiplier as needed to ensure sufficient space

# Create graphs for each column
plt.figure(figsize=(4, fig_height))

for i, col in enumerate(factor.columns):
    plt.subplot(num_columns, 1, i + 1)
    factor[col].value_counts().plot(kind='bar')
    plt.title(f'Bar Chart of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.xticks(rotation=90)

# Adjust layout to avoid warning and overlapping
plt.tight_layout()

# Display the plots
plt.show()


This Python code automates the creation of bar charts for each categorical column in the factor DataFrame, similar to the R code provided. Adjust the DataFrame name (factor) and plot customization according to your specific dataset and visualization preferences.

**Step 1:** Import necessary libraries (pandas for data manipulation and matplotlib.pyplot for plotting).

**Step 2:** Assuming factor is a DataFrame containing categorical columns.

**Step 3:** Use a for loop to iterate through each column (col) in factor.columns.

**Step 4:** Create a subplot for each column using plt.subplot. Adjust the subplot dimensions (len(factor.columns), 1) based on the number of columns in factor.

**Step 5:** Plot the bar chart for each column using value_counts().plot(kind='bar').

**Step 6:** Customize plot titles, labels, and rotation of x-axis labels (plt.title, plt.xlabel, plt.ylabel, plt.xticks(rotation=90)).

**Step 7:** Use plt.tight_layout() to improve subplot spacing, and plt.show() to display the plots.


## 3.	Step 3) Summary Statistic

It is time to check some statistics about our target variables. In the graph below, you count the percentage of individuals with loan approval given their gender.

1. Calculate the percentage of loan approval by gender.

2. Plot the results


In [None]:
#import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt

# Assuming 'dat_rescale' DataFrame is already defined and contains 'Gender' and 'Loan_status' columns

# Calculate the percentage of loan approval given gender
loan_status_percentage = dat_rescale.pivot_table(index='Gender', columns='Loan_status', aggfunc='size', fill_value=0)
loan_status_percentage = loan_status_percentage.div(loan_status_percentage.sum(axis=1), axis=0) * 100

# Reset the index to convert pivot table to DataFrame for plotting
loan_status_percentage = loan_status_percentage.reset_index()

# Melt the DataFrame for easier plotting with seaborn
loan_status_percentage_melted = loan_status_percentage.melt(id_vars='Gender', value_vars=loan_status_percentage.columns[1:],
                                                            var_name='Loan_status', value_name='Percentage')

# Plot the percentage of loan approval by gender
plt.figure(figsize=(10, 6))
sns.barplot(data=loan_status_percentage_melted, x='Gender', y='Percentage', hue='Loan_status')

# Customize the plot
plt.title('Percentage of Loan Approval by Gender')
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.legend(title='Loan Status')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()



Next, check if the level of education affects their loan approval.

In [None]:
# Assuming 'dat_rescale' DataFrame is already defined and contains 'Education' and 'Loan_status' columns

# Calculate the percentage of loan approval given education level
loan_status_education_percentage = dat_rescale.pivot_table(index='Education', columns='Loan_status', aggfunc='size', fill_value=0)
loan_status_education_percentage = loan_status_education_percentage.div(loan_status_education_percentage.sum(axis=1), axis=0) * 100

# Reset the index to convert pivot table to DataFrame for plotting
loan_status_education_percentage = loan_status_education_percentage.reset_index()

# Melt the DataFrame for easier plotting with seaborn
loan_status_education_percentage_melted = loan_status_education_percentage.melt(id_vars='Education', value_vars=loan_status_education_percentage.columns[1:],
                                                            var_name='Loan_status', value_name='Percentage')

# Plot the percentage of loan approval by education level
plt.figure(figsize=(12, 6))
sns.barplot(data=loan_status_education_percentage_melted, x='Education', y='Percentage', hue='Loan_status')

# Customize the plot
plt.title('Percentage of Loan Approval by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Percentage')
plt.xticks(rotation=45)
plt.legend(title='Loan Status')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()


Next, check if the property area affects their loan approval.


In [None]:
# Calculate the percentage of loan approval given property area
loan_status_area_percentage = dat_rescale.pivot_table(index='Property_area', columns='Loan_status', aggfunc='size', fill_value=0)
loan_status_area_percentage = loan_status_area_percentage.div(loan_status_area_percentage.sum(axis=1), axis=0) * 100

# Reset the index to convert pivot table to DataFrame for plotting
loan_status_area_percentage = loan_status_area_percentage.reset_index()

# Melt the DataFrame for easier plotting with seaborn
loan_status_area_percentage_melted = loan_status_area_percentage.melt(id_vars='Property_area', value_vars=loan_status_area_percentage.columns[1:],
                                                            var_name='Loan_status', value_name='Percentage')

# Plot the percentage of loan approval by property area
plt.figure(figsize=(12, 6))
sns.barplot(data=loan_status_area_percentage_melted, x='Property_area', y='Percentage', hue='Loan_status')

# Customize the plot
plt.title('Percentage of Loan Approval by Property Area')
plt.xlabel('Property Area')
plt.ylabel('Percentage')
plt.xticks(rotation=45)
plt.legend(title='Loan Status')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()


Check if the applicant's income is related to the loan amount.

Do a scatter plot to visually inspect the relationship between the applicant's income and the loan amount, and the Pearson correlation coefficient to quantify this relationship. Adjust the DataFrame and column names as needed to fit your specific data context.


In [None]:
#import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Assuming 'dat_rescale' DataFrame is already defined and contains 'ApplicantIncome' and 'LoanAmount' columns

# Scatter plot to visualize the relationship
plt.figure(figsize=(10, 6))
sns.scatterplot(data=dat_rescale, x='ApplicantIncome', y='Loan_amount')

# Customize the plot
plt.title('Relationship between Applicant Income and Loan Amount')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount')
plt.grid(True)

# Show the plot
plt.show()

# Calculate the Pearson correlation coefficient
correlation, p_value = pearsonr(dat_rescale['ApplicantIncome'], dat_rescale['Loan_amount'])
print(f"Pearson correlation coefficient: {correlation}")
print(f"P-value: {p_value}")


**Non-linearity**

Before you run the model, you can see if the applicant income is related to loan amount.


In [None]:
#import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt
#import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Assuming 'dat_rescale' DataFrame is already defined and contains 'ApplicantIncome', 'Loan_amount', and 'Loan_status' columns

# Scatter plot with polynomial regression line
plt.figure(figsize=(12, 6))
sns.scatterplot(data=dat_rescale, x='ApplicantIncome', y='Loan_amount', hue='Loan_status', size=0.5)

# Polynomial regression fit (degree=2)
sns.lmplot(data=dat_rescale, x='ApplicantIncome', y='Loan_amount', hue='Loan_status', order=2, ci=True, aspect=2)

# Customize the plot
plt.title('Relationship between Applicant Income and Loan Amount with Polynomial Regression')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount')
plt.grid(True)
plt.legend(title='Loan Status')

# Show the plot
plt.show()



This code will provide you with a scatter plot to visually inspect the relationship between the applicant's income and the loan amount, with a polynomial regression line (degree 2) to account for non-linearity, colored by loan status. Adjust the DataFrame and column names as needed to fit your specific data context.


In a nutshell, you can test interaction terms in the model to pick up the non-linearity effect between the applicant income and other features. It is important to detect under which condition the applicant income differs.

**Correlation**

The next check is to visualize the correlation between the variables. You convert the factor level type to numeric so that you can plot a heat map containing the coefficient of correlation computed with the Spearman method.


In [None]:
#import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt

# Assuming 'dat_rescale' DataFrame is already defined

# Convert factor level columns to numeric
dat_numeric = dat_rescale.copy()
for col in dat_numeric.select_dtypes(include=['category', 'object']).columns:
    dat_numeric[col] = dat_numeric[col].astype('category').cat.codes

# Compute the correlation matrix using the Spearman method
corr_matrix = dat_numeric.corr(method='spearman')

# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, vmin=-1, vmax=1,
            linewidths=.5, annot_kws={"size": 8}, cbar_kws={"shrink": .8})

# Customize the plot
plt.title('Spearman Correlation Heatmap')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

# Show the plot
plt.show()


**Class Discussion:**

What observations can you make based on the results above?



## 4.	Step 4) Train/test set

Any supervised machine learning task requires splitting the data between a train set and a test set.

To split your data into a training set and a test set in Python, you can use the train_test_split function from sklearn.model_selection.



In [None]:
#import pandas as pd
from sklearn.model_selection import train_test_split


# Assuming 'dat_rescale' DataFrame is already defined


# Set the random seed for reproducibility
random_seed = 1234


# Split the data into training and testing sets
trainData, testData = train_test_split(dat_rescale, test_size=0.3, stratify=dat_rescale['Loan_status'], random_state=random_seed)


# Print the dimensions of the train and test sets
print(f'Train Data Dimensions: {trainData.shape}')
print(f'Test Data Dimensions: {testData.shape}')


The train dataset contains 70 percent of the data (420 observations of 12 variables) while the test data contains the remaining 30 percent (181 observations of 12 variables).

## 5.	Step 5) Build the model

To fit a logistic regression model in Python using statsmodels, and print the summary of the trained model, you can follow these steps:

In [None]:
#import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
#import numpy as np


# Assuming 'trainData' DataFrame is already defined and contains 'Loan_status' and other predictor columns


# Step 1: Prepare the data
# Convert categorical variables into dummy/indicator variables
X = pd.get_dummies(trainData.drop('Loan_status', axis=1), drop_first=True)
y = trainData['Loan_status']


# Standardize the numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Step 2: Instantiate and fit the logistic regression model
model_glm = LogisticRegression(max_iter=1000, random_state=1234)
model_glm.fit(X_scaled, y)


# Step 3: Print the summary (similar to R's summary function)
# Model coefficients and intercept
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model_glm.coef_[0]})
intercept = model_glm.intercept_[0]


print("Intercept:", intercept)
print("\nCoefficients:")
print(coefficients)


## 6.	Step 6) Assess the performance of the model

Let's now evaluate the model performance on the training and data. We start by generating predictions on the training data. The algorithm will predict the Y response for the Loan_status variable.

The accuracy of the model on the training data comes out to be 81 percent.



In [None]:
#import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
#import numpy as np


# Assuming 'trainData' DataFrame is already defined and contains 'Loan_status' and other predictor columns

# Model evaluation
y_pred = model_glm.predict(X_scaled)


print("\nAccuracy:", accuracy_score(y, y_pred))
print("\nClassification Report:\n", classification_report(y, y_pred))


# Model summary similar to R's glm summary output
def logistic_regression_summary(model, X):
    summary_df = pd.DataFrame({
        'Feature': X.columns,
        'Coefficient': model.coef_[0],
        'Odds Ratio': np.exp(model.coef_[0])
    })
    return summary_df


summary = logistic_regression_summary(model_glm, X)
print("\nLogistic Regression Summary:\n")
print(summary)


## 7.	Conclusion

In this guide, you have learned techniques of building a classification model in Python using the powerful logistic regression algorithm.