# Introduction:

This project focuses on applying various data cleaning, exploratory data analysis (EDA), and data preparation techniques to a dataset containing detailed information on mortgage loan applications from the Federal Reserve Bank of Boston. The primary goal is to develop a model that can predict whether a given mortgage loan application is likely to be approved or denied, assisting regulators in identifying potential instances of discrimination in lending practices.

***

* **Variables:** The dataset comprises 2,381 observations and 13 attributes.
    * Independent variables and their descriptions (refer to the dataset documentation for full details:
        * **ATOTAL:** Applicant's total assets.
        * **AVER:** Applicant's average monthly payments.
        * **CLRER:** Applicant's average monthly payments on revolving credit.
        * **DFRAC:** Debt-to-income ratio.
        * **HPI:** Housing price index for the applicant's area.
        * **LOAN:** Loan amount requested.
        * **MI:** Mortgage insurance indicator.
        * **MORTD:** Mortgage debt.
        * **NETW:** Applicant's net worth.
        * **OCC:** Applicant's occupation.
        * **RACE:** Applicant's race.
        * **SCHOOL:** Applicant's education level.

    * Dependent Variable (Response Variable):
        * **DENY:** Indicates whether or not a mortgage application was denied (1 = Denied, 0 = Approved).

***

**Procedures:**
Here are the main procedures for this analysis:
<br>

* **Part 1**: Load Data
    * Get data from GitHub
    * Load the dataset into a Pandas DataFrame.
    <br>
    <br>
* **Part 2**: Perform Exploratory Data Analysis
    * Understanding the nature of each variable  & initial inspections.
    * Perform a thorough EDA on all data attributes to understand their nature, distributions, and relationships.
    * Conduct initial inspections for missing values, invalid data values, and correct data types.
    * Create appropriate exploratory graphics (e.g., bar plots, box plots, histograms, line plots) to visualize data characteristics.
    * Identify and document all potential data integrity and usability issues, assessing which attributes may require transformation.
    <br>
    <br>
* **Part 3**: Data Preparation
    * Address the data integrity and usability issues identified during EDA.
    * Describe and justify all data transformation and preparation steps, such as:
        * Deletion of observations (if needed).
        * Imputation methods for missing data values.
        * Feature Engineering: Creation of new variables
        * Application of mathematical transforms (e.g., Box-Cox, logarithms) or binning.
    <br>
    <br>
* **Part 4**: Prepped Data Review
    * Re-run EDA analysis on variables that were adjusted during the Data Preparation phase.
    * Compare and contrast the results with the pre-preparation EDA to evaluate the impact of adjustments.
    * Clearly describe how each data preparation step has improved the dataset for machine learning algorithm suitability.
    <br>
    <br>
* **Part 5**: Regression Modeling
    * Explain and present your regression modeling work, including your feature selection work + interpretation of the coefficients your models are generating.
    * Do they make sense intuitively? If so, why? If not, why not?
    * Comment on the magnitude and direction of the coefficients + whether they are similar from model to model.
    <br>
    <br>
* **Part 6**: Model Selection
    * Explain your model selection criteria. Identify your preferred model. Compare / contrast its performance with that of your other models.
    * Discuss why you’ve selected that specific model as your preferred model. Apply your preferred model to the testing subset and discuss your results.
    * Did your preferred model perform as well as expected?
    <br>
    <br>
* **Part 7**: Conclusions
    * Summarize the key findings and insights from the entire data cleaning, preparation, and exploratory analysis process.
    * Discuss the overall readiness and improved quality of the dataset for building robust machine learning models.

## Part 1: Load Data
1. Get data from GitHub
2. Load the dataset into a Pandas DataFrame.

### 1. Get data from GitHub

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

In [11]:
# URL to HDMA_Boston_Housing_Data.csv file on GitHub
url = "https://raw.githubusercontent.com/CheliMex/CS381_DataAnalytics/refs/heads/main/Quiz4/HDMA_Boston_Housing_Data.csv"

### 2. Load the dataset into a Pandas DataFrame

In [12]:
# Loading the dataset into a Pandas DataFrame
df = pd.read_csv(url)

In [14]:
# Display every row and column
print("--- Initial Data Load ---\n")
df.head()

--- Initial Data Load ---



Unnamed: 0.1,Unnamed: 0,dir,hir,lvr,ccs,mcs,pbcr,dmi,self,single,uria,comdominiom,black,deny
0,1,0.221,0.221,0.8,5.0,2.0,no,no,no,no,3.9,0,no,no
1,2,0.265,0.265,0.921875,2.0,2.0,no,no,no,yes,3.2,0,no,no
2,3,0.372,0.248,0.920398,1.0,2.0,no,no,no,no,3.2,0,no,no
3,4,0.32,0.25,0.860465,1.0,2.0,no,no,no,no,4.3,0,no,no
4,5,0.36,0.35,0.6,1.0,1.0,no,no,no,no,3.2,0,no,no


In [19]:
# Renaming the 1st column name
ndf = df.rename(columns={'Unnamed: 0': 'Index'})
ndf

Unnamed: 0,Index,dir,hir,lvr,ccs,mcs,pbcr,dmi,self,single,uria,comdominiom,black,deny
0,1,0.221000,0.221000,0.800000,5.000000,2.000000,no,no,no,no,3.900000,0,no,no
1,2,0.265000,0.265000,0.921875,2.000000,2.000000,no,no,no,yes,3.200000,0,no,no
2,3,0.372000,0.248000,0.920398,1.000000,2.000000,no,no,no,no,3.200000,0,no,no
3,4,0.320000,0.250000,0.860465,1.000000,2.000000,no,no,no,no,4.300000,0,no,no
4,5,0.360000,0.350000,0.600000,1.000000,1.000000,no,no,no,no,3.200000,0,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2376,2377,0.300000,0.300000,0.777049,1.000000,2.000000,no,no,no,yes,3.200000,1,no,no
2377,2378,0.260000,0.200000,0.526761,2.000000,1.000000,no,no,no,no,3.100000,0,no,no
2378,2379,0.320000,0.260000,0.753846,6.000000,1.000000,yes,no,no,yes,3.100000,1,yes,yes
2379,2380,0.350000,0.260000,0.813559,2.000000,2.000000,no,no,no,yes,4.300000,1,no,yes


In [24]:
print("--- DataFrame Info ---\n")
df.info()

--- DataFrame Info ---

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2381 entries, 0 to 2380
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   2381 non-null   int64  
 1   dir          2381 non-null   float64
 2   hir          2381 non-null   float64
 3   lvr          2381 non-null   float64
 4   ccs          2381 non-null   float64
 5   mcs          2381 non-null   float64
 6   pbcr         2380 non-null   object 
 7   dmi          2381 non-null   object 
 8   self         2380 non-null   object 
 9   single       2381 non-null   object 
 10  uria         2381 non-null   float64
 11  comdominiom  2381 non-null   int64  
 12  black        2381 non-null   object 
 13  deny         2381 non-null   object 
dtypes: float64(6), int64(2), object(6)
memory usage: 260.6+ KB


In [23]:
print("--- Descriptive Statistics for Numerical Columns ---\n")
df.describe()

--- Descriptive Statistics for Numerical Columns ---



Unnamed: 0.1,Unnamed: 0,dir,hir,lvr,ccs,mcs,uria,comdominiom
count,2381.0,2381.0,2381.0,2381.0,2381.0,2381.0,2381.0,2381.0
mean,1191.0,0.330814,0.255346,0.73776,2.116387,1.721008,3.774496,0.288114
std,687.479818,0.107235,0.096635,0.178715,1.66637,0.537169,2.026636,0.45298
min,1.0,0.0,0.0,0.02,1.0,1.0,1.8,0.0
25%,596.0,0.28,0.214,0.65285,1.0,1.0,3.1,0.0
50%,1191.0,0.33,0.26,0.779412,1.0,2.0,3.2,0.0
75%,1786.0,0.37,0.2988,0.868421,2.0,2.0,3.9,1.0
max,2381.0,3.0,3.0,1.95,6.0,4.0,10.6,1.0


## Part 2: Perform Exploratory Data Analysis (EDA)
1. Understanding the nature of each variable  & initial inspections.
2. Perform a thorough EDA on all data attributes to understand their nature, distributions, and relationships.
3. Conduct initial inspections for missing values, invalid data values, and correct data types.
4. Create exploratory graphics (e.g., bar plots, box plots, histograms, line plots) to visualize data characteristics.
6. Identify and document all potential data integrity and usability issues, assessing which attributes may require transformation.

### 1. Understanding each variable & initial inspections

### 2. Perform EDA on all data attributes

### 3. Exploratory graphics

### 4. Conduct initial inspections

### 5. Identify all potential data

## Part 3: Data Preparation & Transformation

1. Describe and justify all data transformation
2. Deletion of observations (if needed).
3. Imputation methods for missing data values.
4. Feature Engineering: Creation of new variables
5. Application of mathematical transforms (e.g., Box-Cox, logarithms) or binning.

### 1.

## Part 4: Prepped Data Review
1. Re-run EDA analysis on variables that were adjusted during the Data Preparation phase.
2. Compare and contrast the results with the pre-preparation EDA to evaluate the impact of adjustments.
3. Clearly describe how each data preparation step has improved the dataset for machine learning algorithm 

### 1.

## Part 5: Regression Modeling
1. Explain and present your regression modeling work, including your feature selection work + interpretation of the coefficients your models are generating.
2. Do they make sense intuitively? If so, why? If not, why not? Comment on the magnitude and direction of the coefficients + whether they are similar from model to model.
    

### 1.

## Part 6: Model Selection
1. Explain your model selection criteria. Identify your preferred model. Compare / contrast its performance with that of your other models.
2. Discuss why you’ve selected that specific model as your preferred model. Apply your preferred model to the testing subset and discuss your results.Did your preferred model perform as well as expected?

### 1.

## Part 7: Conclusions