Original dataset: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=ny&records=all-records&field_descriptions=labels
    
# So far from original dataset: 
1) Selected "no co-applicant" records from the column "co_applicant_ethnicity_name"
2) Filter columns:
    - action_taken_name
    - denial_reason_name_1
3) Created the column "action_taken":
    - df['action_taken'] = df['action_taken_name'].replace({
    'Loan originated': 'approved', 
    'Application approved but not accepted': 'approved',
    'Application denied by financial institution': 'denied'
})  
4) First feature selection:
    - df = df[

        ['loan_type_name',
        'property_type_name',
        'loan_purpose_name',
        'loan_amount_000s',
        'action_taken',
        'msamd_name',
        'applicant_ethnicity_name', 
        'applicant_race_name_1',
        'applicant_sex_name',
        'applicant_income_000s', 
        'denial_reason_name_1',
        'denial_reason_name_2',
        'denial_reason_name_3', 
        'rate_spread',
        'lien_status_name',
        'minority_population',
        'hud_median_family_income',
        'tract_to_msamd_income']

        ]
5) Excluded "Credit application incomplete" records from the column "denial_reason_name_1"


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('C:/Users/nonox/hmdaNY27082024_2.csv')


print("Shape:",df.shape, "\n")

print(df.head())


Shape: (183483, 18) 

  loan_type_name                                 property_type_name  \
0   Conventional  One-to-four family dwelling (other than manufa...   
1   Conventional  One-to-four family dwelling (other than manufa...   
2   Conventional  One-to-four family dwelling (other than manufa...   
3   Conventional                               Manufactured housing   
4   Conventional  One-to-four family dwelling (other than manufa...   

  loan_purpose_name  loan_amount_000s action_taken  \
0     Home purchase               705     approved   
1       Refinancing               112     approved   
2     Home purchase               356       denied   
3     Home purchase                58       denied   
4       Refinancing               100     approved   

                                     msamd_name  \
0  New York, Jersey City, White Plains - NY, NJ   
1                                           NaN   
2  New York, Jersey City, White Plains - NY, NJ   
3                     

In [3]:
print("Column names:", '\n')
print(df.columns)

Column names: 

Index(['loan_type_name', 'property_type_name', 'loan_purpose_name',
       'loan_amount_000s', 'action_taken', 'msamd_name',
       'applicant_ethnicity_name', 'applicant_race_name_1',
       'applicant_sex_name', 'applicant_income_000s', 'denial_reason_name_1',
       'denial_reason_name_2', 'denial_reason_name_3', 'rate_spread',
       'lien_status_name', 'minority_population', 'hud_median_family_income',
       'tract_to_msamd_income'],
      dtype='object')


# A) Creation of New column "race_sex"

In [4]:
# Creating the ethnicity_race_sex
df['race_sex'] = df['applicant_race_name_1'].str.lower() + "_" + df['applicant_sex_name'].str.lower()

# Checking column created
print(df[['race_sex']].value_counts())

race_sex                                                                                                                                                           
white_male                                                                                                                                                             73939
white_female                                                                                                                                                           46676
information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application    13389
asian_male                                                                                                                                                              9484
black or african american_female                                                                                                                

In [5]:
# New CSV file
df.to_csv('C:/Users/nonox/hmdaNYrace_sex.csv', index=False)

print(f"Dataset saved to '{'C:/Users/nonox/hmdaNYrace_sex.csv'}'")

Dataset saved to 'C:/Users/nonox/hmdaNYrace_sex.csv'


# B) No Inof 
     We filter cases with missing information for the column "race_sex"
     the purpose is to investigate visually this dataset. 
    


In [6]:
# Filter cases with no inf. for either sex or race

OnlyNoInf = df[df['race_sex'] == "information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application"]
print(OnlyNoInf)

       loan_type_name                                 property_type_name  \
0        Conventional  One-to-four family dwelling (other than manufa...   
12       Conventional  One-to-four family dwelling (other than manufa...   
20       Conventional  One-to-four family dwelling (other than manufa...   
22       Conventional  One-to-four family dwelling (other than manufa...   
33        FHA-insured  One-to-four family dwelling (other than manufa...   
...               ...                                                ...   
183453   Conventional  One-to-four family dwelling (other than manufa...   
183454   Conventional  One-to-four family dwelling (other than manufa...   
183470   Conventional  One-to-four family dwelling (other than manufa...   
183477   Conventional  One-to-four family dwelling (other than manufa...   
183481   Conventional  One-to-four family dwelling (other than manufa...   

       loan_purpose_name  loan_amount_000s action_taken  \
0          Home purchase    

In [7]:
# New CSV file with 
OnlyNoInf.to_csv('C:/Users/nonox/OnlyNoInf.csv', index=False)

print(f"Filtered dataset saved to '{'C:/Users/nonox/OnlyNoInf.csv'}'")

Filtered dataset saved to 'C:/Users/nonox/OnlyNoInf.csv'


# C) New column "ethnicity_race_sex"

In [8]:
# Creating the ethnicity_race_sex
df['ethnicity_race_sex'] = df['applicant_ethnicity_name'].str.lower() + "_" + df['applicant_race_name_1'].str.lower() + "_" + df['applicant_sex_name'].str.lower()

# Checking column created
print(df[['ethnicity_race_sex']].value_counts())

ethnicity_race_sex                                                                                                                                                                                                                                   
not hispanic or latino_white_male                                                                                                                                                                                                                        66819
not hispanic or latino_white_female                                                                                                                                                                                                                      42594
information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone applicatio

In [9]:
# Check records for each column
for column in df:
    print(f"Distinct records and their counts for '{column}':")
    print(df[column].value_counts())
    print("\n" + "=+"*50 + "\n")


Distinct records and their counts for 'loan_type_name':
loan_type_name
Conventional          148599
FHA-insured            27042
VA-guaranteed           6474
FSA/RHS-guaranteed      1368
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Distinct records and their counts for 'property_type_name':
property_type_name
One-to-four family dwelling (other than manufactured housing)    175723
Manufactured housing                                               4648
Multifamily dwelling                                               3112
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Distinct records and their counts for 'loan_purpose_name':
loan_purpose_name
Home purchase       96867
Refinancing         58950
Home improvement    27666
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

In [10]:
### 1) Drop "race_sex" column

df = df.drop(columns=['race_sex'])

In [11]:
# Save the new dataset with only the new column "ethnicity_race_sex"

df.to_csv('C:/Users/nonox/hmdaNY_eRs_29082024_1251.csv', index=False)

print(f"Filtered dataset saved to '{'C:/Users/nonox/hmdaNY_eRs_29082024_1251.csv'}'")

Filtered dataset saved to 'C:/Users/nonox/hmdaNY_eRs_29082024_1251.csv'


In [12]:
df = pd.read_csv('C:/Users/nonox/hmdaNY_eRs_29082024_1251.csv')

# Print the values for each record of each column
for column in df:
    print(f"Distinct records and their counts for '{column}':")
    print(df[column].value_counts())
    print("\n" + "=+"*50 + "\n")

Distinct records and their counts for 'loan_type_name':
loan_type_name
Conventional          148599
FHA-insured            27042
VA-guaranteed           6474
FSA/RHS-guaranteed      1368
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Distinct records and their counts for 'property_type_name':
property_type_name
One-to-four family dwelling (other than manufactured housing)    175723
Manufactured housing                                               4648
Multifamily dwelling                                               3112
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Distinct records and their counts for 'loan_purpose_name':
loan_purpose_name
Home purchase       96867
Refinancing         58950
Home improvement    27666
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

In [13]:
pd.set_option('display.max_rows', None)
print(df['ethnicity_race_sex'].value_counts())

ethnicity_race_sex
not hispanic or latino_white_male                                                                                                                                                                                                                        66819
not hispanic or latino_white_female                                                                                                                                                                                                                      42594
information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application    13223
not hispanic or latino_asian_male                                                                                                                                                                                       

# D) QUESTIONS 

### What is the ratio of "denials" to "approvals" for each record-type from the column "ethnicity_race_sex"?

In [14]:
# we are going to calculate the % of denial against approval for each record-type
scenarios = [
    "not hispanic or latino_white_male",
    "not hispanic or latino_white_female",
    "not hispanic or latino_white_information not provided by applicant in mail, internet, or telephone application",
    "not hispanic or latino_white_not applicable",
    "not hispanic or latino_black or african american_male",
    "not hispanic or latino_black or african american_female",
    "not hispanic or latino_black or african american_information not provided by applicant in mail, internet, or telephone application",
    "not hispanic or latino_asian_male",
    "not hispanic or latino_asian_female",
    "not hispanic or latino_asian_information not provided by applicant in mail, internet, or telephone application",
    "not hispanic or latino_asian_not applicable",
    "not hispanic or latino_american indian or alaska native_male",
    "not hispanic or latino_american indian or alaska native_female",
    "not hispanic or latino_american indian or alaska native_information not provided by applicant in mail, internet, or telephone application",
    "not hispanic or latino_native hawaiian or other pacific islander_male",
    "not hispanic or latino_native hawaiian or other pacific islander_female",
    "hispanic or latino_white_male",
    "hispanic or latino_white_female",
    "hispanic or latino_white_information not provided by applicant in mail, internet, or telephone application",
    "hispanic or latino_black or african american_male",
    "hispanic or latino_black or african american_female",
    "hispanic or latino_black or african american_information not provided by applicant in mail, internet, or telephone application",
    "hispanic or latino_asian_male",
    "hispanic or latino_asian_female",
    "hispanic or latino_native hawaiian or other pacific islander_male",
    "hispanic or latino_native hawaiian or other pacific islander_female",
    "hispanic or latino_american indian or alaska native_male",
    "hispanic or latino_american indian or alaska native_female",
    "information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application",
    "information not provided by applicant in mail, internet, or telephone application_male",
    "information not provided by applicant in mail, internet, or telephone application_female",
    "information not provided by applicant in mail, internet, or telephone application_white_male",
    "information not provided by applicant in mail, internet, or telephone application_white_female",
    "information not provided by applicant in mail, internet, or telephone application_white_information not provided by applicant in mail, internet, or telephone application",
    "information not provided by applicant in mail, internet, or telephone application_black or african american_male",
    "information not provided by applicant in mail, internet, or telephone application_black or african american_female",
    "information not provided by applicant in mail, internet, or telephone application_black or african american_information not provided by applicant in mail, internet, or telephone application",
    "information not provided by applicant in mail, internet, or telephone application_asian_male",
    "information not provided by applicant in mail, internet, or telephone application_asian_female",
    "information not provided by applicant in mail, internet, or telephone application_asian_information not provided by applicant in mail, internet, or telephone application",
    "information not provided by applicant in mail, internet, or telephone application_american indian or alaska native_male",
    "information not provided by applicant in mail, internet, or telephone application_american indian or alaska native_female",
    "information not provided by applicant in mail, internet, or telephone application_american indian or alaska native_information not provided by applicant in mail, internet, or telephone application",
    "information not provided by applicant in mail, internet, or telephone application_native hawaiian or other pacific islander_male",
    "information not provided by applicant in mail, internet, or telephone application_native hawaiian or other pacific islander_female",
    "not applicable_white_male",
    "not applicable_white_female",
    "not applicable_white_not applicable",
    "not applicable_black or african american_male",
    "not applicable_black or african american_female",
    "not applicable_asian_male",
    "not applicable_asian_female",
    "not applicable_not applicable_not applicable",
    "not applicable_not applicable_male",
    "not applicable_not applicable_female",
    "not applicable_information not provided by applicant in mail, internet, or telephone application_male",
    "not applicable_information not provided by applicant in mail, internet, or telephone application_female",
    "not applicable_information not provided by applicant in mail, internet, or telephone application_not applicable",
    "not applicable_not applicable_information not provided by applicant in mail, internet, or telephone application",
    "not applicable_information not provided by applicant in mail, internet, or telephone application_information not provided by applicant in mail, internet, or telephone application"
]


# Dictionary to store results
ratios = {}

# Calculate the ratio for each record-case and store the results in the previous dictionary
for scenario in scenarios:
    # Filtering data
    scenarios_df = df[df['ethnicity_race_sex'] == scenario]
    
    # Counting
    approved_count = scenarios_df[scenarios_df['action_taken'] == "approved"].shape[0]  # Shape[]; to match our target
    denied_count = scenarios_df[scenarios_df['action_taken'] == "denied"].shape[0]
    
    # Calculate the ratio of "denied" to "approved"
    if approved_count == 0:  # Avoid division by zero
        ratio = None
    else:
        ratio = denied_count / approved_count
    
    # Store the result
    ratios[scenario] = ratio

# Print the results
for scenario, ratio in ratios.items():
    print(f"Ratio of 'Application denied' to 'approved' for '{scenario}': {ratio}")


Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_white_male': 0.2345311778290993
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_white_female': 0.24068626023127784
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_white_information not provided by applicant in mail, internet, or telephone application': 0.2605042016806723
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_white_not applicable': 0.25
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_black or african american_male': 0.5211267605633803
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_black or african american_female': 0.503062787136294
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_black or african american_information not provided by applicant in mail, internet, or telephone application': 1.2
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_a

#### What is the ratio of "denials" to "approvals" for each record-type from the column "ethnicity_race_sex" with no missing info ?

In [15]:
# we are going to calculate the % of denial against approval for each record-type
scenarios = ["not hispanic or latino_white_male",
            "not hispanic or latino_white_female",
            "not hispanic or latino_black or african american_male",
            "not hispanic or latino_black or african american_female",
            "not hispanic or latino_asian_male",
            "not hispanic or latino_asian_female",
            "not hispanic or latino_american indian or alaska native_male",
            "not hispanic or latino_american indian or alaska native_female",
            "not hispanic or latino_native hawaiian or other pacific islander_male",
            "not hispanic or latino_native hawaiian or other pacific islander_female",
            "hispanic or latino_white_male",
            "hispanic or latino_white_female",
            "hispanic or latino_black or african american_male",
            "hispanic or latino_black or african american_female",
            "hispanic or latino_asian_male",
            "hispanic or latino_asian_female",
            "hispanic or latino_native hawaiian or other pacific islander_male",
            "hispanic or latino_native hawaiian or other pacific islander_female",
            "hispanic or latino_american indian or alaska native_male",
            "hispanic or latino_american indian or alaska native_female"
            ]


# Dictionary to store results
ratios = {}

# Calculate the ratio for each record-case and store the results
for scenario in scenarios:
    # Filtering data
    scenarios_df = df[df['ethnicity_race_sex'] == scenario]
    
    # Counting
    approved_count = scenarios_df[scenarios_df['action_taken'] == "approved"].shape[0]  # Shape[]; to match our target
    denied_count = scenarios_df[scenarios_df['action_taken'] == "denied"].shape[0]
    
    # Calculate the ratio of "denied" to "approved"
    if approved_count == 0:  # Avoid division by zero
        ratio = None
    else:
        ratio = denied_count / approved_count
    
    # Store the result
    ratios[scenario] = ratio

# Print the results
for scenario, ratio in ratios.items():
    print(f"Ratio of 'Application denied' to 'approved' for '{scenario}': {ratio}")

Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_white_male': 0.2345311778290993
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_white_female': 0.24068626023127784
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_black or african american_male': 0.5211267605633803
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_black or african american_female': 0.503062787136294
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_asian_male': 0.2380381180860989
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_asian_female': 0.22051479062620052
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_american indian or alaska native_male': 0.592057761732852
Ratio of 'Application denied' to 'approved' for 'not hispanic or latino_american indian or alaska native_female': 0.6175115207373272
Ratio of 'Application denied' to 'approved' for 'not hispanic or l

# E) Excluding 
missing information from ethnicity_race_sex

In [16]:

# Target ethnicity_race_sex not affected by missing information
ethnicity_race_sex = [
    "not hispanic or latino_white_male",
    "not hispanic or latino_white_female",
    "not hispanic or latino_black or african american_male",
    "not hispanic or latino_black or african american_female",
    "not hispanic or latino_asian_male",
    "not hispanic or latino_asian_female",
    "not hispanic or latino_american indian or alaska native_male",
    "not hispanic or latino_american indian or alaska native_female",
    "not hispanic or latino_native hawaiian or other pacific islander_male",
    "not hispanic or latino_native hawaiian or other pacific islander_female",
    "hispanic or latino_white_male",
    "hispanic or latino_white_female",
    "hispanic or latino_black or african american_male",
    "hispanic or latino_black or african american_female",
    "hispanic or latino_asian_male",
    "hispanic or latino_asian_female",
    "hispanic or latino_native hawaiian or other pacific islander_male",
    "hispanic or latino_native hawaiian or other pacific islander_female",
    "hispanic or latino_american indian or alaska native_male",
    "hispanic or latino_american indian or alaska native_female"
]


df = df[df['ethnicity_race_sex'].isin(ethnicity_race_sex)]

In [17]:
print(df.columns)

Index(['loan_type_name', 'property_type_name', 'loan_purpose_name',
       'loan_amount_000s', 'action_taken', 'msamd_name',
       'applicant_ethnicity_name', 'applicant_race_name_1',
       'applicant_sex_name', 'applicant_income_000s', 'denial_reason_name_1',
       'denial_reason_name_2', 'denial_reason_name_3', 'rate_spread',
       'lien_status_name', 'minority_population', 'hud_median_family_income',
       'tract_to_msamd_income', 'ethnicity_race_sex'],
      dtype='object')


In [18]:
# Saving .csv file
df.to_csv('C:/Users/nonox/hmdaNY29082024_1407_2.csv', index=False)

print(f"Dataset saved to '{'C:/Users/nonox/hmdaNY29082024_1407_2.csv'}'")


Dataset saved to 'C:/Users/nonox/hmdaNY29082024_1407_2.csv'


In [19]:
df = pd.read_csv('C:/Users/nonox/hmdaNY29082024_1407_2.csv')

# Print the values for each record of each column
for column in df:
    print(f"Distinct records and their counts for '{column}':")
    print(df[column].value_counts())
    print("\n" + "=+"*50 + "\n")

Distinct records and their counts for 'loan_type_name':
loan_type_name
Conventional          121855
FHA-insured            23074
VA-guaranteed           5406
FSA/RHS-guaranteed      1280
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Distinct records and their counts for 'property_type_name':
property_type_name
One-to-four family dwelling (other than manufactured housing)    147493
Manufactured housing                                               3937
Multifamily dwelling                                                185
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Distinct records and their counts for 'loan_purpose_name':
loan_purpose_name
Home purchase       82659
Refinancing         45803
Home improvement    23153
Name: count, dtype: int64

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

In [20]:
print(df.shape)

(151615, 19)


In [21]:
print(df.columns)

Index(['loan_type_name', 'property_type_name', 'loan_purpose_name',
       'loan_amount_000s', 'action_taken', 'msamd_name',
       'applicant_ethnicity_name', 'applicant_race_name_1',
       'applicant_sex_name', 'applicant_income_000s', 'denial_reason_name_1',
       'denial_reason_name_2', 'denial_reason_name_3', 'rate_spread',
       'lien_status_name', 'minority_population', 'hud_median_family_income',
       'tract_to_msamd_income', 'ethnicity_race_sex'],
      dtype='object')


# F) More questions

#### Is there any relationship between missing values from minority_population and approval/denial rate?
#
#### What is the race?
#

In [22]:
import pandas as pd

# Step 1: Checking for missing values in the 'minority_population' column
df['minority_population_missing'] = df['minority_population'].isnull()

# Step 2: Grouping the data by 'minority_population_missing', 'action_taken', and 'applicant_race_name_1'
grouped = df.groupby(['minority_population_missing', 'action_taken', 'ethnicity_race_sex']).size().unstack(fill_value=0)

# Step 3: Calculate the approval and denial rates
approval_rate = grouped.div(grouped.sum(axis=1), axis=0) * 100

# Step 4: Display the results
print("Approval/Denial Rates by minority_population_missing and Race:")
print(approval_rate)

Approval/Denial Rates by minority_population_missing and Race:
ethnicity_race_sex                        hispanic or latino_american indian or alaska native_female  \
minority_population_missing action_taken                                                               
False                       approved                                               0.033520            
                            denied                                                 0.124992            
True                        approved                                               0.000000            
                            denied                                                 0.000000            

ethnicity_race_sex                        hispanic or latino_american indian or alaska native_male  \
minority_population_missing action_taken                                                             
False                       approved                                               0.060336          
      

### What is the ratio of applications of men vs women?

In [23]:
# Step 1: Count the number of applications for men and women
male_count = df[df['applicant_sex_name'] == 'Male'].shape[0]
female_count = df[df['applicant_sex_name'] == 'Female'].shape[0]

# Step 2: Calculate the ratio of men to women
if female_count == 0:  # Avoid division by zero
    ratio = None
else:
    ratio = female_count / male_count

# Step 3: Display the results
print(f"Number of applications by men: {male_count}")
print(f"Number of applications by women: {female_count}")
print(f"Ratio of applications (Men vs Women): {ratio}")


Number of applications by men: 90250
Number of applications by women: 61365
Ratio of applications (Men vs Women): 0.6799445983379502


### How much is the average of the total loan amount asked per men vs women?

In [24]:
# Step 1: Filter the data for men and women
male_loans = df[df['applicant_sex_name'] == 'Male']['loan_amount_000s']
female_loans = df[df['applicant_sex_name'] == 'Female']['loan_amount_000s']

# Step 2: Calculate the average loan amount for men and women
average_male_loan = male_loans.mean()
average_female_loan = female_loans.mean()

# Step 3: Display the results
print(f"Average loan amount requested by men: ${average_male_loan:.2f} (in thousands)")
print(f"Average loan amount requested by women: ${average_female_loan:.2f} (in thousands)")


Average loan amount requested by men: $279.14 (in thousands)
Average loan amount requested by women: $212.62 (in thousands)


### What is the avg income per ethnicity, race, sex?

In [25]:
# Step 1: Group the data by 'ethnicity_race_sex' and calculate the average income
average_income_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['applicant_income_000s'].mean()

# Step 2: Display the results
print("Average Applicant Income by Ethnicity, Race, and Sex:")
print(average_income_by_ethnicity_race_sex)


Average Applicant Income by Ethnicity, Race, and Sex:
ethnicity_race_sex
hispanic or latino_american indian or alaska native_female                  76.615385
hispanic or latino_american indian or alaska native_male                    94.696296
hispanic or latino_asian_female                                             82.377358
hispanic or latino_asian_male                                              100.626866
hispanic or latino_black or african american_female                         77.077151
hispanic or latino_black or african american_male                           85.901316
hispanic or latino_native hawaiian or other pacific islander_female         78.260000
hispanic or latino_native hawaiian or other pacific islander_male           77.255319
hispanic or latino_white_female                                             82.790662
hispanic or latino_white_male                                              103.220159
not hispanic or latino_american indian or alaska native_female     

### What is the avg of "rate_spread" for each group?

In [26]:

# Step 1: Group the data by 'ethnicity_race_sex' and calculate the average rate spread
average_rate_spread_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['rate_spread'].mean()

# Step 2: Display the results
print("Average Rate Spread by Ethnicity, Race, and Sex:")
print(average_rate_spread_by_ethnicity_race_sex)


Average Rate Spread by Ethnicity, Race, and Sex:
ethnicity_race_sex
hispanic or latino_american indian or alaska native_female                 1.555000
hispanic or latino_american indian or alaska native_male                   1.600000
hispanic or latino_asian_female                                            1.775000
hispanic or latino_asian_male                                              1.830000
hispanic or latino_black or african american_female                        2.178667
hispanic or latino_black or african american_male                          1.990000
hispanic or latino_native hawaiian or other pacific islander_female        2.200000
hispanic or latino_native hawaiian or other pacific islander_male          1.626667
hispanic or latino_white_female                                            2.126797
hispanic or latino_white_male                                              2.123458
not hispanic or latino_american indian or alaska native_female             3.936364
not hisp

### Which group has the most empty cells in the column "rate_spread" which means they are not being overcharged)?

In [27]:
# Step 1: Group the data by 'ethnicity_race_sex' and count the null values in 'rate_spread'
null_rate_spread_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['rate_spread'].apply(lambda x: x.isnull().sum())

# Step 2: Display the results
print("Null values in 'rate_spread' by Ethnicity, Race, and Sex:")
print(null_rate_spread_by_ethnicity_race_sex)


Null values in 'rate_spread' by Ethnicity, Race, and Sex:
ethnicity_race_sex
hispanic or latino_american indian or alaska native_female                    78
hispanic or latino_american indian or alaska native_male                     157
hispanic or latino_asian_female                                               53
hispanic or latino_asian_male                                                 70
hispanic or latino_black or african american_female                          368
hispanic or latino_black or african american_male                            324
hispanic or latino_native hawaiian or other pacific islander_female           58
hispanic or latino_native hawaiian or other pacific islander_male             98
hispanic or latino_white_female                                             3348
hispanic or latino_white_male                                               5904
not hispanic or latino_american indian or alaska native_female               340
not hispanic or latino_american 

###  What is the % of null cell compared to not null (not overcharged vs overcharged)?

In [28]:

# Step 1: Group the data by 'ethnicity_race_sex' and count the total number of rows and null values in 'rate_spread'
total_count_by_group = df.groupby('ethnicity_race_sex')['rate_spread'].count() + df.groupby('ethnicity_race_sex')['rate_spread'].apply(lambda x: x.isnull().sum())
null_count_by_group = df.groupby('ethnicity_race_sex')['rate_spread'].apply(lambda x: x.isnull().sum())

# Step 2: Calculate the percentage of null values for each group
percentage_null_by_group = (null_count_by_group / total_count_by_group) * 100

# Step 3: Display the results
print("Percentage of Null values in 'rate_spread' by Ethnicity, Race, and Sex:")
print(percentage_null_by_group)


#https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
#https://www.geeksforgeeks.org/applying-lambda-functions-to-pandas-dataframe/
#https://github.com/LDSSA/batch4-students
#https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/
#https://realpython.com/python-lambda/

Percentage of Null values in 'rate_spread' by Ethnicity, Race, and Sex:
ethnicity_race_sex
hispanic or latino_american indian or alaska native_female                 97.500000
hispanic or latino_american indian or alaska native_male                   98.742138
hispanic or latino_asian_female                                            96.363636
hispanic or latino_asian_male                                              98.591549
hispanic or latino_black or african american_female                        96.083551
hispanic or latino_black or african american_male                          95.575221
hispanic or latino_native hawaiian or other pacific islander_female        98.305085
hispanic or latino_native hawaiian or other pacific islander_male          97.029703
hispanic or latino_white_female                                            96.317606
hispanic or latino_white_male                                              95.241168
not hispanic or latino_american indian or alaska native_fem

In [29]:

# Group by 'ethnicity_race_sex' and 'applicant_sex_name', then count
result = df.groupby(['ethnicity_race_sex', 'applicant_sex_name']).size().unstack(fill_value=0)

# Rename columns for clarity
result = result.rename(columns={'Male': 'Total_Males', 'Female': 'Total_Females'})

# Sort by Total_Males in descending order (you can change this if needed)
result_sorted = result.sort_values('Total_Males', ascending=False)

# Display the results
print(result_sorted)

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unstack.html

applicant_sex_name                                  Total_Females  Total_Males
ethnicity_race_sex                                                            
not hispanic or latino_white_male                               0        66819
not hispanic or latino_asian_male                               0         9289
not hispanic or latino_black or african america...              0         6588
hispanic or latino_white_male                                   0         6199
not hispanic or latino_american indian or alask...              0          441
hispanic or latino_black or african american_male               0          339
not hispanic or latino_native hawaiian or other...              0          244
hispanic or latino_american indian or alaska na...              0          159
hispanic or latino_native hawaiian or other pac...              0          101
hispanic or latino_asian_male                                   0           71
not hispanic or latino_white_female                 

In [30]:

# Group by 'ethnicity_race_sex' and 'applicant_sex_name', then count
result = df.groupby(['ethnicity_race_sex', 'applicant_sex_name']).size().unstack(fill_value=0)

# Rename columns for clarity
result = result.rename(columns={'Male': 'Total_Males', 'Female': 'Total_Females'})

# Sort by Total_Males in descending order (you can change this if needed)
result_sorted = result.sort_values('Total_Males', ascending=False)

# Save the results to an Excel file
excel_filename = 'ethnicity_race_sex_counts.xlsx'
result_sorted.to_excel(excel_filename)

# Display the results
print(result_sorted)
print("\n", "\n")
print(f"Results have been saved to {excel_filename}")

applicant_sex_name                                  Total_Females  Total_Males
ethnicity_race_sex                                                            
not hispanic or latino_white_male                               0        66819
not hispanic or latino_asian_male                               0         9289
not hispanic or latino_black or african america...              0         6588
hispanic or latino_white_male                                   0         6199
not hispanic or latino_american indian or alask...              0          441
hispanic or latino_black or african american_male               0          339
not hispanic or latino_native hawaiian or other...              0          244
hispanic or latino_american indian or alaska na...              0          159
hispanic or latino_native hawaiian or other pac...              0          101
hispanic or latino_asian_male                                   0           71
not hispanic or latino_white_female                 

In [31]:

# Group by 'ethnicity_race_sex' and 'applicant_sex_name', then count
result = df.groupby(['ethnicity_race_sex', 'applicant_sex_name']).size().unstack(fill_value=0)

# Rename columns for clarity
result = result.rename(columns={'Male': 'Total_Males', 'Female': 'Total_Females'})

# Sort by the index (ethnicity_race_sex) alphabetically
result_sorted = result.sort_index()

# Save the results to an Excel file
excel_filename = 'ethnicity_race_sex_counts_ordered.xlsx'
result_sorted.to_excel(excel_filename)

print(f"Results have been saved to {excel_filename}")

Results have been saved to ethnicity_race_sex_counts_ordered.xlsx


In [32]:
print(df.columns)

Index(['loan_type_name', 'property_type_name', 'loan_purpose_name',
       'loan_amount_000s', 'action_taken', 'msamd_name',
       'applicant_ethnicity_name', 'applicant_race_name_1',
       'applicant_sex_name', 'applicant_income_000s', 'denial_reason_name_1',
       'denial_reason_name_2', 'denial_reason_name_3', 'rate_spread',
       'lien_status_name', 'minority_population', 'hud_median_family_income',
       'tract_to_msamd_income', 'ethnicity_race_sex',
       'minority_population_missing'],
      dtype='object')


   Original dataset: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=ny&records=all-records&field_descriptions=labels
    
# So far from original dataset: 
1) Selected "no co-applicant" records from the column "co_applicant_ethnicity_name"
2) Filter columns:
    - action_taken_name
    - denial_reason_name_1
3) Created the column "action_taken":
    - df['action_taken'] = df['action_taken_name'].replace({
    'Loan originated': 'approved', 
    'Application approved but not accepted': 'approved',
    'Application denied by financial institution': 'denied'
})  
4) First feature selection:
    - df = df[

        ['loan_type_name',
        'property_type_name',
        'loan_purpose_name',
        'loan_amount_000s',
        'action_taken',
        'msamd_name',
        'applicant_ethnicity_name', 
        'applicant_race_name_1',
        'applicant_sex_name',
        'applicant_income_000s', 
        'denial_reason_name_1',
        'denial_reason_name_2',
        'denial_reason_name_3', 
        'rate_spread',
        'lien_status_name',
        'minority_population',
        'hud_median_family_income',
        'tract_to_msamd_income']

        ]
5) Excluded "Credit application incomplete" records from the column "denial_reason_name_1"
6) New column "ethnicity_race_sex"
