In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [18]:
import pandas as pd

file_path = '/content/drive/MyDrive/test_2023_public_lar.csv'

df = pd.read_csv(file_path)

df.head()

Unnamed: 0,LoanAmount,AppraisedValue,LoanToValue_LTVratio,InterestRate,LoanTerm,LoanStatus,DenialReason,PropertyType,ConstructionMethod,OccupancyType,...,MetroArea,Race,Ethnicity,Sex,ApplicantIncome,DebtToIncomeRatio_DTI,CreditScoreType,CoApplicantCreditScoreType,LenderID,ConformingLoanStatus
0,305000,295000.0,,5.5,360,6,10,Single Family (1-4 Units):Site-Built,1,1,...,16300,racE nOt avaiLaBLE,EtHNiCity NoT AVaILaBLE,Sex Not Available,,,9,9,549300JOT0D4J0SZIK67,C
1,185000,185000.0,,8.0,360,6,10,Single Family (1-4 Units):Site-Built,1,1,...,17140,raCE nOT AvailAble,ETHNIciTy NOT AvaILAbLe,Sex Not Available,,,9,9,549300JOT0D4J0SZIK67,C
2,205000,205000.0,,7.125,360,6,10,Single Family (1-4 Units):Site-Built,1,1,...,16984,racE nOt AVaIlaBlE,ethnicitY noT avaiLable,Sex Not Available,,,9,9,549300JOT0D4J0SZIK67,C
3,255000,,,6.625,360,6,10,Single Family (1-4 Units):Site-Built,1,1,...,28140,RacE nOt AVAILaBLe,eThNiCity NoT avAilaBle,Sex Not Available,,,9,9,549300JOT0D4J0SZIK67,C
4,175000,,,7.125,360,6,10,Single Family (1-4 Units):Site-Built,1,1,...,28140,rACe nOT AvAiLAbLE,EtHNiciTY Not avAIlABlE,Sex Not Available,,,9,9,549300JOT0D4J0SZIK67,C


Q1. What is this dataset about?

In [19]:
'''

In plain English, this is a home-lending (mortgage) dataset at the loan-application level. Each row describes a single loan/application with fields
that capture the loan’s size and terms (LoanAmount, InterestRate, LoanTerm, LoanToValue_LTVratio), the collateral (AppraisedValue, PropertyType,
ConstructionMethod, OccupancyType, TotalUnits), limited geography (CountyCode, StateCode, CensusTract, MetroArea), and some underwriting context
(ApplicantIncome, DebtToIncomeRatio_DTI, CreditScoreType, DenialReason) along with lender and conforming status.

The kinds of decisions that I believe could be made using this dataset are:

- Pricing & risk: calibrate rates, LTV/DTI screens, and conforming thresholds by county/segment.
- Market selection: identify counties/metros with strong demand, lower delinquency risk proxies, or better margins.
- Product design: tune terms (e.g., LoanTerm) or emphasize products for manufactured vs. site-built homes.
- Fair lending monitoring: look for systematic differences across geographies/segments (and then investigate root causes).

The variables seem most important for housing dynamics are:

- LoanAmount & AppraisedValue → core to price levels and leverage; together define LTV, a key risk/affordability indicator.
- InterestRate & LoanTerm → affordability (monthly payment), demand elasticity, and prepayment risk.
- ApplicantIncome & DebtToIncomeRatio_DTI → borrower capacity; strongest signals of ability to repay.
- PropertyType, ConstructionMethod, OccupancyType → segment structure (SFH vs. MF, manufactured vs. site-built, owner-occupied vs. investment).
- CountyCode / StateCode / MetroArea / CensusTract → local market differences (prices, supply, regulation, economic conditions).
- CreditScoreType & DenialReason → credit box tightness and drivers of approval/denial.

'''

'\n\nIn plain English, this is a home-lending (mortgage) dataset at the loan-application level. Each row describes a single loan/application with fields \nthat capture the loan’s size and terms (LoanAmount, InterestRate, LoanTerm, LoanToValue_LTVratio), the collateral (AppraisedValue, PropertyType, \nConstructionMethod, OccupancyType, TotalUnits), limited geography (CountyCode, StateCode, CensusTract, MetroArea), and some underwriting context \n(ApplicantIncome, DebtToIncomeRatio_DTI, CreditScoreType, DenialReason) along with lender and conforming status.\n\nThe kinds of decisions that I believe could be made using this dataset are:\n\n- Pricing & risk: calibrate rates, LTV/DTI screens, and conforming thresholds by county/segment.\n- Market selection: identify counties/metros with strong demand, lower delinquency risk proxies, or better margins.\n- Product design: tune terms (e.g., LoanTerm) or emphasize products for manufactured vs. site-built homes.\n- Fair lending monitoring: look f

Q2. Three analysis-ready questions

In [20]:
'''

1. How does borrower affordability differ across counties?

Vars: ApplicantIncome, DebtToIncomeRatio_DTI, InterestRate, LoanAmount, CountyCode, OccupancyType.
Value: pinpoints where borrowers are most/least stretched → informs county-level pricing, marketing, and risk appetite.

2. Which factors most strongly explain high LTV loans?
Vars: LoanToValue_LTVratio (target), LoanAmount, AppraisedValue, ApplicantIncome, PropertyType, OccupancyType, CountyCode, CreditScoreType.
Value: clarifies drivers of leverage → guides appraisal diligence, down-payment assistance, and LTV caps.

3. Are manufactured homes priced differently than site-built after controlling for income and LTV?
Vars: InterestRate (target), ConstructionMethod / ManufacturedType, LoanToValue_LTVratio, ApplicantIncome, CountyCode, LoanTerm, ConformingLoanStatus.
Value: detects product-level pricing gaps → supports product design and fair-pricing reviews.

'''

'\n\n1. How does borrower affordability differ across counties?\n\nVars: ApplicantIncome, DebtToIncomeRatio_DTI, InterestRate, LoanAmount, CountyCode, OccupancyType.\nValue: pinpoints where borrowers are most/least stretched → informs county-level pricing, marketing, and risk appetite.\n\n2. Which factors most strongly explain high LTV loans?\nVars: LoanToValue_LTVratio (target), LoanAmount, AppraisedValue, ApplicantIncome, PropertyType, OccupancyType, CountyCode, CreditScoreType.\nValue: clarifies drivers of leverage → guides appraisal diligence, down-payment assistance, and LTV caps.\n\n3. Are manufactured homes priced differently than site-built after controlling for income and LTV?\nVars: InterestRate (target), ConstructionMethod / ManufacturedType, LoanToValue_LTVratio, ApplicantIncome, CountyCode, LoanTerm, ConformingLoanStatus.\nValue: detects product-level pricing gaps → supports product design and fair-pricing reviews.\n\n'

Q3. Find Missing Data

In [21]:
# Q3a — Missing data: counts and percentages
import pandas as pd
import numpy as np
n_rows = len(df)

missing_counts = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing_counts / n_rows * 100).round(2)

missing_table = (
    pd.DataFrame({"missing_count": missing_counts, "missing_pct": missing_pct})
    .sort_values("missing_count", ascending=False)
)

print(f"Dataset shape: {df.shape}")
print(f"Columns with any missing: {(missing_counts > 0).sum()} / {df.shape[1]}")
missing_table.head(25)

Dataset shape: (10200, 25)
Columns with any missing: 7 / 25


Unnamed: 0,missing_count,missing_pct
LoanToValue_LTVratio,2633,25.81
DebtToIncomeRatio_DTI,1969,19.3
AppraisedValue,1059,10.38
InterestRate,995,9.75
ApplicantIncome,624,6.12
CensusTract,2,0.02
LoanTerm,1,0.01
LoanStatus,0,0.0
LoanAmount,0,0.0
ConstructionMethod,0,0.0


In [23]:
# Q3b — County list and likely state mapping (via mode of StateCode)

county_col = next((c for c in ["CountyCode","county","County","county_code"] if c in df.columns), None)
state_col  = next((c for c in ["StateCode","state","state_code","region","msa"] if c in df.columns), None)

if not county_col:
    print("No county-like column found.")
else:
    counties = sorted(df[county_col].dropna().unique().tolist())
    print(f"Unique counties in '{county_col}':", len(counties))
    print("County values:", counties)

    if state_col:
        # Map county -> most common state in the data
        m = (df[[county_col, state_col]]
             .dropna()
             .groupby(county_col)[state_col]
             .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else None)
             .reset_index()
             .rename(columns={county_col:"County", state_col:"LikelyState"}))
        print("\nCounty → LikelyState (from mode):")
        print(m.sort_values(["LikelyState","County"]).to_string(index=False))
    else:
        print("No state/region column found for mapping.")

Unique counties in 'CountyCode': 8
County values: [17031.0, 18097.0, 19113.0, 26163.0, 27053.0, 29095.0, 39061.0, 55025.0]

County → LikelyState (from mode):
 County LikelyState
19113.0          IA
17031.0          IL
18097.0          IN
26163.0          MI
27053.0          MN
29095.0          MO
39061.0          OH
55025.0          WI


In [24]:
# Q3c — IQR-based outlier flags and examples

cols = [c for c in ["LoanAmount","AppraisedValue","LoanToValue_LTVratio","InterestRate","ApplicantIncome"] if c in df.columns]
for c in cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

def iqr_bounds(s: pd.Series):
    s = s.dropna()
    if len(s) < 5:
        return None, None
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    return q1 - 1.5*iqr, q3 + 1.5*iqr

for c in cols:
    lo, hi = iqr_bounds(df[c])
    if lo is None:
        print(f"{c}: not enough data.")
        continue
    mask = (df[c] < lo) | (df[c] > hi)
    print(f"\n{c}: outliers = {mask.sum()} | bounds = [{lo:.3g}, {hi:.3g}]")
    print(df.loc[mask, [c]].head(5))


LoanAmount: outliers = 669 | bounds = [-1.35e+05, 6.65e+05]
     LoanAmount
31       725000
98       725000
148      755000
179      675000
185     1485000

AppraisedValue: outliers = 747 | bounds = [-1.55e+05, 8.05e+05]
     AppraisedValue
94         855000.0
98         965000.0
158       1285000.0
163       1185000.0
165        855000.0

LoanToValue_LTVratio: outliers = 524 | bounds = [55.2, 121]
     LoanToValue_LTVratio
151                45.410
155                53.506
156                49.020
157                34.916
158                37.296

InterestRate: outliers = 266 | bounds = [4.5, 8.5]
     InterestRate
146         3.250
260         8.750
278        11.490
309         9.250
310         9.375

ApplicantIncome: outliers = 843 | bounds = [-72, 288]
     ApplicantIncome
148            508.0
163            352.0
170            366.0
177            354.0
185           1608.0


In [25]:
# Q3c - verbal questions

'''
# How would you confirm if it’s a data entry error or a legitimate value?

- I will confirm legitimacy by:
(1) Cross-checking internal consistency (e.g., AppraisedValue > 0, recompute LTV ≈ LoanAmount/AppraisedValue * 100),
(2) Comparing to realistic ranges from the data dictionary/course notes,
(3) Checking if extreme values cluster in specific counties/segments where high tail is plausible,
(4) Looking for obvious keying artifacts (e.g., InterestRate == 0 or 99).

# What to do with the outliers (and why)

1) LoanAmount & AppraisedValue (very big numbers)
Do: Keep them in the data. For charts/models, use a log transform (e.g., log1p(x)) or winsorize (cap at the 1st/99th percentiles).
Why: Expensive homes and big loans are real. Log/winsorizing stops a few huge numbers from skewing results.

2) LoanToValue_LTVratio (low values like 35–55)
Do: Keep them. Maybe cap only very high LTVs if you must.
Why: Low LTV just means bigger down payments—totally normal, not an error.

3) InterestRate (very low like 3.25% or high like 9–11.5%)
Do: Keep them. If you need to fill missing rates later, fill by similar loans (same LoanTerm and ConformingLoanStatus). For modeling, light
winsorizing is okay.
Why: Different products and times have different rates. These are likely real.

4) ApplicantIncome (very high, e.g., 352–1608)
Do: Keep. For modeling, log transform and/or winsorize the high tail. Also check against DTI and loan size.
Why: High incomes exist, especially in some areas. Log/winsorizing reduces distortion.

'''

'\n# How would you confirm if it’s a data entry error or a legitimate value?\n\n- I will confirm legitimacy by:\n(1) Cross-checking internal consistency (e.g., AppraisedValue > 0, recompute LTV ≈ LoanAmount/AppraisedValue * 100),\n(2) Comparing to realistic ranges from the data dictionary/course notes,\n(3) Checking if extreme values cluster in specific counties/segments where high tail is plausible,\n(4) Looking for obvious keying artifacts (e.g., InterestRate == 0 or 99).\n\n# What to do with the outliers (and why)\n\n1) LoanAmount & AppraisedValue (very big numbers)\nDo: Keep them in the data. For charts/models, use a log transform (e.g., log1p(x)) or winsorize (cap at the 1st/99th percentiles).\nWhy: Expensive homes and big loans are real. Log/winsorizing stops a few huge numbers from skewing results.\n\n2) LoanToValue_LTVratio (low values like 35–55)\nDo: Keep them. Maybe cap only very high LTVs if you must.\nWhy: Low LTV just means bigger down payments—totally normal, not an erro

Q4 — Identify missing variables

In [30]:
# 1) Identify variables with missing data
missing_pct = (missing_counts / n_rows * 100).round(2)
missing = (
    pd.DataFrame({"missing_count": missing_counts, "missing_pct": missing_pct})
      .query("missing_count > 0")
      .sort_values("missing_count", ascending=False)
)
print("Columns with missing values:")
print(missing.to_string())

Columns with missing values:
                       missing_count  missing_pct
LoanToValue_LTVratio            2714        26.61
DebtToIncomeRatio_DTI           1969        19.30
InterestRate                    1076        10.55
AppraisedValue                  1059        10.38
ApplicantIncome                  624         6.12
CensusTract                        2         0.02
LoanTerm                           1         0.01


In [31]:
# 2) Keep/Drop policy for missing values:
'''
- LoanToValue_LTVratio: KEEP; compute = LoanAmount/AppraisedValue*100 when both exist (preserves meaning).
- InterestRate: KEEP; impute median within product/term if available; fallback to global median.
- ApplicantIncome: KEEP; impute median by CountyCode (local incomes); fallback to global median.
- AppraisedValue: KEEP; if you must impute later, use median within (CountyCode, PropertyType); otherwise leave as NaN.
- DebtToIncomeRatio_DTI: KEEP; if numeric, median by (CountyCode, OccupancyType); if banded, mode.
- LoanTerm, CensusTract: very small missingness → reasonable to DROP those few rows instead of inventing values.
'''

'\n- LoanToValue_LTVratio: KEEP; compute = LoanAmount/AppraisedValue*100 when both exist (preserves meaning).\n- InterestRate: KEEP; impute median within product/term if available; fallback to global median.\n- ApplicantIncome: KEEP; impute median by CountyCode (local incomes); fallback to global median.\n- AppraisedValue: KEEP; if you must impute later, use median within (CountyCode, PropertyType); otherwise leave as NaN.\n- DebtToIncomeRatio_DTI: KEEP; if numeric, median by (CountyCode, OccupancyType); if banded, mode.\n- LoanTerm, CensusTract: very small missingness → reasonable to DROP those few rows instead of inventing values.\n'

In [32]:
# 3) Clean THREE variables (LTV, InterestRate, ApplicantIncome)
df_clean = df.copy()

# Ensure numeric types where needed
for c in ["LoanAmount","AppraisedValue","LoanToValue_LTVratio","InterestRate","ApplicantIncome"]:
    if c in df_clean.columns:
        df_clean[c] = pd.to_numeric(df_clean[c], errors="coerce")

report = []

# A) LTV: compute when possible
if {"LoanToValue_LTVratio","LoanAmount","AppraisedValue"}.issubset(df_clean.columns):
    before = int(df_clean["LoanToValue_LTVratio"].isna().sum())
    m = (
        df_clean["LoanToValue_LTVratio"].isna()
        & df_clean["LoanAmount"].notna()
        & df_clean["AppraisedValue"].notna()
        & (df_clean["AppraisedValue"] > 0)
    )
    df_clean.loc[m, "LoanToValue_LTVratio"] = (
        df_clean.loc[m, "LoanAmount"] / df_clean.loc[m, "AppraisedValue"] * 100
    )
    after = int(df_clean["LoanToValue_LTVratio"].isna().sum())
    report.append(["LoanToValue_LTVratio", before, after, "computed = LoanAmount/AppraisedValue*100"])

# B) InterestRate: median by (LoanTerm, ConformingLoanStatus) if present; else global
if "InterestRate" in df_clean.columns:
    before = int(df_clean["InterestRate"].isna().sum())
    group_cols = [c for c in ["LoanTerm","ConformingLoanStatus"] if c in df_clean.columns]
    if group_cols:
        med = df_clean.groupby(group_cols)["InterestRate"].transform("median")
        df_clean["InterestRate"] = df_clean["InterestRate"].fillna(med)
    df_clean["InterestRate"] = df_clean["InterestRate"].fillna(df_clean["InterestRate"].median())
    after = int(df_clean["InterestRate"].isna().sum())
    report.append(["InterestRate", before, after, f"median by {group_cols or 'global'}, then global median"])

# C) ApplicantIncome: median by CountyCode; else global
if "ApplicantIncome" in df_clean.columns:
    before = int(df_clean["ApplicantIncome"].isna().sum())
    if "CountyCode" in df_clean.columns:
        med_cc = df_clean.groupby("CountyCode")["ApplicantIncome"].transform("median")
        df_clean["ApplicantIncome"] = df_clean["ApplicantIncome"].fillna(med_cc)
    df_clean["ApplicantIncome"] = df_clean["ApplicantIncome"].fillna(df_clean["ApplicantIncome"].median())
    after = int(df_clean["ApplicantIncome"].isna().sum())
    report.append(["ApplicantIncome", before, after, "median by CountyCode, then global median"])

# Show cleaning summary
clean_report = pd.DataFrame(report, columns=["variable","missing_before","missing_after","strategy"])
print("\nCleaning summary:")
print(clean_report.to_string(index=False))

# Before/after snapshot for top-missing columns
print("\nMissingness AFTER cleaning (top 15):")
print(df_clean.isna().sum().sort_values(ascending=False).head(15).to_string())


Cleaning summary:
            variable  missing_before  missing_after                                                           strategy
LoanToValue_LTVratio            2714            312                           computed = LoanAmount/AppraisedValue*100
        InterestRate            1076              0 median by ['LoanTerm', 'ConformingLoanStatus'], then global median
     ApplicantIncome             624              0                           median by CountyCode, then global median

Missingness AFTER cleaning (top 15):
DebtToIncomeRatio_DTI    1969
AppraisedValue           1059
LoanToValue_LTVratio      312
CensusTract                 2
LoanTerm                    1
LoanStatus                  0
DenialReason                0
LoanAmount                  0
InterestRate                0
ConstructionMethod          0
PropertyType                0
ManufacturedType            0
OccupancyType               0
CountyCode                  0
StateCode                   0


In [33]:
# 3) Process Explanation:
'''

- Identify missingness: I listed every column with missing values and the % of rows affected.
- Policy:
  + Keep important economic fields and impute intelligently using related info (e.g., compute LTV from LoanAmount/AppraisedValue; impute
  InterestRate by loan term/product; impute ApplicantIncome by county to reflect local wages).
  + Drop rows only when missingness is tiny in ID-like fields (e.g., LoanTerm, CensusTract) to avoid inventing identifiers.
- Cleaned three variables:
  1) LoanToValue_LTVratio = LoanAmount / AppraisedValue × 100 when both are present (preserves meaning).
  2) InterestRate filled by median within (LoanTerm, ConformingLoanStatus), then global median (respects product tiers).
  3) ApplicantIncome filled by county median, then global median (captures local income levels).
- Why this fits: These choices use domain structure (loan terms, product status, geography) to make fills realistic, reduce bias, and keep as many
useful rows as possible for analysis.

'''

'\n\n- Identify missingness: I listed every column with missing values and the % of rows affected.\n- Policy:\n  + Keep important economic fields and impute intelligently using related info (e.g., compute LTV from LoanAmount/AppraisedValue; impute InterestRate by loan term/product; impute ApplicantIncome by county to reflect local wages).\n  + Drop rows only when missingness is tiny in ID-like fields (e.g., LoanTerm, CensusTract) to avoid inventing identifiers.\n- Cleaned three variables:\n  1) LoanToValue_LTVratio = LoanAmount / AppraisedValue × 100 when both are present (preserves meaning).\n  2) InterestRate filled by median within (LoanTerm, ConformingLoanStatus), then global median (respects product tiers).\n  3) ApplicantIncome filled by county median, then global median (captures local income levels).\n- Why this fits: These choices use domain structure (loan terms, product status, geography) to make fills realistic, reduce bias, and keep as many useful rows as possible for anal

Q5. Reflection

In [34]:
'''
1) Key steps I took are:
- Got oriented: skimmed column names and the data dictionary to understand entities (loan, property, borrower, geography).
- Scanned data quality: computed missing counts/percentages to see where problems live (e.g., LTV, DTI, AppraisedValue, InterestRate, Income).
- Checked geography: identified the county and state fields and mapped counties → likely states to understand coverage.
- Probed distributions/outliers: used a simple IQR rule on monetary/rate fields (LoanAmount, AppraisedValue, LTV, InterestRate, Income) to spot
unusual points.
- Chose practical fixes: designed minimal, domain-sensible imputations (compute LTV from fundamentals; median InterestRate within product/term;
county-median ApplicantIncome).
- Validated logic: sanity-checked with internal consistency (e.g., LTV ≈ LoanAmount / AppraisedValue × 100), “impossible value” rules, and small
before/after missingness snapshots.

2) What surprised me are:
- How much LTV and DTI were missing: >25% of rows lacked LTV and ~19.3% lacked DTI—higher than I expected for core risk fields. That pushed me to
compute LTV where possible rather than blanket imputation.
- A small set of counties (only 8 unique): I expected broader national coverage; the limited geography suggests results are quite regional and not
automatically generalizable.
- Outliers weren’t “wrong,” just real tails: Very large loans/appraisals and high incomes showed up, plus rate values outside the 4.5–8.5% IQR band.
These look like genuine market/product effects (e.g., jumbos, non-conforming loans, timing), reminding me that “outlier” ≠ “error.”
- Low LTV flagged as “outliers” by IQR: Statistically unusual but economically sensible—big down payments are legitimate and typically lower risk.
Good reminder to pair stats with domain logic.
- Tiny missingness in ID-like fields (e.g., LoanTerm, CensusTract): nice win—dropping 1–2 rows can be cleaner than inventing identifiers.

3) If I continued, I would:
- Segment results by product/term and county to see how pricing/affordability shifts locally.
- Add quick log transforms / winsorization for modeling stability, and run a simple model to quantify drivers of LTV or InterestRate.
- Revisit DTI handling (banded vs numeric) and, if needed, impute within county/occupancy groups to keep more rows for analysis.

'''

'\n1) Key steps I took are:\n- Got oriented: skimmed column names and the data dictionary to understand entities (loan, property, borrower, geography).\n- Scanned data quality: computed missing counts/percentages to see where problems live (e.g., LTV, DTI, AppraisedValue, InterestRate, Income).\n- Checked geography: identified the county and state fields and mapped counties → likely states to understand coverage.\n- Probed distributions/outliers: used a simple IQR rule on monetary/rate fields (LoanAmount, AppraisedValue, LTV, InterestRate, Income) to spot \nunusual points.\n- Chose practical fixes: designed minimal, domain-sensible imputations (compute LTV from fundamentals; median InterestRate within product/term; \ncounty-median ApplicantIncome).\n- Validated logic: sanity-checked with internal consistency (e.g., LTV ≈ LoanAmount / AppraisedValue × 100), “impossible value” rules, and small \nbefore/after missingness snapshots.\n\n2) What surprised me are:\n- How much LTV and DTI were

Q6. AI Disclosure:

In [35]:
'''
I used AI tools, but only for deeper business problem understanding: what I did was posting the entire .xlsx dictionary file to the ChatGPT and
asked it to explain deeper about what those variable means in the context and why are they being encoded like that (ONLY thing I did). Those helped
me a lot in reaching the high level of understanding of this problem that incentivized me to better understand how each code I write influences the
entire problem.

NOTE: All the codes are made by my own.

'''

'\nI used AI tools, but only for deeper business problem understanding: what I did was posting the entire .xlsx dictionary file to the ChatGPT and \nasked it to explain deeper about what those variable means in the context and why are they being encoded like that (ONLY thing I did). Those helped \nme a lot in reaching the high level of understanding of this problem that incentivized me to better understand how each code I write influences the\nentire problem. \n\nNOTE: All the codes are made by my own.\n\n'