### Exercise 0: What Does Linear Regression Compute?

Consider a vector
$$
y \in \mathbb{R}^n
$$
where each entry represents a numeric outcome.

Let
$$
X \in \mathbb{R}^{n \times p}
$$
be a matrix whose columns correspond to $p$ explanatory variables.

---

### Questions

#### 1. Meaning of the Fitted Vector

In linear regression, the fitted values are given by
$$
\hat{y} = X\hat{\beta}.
$$

In your own words, explain what the vector $\hat{y}$ represents.

Which of the following best describes $\hat{y}$?

- **A.** The average value of $y$  
- **B.** The vector in $\mathrm{col}(X)$ that is closest to $y$  
- **C.** The most frequent value of $y$  
- **D.** The maximum value permitted by the model  

Select **one** option and justify your choice using geometric reasoning.

#### 2. Residual Interpretation

The residual vector is defined as
$$
r = y - \hat{y}.
$$

Answer the following:

- Where does the vector $r$ lie geometrically, relative to the column space of $X$?
- What does a large value of $|r_i|$ indicate for an individual observation $i$?


### Dataset Overview

This dataset contains information about **loan applicants and their loans**, commonly used in credit risk and fraud-related analysis.  
Each row corresponds to **one loan application**.

### Borrower Characteristics
- **`person_age`**: Age of the applicant (in years).
- **`person_income`**: Annual income of the applicant.
- **`person_home_ownership`**: Housing status (`RENT`, `OWN`, `MORTGAGE`).
- **`person_emp_length`**: Number of years the applicant has been employed.
- **`cb_person_cred_hist_length`**: Length of the applicant’s credit history.
- **`cb_person_default_on_file`**: Whether the applicant has defaulted before (`Y` or `N`).

These variables describe the applicant’s **financial background and stability**.

---

### Loan Characteristics
- **`loan_amnt`**: Amount of money borrowed.
- **`loan_int_rate`**: Interest rate charged on the loan.
- **`loan_intent`**: Purpose of the loan (e.g. `PERSONAL`, `MEDICAL`, `EDUCATION`).
- **`loan_grade`**: Credit grade assigned to the loan.
- **`loan_percent_income`**: Fraction of the applicant’s income used to repay the loan.


These variables describe the **risk profile of the loan itself**.

---

### Outcome
- **`loan_status`**: Loan outcome  
  - `1` = loan defaulted  
  - `0` = loan repaid

---

Link: https://www.kaggle.com/datasets/laotse/credit-risk-dataset


### Exercise 1: What Will Linear Regression Learn From This Data?

We will use linear regression to relate borrower and loan characteristics to
`loan_status`, where:

- `loan_status = 1` indicates a default
- `loan_status = 0` indicates a repaid loan

The linear model produces a score of the form:

$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots
$$

### Questions

1. When linear regression is fit to `loan_status`, what does a **positive coefficient** mean for a variable?
   - What does a **negative coefficient** mean?

2. For each of the following variables, predict the **expected sign** of its coefficient and briefly justify your reasoning:
   - `person_income`
   - `loan_amnt`
   - `loan_int_rate`
   - `cb_person_default_on_file`

3. Which of these variables do you expect to have the **largest impact** on the fitted linear score?

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv("D:/CMI/Juspay/Linear_regression/credit_risk_dataset.csv")

# Basic inspection
df.head()


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


### Handling Missing Values

Before fitting linear regression, we need to address missing values in the data.

- **Numeric variables** are imputed using the **median**
  
- **Categorical variables** are imputed using the most frequent category  

### Modeling Choices

- Borrower and loan attributes are used as explanatory variables.
- Categorical variables are **one-hot encoded** so that each category is compared
  against a baseline group.
- The fitted model is interpreted as a **linear risk score**:
  
  - Positive coefficients indicate higher default risk.
  - Negative coefficients indicate lower default risk.
  - Larger magnitudes indicate stronger associations.

In [26]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Separate features and outcome
X = df.drop(columns=["loan_status"])
y = df["loan_status"]

# Identify numeric and categorical columns
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns
categorical_cols = X.select_dtypes(include=["object"]).columns

# Impute numeric variables with median
num_imputer = SimpleImputer(strategy="median")
X[numeric_cols] = num_imputer.fit_transform(X[numeric_cols])

# Impute categorical variables with most frequent category
cat_imputer = SimpleImputer(strategy="most_frequent")
X[categorical_cols] = cat_imputer.fit_transform(X[categorical_cols])

# One-hot encode categorical variables
X_encoded = pd.get_dummies(X, drop_first=False)

# Scale features for coefficient comparability
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)


### Fitting a Linear Regression Model

We now fit a linear regression model using `loan_status` as the outcome, where:

- `loan_status = 1` indicates a default
- `loan_status = 0` indicates a repaid loan

### Exercise 2: Interpreting Coefficients

The table below shows the coefficients from a linear regression model fit to
`loan_status`, after scaling all features and explicitly encoding categorical variables.

### Questions

1. **Dominant Risk Drivers**
   - Which variable has the largest positive coefficient?
   - Why does this result make economic sense in a credit risk setting?

2. **Credit Grade Effects**
   - Observe the coefficients for `loan_grade_A` through `loan_grade_G`.
   - How does default risk change as credit grade worsens?
   
3. **Borrower Stability**
   - Interpret the signs of:
     - `person_income`
     - `person_emp_length`
     - `person_age`
   - How do these align with intuition about borrower stability and risk?

4. **Loan Purpose and Ownership**
   - Some loan intents (e.g. `EDUCATION`, `VENTURE`) have negative coefficients.
   - What does a negative coefficient indicate in this encoding?
   - Why might these loans still default despite lower associated risk?

5. **Surprising or Subtle Results**
   - The coefficients for `cb_person_default_on_file_N` and
     `cb_person_default_on_file_Y` are close to zero and symmetric.
   - Give possible explanations for why a historically important risk factor
     might appear weak in this linear model.

In [28]:
from sklearn.linear_model import LinearRegression
import pandas as pd

# Fit linear regression
lr = LinearRegression()
lr.fit(X_scaled, y)

# Create coefficient table
coef_df = pd.DataFrame({
    "feature": X_encoded.columns,
    "coefficient": lr.coef_
}).sort_values(by="coefficient", ascending=False)

coef_df


Unnamed: 0,feature,coefficient
5,loan_percent_income,0.192254
20,loan_grade_D,0.096283
21,loan_grade_E,0.059533
22,loan_grade_F,0.035402
10,person_home_ownership_RENT,0.033271
23,loan_grade_G,0.030188
1,person_income,0.020626
11,loan_intent_DEBTCONSOLIDATION,0.019635
4,loan_int_rate,0.016932
13,loan_intent_HOMEIMPROVEMENT,0.016516


### Exercise 3: How Variable Importance Changes When Information Is Removed

In the previous model, `loan_percent_income` emerged as the strongest driver of default.
This variable combines loan amount and borrower income into a single measure of
financial stress.

To understand how linear regression distributes importance across variables, we now
refit the model **after removing `loan_percent_income`**.

### Questions

1. After removing `loan_percent_income`, which variables change the most in magnitude?
2. How do the coefficients of `loan_amnt` and `person_income` change compared to the
   previous model?
3. Why does removing a variable that summarizes multiple risk factors cause other
   coefficients to become larger?
4. What does this tell us about how linear regression decides which variables appear
   important when several variables contain overlapping information?


In [49]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Remove the dominant summary variable
X_no_percent_income = X_encoded.drop(columns=["loan_percent_income"])

# Scale features 
scaler_no_percent = StandardScaler()
X_no_percent_scaled = scaler_no_percent.fit_transform(X_no_percent_income)

# Refit linear regression
lr_no_percent_income = LinearRegression(fit_intercept=False)
lr_no_percent_income.fit(X_no_percent_scaled, y)

# Coefficient table
coef_no_percent_income_df = pd.DataFrame({
    "feature": X_no_percent_income.columns,
    "coefficient": lr_no_percent_income.coef_
}).sort_values(by="coefficient", ascending=False)

coef_no_percent_income_df


Unnamed: 0,feature,coefficient
19,loan_grade_D,0.100361
20,loan_grade_E,0.060976
3,loan_amnt,0.045405
9,person_home_ownership_RENT,0.043976
21,loan_grade_F,0.035368
22,loan_grade_G,0.030655
10,loan_intent_DEBTCONSOLIDATION,0.021168
13,loan_intent_MEDICAL,0.014872
12,loan_intent_HOMEIMPROVEMENT,0.012945
4,loan_int_rate,0.010031


### Exercise 4: Understanding Residuals

After fitting the linear regression model, we compute predictions
$\hat{y}$ and residuals:

$$
\text{residual}_i = y_i - \hat{y}_i
$$

Residuals measure how much each loan’s outcome deviates from what the
linear model explains using the available features.

### Questions

1. What does a **large positive residual** indicate for a loan?
2. What does a **large negative residual** indicate?
3. Why might some defaulted loans still have small residuals?
4. What kinds of information might be missing for loans with very large residuals?

In [60]:
import pandas as pd
import numpy as np

# Create a fully imputed version of the original dataframe
df_imputed = df.copy()
df_imputed[numeric_cols] = num_imputer.transform(df[numeric_cols])
df_imputed[categorical_cols] = cat_imputer.transform(df[categorical_cols])

# Add predictions from the original (full) linear model
df_imputed["y_hat"] = lr.predict(X_scaled)

# Compute residuals
df_imputed["residual"] = df_imputed["loan_status"] - df_imputed["y_hat"]
df_imputed["abs_residual"] = df_imputed["residual"].abs()

# Inspect loans with largest absolute residuals
df_imputed.sort_values("abs_residual").head(10)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,y_hat,residual,abs_residual
3243,23.0,70000.0,RENT,1.0,EDUCATION,A,3500.0,5.42,0,0.05,N,3.0,-1.3e-05,1.3e-05,1.3e-05
26126,31.0,96000.0,MORTGAGE,4.0,VENTURE,C,12000.0,12.73,0,0.13,Y,5.0,1.3e-05,-1.3e-05,1.3e-05
19092,28.0,30000.0,MORTGAGE,4.0,EDUCATION,A,3000.0,6.76,0,0.1,N,9.0,-2.5e-05,2.5e-05,2.5e-05
26799,27.0,113800.0,MORTGAGE,3.0,MEDICAL,A,7500.0,8.63,0,0.07,N,10.0,2.6e-05,-2.6e-05,2.6e-05
11166,23.0,72000.0,MORTGAGE,0.0,PERSONAL,B,5000.0,11.86,0,0.07,N,2.0,4e-05,-4e-05,4e-05
15890,26.0,250000.0,MORTGAGE,11.0,HOMEIMPROVEMENT,B,9000.0,9.99,0,0.04,N,2.0,-4.4e-05,4.4e-05,4.4e-05
26322,28.0,100000.0,OWN,7.0,HOMEIMPROVEMENT,C,20000.0,12.68,0,0.2,Y,6.0,-6.4e-05,6.4e-05,6.4e-05
22552,30.0,55684.0,MORTGAGE,13.0,EDUCATION,B,5200.0,12.69,0,0.09,N,9.0,-7.4e-05,7.4e-05,7.4e-05
26537,34.0,105000.0,MORTGAGE,2.0,MEDICAL,A,10000.0,7.51,0,0.1,N,7.0,-0.000102,0.000102,0.000102
26916,29.0,26000.0,OWN,5.0,VENTURE,B,4500.0,11.71,0,0.17,N,5.0,0.000102,-0.000102,0.000102


### Exercise 5: Model Metrics

To summarize the performance of the linear regression model, we report two metrics:
**Mean Squared Error (MSE)** and **$R^2$**.

The Mean Squared Error is defined as:
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

MSE measures the average squared difference between the observed loan outcome and the
linear risk score produced by the model.

The $R^2$ statistic takes values between 0 and 1 and measures the fraction of variation
in the outcome that is explained by the model, relative to a baseline that predicts
the same value for every loan. Higher values indicate that more systematic structure
in the data is captured by the model.

Intuitively:
- **MSE** answers: *How large are the model’s prediction errors on average?*
- **$R^2$** answers: *How much of the variation in loan default does the model explain?*

### Questions

1. What does a larger MSE indicate about the average size of the model’s residuals?
2. Why is a non-zero MSE expected when modeling loan default with a linear model?
3. How does $R^2$ help contextualize the error measured by MSE?


In [77]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = lr.predict(X_scaled)

mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

mse, r2


(0.1104519985245318, 0.35244765055653127)

### Conclusion: Why This Model Is Still Useful Despite a Low $R^2$

The $R^2$ value for this linear regression model indicates that a substantial portion
of the variation in loan default outcomes remains unexplained. This is expected and
does not undermine the value of the analysis.

Loan default is influenced by many factors that are not fully captured in the data,
including time-varying borrower behavior, external economic conditions, and nonlinear
effects. A simple linear model is not designed to capture all such complexity.

Despite this, the model remains useful because it:
- Provides clear, interpretable relationships between borrower attributes and default risk.
- Highlights which variables contribute most strongly to risk on average.
- Reveals how importance shifts when information overlaps across features.
- Identifies loans that are difficult to explain through residual analysis.