# Import Data and Prepare Variables
Import the dataset into R. Convert the 'Education' variable into a factor using the as.factor() function. Display the structure of the data to confirm the conversion.

## Task 1

In [16]:
# Load necessary libraries
library(readr)

# Import the dataset
loan_data <- read_csv("Loan.csv")

# Convert 'Education' variable into a factor
loan_data$Education <- as.factor(loan_data$Education)

# Display the structure of the data to confirm the conversion
str(loan_data)

[1mRows: [22m[34m5000[39m [1mColumns: [22m[34m5[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[32mdbl[39m (5): Loan, Income, Family, CCAvg, Education

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


spc_tbl_ [5,000 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Loan     : num [1:5000] 0 0 0 0 0 0 0 0 0 1 ...
 $ Income   : num [1:5000] 49 34 11 100 45 29 72 22 81 180 ...
 $ Family   : num [1:5000] 4 3 1 1 4 4 2 1 3 1 ...
 $ CCAvg    : num [1:5000] 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
 $ Education: Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 3 2 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   Loan = [32mcol_double()[39m,
  ..   Income = [32mcol_double()[39m,
  ..   Family = [32mcol_double()[39m,
  ..   CCAvg = [32mcol_double()[39m,
  ..   Education = [32mcol_double()[39m
  .. )
 - attr(*, "problems")=<externalptr> 


# Run Linear Probability Model
Fit a linear probability model by regressing 'Loan' on 'Income', 'Family', 'CCAvg', and 'Education'. Use the lm() function and display the summary of the regression results.

In [13]:
# Ensure the dataset contains the required variables
if (all(c("Loan", "Income", "Family", "CCAvg", "Education") %in% names(loan_data))) {
  # Ensure 'Loan' is numeric
  loan_data$Loan <- as.numeric(loan_data$Loan)
  # Fit a linear probability model
  linear_model <- lm(Loan ~ Income + Family + CCAvg + Education, data = loan_data)
  # Display the summary of the regression results
  print(summary(linear_model))
} else {
  stop("The dataset 'loan_data' does not contain all the required variables.")
}


Call:
lm(formula = Loan ~ Income + Family + CCAvg + Education, data = loan_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56354 -0.14730 -0.03822  0.06978  1.05386 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.455e-01  1.127e-02 -30.653  < 2e-16 ***
Income       3.367e-03  9.799e-05  34.364  < 2e-16 ***
Family       3.160e-02  3.010e-03  10.499  < 2e-16 ***
CCAvg        1.373e-02  2.538e-03   5.412 6.52e-08 ***
Education2   1.517e-01  8.473e-03  17.908  < 2e-16 ***
Education3   1.605e-01  8.229e-03  19.511  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2393 on 4994 degrees of freedom
Multiple R-squared:  0.3409,	Adjusted R-squared:  0.3402 
F-statistic: 516.6 on 5 and 4994 DF,  p-value: < 2.2e-16



# Interpret Coefficients for Education Variables
Interpret the coefficients in front of the two 'Education' factor variables from the regression output. Discuss what these coefficients represent in the context of the model.

In [14]:
# Interpret the coefficients for 'Education' variables
# Ensure the linear model is defined
if (exists("linear_model")) {
  # Extract coefficients from the linear model
  education_coefficients <- summary(linear_model)$coefficients[grep("Education", rownames(summary(linear_model)$coefficients)), ]
  # Display the coefficients for 'Education' variables
  education_coefficients
  
  # Discuss the interpretation of the coefficients
  # Note: Replace the following comments with your interpretation based on the output
  # Coefficients represent the change in the probability of loan approval for each level of 'Education'
  # compared to the baseline level (reference category).
} else {
  stop("The variable 'linear_model' is not defined. Please ensure the linear model cell has been executed.")
}

Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
Education2,0.1517432,0.008473425,17.90813,1.46666e-69
Education3,0.1605467,0.008228543,19.51096,9.172939e-82


# Analyze Fitted Values and Identify Out-of-Bounds Predictions
Calculate the fitted values (y_hat = X * beta_hat) from the linear probability model. Check if any fitted values are greater than 1 or less than 0. Display a few rows of customers with such out-of-bounds predictions.

In [15]:
# Calculate fitted values (y_hat) from the linear probability model
# Ensure 'loan_data' and 'linear_model' are defined
if (exists("loan_data") && exists("linear_model")) {
  # Calculate fitted values (y_hat) from the linear probability model
  loan_data$fitted_values <- predict(linear_model, newdata = loan_data)
  
  # Identify customers with fitted values greater than 1 or less than 0
  out_of_bounds <- subset(loan_data, fitted_values > 1 | fitted_values < 0)
  
  # Display a few rows of customers with out-of-bounds predictions
  head(out_of_bounds)
} else {
  stop("The variable 'loan_data' or 'linear_model' is not defined. Please ensure the relevant cells have been executed.")
}

Loan,Income,Family,CCAvg,Education,fitted_values
<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
0,49,4,1.6,1,-0.03216012
0,34,3,1.5,1,-0.11564489
0,11,1,1.0,1,-0.26316401
0,22,1,0.3,3,-0.07519169
0,22,1,1.5,3,-0.05871013
0,21,1,0.5,2,-0.08461553


## Task 2

In [18]:
# Fit the linear probability model (LPM)
lpm <- lm(Loan ~ Income + Family + CCAvg + Education, data = loan_data)

In [19]:
# 4. Fit Logit model
logit_mod <- glm(Loan ~ Income + Family + CCAvg + Education, 
                 data = loan_data, family = binomial)
summary(logit_mod)

# 5. Confusion matrix and PCP
threshold <- mean(loan_data$Loan)
logit_pred <- ifelse(predict(logit_mod, type = "response") > threshold, 1, 0)
actual <- loan_data$Loan
table(Predicted = logit_pred, Actual = actual)

overall_pcp <- mean(logit_pred == actual)
pcp_1 <- mean(logit_pred[actual == 1] == 1)
pcp_0 <- mean(logit_pred[actual == 0] == 0)

overall_pcp
pcp_1
pcp_0

# 6. Predicted probability at mean Xs, Education=2
mean_income <- mean(loan_data$Income)
mean_family <- mean(loan_data$Family)
mean_ccavg <- mean(loan_data$CCAvg)
edu2 <- 2

# Manual calculation
coefs <- coef(logit_mod)
xb <- coefs[1] + coefs["Income"] * mean_income +
      coefs["Family"] * mean_family +
      coefs["CCAvg"] * mean_ccavg +
      coefs["Education2"] * (edu2 == 2) +
      coefs["Education3"] * (edu2 == 3)
prob_manual <- exp(xb) / (1 + exp(xb))
prob_manual

# Using predict()
new_obs <- data.frame(Income = mean_income, Family = mean_family, 
                      CCAvg = mean_ccavg, Education = factor(2, levels = 1:3))
prob_predict <- predict(logit_mod, new_obs, type = "response")
prob_predict

# 7. Compare coefficients
lpm_coefs <- coef(lpm)
logit_coefs <- coef(logit_mod)
data.frame(LPM = lpm_coefs, Logit = logit_coefs)

# 8. Partial effects at mean Xs, Education=2
# For logit: partial effect = beta * p * (1-p)
p <- as.numeric(prob_manual)
logit_partial <- coefs[-1] * p * (1 - p)
lpm_partial <- lpm_coefs[-1]
data.frame(LPM = lpm_partial, Logit = logit_partial)


Call:
glm(formula = Loan ~ Income + Family + CCAvg + Education, family = binomial, 
    data = loan_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -13.177833   0.517777 -25.451  < 2e-16 ***
Income        0.059791   0.002687  22.255  < 2e-16 ***
Family        0.587079   0.071275   8.237  < 2e-16 ***
CCAvg         0.162679   0.040505   4.016 5.91e-05 ***
Education2    3.910609   0.251037  15.578  < 2e-16 ***
Education3    3.933173   0.244329  16.098  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3162.0  on 4999  degrees of freedom
Residual deviance: 1334.8  on 4994  degrees of freedom
AIC: 1346.8

Number of Fisher Scoring iterations: 8


         Actual
Predicted    0    1
        0 4001   61
        1  519  419

Unnamed: 0_level_0,LPM,Logit
Unnamed: 0_level_1,<dbl>,<dbl>
(Intercept),-0.345540917,-13.17783285
Income,0.003367256,0.05979075
Family,0.031602457,0.58707882
CCAvg,0.013734629,0.16267911
Education2,0.151743228,3.91060897
Education3,0.160546743,3.93317273


Unnamed: 0_level_0,LPM,Logit
Unnamed: 0_level_1,<dbl>,<dbl>
Income,0.003367256,0.002390594
Family,0.031602457,0.023472981
CCAvg,0.013734629,0.006504346
Education2,0.151743228,0.156356601
Education3,0.160546743,0.15725876


## Quiz Questions

In [20]:
# Check the number of levels and their names for Education
levels(loan_data$Education)
length(levels(loan_data$Education))

In [21]:
# Get the coefficient estimate for Income from the LPM
coef(lpm)["Income"]