# QF 627 Extras - Financial Analytics
## Problem Set `3` | `RE`view

> "👋 Hi Team! 

> The current problem set contains four larger sets of questions designed to consolidate your understanding of propensity scoring and matching. These questions are not meant for testing; instead, they are crafted to further enhance your learning and hands-on problem-solving of causal analytic questions in the field. You may find that the problem sets serve as extended lecture notes. While answering the questions, you will be effectively internalizing the learning from the class.

> Enjoy 🤞"

### <font color = green> Activation of necessary libraries. </font>

> Before proceeding to answer the questions below, please activate the necessary packages and modules.

In [1]:
import numpy as np
import pandas as pd # polars

import matplotlib.pyplot as plt # lets-plot
import matplotlib as mpl

from lets_plot import *
LetsPlot.setup_html()

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

> Let's set some print option.

In [2]:
np.set_printoptions(precision = 3)

plt.style.use("ggplot")

mpl.rcParams["axes.grid"] = True
mpl.rcParams["grid.color"] = "grey"
mpl.rcParams["grid.alpha"] = 0.25

mpl.rcParams["axes.facecolor"] = "white"

mpl.rcParams["legend.fontsize"] = 14

## ❓ <a id = "top"> List of Analytic Questions </a> ❓

## [Q1. Impact of Financial Literacy Programs on Savings Rate](#p1)

> ### <font color = red> Q 6 </font>

## [Q2. Effect of Loan Counseling on Loan Default Rates](#p2)

> ### <font color = red> F 2 </font>

## [Q3. Impact of Urban Development Projects on Property Prices](#p3)

> ### <font color = red> Q 7 </font>

## [Q4. Effect of Historical Preservation Status on Property Values](#p4)

> ### <font color = red> F E </font>

### <mark>REMINDING NOTE to Problem-Solving Leaders:</mark>

#### During the problem-solving session, each team member must take at least one question and actively participate.

#### <font color = "red"> While presenting your code and solutions in class, you must clearly explain the following:
* `What you did` – Describe the steps and functions you implemented.
* `Why you did it` – Explain your reasoning or strategy behind the approach.
* `The implications of the code` – Discuss the results, insights, or impact of your code.

#### <font color = "red"> The core skill you should develop is `the ability to explain` your lines of code and solutions effectively.

#### <font color = "red"> Problem-solvers need to submit their individual notebook via our eLearn. Then, please submit the group's work via [HERE](https://www.dropbox.com/request/IbdlmMa4I4UAVAgN7q12).

    Prof. Roh's further NOTE: 
    
> For the questions below, you will experience a systematic, step-by-step workflow to answer causal analytic questions. There will also be questions to estimate ATE (Average Treatment Effects), along with `ATT` and `ATC`. As you must have found by now from reading the lecture notes, ATT refers to the Average Treatment Effect on the Treated, and ATC refers to the Average Treatment Effect on the Control.

#### Treatment Effect Metrics

| Metric | Focus Group       | Question Answered                                                     | Use Cases                                      |
|--------|-------------------|-----------------------------------------------------------------------|------------------------------------------------|
| ATT    | Treated           | What is the effect of the treatment on the treated?                   | Evaluating the impact on participants          |
| ATC    | Control           | What would have been the effect on the control group if they had been treated? | Evaluating potential impact on non-participants |
| ATE    | Entire Population | What is the average effect of the treatment across the whole population? | General policy evaluation                      |


## <a id = "p1"> </a> <font color = "red"> Q1. Impact of Financial Literacy Programs on Savings Rate </font> [back to table of contents](#top)

#### Q 1-1: Import the following data (`https://talktoroh.com/s/savings_rate.csv`) and run a data inspection

In [3]:
save =\
(
    pd
    .read_csv("https://talktoroh.com/s/savings_rate.csv")
)

In [4]:
save.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            1000 non-null   int64  
 1   income         1000 non-null   float64
 2   education      1000 non-null   int64  
 3   participation  1000 non-null   int64  
 4   savings_rate   1000 non-null   float64
dtypes: float64(2), int64(3)
memory usage: 39.2 KB


#### Q 1-2: Estimate the propensity score using logistic regression.

In [5]:
save.columns

Index(['age', 'income', 'education', 'participation', 'savings_rate'], dtype='object')

### Prepare your response (y) variable

In [6]:
save["participation"].unique()

array([1, 0])

In [7]:
response = save["participation"]

### Prepare your explanatory (x) variable and covariates

In [8]:
# dependency for logistic regression

# from sklearn.linear_model import LogisticRegression

covariates = save[['age', 'income', 'education']]

#### Model specification

In [9]:
logit = LogisticRegression()

In [10]:
(
    logit
    .fit(covariates, response) # calculate, using your algorithm
)

In [11]:
(
    logit # calculated
    .predict_proba(covariates) # returns prediction
) # predicted probability of being 0 and 1

array([[0.331, 0.669],
       [0.428, 0.572],
       [0.492, 0.508],
       ...,
       [0.333, 0.667],
       [0.595, 0.405],
       [0.239, 0.761]], shape=(1000, 2))

In [12]:
(
    logit
    .predict_proba(covariates)
).ndim

2

In [13]:
# savings["propensity_scores"]
save["probability_of_being_treated"] =\
(
    logit
    .predict_proba(covariates)
    [ : , 1] 
         # the column that displays propensity (here, probability) of being 1 (treated)
)

In [14]:
save

Unnamed: 0,age,income,education,participation,savings_rate,probability_of_being_treated
0,56,65094.392138,19,1,6.608668,0.669293
1,46,41346.621957,14,1,4.212576,0.571682
2,32,62535.381681,13,0,6.326765,0.508123
3,60,33054.397180,15,0,3.392591,0.603909
4,25,57947.062669,19,1,5.901228,0.724385
...,...,...,...,...,...,...
995,22,47785.961206,10,1,4.860295,0.453244
996,40,43009.452472,11,1,4.366051,0.471781
997,27,26079.453991,15,0,2.689879,0.666839
998,61,57704.001599,11,0,5.838729,0.405133


#### Q 1-3: Perform matching using nearest neighbor matching.

In [15]:
treated_unit =\
(
    save
    [save["participation"] == 1]
    # .copy()
)

controlled_unit =\
(
    save
    [save["participation"] == 0]
   # .copy()
)

In [16]:
len(treated_unit)

569

In [17]:
len(controlled_unit)

431

### Using Propensity Scores, we will match treated and matched controlled.

In [18]:
(
    np
    .random
    .seed(2025)
)

In [19]:
NN = NearestNeighbors(n_neighbors = 1)

In [20]:
save.head(7)

Unnamed: 0,age,income,education,participation,savings_rate,probability_of_being_treated
0,56,65094.392138,19,1,6.608668,0.669293
1,46,41346.621957,14,1,4.212576,0.571682
2,32,62535.381681,13,0,6.326765,0.508123
3,60,33054.39718,15,0,3.392591,0.603909
4,25,57947.062669,19,1,5.901228,0.724385
5,38,71623.52931,13,1,7.260867,0.4771
6,56,12925.332498,16,1,1.403627,0.685678


In [21]:
treated_unit.head(7)

Unnamed: 0,age,income,education,participation,savings_rate,probability_of_being_treated
0,56,65094.392138,19,1,6.608668,0.669293
1,46,41346.621957,14,1,4.212576,0.571682
4,25,57947.062669,19,1,5.901228,0.724385
5,38,71623.52931,13,1,7.260867,0.4771
6,56,12925.332498,16,1,1.403627,0.685678
7,36,38046.571168,16,1,3.904756,0.660304
9,28,46954.319209,12,1,4.788025,0.516278


In [22]:
controlled_unit.head(7)

Unnamed: 0,age,income,education,participation,savings_rate,probability_of_being_treated
2,32,62535.381681,13,0,6.326765,0.508123
3,60,33054.39718,15,0,3.392591,0.603909
8,40,58656.081908,11,0,5.935441,0.434769
10,28,55567.188101,10,0,5.631386,0.425741
12,53,51298.846812,18,0,5.217657,0.671362
13,57,47664.841469,13,0,4.833364,0.5047
15,20,53816.31265,18,0,5.46666,0.710094


In [23]:
(
    NN
    .fit(controlled_unit
         [["probability_of_being_treated"]] # propensity scores
        )
)

In [24]:
distance, index_of_controlled =\
(
    NN
    .kneighbors(treated_unit
                [["probability_of_being_treated"]]
               )
)

In [25]:
index_of_controlled.flatten()

array([ 10, 304, 121, 367, 305, 370,  11,  48, 254, 115, 369, 399, 130,
       363, 110,  33, 413, 309, 366,  10, 230, 182, 214, 324, 159, 296,
       360, 229, 121, 158,  80, 250,  83, 382,  83,  47,  31, 378, 111,
       378, 370,  11,  19,  48, 158, 213, 110, 146, 226, 428, 361, 210,
         4, 295, 381, 412, 287, 214,  46, 288, 360, 145, 116, 271, 347,
       222,  64,  88, 319, 135,  90, 254,  78, 291, 314,  11,  31, 158,
       378,  98, 309, 110, 182, 233,  14, 334, 206, 188, 272, 175, 121,
        99, 252, 316, 415, 176,  16, 234, 336, 173, 429,  23, 317, 283,
       228, 148, 305,  84, 292,  41,  31, 133, 206, 416, 386, 417, 376,
       320,  33,  79,  90, 305,  41, 400, 173, 337,   2, 193, 154,  90,
       223, 103, 127, 205,  84, 247, 353, 155, 261,  97, 418, 420, 252,
       292,  85, 386, 312, 135, 374, 336, 141, 347,  21, 147, 148, 386,
        77, 163, 213,   4, 331,  54, 295, 420, 168, 177,  21, 153, 183,
       135, 172, 309,  30, 150,  59,  21, 427,  10, 140, 157, 30

In [26]:
matched_controlled =\
(
    controlled_unit
    .iloc[index_of_controlled
          .flatten()
          ]
#    .copy()
)

matched_controlled

Unnamed: 0,age,income,education,participation,savings_rate,probability_of_being_treated
20,47,55916.782136,18,0,5.660235,0.669875
703,37,46946.294598,14,0,4.783129,0.572313
290,46,43697.197744,19,0,4.464246,0.725440
851,43,38698.731945,11,0,3.936380,0.477387
705,45,34778.649040,17,0,3.558524,0.685868
...,...,...,...,...,...,...
499,50,28034.376797,14,0,2.882126,0.596508
565,29,47906.155578,19,0,4.882003,0.738328
129,52,72559.894829,13,0,7.342867,0.453187
406,32,77699.554941,13,0,7.834778,0.471949


In [27]:
len(matched_controlled) == len(treated_unit)

True

#### Q 1-4: Estimate the ATE, ATT, and ATC.

### Average Treatment Effects (ATE)

In [28]:
ATE =\
(
    treated_unit["savings_rate"].mean()
    - # naive comparisons
    controlled_unit["savings_rate"].mean()
)

In [29]:
ATE

np.float64(-0.2064004787149063)

#### Average Treatment Effects on the Treated (`ATT`)

#### <mark>For our ATT, we paired each treated household with its single closest controlled household in propensity score.</mark>

In [30]:
ATT =\
(
    treated_unit["savings_rate"].mean() 
    - # fair comparisons
    matched_controlled["savings_rate"].mean()
)

In [31]:
ATT

np.float64(0.11070165114858277)

### Average Treatment Effects on the Controlled (ATC)

In [32]:
ATC =\
(
    matched_controlled["savings_rate"].mean()
    -
    controlled_unit["savings_rate"].mean()
)

In [33]:
ATC

np.float64(-0.31710212986348907)

#### Q 1-5: Interpret the results. When answering, please use the following boilerplate.

```python

print(f"After matching, the estimated ATT is {att:.3f}, indicating that {explanatory_variable} increases {response_variable} by {att*100:.2f} percentage points on average.")

print(f"The estimated ATE is {ate:.3f}, indicating the overall effect of {explanatory_variable} on the entire population is {ate*100:.2f} percentage points.")

print(f"The estimated ATC is {atc:.3f}, indicating that {explanatory_variable} would have increased {response_variable} for the control group by {atc*100:.2f} percentage points on average if they had {treatment}.")

In [34]:
print(f"After matching, the estimated ATT is {ATT:.3f}, indicating that taking part in the financial literacy program increases the savings rate by {ATT* 100:.2f} percentage points on average.")

After matching, the estimated ATT is 0.111, indicating that taking part in the financial literacy program increases the savings rate by 11.07 percentage points on average.


In [35]:
print(f"The estimated ATE is {ATE:.3f}, indicating the overall effect of taking part in the financial literacy program on the entire population is {ATE*100:.2f} percentage points.")

The estimated ATE is -0.206, indicating the overall effect of taking part in the financial literacy program on the entire population is -20.64 percentage points.


In [36]:
print(f"The estimated ATC is {ATC:.3f}, indicating that taking part in the financial literacy program would have increased the savings rate for the control group by {ATC*100:.2f} percentage points on average if they had participated.")

The estimated ATC is -0.317, indicating that taking part in the financial literacy program would have increased the savings rate for the control group by -31.71 percentage points on average if they had participated.


### <mark>How do we know that the matching was successful indeed?</mark>

In [37]:
matched_save =\
(
    pd
    .concat(
        [treated_unit,
        # controlled_unit # NOTE that this is not the unit that we are using for analysis
         matched_controlled
        ]
    )
)

In [38]:
len(matched_save) == len(treated_unit) * 2

True

In [39]:
len(matched_save)

1138

In [40]:
len(matched_controlled)

569

### <mark>Balance Check (Propensity Score Matching) = Randomization Check (RCTs)</mark>

In [41]:
def calculate_mean_difference(A, B):
    return (A.mean() - B.mean() 
            / 
            np.sqrt((A.var() + B.var()) 
                    / 2)
           )

In [42]:
save.columns

Index(['age', 'income', 'education', 'participation', 'savings_rate',
       'probability_of_being_treated'],
      dtype='object')

In [43]:
covariates = ['age', 'income', 'education']

In [44]:
type(covariates)

list

In [45]:
mean_difference_before_matching = []
mean_difference_after_matching = []

In [46]:
# statistical test for comparing means 
# t-statistic

t_statistic = []

In [47]:
from scipy.stats import ttest_ind

In [48]:
### Pythonian-way of writing code

for variable in covariates:
    ( # chaining for your understanding
     mean_difference_before_matching
     .append(calculate_mean_difference(treated_unit[variable], 
                                       controlled_unit[variable]
                                      )
            )
    )
    (
     mean_difference_after_matching
     .append(calculate_mean_difference(treated_unit[variable],
                                       matched_controlled[variable]
                                      )
            )
     )
    t_before = ttest_ind(treated_unit[variable],
                          controlled_unit[variable],
                          equal_var = False # team, here, we are NOT assumming there will be equal variable
                         )
    t_after = ttest_ind(treated_unit[variable],
                          matched_controlled[variable],
                          equal_var = False # team, here, we are NOT assumming there will be equal variable
                         )
    
     # simple interface for displaying the info above
    t_statistic.append(
         {"covariate": variable,
          "p_before": t_before.pvalue,
          "p_after": t_after.pvalue
         }
         )

In [49]:
covariates

['age', 'income', 'education']

In [50]:
mean_difference_before_matching

[np.float64(37.36635064087312),
 np.float64(50294.897198560735),
 np.float64(9.979577899873654)]

In [51]:
mean_difference_after_matching

[np.float64(37.401558828605545),
 np.float64(50295.105408821706),
 np.float64(9.57562478735348)]

In [52]:
DF_for_t_statistic =\
(
    pd
    .DataFrame(t_statistic)
)

DF_for_t_statistic

Unnamed: 0,covariate,p_before,p_after
0,age,0.1568968,0.385249
1,income,0.02188371,0.247561
2,education,4.299881e-10,0.80601


## <a id = "p2"> </a> <font color = "red"> Q2. Effect of Loan Counseling on Loan Default Rates </font> [back to table of contents](#top)

#### Q 2-1: Import the following data (`https://talktoroh.com/s/loan_default.csv`) and run a data inspection

In [53]:
loan =\
(
    pd
    .read_csv("https://talktoroh.com/s/loan_default.csv")
)

In [54]:
loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   credit_score  1000 non-null   float64
 1   income        1000 non-null   float64
 2   dti_ratio     1000 non-null   float64
 3   counseling    1000 non-null   int64  
 4   default       1000 non-null   int64  
dtypes: float64(3), int64(2)
memory usage: 39.2 KB


#### Q 2-2: Estimate the propensity score using logistic regression.

In [55]:
loan.columns

Index(['credit_score', 'income', 'dti_ratio', 'counseling', 'default'], dtype='object')

In [56]:
loan["counseling"].unique()

array([0, 1])

In [57]:
response = loan["counseling"]

In [58]:
covariates = loan[['credit_score', 'income', 'dti_ratio']]

In [59]:
logit = LogisticRegression()

In [60]:
(
    logit
    .fit(covariates, response) # calculate, using your algorithm
)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [61]:
(
    logit # calculated
    .predict_proba(covariates) # returns prediction
) # predicted probability of being 0 and 1

array([[1.000e+00, 1.403e-46],
       [1.000e+00, 1.553e-30],
       [8.475e-01, 1.525e-01],
       ...,
       [0.000e+00, 1.000e+00],
       [1.374e-07, 1.000e+00],
       [0.000e+00, 1.000e+00]], shape=(1000, 2))

In [62]:
(
    logit
    .predict_proba(covariates)
).ndim

2

In [63]:
# loan["propensity_scores"]
loan["probability_of_being_treated"] =\
(
    logit
    .predict_proba(covariates)
    [ : , 1] 
         # the column that displays propensity (here, probability) of being 1 (treated)
)

In [64]:
loan

Unnamed: 0,credit_score,income,dti_ratio,counseling,default,probability_of_being_treated
0,674.835708,87987.108732,0.232482,0,0,1.403187e-46
1,643.086785,78492.673658,0.285548,0,0,1.552863e-30
2,682.384427,61192.607398,0.220758,0,0,1.525168e-01
3,726.151493,47061.264446,0.269204,1,0,1.000000e+00
4,638.292331,73964.466272,0.110639,0,0,6.717427e-23
...,...,...,...,...,...,...
995,635.944985,81403.004766,0.307748,0,0,2.007183e-35
996,739.884326,59469.574815,0.325775,1,0,9.912366e-01
997,682.042143,42362.506975,0.175824,1,0,1.000000e+00
998,621.441051,56738.660721,0.333418,1,0,9.999999e-01


#### Q 2-3: Perform matching using nearest neighbor matching.

In [65]:
treated_unit =\
(
    loan
    [loan["counseling"] == 1]
    # .copy()
)

controlled_unit =\
(
    loan
    [loan["counseling"] == 0]
   # .copy()
)

In [66]:
len(treated_unit)

490

In [67]:
len(controlled_unit)

510

In [68]:
(
    np
    .random
    .seed(2025)
)

In [69]:
NN = NearestNeighbors(n_neighbors = 1)

In [70]:
loan.head(7)

Unnamed: 0,credit_score,income,dti_ratio,counseling,default,probability_of_being_treated
0,674.835708,87987.108732,0.232482,0,0,1.403187e-46
1,643.086785,78492.673658,0.285548,0,0,1.552863e-30
2,682.384427,61192.607398,0.220758,0,0,0.1525168
3,726.151493,47061.264446,0.269204,1,0,1.0
4,638.292331,73964.466272,0.110639,0,0,6.717427000000001e-23
5,638.293152,67869.707708,0.321329,0,0,1.224637e-12
6,728.960641,77903.864401,0.300121,0,0,1.072215e-29


In [71]:
treated_unit.head(7)

Unnamed: 0,credit_score,income,dti_ratio,counseling,default,probability_of_being_treated
3,726.151493,47061.264446,0.269204,1,0,1.0
9,677.128002,49295.295769,0.393757,1,0,1.0
13,554.335988,46216.243638,0.20601,1,0,1.0
16,599.358444,46971.639928,0.273549,1,0,1.0
17,665.712367,50322.283319,0.099614,1,0,1.0
18,604.598796,53593.053836,0.363542,1,0,1.0
21,638.711185,48525.999921,0.327738,1,0,1.0


In [72]:
controlled_unit.head(7)

Unnamed: 0,credit_score,income,dti_ratio,counseling,default,probability_of_being_treated
0,674.835708,87987.108732,0.232482,0,0,1.403187e-46
1,643.086785,78492.673658,0.285548,0,0,1.552863e-30
2,682.384427,61192.607398,0.220758,0,0,0.1525168
4,638.292331,73964.466272,0.110639,0,0,6.717427000000001e-23
5,638.293152,67869.707708,0.321329,0,0,1.224637e-12
6,728.960641,77903.864401,0.300121,0,0,1.072215e-29
7,688.371736,72703.436034,0.218291,0,0,7.255081e-21


In [73]:
(
    NN
    .fit(controlled_unit
         [["probability_of_being_treated"]] # propensity scores
        )
)

In [74]:
distance, index_of_controlled =\
(
    NN
    .kneighbors(treated_unit
                [["probability_of_being_treated"]]
               )
)

In [75]:
index_of_controlled.flatten()

array([168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 401, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 114, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 200, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       200, 168, 168, 168, 168, 168, 168, 168, 216, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
       168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 16

In [76]:
matched_controlled =\
(
    controlled_unit
    .iloc[index_of_controlled
          .flatten()
          ]
#    .copy()
)

matched_controlled

Unnamed: 0,credit_score,income,dti_ratio,counseling,default,probability_of_being_treated
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405
...,...,...,...,...,...,...
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405
314,715.27394,60207.052397,0.366592,0,0,0.877405


In [77]:
len(matched_controlled) == len(treated_unit)

True

#### Q 2-4: Estimate the ATE, ATT, and ATC.

In [78]:
ATE =\
(
    treated_unit["default"].mean()
    - # naive comparisons
    controlled_unit["default"].mean()
)

In [79]:
ATE

np.float64(0.0)

In [80]:
ATT =\
(
    treated_unit["default"].mean() 
    - # fair comparisons
    matched_controlled["default"].mean()
)

In [81]:
ATT

np.float64(0.0)

In [82]:
ATC =\
(
    matched_controlled["default"].mean()
    -
    controlled_unit["default"].mean()
)

In [83]:
ATC

np.float64(0.0)

#### Q 2-5: Interpret the results. When answering, please use the following boilerplate.

```python

print(f"After matching, the estimated ATT is {att:.3f}, indicating that {explanatory_variable} increases {response_variable} by {att*100:.2f} percentage points on average.")

print(f"The estimated ATE is {ate:.3f}, indicating the overall effect of {explanatory_variable} on the entire population is {ate*100:.2f} percentage points.")

print(f"The estimated ATC is {atc:.3f}, indicating that {explanatory_variable} would have increased {response_variable} for the control group by {atc*100:.2f} percentage points on average if they had {treatment}.")

In [84]:
print(f"After matching, the estimated ATT is {ATT:.3f}, indicating that taking part in loan counseling increases loan default rates by {ATT*100:.2f} percentage points on average.")

After matching, the estimated ATT is 0.000, indicating that taking part in loan counseling increases loan default rates by 0.00 percentage points on average.


In [85]:
print(f"The estimated ATE is {ATE:.3f}, indicating the overall effect of taking part in loan counseling on the entire population is {ATE*100:.2f} percentage points.")

The estimated ATE is 0.000, indicating the overall effect of taking part in loan counseling on the entire population is 0.00 percentage points.


In [86]:
print(f"The estimated ATC is {ATC:.3f}, indicating that taking part in loan counseling would have increased loan default rates for the control group by {ATC*100:.2f} percentage points on average if they had loan counseling.")

The estimated ATC is 0.000, indicating that taking part in loan counseling would have increased loan default rates for the control group by 0.00 percentage points on average if they had loan counseling.


## <a id = "p3"> </a> <font color = "red"> Q3. Impact of Urban Development Projects on Property Prices </font> [back to table of contents](#top)

#### Q 3-1: Import the following data (`https://talktoroh.com/s/property_prices.csv`) and run a data inspection

In [87]:
price =\
(
    pd
    .read_csv("https://talktoroh.com/s/property_prices.csv")
)

In [88]:
price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   distance_to_center  1000 non-null   float64
 1   crime_rate          1000 non-null   float64
 2   school_quality      1000 non-null   int64  
 3   development         1000 non-null   int64  
 4   property_price      1000 non-null   float64
dtypes: float64(3), int64(2)
memory usage: 39.2 KB


#### Q 3-2: Estimate the propensity score using logistic regression.

In [89]:
price.columns

Index(['distance_to_center', 'crime_rate', 'school_quality', 'development',
       'property_price'],
      dtype='object')

In [90]:
price["development"].unique()

array([0, 1])

In [91]:
response = price["development"]

In [92]:
confounders = price[['distance_to_center', 'crime_rate', 'school_quality']]

In [93]:
logit = LogisticRegression()

In [94]:
(
    logit
    .fit(covariates, response) # calculate, using your algorithm
)

In [95]:
(
    logit # calculated
    .predict_proba(covariates) # returns prediction
) # predicted probability of being 0 and 1

array([[0.973, 0.027],
       [0.938, 0.062],
       [0.9  , 0.1  ],
       ...,
       [0.762, 0.238],
       [0.783, 0.217],
       [0.78 , 0.22 ]], shape=(1000, 2))

In [96]:
(
    logit
    .predict_proba(covariates)
).ndim

2

In [97]:
# price["propensity_scores"]
price["probability_of_being_treated"] =\
(
    logit
    .predict_proba(covariates)
    [ : , 1] 
         # the column that displays propensity (here, probability) of being 1 (treated)
)

In [98]:
price

Unnamed: 0,distance_to_center,crime_rate,school_quality,development,property_price,probability_of_being_treated
0,12.483571,7.798711,6,0,335148.752525,0.027088
1,9.308678,6.849267,3,0,306382.747056,0.062133
2,13.238443,5.119261,8,0,361525.440524,0.100297
3,17.615149,3.706126,3,0,299636.594764,0.132037
4,8.829233,6.396447,2,0,310586.086607,0.080662
...,...,...,...,...,...,...
995,8.594499,7.140300,7,0,333599.813617,0.057597
996,18.988433,4.946957,5,0,305281.768505,0.062693
997,13.204214,3.236251,5,1,339453.273420,0.238233
998,7.144105,4.673866,3,0,311367.485827,0.216870


#### Q 3-3: Perform matching using nearest neighbor matching.

In [99]:
treated_unit =\
(
    price
    [price["development"] == 1]
    # .copy()
)

controlled_unit =\
(
    price
    [price["development"] == 0]
   # .copy()
)

In [100]:
len(treated_unit)

184

In [101]:
len(controlled_unit)

816

In [102]:
(
    np
    .random
    .seed(2025)
)

In [103]:
NN = NearestNeighbors(n_neighbors = 1)

In [104]:
price.head(7)

Unnamed: 0,distance_to_center,crime_rate,school_quality,development,property_price,probability_of_being_treated
0,12.483571,7.798711,6,0,335148.752525,0.027088
1,9.308678,6.849267,3,0,306382.747056,0.062133
2,13.238443,5.119261,8,0,361525.440524,0.100297
3,17.615149,3.706126,3,0,299636.594764,0.132037
4,8.829233,6.396447,2,0,310586.086607,0.080662
5,8.829315,5.786971,3,0,300829.005431,0.111404
6,17.896064,6.790386,6,0,345643.565883,0.026562


In [105]:
treated_unit.head(7)

Unnamed: 0,distance_to_center,crime_rate,school_quality,development,property_price,probability_of_being_treated
10,7.682912,7.634788,7,1,344091.941522,0.048039
13,0.433599,3.621624,2,1,295154.964102,0.499751
15,7.188562,5.395822,8,1,342087.086309,0.155725
21,8.871118,3.8526,8,1,389102.7449,0.265144
29,8.541531,2.954415,9,1,396494.062927,0.377602
30,6.991467,4.485247,5,1,345827.509088,0.238831
31,19.261391,1.662832,5,1,324288.921884,0.279658


In [106]:
controlled_unit.head(7)

Unnamed: 0,distance_to_center,crime_rate,school_quality,development,property_price,probability_of_being_treated
0,12.483571,7.798711,6,0,335148.752525,0.027088
1,9.308678,6.849267,3,0,306382.747056,0.062133
2,13.238443,5.119261,8,0,361525.440524,0.100297
3,17.615149,3.706126,3,0,299636.594764,0.132037
4,8.829233,6.396447,2,0,310586.086607,0.080662
5,8.829315,5.786971,3,0,300829.005431,0.111404
6,17.896064,6.790386,6,0,345643.565883,0.026562


In [107]:
(
    NN
    .fit(controlled_unit
         [["probability_of_being_treated"]] # propensity scores
        )
)

In [108]:
distance, index_of_controlled =\
(
    NN
    .kneighbors(treated_unit
                [["probability_of_being_treated"]]
               )
)

In [109]:
index_of_controlled.flatten()

array([772, 624, 683, 122, 759, 551, 255, 463, 348, 482, 739, 318, 462,
       276, 690, 568, 330, 482, 167, 683,  10, 573, 239, 213, 225, 648,
       635, 779, 482, 673, 215, 712, 648, 128, 521, 220, 527, 747, 589,
        13, 481, 591, 587, 343, 577, 115, 470, 334,   3, 544, 497, 697,
       416, 739, 463, 388, 164, 283, 594, 318,  55,  42, 429, 288, 282,
       602, 482, 521, 482, 195, 341, 412, 251,  13, 585, 146, 567, 118,
       366, 232, 381, 711, 622, 413, 545, 798, 771, 673, 577, 470, 176,
       761, 123, 176, 186, 521, 739, 527,  37, 570, 739, 343, 431, 482,
       779, 108, 470, 527, 676, 555,  80, 517, 810, 462,   9, 650, 804,
       570, 759, 366, 308, 244, 662, 759, 383, 761, 592,  13, 463, 373,
       389, 570, 614,  78, 375, 611, 154, 482, 142,  47, 234, 431, 533,
       678, 771, 521, 667, 521, 624, 204, 638, 710, 600, 527,  10, 613,
       317, 746, 802, 234, 521, 315, 470, 402, 738, 464, 613, 255, 767,
       470, 648, 429, 527, 626, 357, 729, 690, 457, 482, 119, 56

In [110]:
matched_controlled =\
(
    controlled_unit
    .iloc[index_of_controlled
          .flatten()
          ]
#    .copy()
)

matched_controlled

Unnamed: 0,distance_to_center,crime_rate,school_quality,development,property_price,probability_of_being_treated
948,9.772070,7.252101,6,0,348400.244191,0.048090
763,9.519701,1.796068,5,0,346182.262594,0.507282
833,6.161012,5.574248,9,0,369516.986242,0.157240
149,11.484923,3.306077,6,0,345544.324988,0.264905
932,2.759930,4.069379,6,0,345695.324260,0.378603
...,...,...,...,...,...,...
144,11.299414,4.653856,3,0,313473.401794,0.151276
690,8.195169,6.173388,9,0,380431.057499,0.096772
670,16.230426,4.522136,2,0,294632.416819,0.100990
310,8.911594,4.949946,8,0,357360.341955,0.163557


In [111]:
len(matched_controlled) == len(treated_unit)

True

#### Q 3-4: Estimate the ATE, ATT, and ATC.

In [112]:
ATE =\
(
    treated_unit["property_price"].mean()
    - # naive comparisons
    controlled_unit["property_price"].mean()
)

In [113]:
ATE

np.float64(23003.848922164703)

In [114]:
ATT =\
(
    treated_unit["property_price"].mean() 
    - # fair comparisons
    matched_controlled["property_price"].mean()
)

In [115]:
ATT

np.float64(14789.249480063503)

In [116]:
ATC =\
(
    matched_controlled["property_price"].mean()
    -
    controlled_unit["property_price"].mean()
)

In [117]:
ATC

np.float64(8214.5994421012)

#### Q 3-5: Interpret the results. When answering, please use the following boilerplate.

```python

print(f"After matching, the estimated ATT is {att:.3f}, indicating that {explanatory_variable} increases {response_variable} by {att*100:.2f} percentage points on average.")

print(f"The estimated ATE is {ate:.3f}, indicating the overall effect of {explanatory_variable} on the entire population is {ate*100:.2f} percentage points.")

print(f"The estimated ATC is {atc:.3f}, indicating that {explanatory_variable} would have increased {response_variable} for the control group by {atc*100:.2f} percentage points on average if they had {treatment}.")

In [118]:
print(f"After matching, the estimated ATT is {ATT:.3f}, indicating that urban development increases property prices by {ATT*100:.2f} percentage points on average.")

After matching, the estimated ATT is 14789.249, indicating that urban development increases property prices by 1478924.95 percentage points on average.


In [119]:
print(f"The estimated ATE is {ATE:.3f}, indicating the overall effect of urban development on the entire population is {ATE*100:.2f} percentage points.")

The estimated ATE is 23003.849, indicating the overall effect of urban development on the entire population is 2300384.89 percentage points.


In [120]:
print(f"The estimated ATC is {ATC:.3f}, indicating that urban development would have increased property prices for the control group by {ATC*100:.2f} percentage points on average if they had completed urban development projects.")

The estimated ATC is 8214.599, indicating that urban development would have increased property prices for the control group by 821459.94 percentage points on average if they had completed urban development projects.


## <a id = "p4"> </a> <font color = "red"> Q4. Effect of Historical Preservation Status on Property Values </font> [back to table of contents](#top)

#### Q 4-1: Import the following data (`https://talktoroh.com/s/property_values.csv`) and run a data inspection

In [121]:
value =\
(
    pd
    .read_csv("https://talktoroh.com/s/property_values.csv")
)

In [122]:
value.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   property_size        1000 non-null   float64
 1   property_age         1000 non-null   float64
 2   neighborhood_income  1000 non-null   float64
 3   preservation_status  1000 non-null   int64  
 4   property_value       1000 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 39.2 KB


#### Q 4-2: Estimate the propensity score using logistic regression.

In [123]:
value.columns

Index(['property_size', 'property_age', 'neighborhood_income',
       'preservation_status', 'property_value'],
      dtype='object')

In [124]:
value["preservation_status"].unique()

array([1, 0])

In [125]:
response = value["preservation_status"]

In [126]:
confounders = value[['property_size', 'property_age', 'neighborhood_income']]

In [127]:
logit = LogisticRegression()

In [128]:
(
    logit
    .fit(covariates, response) # calculate, using your algorithm
)

In [129]:
(
    logit # calculated
    .predict_proba(covariates) # returns prediction
) # predicted probability of being 0 and 1

array([[0.246, 0.754],
       [0.422, 0.578],
       [0.24 , 0.76 ],
       ...,
       [0.157, 0.843],
       [0.616, 0.384],
       [0.455, 0.545]], shape=(1000, 2))

In [130]:
# value["propensity_scores"]
value["probability_of_being_treated"] =\
(
    logit
    .predict_proba(covariates)
    [ : , 1] 
         # the column that displays propensity (here, probability) of being 1 (treated)
)

In [131]:
value

Unnamed: 0,property_size,property_age,neighborhood_income,preservation_status,property_value,probability_of_being_treated
0,2248.357077,77.987109,49872.325875,1,494979.578638,0.753688
1,1930.867849,68.492674,57832.219939,0,490994.275893,0.577799
2,2323.844269,51.192607,48113.701185,1,512684.970127,0.759567
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
4,1882.923313,63.964466,31595.779996,1,447927.482191,0.938046
...,...,...,...,...,...,...
995,1859.449854,71.403005,61162.207778,0,469810.410392,0.503931
996,2898.843263,49.469575,63866.288102,0,575492.129405,0.430017
997,2320.421431,32.362507,41373.591368,1,499530.894638,0.842768
998,1714.410505,46.738661,65012.646259,0,499686.077176,0.383546


#### Q 4-3: Perform matching using nearest neighbor matching.

In [132]:
treated_unit =\
(
    value
    [value["preservation_status"] == 1]
    # .copy()
)

controlled_unit =\
(
    value
    [value["preservation_status"] == 0]
   # .copy()
)

In [133]:
len(treated_unit)

505

In [134]:
len(controlled_unit)

495

In [135]:
(
    np
    .random
    .seed(2025)
)

In [136]:
NN = NearestNeighbors(n_neighbors = 1)

In [137]:
price.head(7)

Unnamed: 0,distance_to_center,crime_rate,school_quality,development,property_price,probability_of_being_treated
0,12.483571,7.798711,6,0,335148.752525,0.027088
1,9.308678,6.849267,3,0,306382.747056,0.062133
2,13.238443,5.119261,8,0,361525.440524,0.100297
3,17.615149,3.706126,3,0,299636.594764,0.132037
4,8.829233,6.396447,2,0,310586.086607,0.080662
5,8.829315,5.786971,3,0,300829.005431,0.111404
6,17.896064,6.790386,6,0,345643.565883,0.026562


In [138]:
treated_unit.head(7)

Unnamed: 0,property_size,property_age,neighborhood_income,preservation_status,property_value,probability_of_being_treated
0,2248.357077,77.987109,49872.325875,1,494979.578638,0.753688
2,2323.844269,51.192607,48113.701185,1,512684.970127,0.759567
4,1882.923313,63.964466,31595.779996,1,447927.482191,0.938046
6,2789.606408,67.903864,60018.08213,1,545643.354295,0.540193
7,2383.717365,62.703436,47743.670534,1,500065.380758,0.776691
10,1768.291154,76.347881,35886.602034,1,432080.234096,0.914954
11,1767.135123,53.951992,48559.128322,1,468340.228487,0.748159


In [139]:
controlled_unit.head(7)

Unnamed: 0,property_size,property_age,neighborhood_income,preservation_status,property_value,probability_of_being_treated
1,1930.867849,68.492674,57832.219939,0,490994.275893,0.577799
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
5,1882.931522,57.869708,63199.405611,0,489203.881698,0.440299
8,1765.262807,70.991054,69888.685026,0,490901.631193,0.309977
9,2271.280022,39.295296,74063.552064,0,544164.390865,0.21147
14,1137.541084,84.719276,72442.122629,0,447228.834491,0.267804
15,1718.856235,53.958216,57092.607918,0,474743.544839,0.573385


In [140]:
(
    NN
    .fit(controlled_unit
         [["probability_of_being_treated"]] # propensity scores
        )
)

In [141]:
distance, index_of_controlled =\
(
    NN
    .kneighbors(treated_unit
                [["probability_of_being_treated"]]
               )
)

In [142]:
index_of_controlled.flatten()

array([  1,   1,   1,  57,   1,   1,   1,   1,   1, 389,   1,   1,   1,
         1,   1,   1,   1,  63,   1,   1,   1,   1,   1,  73,   1,   1,
         1,   1,   1,   1,   1, 340, 274, 378,   1,   1, 116,   1,   1,
       116,   1, 242,   1,   1,   1,   1,   1, 103,   1,   1,   1,   1,
       141,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1, 415,
       213,  71,   1,   1,   1, 242,   1,   1,  82,   0,   1,   1, 231,
         1, 254,   1,   1,   1,   1,   1,   1,   1,   1, 415,   1,  11,
         1,   1, 242,   0,   1, 231,   1,   1, 116,   1,   1,   1,   1,
        53,   1, 103,   1,   1, 231,   0,   1,   1,   1,  84,   1, 410,
         1,   1,   1,   1,   1,   1,   1,   1,   1, 366,   1,   1,   1,
         1,   1,   1, 366,   1,   1,   1,   1,   1,   1, 103,   1,   1,
       378,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1, 103,   1,   1,   1,   1,   1,   1,   1, 213,   1,   1,
         1,   1,   1, 116,   1,   1,   1,   1,   1,   1,   1,   

In [143]:
matched_controlled =\
(
    controlled_unit
    .iloc[index_of_controlled
          .flatten()
          ]
#    .copy()
)

matched_controlled

Unnamed: 0,property_size,property_age,neighborhood_income,preservation_status,property_value,probability_of_being_treated
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
122,2701.397155,62.774604,59711.094304,0,508952.072242,0.539462
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
...,...,...,...,...,...,...
20,2732.824384,60.456710,60898.989454,0,534760.599093,0.509594
248,2882.727120,54.675718,57163.194138,0,536294.350323,0.590076
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539
3,2761.514928,37.061264,55380.577055,0,526182.130116,0.606539


In [144]:
len(matched_controlled) == len(treated_unit)

True

#### Q 4-4: Estimate the ATE, ATT, and ATC.

In [145]:
ATE =\
(
    treated_unit["property_value"].mean()
    - # naive comparisons
    controlled_unit["property_value"].mean()
)

In [146]:
ATE

np.float64(-28050.333718945098)

In [147]:
ATT =\
(
    treated_unit["property_value"].mean() 
    - # fair comparisons
    matched_controlled["property_value"].mean()
)

In [148]:
ATT

np.float64(-28179.83412619785)

In [149]:
ATC =\
(
    matched_controlled["property_value"].mean()
    -
    controlled_unit["property_value"].mean()
)

In [150]:
ATC

np.float64(129.50040725275176)

#### Q 4-5: Interpret the results. When answering, please use the following boilerplate.

```python

print(f"After matching, the estimated ATT is {att:.3f}, indicating that {explanatory_variable} increases {response_variable} by {att*100:.2f} percentage points on average.")

print(f"The estimated ATE is {ate:.3f}, indicating the overall effect of {explanatory_variable} on the entire population is {ate*100:.2f} percentage points.")

print(f"The estimated ATC is {atc:.3f}, indicating that {explanatory_variable} would have increased {response_variable} for the control group by {atc*100:.2f} percentage points on average if they had {treatment}.")

In [151]:
print(f"After matching, the estimated ATT is {ATT:.3f}, indicating that historically preserving a property increases the property value by {ATT*100:.2f} percentage points on average.")

After matching, the estimated ATT is -28179.834, indicating that historically preserving a property increases the property value by -2817983.41 percentage points on average.


In [152]:
print(f"The estimated ATE is {ATE:.3f}, indicating the overall effect of historically preserving a property on the entire population is {ATE*100:.2f} percentage points.")

The estimated ATE is -28050.334, indicating the overall effect of historically preserving a property on the entire population is -2805033.37 percentage points.


In [153]:
print(f"The estimated ATC is {ATC:.3f}, indicating that historically preserving a property would have increased the property value for the control group by {ATC*100:.2f} percentage points on average if they had historically preserved the property.")

The estimated ATC is 129.500, indicating that historically preserving a property would have increased the property value for the control group by 12950.04 percentage points on average if they had historically preserved the property.


### <font color="green">"Thank you for putting your efforts into the exercise problem sets 💯"</font>