# Matching

- This is a concept that involves finding two or more pieces of data that are same.

- This can be dne by looking at the values of data or looking at the relationship between data.

- Imagine you have a box of toys. You want to find all of the pairs of matching toys.

- You can do this by matching the toys by their color, size, or shape. For example, you could find all of the red toys, or all of the big toys, or all of the round toys.

- Once you have found all of the matching toys, you can put them away in pairs.

# Case Study: Are Catholic Schools Better?

In [4]:
!pip install CausalInference

Collecting CausalInference
  Downloading CausalInference-0.1.3-py3-none-any.whl (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: CausalInference
Successfully installed CausalInference-0.1.3


In [5]:
# Loading the directory
%cd /content/drive/MyDrive/Business Analyst Workbooks/Matching

/content/drive/MyDrive/Business Analyst Workbooks/Matching


In [6]:
# Import the Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as ss
from causalinference import CausalModel

In [7]:
# Get Data
df = pd.read_csv("school.csv")

In [8]:
df.head()

Unnamed: 0,childid,catholic,race,number_places_lived,mom_age,dad_age,dad_education,mom_education,mom_score,dad_score,income,poverty,food_stamps,score_standardized
0,0001002C,0,"WHITE, NON-HISPANIC",1,47,45,DOCTORATE OR PROFESSIONAL DEGREE,SOME COLLEGE,53.5,77.5,62500.5,0,0,0.981753
1,0001004C,0,"WHITE, NON-HISPANIC",1,41,48,BACHELOR'S DEGREE,GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,34.95,53.5,45000.5,0,0,0.594378
2,0001010C,0,"WHITE, NON-HISPANIC",1,43,55,"MASTER'S DEGREE (MA, MS)",GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,63.43,53.5,62500.5,0,0,0.490611
3,0001011C,1,"WHITE, NON-HISPANIC",1,38,39,BACHELOR'S DEGREE,GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,53.5,53.5,87500.5,0,0,1.451278
4,0001012C,0,"WHITE, NON-HISPANIC",1,47,57,DOCTORATE OR PROFESSIONAL DEGREE,"MASTER'S DEGREE (MA, MS)",61.56,77.5,150000.5,0,0,2.595699


In [9]:
# Drop the unecessary data
df = df.iloc[:,1:]
df.head()

Unnamed: 0,catholic,race,number_places_lived,mom_age,dad_age,dad_education,mom_education,mom_score,dad_score,income,poverty,food_stamps,score_standardized
0,0,"WHITE, NON-HISPANIC",1,47,45,DOCTORATE OR PROFESSIONAL DEGREE,SOME COLLEGE,53.5,77.5,62500.5,0,0,0.981753
1,0,"WHITE, NON-HISPANIC",1,41,48,BACHELOR'S DEGREE,GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,34.95,53.5,45000.5,0,0,0.594378
2,0,"WHITE, NON-HISPANIC",1,43,55,"MASTER'S DEGREE (MA, MS)",GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,63.43,53.5,62500.5,0,0,0.490611
3,1,"WHITE, NON-HISPANIC",1,38,39,BACHELOR'S DEGREE,GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,53.5,53.5,87500.5,0,0,1.451278
4,0,"WHITE, NON-HISPANIC",1,47,57,DOCTORATE OR PROFESSIONAL DEGREE,"MASTER'S DEGREE (MA, MS)",61.56,77.5,150000.5,0,0,2.595699


# Unconfoundedness

- Imagine you have two groups of toys: one group of red toys, and one group of blue toys. You want to know if being red makes a toy more fun to play with.

- To find out, you could look at how much fun children have playing with red toys versus blue toys. But there is a problem: some toys might be more fun to play with for other reasons, such as because they are shaped like a dinosaur or because they make a noise.

- To solve this problem, you need to make sure that the two groups of toys are similar in all other ways, except for their color. This is called unconfoundedness.

- One way to make sure that the two groups of toys are unconfounded is to match them. This means finding two toys that are similar in all other ways, except for their color. You can match the toys by looking at their size, shape, and other factors.

- Once you have matched the toys, you can compare how much fun children have playing with them. If the red toys are more fun to play with than the blue toys, then you can conclude that being red makes a toy more fun to play with.

- Unconfoundedness is an important concept in data analytics. It allows us to draw causal conclusions from data. This means that we can learn about the causes and effects of things.


# Data Analysis

- With the help of groupby function we need to group the functions through which they are catholic or not and the mean of all features are determined using mean function.

In [10]:
# Comparing Group Avg's
df.groupby('catholic').mean()

  df.groupby('catholic').mean()


Unnamed: 0_level_0,number_places_lived,mom_age,dad_age,mom_score,dad_score,income,poverty,food_stamps,score_standardized
catholic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,1.106246,37.794621,40.134919,43.909495,42.59052,65393.92854,0.101578,0.045566,0.163128
1,1.073118,39.775269,42.007527,47.620871,45.908269,86180.625269,0.016129,0.006452,0.219685


In [11]:
 # T-Test (Done for Continuous Variables)
group1 = df.where(df.catholic == 0).dropna()["income"] # Non-Catholic
group2 = df.where(df.catholic == 1).dropna()["income"] # Catholic
group2.head()

3      87500.5
21    150000.5
22     87500.5
23    150000.5
24    150000.5
Name: income, dtype: float64

In [12]:
stat, p = ss.ttest_ind(group1,group2)
print(p)

5.943636213205364e-41


- We will be creating a loop that will check all the entries and find out the p-value of the variable.

In [13]:
# What to look
continuous = ["number_places_lived", "mom_age", "dad_age", "mom_score", "dad_score", "income"]
# Where to store the result
stat = {}
p = {}
# Loop
for x in continuous:
  group1 = df.where(df.catholic == 0).dropna()[x] # Non-Catholic
  group2 = df.where(df.catholic == 1).dropna()[x] # Catholic
  stat[x], p[x] = ss.ttest_ind(group1,group2)
ttests = pd.DataFrame.from_dict(p, orient="Index")
ttests.columns = ['pvalue']
print(ttests)

                           pvalue
number_places_lived  7.072609e-03
mom_age              1.359492e-22
dad_age              3.344265e-16
mom_score            2.280116e-19
dad_score            5.489722e-18
income               5.943636e-41


- We have studied our continuous variables and the p-value concludes that these can undergo matching process.|

- We will do Chi-Square test do determine the relationship between binary variables.

In [14]:
# Chi-Square Test
tab = pd.crosstab(index = df.poverty,
                  columns = df.catholic)
tab

catholic,0,1
poverty,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4042,915
1,457,15


In [15]:
statistic, p, dof, exp = ss.chi2_contingency(tab)
p

6.511354893726035e-17

- We will create a loop for chi square values too.

In [16]:
from pandas.core.arrays.categorical import CategoricalT
# What to loop
categorical = ["poverty", "food_stamps"]
# Where to Store
statistic = {}
p = {}
dof = {}
exp = {}
# Loop
for x in categorical:
  tab = pd.crosstab(index = df[x],
                  columns = df.catholic)
  statistic[x], p[x], dof[x], exp[x] = ss.chi2_contingency(tab)
chisquare = pd.DataFrame.from_dict(p, orient = "Index")
chisquare.columns = ["pvalue"]
print(chisquare)

                   pvalue
poverty      6.511355e-17
food_stamps  3.294153e-08


- The p-values are very low so there is strong relation between binary variables too.

# Matching Preparation

In [17]:
temp = pd.get_dummies(df)
temp.head(1)

Unnamed: 0,catholic,number_places_lived,mom_age,dad_age,mom_score,dad_score,income,poverty,food_stamps,score_standardized,...,dad_education_VOC/TECH PROGRAM,mom_education_8TH GRADE OR BELOW,mom_education_9TH - 12TH GRADE,mom_education_BACHELOR'S DEGREE,mom_education_DOCTORATE OR PROFESSIONAL DEGREE,mom_education_GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,mom_education_HIGH SCHOOL DIPLOMA/EQUIVALENT,"mom_education_MASTER'S DEGREE (MA, MS)",mom_education_SOME COLLEGE,mom_education_VOC/TECH PROGRAM
0,0,1,47,45,53.5,77.5,62500.5,0,0,0.981753,...,0,0,0,0,0,0,0,0,1,0


In [18]:
print(temp.columns)

Index(['catholic', 'number_places_lived', 'mom_age', 'dad_age', 'mom_score',
       'dad_score', 'income', 'poverty', 'food_stamps', 'score_standardized',
       'race_AMERICAN INDIAN OR ALASKA NATIVE', 'race_ASIAN',
       'race_BLACK OR AFRICAN AMERICAN, NON-HISPANIC',
       'race_HISPANIC, RACE NOT SPECIFIED', 'race_HISPANIC, RACE SPECIFIED',
       'race_MORE THAN ONE RACE, NON HISPANIC',
       'race_NATIVE HAWAIIAN, OTHER PACIFIC ISLANDER', 'race_NOT ASCERTAINED',
       'race_WHITE, NON-HISPANIC', 'dad_education_8TH GRADE OR BELOW',
       'dad_education_9TH - 12TH GRADE', 'dad_education_BACHELOR'S DEGREE',
       'dad_education_DOCTORATE OR PROFESSIONAL DEGREE',
       'dad_education_GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE',
       'dad_education_HIGH SCHOOL DIPLOMA/EQUIVALENT',
       'dad_education_MASTER'S DEGREE (MA, MS)', 'dad_education_SOME COLLEGE',
       'dad_education_VOC/TECH PROGRAM', 'mom_education_8TH GRADE OR BELOW',
       'mom_education_9TH - 12TH GRADE', 'mom_

# Curse of Dimensionality

- We need to keep the model as simple as possible, let'say you have a variable with three options and another variable with 3 more options, this leads to exponential observations so this leads to artificial and complicated outcomes.

In [19]:
df.race.value_counts()

WHITE, NON-HISPANIC                        3654
HISPANIC, RACE NOT SPECIFIED                408
HISPANIC, RACE SPECIFIED                    387
BLACK OR AFRICAN AMERICAN, NON-HISPANIC     357
ASIAN                                       342
MORE THAN ONE RACE, NON HISPANIC            123
NATIVE HAWAIIAN, OTHER PACIFIC ISLANDER      93
AMERICAN INDIAN OR ALASKA NATIVE             62
NOT ASCERTAINED                               3
Name: race, dtype: int64

- As this is very complicated, we need to see to simplify,

- Consider White race, Group Hispanic, consider Black or African American and Asian. Then, exclude the rest from the list.

In [20]:
# Preparing Race Variable with if else condition
df['race_asian'] = np.where(df.race == 'ASIAN', 1, 0)
df['race_white'] = np.where(df.race == 'WHITE, NON-HISPANIC', 1, 0)
df['race_black'] = np.where(df.race == 'BLACK OR AFRICAN AMERICAN, NON-HISPANIC', 1, 0)
df['race_hispanic'] = np.where((df.race == 'HISPANIC, RACE NOT SPECIFIED') |
                                 (df.race == 'HISPANIC, RACE SPECIFIED' ), 1, 0)
df.head()

Unnamed: 0,catholic,race,number_places_lived,mom_age,dad_age,dad_education,mom_education,mom_score,dad_score,income,poverty,food_stamps,score_standardized,race_asian,race_white,race_black,race_hispanic
0,0,"WHITE, NON-HISPANIC",1,47,45,DOCTORATE OR PROFESSIONAL DEGREE,SOME COLLEGE,53.5,77.5,62500.5,0,0,0.981753,0,1,0,0
1,0,"WHITE, NON-HISPANIC",1,41,48,BACHELOR'S DEGREE,GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,34.95,53.5,45000.5,0,0,0.594378,0,1,0,0
2,0,"WHITE, NON-HISPANIC",1,43,55,"MASTER'S DEGREE (MA, MS)",GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,63.43,53.5,62500.5,0,0,0.490611,0,1,0,0
3,1,"WHITE, NON-HISPANIC",1,38,39,BACHELOR'S DEGREE,GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE,53.5,53.5,87500.5,0,0,1.451278,0,1,0,0
4,0,"WHITE, NON-HISPANIC",1,47,57,DOCTORATE OR PROFESSIONAL DEGREE,"MASTER'S DEGREE (MA, MS)",61.56,77.5,150000.5,0,0,2.595699,0,1,0,0


In [21]:
# Preparing Education Variable
df.dad_education.value_counts()

HIGH SCHOOL DIPLOMA/EQUIVALENT            1524
SOME COLLEGE                              1344
BACHELOR'S DEGREE                         1026
9TH - 12TH GRADE                           355
MASTER'S DEGREE (MA, MS)                   354
VOC/TECH PROGRAM                           306
DOCTORATE OR PROFESSIONAL DEGREE           224
8TH GRADE OR BELOW                         167
GRADUATE/PROFESSIONAL SCHOOL-NO DEGREE     129
Name: dad_education, dtype: int64

- It is oversimplified so we reduce the dimensions.

In [24]:
# Transforming the Education Variable
df["dad_higher_education"] = np.where((df.dad_education == "BACHELOR'S DEGREE") |
                                        (df.dad_education == "MASTER'S DEGREE (MA, MS)") |
                                        (df.dad_education == "DOCTORATE OR PROFESSIONAL DEGREE"),1,0)

df["mom_higher_education"] = np.where((df.mom_education == "BACHELOR'S DEGREE") |
                                        (df.mom_education == "MASTER'S DEGREE (MA, MS)") |
                                        (df.mom_education == "DOCTORATE OR PROFESSIONAL DEGREE"),1,0)
df.head(1)

Unnamed: 0,catholic,race,number_places_lived,mom_age,dad_age,dad_education,mom_education,mom_score,dad_score,income,poverty,food_stamps,score_standardized,race_asian,race_white,race_black,race_hispanic,dad_higher_education,mom_higher_education
0,0,"WHITE, NON-HISPANIC",1,47,45,DOCTORATE OR PROFESSIONAL DEGREE,SOME COLLEGE,53.5,77.5,62500.5,0,0,0.981753,0,1,0,0,1,0


In [25]:
# Cleaning Dataset (Removing the text based columns)
df = df.drop(columns= ["race", "dad_education", "mom_education"])
df.head(1)

Unnamed: 0,catholic,number_places_lived,mom_age,dad_age,mom_score,dad_score,income,poverty,food_stamps,score_standardized,race_asian,race_white,race_black,race_hispanic,dad_higher_education,mom_higher_education
0,0,1,47,45,53.5,77.5,62500.5,0,0,0.981753,0,1,0,0,1,0


In [27]:
# Isloating y, treatment and confounders
treat = df.catholic.values
y = df.score_standardized.values
confounders = df.drop(columns = ["catholic", "score_standardized"]).values

# Common Support Region

- Imagine you have a box of red and blue toys. You want to know if red toys are more fun to play with than blue toys.

- However, you know that some toys might be more fun to play with for other reasons, such as because they are shaped like a dinosaur or because they make a noise.

- To ensure that you are comparing two groups of toys that are similar in all other ways, except for their color, you need to create a common support region. This means finding a range of values for the covariates (i.e., the factors that can affect how fun a toy is to play with) for which there are both red and blue toys.

- For example, you might decide to only compare toys that are the same size and shape. This would create a common support region because there are both red and blue toys that are the same size and shape.

- Once you have created a common support region, you can compare the fun factor of the red toys to the fun factor of the blue toys. If the red toys are more fun to play with than the blue toys, then you can conclude that being red makes a toy more fun to play with.

- The common support region is an important concept in data analytics because it allows us to draw causal conclusions from our data. However, it is important to note that the common support region is not always easy to identify. In some cases, it may be necessary to use statistical methods to create a common support region.

In [34]:
# Logistic Regression
import statsmodels.api as sm
confounders_csr = sm.add_constant(confounders)
csr_model = sm.Logit(treat, confounders_csr)
csr_model = csr_model.fit()
print(csr_model.summary())

Optimization terminated successfully.
         Current function value: 0.430983
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                 5429
Model:                          Logit   Df Residuals:                     5414
Method:                           MLE   Df Model:                           14
Date:                Fri, 06 Oct 2023   Pseudo R-squ.:                 0.05888
Time:                        22:47:21   Log-Likelihood:                -2339.8
converged:                       True   LL-Null:                       -2486.2
Covariance Type:            nonrobust   LLR p-value:                 3.835e-54
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -4.3412      0.418    -10.380      0.000      -5.161      -3.521
x1            -0.1666      0.

In [35]:
# Checking correlations in the confounders
df.drop(columns = ["catholic", "score_standardized"]).corr()

Unnamed: 0,number_places_lived,mom_age,dad_age,mom_score,dad_score,income,poverty,food_stamps,race_asian,race_white,race_black,race_hispanic,dad_higher_education,mom_higher_education
number_places_lived,1.0,-0.105116,-0.096571,-0.042976,-0.033342,-0.039007,0.05845,0.063574,0.030205,-0.030456,-0.004143,0.016853,-0.022836,-0.04523
mom_age,-0.105116,1.0,0.741862,0.224808,0.237727,0.276373,-0.134362,-0.106739,0.05227,0.098262,-0.060004,-0.116422,0.265912,0.281417
dad_age,-0.096571,0.741862,1.0,0.179047,0.197057,0.222995,-0.1006,-0.08356,0.085816,0.063543,-0.039396,-0.108158,0.247213,0.213065
mom_score,-0.042976,0.224808,0.179047,1.0,0.30938,0.361271,-0.190873,-0.111281,-0.015435,0.155196,-0.033081,-0.158292,0.324952,0.458104
dad_score,-0.033342,0.237727,0.197057,0.30938,1.0,0.406067,-0.172984,-0.113052,0.029,0.144808,-0.075754,-0.147734,0.467236,0.321471
income,-0.039007,0.276373,0.222995,0.361271,0.406067,1.0,-0.369395,-0.214107,-0.005106,0.247893,-0.119058,-0.199047,0.440927,0.397905
poverty,0.05845,-0.134362,-0.1006,-0.190873,-0.172984,-0.369395,1.0,0.38783,0.046462,-0.267137,0.092218,0.217985,-0.1683,-0.157597
food_stamps,0.063574,-0.106739,-0.08356,-0.111281,-0.113052,-0.214107,0.38783,1.0,-0.005068,-0.150369,0.108143,0.078449,-0.105153,-0.105883
race_asian,0.030205,0.05227,0.085816,-0.015435,0.029,-0.005106,0.046462,-0.005068,1.0,-0.372021,-0.06879,-0.107396,0.08966,0.07565
race_white,-0.030456,0.098262,0.063543,0.155196,0.144808,0.247893,-0.267137,-0.150369,-0.372021,1.0,-0.380653,-0.594279,0.105363,0.130682


In [36]:
# Predictions
probabilities = csr_model.predict(confounders_csr)
probabilities

array([0.26041384, 0.17273291, 0.24596599, ..., 0.23182116, 0.12192314,
       0.1825144 ])