# SI100B Python Project (Fall, 2022): 
# Social Exploration Based on Data Processing - Lecture 2: Causal Analysis & Data Cleaning (11 points)
*******

Update: `2022-12-3 22:53` 

## Introduction

This week you will learn the basic analysis of data, which means that from this moment on you're going to start working with real social science data!

However, this week's content will probably focus more on providing ideas. This is because you may still have questions about "how to relate data to a problem" before you are introduced to specific methods. This session will show you some basic methods and their respective strengths and limitations. Not all of them may work well in your research, but you may see them in other people's literature or in processed data.

Also, we would like to stress again that the search for data may take much longer than you expect, so be sure to start collecting data as soon as possible.

# Casual Analysis

## Causation

Causation, in the statistical sense, means that when we intervene , the chances of different outcomes are systematically changed.

That is, when we state a proposition such as 'smoking causes lung cancer', we are not saying that smoking necessarily leads to lung cancer, nor are we saying that not smoking necessarily leads to lung cancer, but that if a person is forced to smoke (intervene), the probability of getting lung cancer increases (statistical).

It is important to note that there are two major barriers to causality: *reverse casuality* and *confounders*, which are not covered here, but you can search for more information. We have prepared an exercise for you to help you understand them.

### Correlation and Causality

As we all know, correlation does not equal causation: hot weather boosts ice cream sales, but we probably wouldn't say the opposite (or, perhaps, that the ice cream you buy is actually a weapon of the law of causation?) .

At the same time, unlike the natural sciences, causality is quite important for the social sciences: because it cannot be obtained by experiment. At the same time, society is often more complex than other natural sciences that rely on observation, such as meteorology.

## Counterfactual

Now let's think about a topic: what would have happened if you hadn't come to ShanghaiTech University?

What if you had chosen a university with similar scores to ShanghaiTech University, how would we know if you made the right choice? For the sake of convenience, let's say you are only concerned about where your postgraduate studies will take you.

You might quickly think of one criterion: where students from both schools go to graduate. But these two schools have completely different student populations, and it doesn't seem like a great reference for you.

So you suddenly remember that there is a student in your class with the same score as you but who went to Shanghai University of Railway Technology (we believe it exists; if not, we'll create one for you). However, you two individuals still seem to be very different: your acceptance of the new type of university, the importance you attach to undergraduate training, the situation of your parents' support, etc. are all different. This seems to have influenced your choices to a considerable extent.

So what to do? In fact, the best object of comparison is in fact your own de-converted self in a parallel universe. And it is in fact the 'counterfactual' of our title - because this object does not exist. And the essence of **causality** is to find the **counterfactual**.

We can construct a variable such that $Y_i(1)$ means you went to MIT or not if you came to ShanghaiTech University, and $Y_i(0)$ means you went to MIT or not if you went to Shanghai University of Railway Technology, so the causal effect is $Y_i(1) -Y_i(0)$. That is.

- $Y_i(1)-Y_i(0)=0$ means ShanghaiTech University has no effect on you

- $Y_i(1)-Y_i(0)=-1$ means that ShanghaiTech University is preventing you from going to MIT

- $Y_i(1)-Y_i(0)=1$ means that ShanghaiTech University helped you get to MIT

But here's the problem: $Y_i(1)$ and $Y_i(0)$ can only be observed for one. So we need to go to the **counterfactual** that is not observed approximately.

### Finding Counterfactuals Can Be Difficult

For researchers, only one of $Y_i(1)$ and $Y_i(0)$ can be observed. However, for individuals, making a decision is more or less a matter of thinking about their $Y_i(1)$ and $Y_i(0)$. This process of decision making is complex and often unobservable to the researcher; also, individuals $i$ with greater $Y_i(1)-Y_i(0)$ are more likely to choose ShanghaiTech University, i.e. the causal effect leads to decision.

Of course, even though finding the perfect counterfactual is difficult, knowing that the true causal effect depends on the counterfactual can already help us avoid some thinking misconceptions.

### Understanding the Correct Counterfactuals

Suppose the Night City in the year 2078 has developed an effective vaccine for cyberpsychosis, claiming that the vaccine will make you immune to cyberpsychosis. But soon, a radio host claims that "the vaccine is killing people!"

His argument went like this.

- According to reliable data, 99% of the 1 million people in the Night City have been vaccinated; 2% of those vaccinated have adverse reactions, and adverse reactions have a 1% mortality rate. However, there have been no reports of confirmed cyberpsychosis in vaccinated people, and one can reluctantly assume that the probability of getting cyberpsychosis with the vaccine is 0%.

- Without the vaccination, there is a 3% statistical probability of developing cyberpsychosis per person, with a 22% mortality rate.

- Thus, the number of deaths from the vaccine is $10^6\times 0.99\times 0.02\times 0.01 = 198$, while the number of deaths from cyberpsychosis is $10^6\times 0.01\times 0.03\times 0.22 = 66$, which is only $\frac13$ of the number of deaths from the vaccine.

At first glance, this argument seems to make some sense. Where does the problem lie?

It is a mistake in finding a counterfactual. The correct counterfactual should be, "if no one had been vaccinated" -

- The number of people who die from cyberpsychosis is $10^6\times 0.03\times 0.22=6600$


## Data and Causality

As the name implies, data is facts. Data by itself cannot tell us what happens in a counterfactual world, and thus **cannot tell us causality**. Therefore, we need to combine external knowledge or assumption.

### External Knowledge

Now take the heat of the weather and ice cream sales as the two dimensions, draw a graph of their relationship and then erase the names of the two variables. Can you tell which is which?

The fact is that pure data can only show correlation, but we can tell by external knowledge (aka common sense) that ice cream sales do not affect temperature. Also, very often we confirm that correlation and causality differ only by relying on common sense.

This poses a problem: it is very easy to fall into the trap if we do not have a certain aspect of knowledge in our heads.

### Assumption

If you've read mystery novels, you should have noticed that detectives often need to make enough assumptions to be able to draw out the evidence that identifies the culprit.

The same is true in reality. Through data, it is often difficult to arrive at the truth straight away. But through "intervention" - in this case, hypothesis - we can better determine causality.

For example, we have said before that correlation is not the same as causation; but put another way, some correlations do imply causation. We need to make causal assumptions in order to draw causal conclusions, especially when the assumption is moderate, qualitative or obvious.

- You assume something, you get something; you assume nothing, you get nothing.

## Randomized Control Trial

### 君の名は。

Wait, let's not think about the body swap. Let's look at your twin brother - is your twin brother you in a parallel universe?

It must be admitted that twin studies are significant in the 'decomposition' genre, such as nature v.s. nurture.

However, applying the same formula to the 'causal' type of study, one finds that it does not hold water.

- Internal validity：If the twins are assumed to be the same in all dimensions, why are they different in X by preference?

- External validity：The number of births, birth spacing, etc. may vary for twins.

Thus, the main character of this section makes his appearance.

### Core Hypothesis: the Control Group Responded to the Counterfactual That the Intervention Group Would Have Received the Intervention If They Had Not （控制组反应了干预组如果没有接受干预的反事实）

Randomized Control Trial means that each individual is randomly assigned to either the intervention or control group. The randomization of intervention assignment ensures that the average difference in outcomes between the intervention and control groups is fully attributable to the intervention itself, as the characteristics of the two groups are essentially the same before all interventions.

Thus, the RCT carries out its core assumption and assumes that the intervention and control groups are similar in both observable and unobservable dimensions.

Of course, the RCT is not really that complicated, and is essentially similar to the control experiments you have done in high school biology (or junior high school biology if you didn't take biology in high school). As such, it requires the same fine-grained control of variables. Given that it is difficult for us to do RCTs in practice, I won't go into much detail here. However, as you study someone else's data, you may want to take note of whether it meets the criteria you have in mind.

### Limitations

Of course, RCT has the same limitations. To save you time, we can briefly summarise them in three points.

- Attrition: continuous tracking of participants is difficult, making randomised trials unsuitable for studying long-term effects. In particular, it can happen that the rate of loss varies from group to group, which can affect the accuracy of the experiment to a considerable extent.

- Sample size: Obviously, it is generally difficult to make large samples for randomised trials.

- External validity: experiments (as opposed to observational data) are more 'local' in nature, for example many experiments have a sample of mostly university students.

If you read the extra-curricular material we sent you last week, you will have seen all of the above probably in it.

## Intervention $\to$ Observe

RCT requires you to intervene in reality in some way. But right now we may not have the power (or rather, the funds ......) to intervene. So, can we obtain causality by simply 'observing'?

This is where the importance of assumptions comes in. Without hypotheses, there is no causal inference; and too strong a hypothesis makes the conclusions less credible. Therefore, the core of observational research is to **select appropriate hypotheses and use them as a basis for analysis**.

### However...

Obtaining causality by observation is actually quite difficult.

Suppose now we know that $Wage_i=\alpha+\beta EDU_i+\varepsilon_i$, which means that wage is a linear function related to education.

(This equation may be a little strange, but never mind, you just know for now that it is a linear function.)

First of all, if $\beta>0$, can we prove causality of wages and education? Not necessarily - people who can go to university tend to be smarter themselves, and these smart people, even if they don't go to university, are likely to earn more.

So now, by some kind of magic (or, believe it or not, quantitative IQ tests), we can control for the IQ of all survey respondents to remain the same. But does this mean that there are no other variables? Obviously not; family, emotional intelligence, and even genetics can all be influential factors.

So, why is this? In fact, the real reason is that it is difficult for us to make perfect control-controls, i.e. two people who are the same in all other dimensions and differ only in their education, simply by observation.
The difference is in education. It is so difficult to obtain a rigorous causal relationship for the intuitively "obvious" issue of education increasing income, let alone something much more complex (e.g., does in-rolling win).

So we need to look for natural interventions (e.g. 18 years old, 一本线, etc.). Of course there are actually two other ideas, one for matching and one for modelling, which are not much described here, so you can search for them if you are interested and apply them to your research.

## Methods of Causal Inference

There are many methods of causal inference, and here we recommend that you teach yourself both DID and piecewise regression.

As we have very little time, your goal is not so much technical details as a deeper understanding of.

- How to find counterfactuals in observations when there is a lack of intervention
- Understanding the significance of hypotheses in observational studies

### Difference-in-Difference

Core assumptions: parallel trends

- In the absence of treatment, the difference between the ‘treatment’ and ‘control’group is constant over the treatment period.

### Piecewise Regression

Core assumption: locally random

- No manipulation
- Within a very small segment near a breakpoint, the value is random

## Researcher Degrees of Freedom

It is also the conscious or unconscious set of minor decisions that might be made by the researcher depending on what the data seem to be showing. This is probably unavoidable with any data content, so it is important that you understand this.

It could be changes in the design of the experiment, it could be when to stop collecting data, it could be what data to exclude, it could be what groups to emphasize ...... And so on and so forth. Always bear in mind that there is an implicit element of this.

# Use of Data

## Problems with data

### Differences in Sampling Units

The sampling unit determines whether a particular piece of information is available or not.

For example, if the data you find is sampled by household, then there is no data for individuals.

It is important to note that **partially having it is more harmful than not having it at all** , just as it is better to have a program that runs but has errors (especially if you don't know it's wrong) than not to have it.

### Sample Wear

The harm can also be seen in this. For example, if an individual or business can be tracked for two consecutive periods, what does that tell you?

The fact that tracking is difficult and that the matter of whether or not one can be tracked **is not random**.

### Representation

When we conduct a per capita income survey, we may encounter situations where the cost of the survey in towns is different from the cost of the survey in villages. At such times the town sample may be larger than the village sample, resulting in an unreasonable calculation of the average.

At this point, we need to **weight** the sample to calculate it.

### Misstatement (Intentional or unintentional)

Any data is inevitably laced with human factors.

A classic case in point is the sex ratio at birth in our country. It is well known that there are currently slightly more boys than girls among our newborns. There are two main explanations for this phenomenon.

- The one-child policy + preference for boys + sex selection techniques

- The one-child policy + misstatement

Assuming the second one, how should it be **verified**?

### Changes to the Questions

As times change (seems a very old fashioned way to start ......) , the design of questionnaires **can't and shouldn't be set in stone**. So, the same question can mean different things if different options are given.

### No data is Perfect

Keep in mind that your aim is not to create a perfect piece of data, but to better understand the data and interpret the results. That is, the core of your concern is **whether, or how, imperfections in the data affect the answers to the questions**.

For example, is the rounding of survey data or the error in the estimates important? This may then depend on the question you are studying and the assumptions made.

## Key Points of Programming

As with the previous programming, there is nothing special about programming data. Therefore, you need to pay attention to the same things you needed to pay attention to before: comments, backups, code readability, program efficiency... etc.

So there's not much to say. Just keep it the same as before (unless you haven't done a good job of writing code before).

## Initial Data Cleaning

### Full Awareness of Data

Before embarking on any 'complex' statistical analysis, you must be fully aware of what data you are using and the need for adequate 'description'.

In our earlier discussion of causal analysis, it is easy to see that the more complex the statistical analysis, the more assumptions it relies on (and many of them cannot be tested); in contrast, far fewer assumptions are required to 'describe' it.

If you are cleaning a single variable, you can describe the original input and then describe the output.

### "Numbers" and "Characters"

You may think it is trivial to distinguish between them.

So now there is a question: which is the number and which is the character, the name or the ID number?

Also note that values like `"NA"` which means "Not Available" may turn your entire variable into characters.

### Missing Values

`When you can handle missing values correctly, it means you're getting started with data cleaning.`

There is an important implicit assumption in data analysis in practice: analyses involving $(y,x_1,x_2)$ will only use observations **with none of these three variables missing**。

That is, $(y,x_1)\to(y,x_1,x_2)$ contains two effects.

- Adding the variable $x_2$

- Adding a new sample constraint: $x_2$ cannot be a missing value

Moreover, the missing information is not necessarily random.

Note also that missing values are represented differently for different data, and there may be more than one missing value.

You can get how the function you are using treats missing values by searching for them.

#### How to Handle Missing Values

A simpler and more brutal way would be to write down the missing value as 0. But obviously, the missing value $\neq$ is 0.

Another idea is to replace the missing value with the average value. This idea also has some problems - because missing is not necessarily random. For example, people who have no concept of income, have a grey industry of income, have too much income, etc. may not report their income.

Also, there is the question: does nonmissing + missing = missing? The cost of this is that relatively important information may be lost because of relatively minor information.

So what exactly do we do? The first thing that needs to be understood is why the missing values occur, and there is information that is better suited to making up the values.

- Gender, date of birth for consecutive surveys

- The presence of logical relationships, e.g. is there a job? $\to$ If the answer is yes, what were the hours of work?

Of course, it is more likely that the values are still not well complemented. It must be acknowledged that perfect complementary values do not exist, as it is impossible to create something out of nothing. It is therefore more important to **be clear what assumptions are made when dealing with missing values**. For example, when counting consumption, if an important consumption subcategory is missing (e.g. food), it is marked as missing; otherwise (e.g. toys), imputation is used to assign values.

Imputation is a separate discipline and will not be expanded upon here. You can search for yourself if you are interested.

### Extreme Values

Similarly, there is a dilemma in dealing with extreme values.

Often, extreme values as outlier do affect your statistics to a large extent, for example, if a respondent does not pay attention to the unit "million" instead of "dollar" when filling out an income questionnaire, or if it is simply a very high income group, this can have a significant impact on the mean income value. This can have a significant impact on the mean income. It is obviously reasonable to remove them at such times.
*
In addition, we can also control for extremes, such as winsorize or trim, but what exactly should be done? Unilateral or bilateral, 1% or 5%, and also considering the researcher degree of freedom ...... There are many, many options.

So again, it is more important to **be clear what assumptions are made when dealing with extreme values**. By being more descriptive more BENCHMARK can be provided.

## Application: Discrimination

Here is an extra-curricular resource for you in order to give you some reference for researching a particular topic.

You don't need to read everything in this, but it should give you some insight.

## Phase Wrap-up: You're Always Going to Have to Do Something

Today's content contains a considerable amount of content involving ways of thinking. It may take you a while to understand them and then apply them to your goals.

Of course, you may still be feeling a bit out of your depth at the moment. For the data manipulation part, the following exercises will help you. You will also learn how to approach a specific problem.

However, the topics you need to study are probably more complex than those in the exercises. So, you should try to find the smaller, more convenient subtle elements that you can study. Then, go through and outline your project one by one.

At the same time, don't worry that your answers to the questions won't be 'perfect'. It is more important to show your thinking than to answer the questions.

## Exercises (1 points)

### 0. Casuality (1 point)

- All other things being equal, people who drink moderate amounts of alcohol per day have a lower mortality rate than those who do not drink. Why?

- The proportion of left-handed people who die each year is smaller than the proportion of left-handed people in the total population. Why?

### 1. Handling Realistic Data: Efficacy of Small Class Size in Early Education (10 points)

#### Background

The STAR (Student–Teacher Achievement Ratio) Project is a four-year longitudinal study examining the effect of class size in early grade levels on educational performance and personal development. A longitudinal study is one in which the same participants are followed over time. This particular study lasted from 1985 to 1989 and involved 11,601 students. During the four years of the study, students were randomly assigned to small classes, regular-sized classes, or regular-sized classes with an aid.

We will analyze just a portion of this data to investigate whether the small class sizes improved educational performance or not. The data file name is `STAR.csv`, which is in CSV format. The names and descriptions of variables in this data set are displayed in the table below. Note that there are a fair amount of missing values in this data set, which arise, for example, because some students left a STAR school before third grade, or did not enter a STAR school until first grade.

| Variable     | Description                                                  |
| ------------ | ------------------------------------------------------------ |
| `race`       | student’s race (white = 1, black = 2, Asian = 3, Hispanic = 4, Native American = 5, others = 6) |
| `classtype`  | type of kindergarten class (small = 1, regular = 2, regular with aid = 3) |
| `g4math`     | total scaled score for the math portion of the fourth-grade standardized test |
| `g4reading`  | total scaled score for the reading portion of the fourth-grade standardized test |
| `yearssmall` | number of years in small classes                             |
| `hsgrad`     | high-school graduation (did graduate = 1, did not graduate = 0) |


#### Action

Analyse the following topics by processing the data in `STAR.csv`:

1. How does performance on fourth-grade reading and math tests for those students assigned to a small class in kindergarten compare with those assigned to a regularsized class? Do students in the smaller classes perform better?
2. Examine whether the STAR program reduced achievement gaps across different
racial groups. Begin by comparing the reading and math test scores
between white and minority students (i.e., blacks and Hispanics) among those
students who were assigned to regular-sized classes with no aid. Conduct the
same comparison among those students who were assigned to small classes. 

Use any indicators or plots to make the comparisons while handling missing values. Display the results in a clear format.

#### Questions (The answers are not unique)

- How do you group the data?
- How do you handle the missing values?
- What aspects can your indicators or plots reflect accurately? What can't?
- Give a brief substantive interpretation of the results.

#### Code

In [2]:
# Your code here

## Reference

ECON1013, Data and Society