# Chapter 1: Introduction to Data

#### Walkthrough of the chapter's *Guided Practice* and exercises.

### Guided Practice 1.1
The proportion of patients in treatment group who had a stroke by the end of their first year can be calculated as 
\begin{equation}
\frac{45}{45 + 179} = \frac{45}{224} = 0.20 = 20\%
\end{equation}

### Exercise 1.1 - Migraine and acupuncture, Part I

* (a) Around 23.26% of those who received acupuncture were pain free in the treatment group after 24 hours.
* (b) In the control group, around 4.34% were pain free after 24 hours.
* (c) In treatment group we find the highest percentage of pain free patients.
* (d) We might have sampled a population that is not representative of the whole population who suffer from migraine. Even if bad sampling can be an issue, it might not be the only one though.


### Exercise 1.2 - Sinusitis and antibiotics, Part I
* (a) Around 77.65% of patients in the treatment group reported improvements in symptoms.
* (b) Around 80.25% of patients in the control group reported improvements in symptoms.
* (c) We have a slightly greater percentage in the control group.
* (d) First of all, we see in this sample a higher percentage in the control group. However, the difference in the percentage is so small it could be from random fluctuations which are normal in these kinds of studies. From this sample, we can't deduct anything real.


### Guided Practice 1.2
The grade of the first loan (as shown in the book) is __A__. The home ownership is __rent__.

### Guided Practice 1.3
An feasible organization of grades could be the following:

| Name            | Description                                |
|-----------------|--------------------------------------------|
| `student_name`  | The student name                           |
| `homework_type` | The type (can be assignment, quiz or exam) |
| `class`         | The class for which the grade refers to    |
| `grade`         | The actual grade                           |

It is not exhaustive but it gets the job done.

### Guided Practice 1.4

We can set up a data matrix such as:

| Name                                               | Description                                |
|----------------------------------------------------|--------------------------------------------|
| `county`                                           | The county name                            |
| `state`                                            | The state in which it is located.          |
| `population_in_2017`                               | The class for which the grade refers to    |
| `population_change_2010_2017`                      | The actual grade                           |
| `poverty`                                          | Poverty index.                             |
| `etc...`                                           | The additional six characteristics         |


### Guided Practice 1.6
The variable `group` is categorical, while the variable `num_migraines` is discrete.

### Guided Practice 1.7
In order to create questions, we need to see the data matrix for the dataset `loan50`. To do so we import it with `pandas`. Let's start by importing all the relevant data analysis libraries.

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

Now let's read the csv file which contains our dataset.

In [3]:
datasets_folder = Path("../datasets/")
loan50_file = datasets_folder / "loan50.csv"

loan50_df = pd.read_csv(loan50_file)

Let's get an idea about the data by showing the first 10 rows.

In [4]:
display(loan50_df.head(10))

Unnamed: 0,state,emp_length,term,homeownership,annual_income,verified_income,debt_to_income,total_credit_limit,total_credit_utilized,num_cc_carrying_balance,loan_purpose,loan_amount,grade,interest_rate,public_record_bankrupt,loan_status,has_second_income,total_income
0,NJ,3.0,60,rent,59000,Not Verified,0.557525,95131,32894,8,debt_consolidation,22000,B,10.9,0,Current,False,59000
1,CA,10.0,36,rent,60000,Not Verified,1.305683,51929,78341,2,credit_card,6000,B,9.92,1,Current,False,60000
2,SC,,36,mortgage,75000,Verified,1.05628,301373,79221,14,debt_consolidation,25000,E,26.3,0,Current,False,75000
3,CA,0.0,36,rent,75000,Not Verified,0.574347,59890,43076,10,credit_card,6000,B,9.92,0,Current,False,75000
4,OH,4.0,60,mortgage,254000,Not Verified,0.23815,422619,60490,2,home_improvement,25000,B,9.43,0,Current,False,254000
5,IN,6.0,36,mortgage,67000,Source Verified,1.077045,349825,72162,4,home_improvement,6400,B,9.92,0,Current,False,67000
6,NY,2.0,36,rent,28800,Source Verified,0.099722,15980,2872,1,debt_consolidation,3000,D,17.09,0,Current,False,28800
7,MO,10.0,36,mortgage,80000,Not Verified,0.350913,258439,28073,3,credit_card,14500,A,6.08,0,Current,False,80000
8,FL,6.0,60,rent,34000,Not Verified,0.6975,87705,23715,10,credit_card,10000,A,7.97,0,Current,False,34000
9,FL,3.0,60,mortgage,80000,Source Verified,0.166854,330394,32036,4,debt_consolidation,18500,C,12.62,1,Current,True,192000


The questions that I would ask are:
* Is there an association between `annual_income` and/or `total_income` and `homeownership`? 
* How does `loan_amount` affect `interest_rate`?

The first question comes from a personal experience of knowledge according to which those on a low income (either annual or total) usually rent houses rather than owning houses (unless, of course, someone inherited a family member house).

The second question comes from the intuition according to which the amount of the loan influences somehow the interest rate.

### Exercise 1.3 - Air pollution and birth outcomes, study components.
* (a) The research question of the study could be: do certain levels of air pollutants cause preterm births?
* (b) The subjects were __143,196 births__ between the years 1989 and 1993, taken accordingly.
* (c) The continuous explanatory variables in the study are levels of CO, nitrogen dioxide, ozone, PM10 subjects were exposed to which were calculated during gestation. Then we have a discrete explanatory variable which is the year the observation is collected. The response variable is whether or not the preterm birth happened, and this is definitely a categorical variable. If we were to predict, let's say, how many weeks in advance the preterm occurred, then we would state that such variable would be ordinal.

### Exercise 1.4 - Buteyko method, study components.

* (a) The research question is whether the Buteyko method reduces asthma symptoms / improve quality of life.
* (b) The subjects were 600 asthma patients.
* (c) Here we have multiple response variables, due to the fact that we are testing the effectiveness of such method on multiple outcomes, on a scale from 1 to 10: this makes the response variables ordinal categorical. The explanatory variable used is the categorical variable which tells us if the patient took the method or not.

### Exercise 1.5 - Cheaters, study components.

* (a) The research question, as always, can be deducted from the purpose of the study: given the age and the explicit instruction or not, what is the relationship between this and the person's honesty?
* (b) Subjects of the experiment are 160 children aged from 5 to 15.
* (c) Recorded variables are age (discrete numerical), sex (nominal categorical) and the fact that the child is an only child or not (nominal categorical). An additional variable which was recorded is the outcome of the fair coin toss, and this is also a categorical variable.

### Exercise 1.6 - Stealers, study components.

* (a) The main research question is: given the socio-economic class, how is it likely the individual behaves unethically?
* (b) The subject are 129 undergraduate from Berkeley.
* (c) Explanatory variables recorded are the three ordinal categorical variables that states their money, education and job tier/profiles. The response variable is the number of candies they took after the survey (which a numerical discrete variable).

### Exercise 1.7 - Migraine and acupuncture, Part II.

The explanatory variable recorded is the kind of acupuncture people received (in the tratment group, migraine-specific acupuncture while in the control group a placebo one). The response variable is the indicator variable which states whether or not they were pain-free after 24 hours from the treatment.

### Exercise 1.8 - Sinusitis and antibiotics, Part II.

The explanatory variable is the kind of treatment the patients received. The response variable is whether they had improvements in symptoms or not.

### Exercise 1.9 - Fisher’s irises.

* (a) There are $50 * 3 = 150$ cases (observations).
* (b) The numerical variables included are sepal lenght, sepal width, petal lenght and petal width, which are continuous numerical.
* (c) The only categorical variable is the specie of iris flower, which can be one of the following three levels: _setosa_, _versicolor_ and _virginica_.

### Exercise 1.10 - Smoking habits of UK residents.

* (a) Each row represents an observation, a fact which includes personal details and smoking habits from a UK resident taking part in this study. 
* (b) In this survey they included 1691 partecipants.
* (c) The variables are: _sex_ (nominal categorical), _age_ (discrete numerical), _marital_ (nominal categorical), _grossIncome_ (ordinal categorical), _smoke_ (nominal categorical), _amtWeekends_ (discrete numerical) amd _amtWeekdays_ (discrete numerical).

### Exercise 1.11 - US Airports.

* (a) Variables used are _latitude_ and _longitude_ (continuous numerical), an indicator variable which indicates if the airport is for private or public use (nominal categorical) and an indicator variable which instead indicates the ownership (private or public) of the airport itself.
* (b) Answered above.

### Exercise 1.13 - UN Votes.

* (a) The variables included are: _issue_ (nominal categorical), _% of yes_ (continuous numerical), _country_ (nominal categorical) and _year_ (discrete numerical).
* (b) Answered above.

### Guided Practice 1.9

In (2) the target population is all the Duke undergrads in the past 5 years. An individual case can contain all the student's informations and the years it took for him to complete his degree. In (3) the target population is all the people with severe heart disease, and an individual case might contain all the person's information including the severity of his heart disease.

### Guided Practice 1.11

Online reviews are an example of a convenience sample, since online reviews are first done by peole who care about evaluating their experience, and second they are usually from people who has to complain or even people astonished by their products. An average, non-caring user who might be more frequent that what we might think, may not be included.

### Guided Practice 1.12

The answer is no, since we are just observing the apparent casual correlation, but we are not making any experiment where we separate partecipants and, most importantly, we select partecipants carefully.

### Guided Practice 1.13

One confounding variable can be _salary_: those who have higher salaries own a house, and usually who buys a house does not buy it in a multi-unit structure.

### Exercise 1.13 - Air pollution and birth outcomes, scope of inference.

* (a) Since the study explicitly states that it is being analyzed the relationship between air pollutants and preterm births in Southern California, I would argue that the population of interest are the prebirths in Southern California. The sample consists of 143,196 births between 1989 and 1993.
* (b) Studies can probably be generalized to the population, but one thing that might help out is to sample from other years and see if we always achieve the same results.

### Exercise 1.14 - Cheaters, scope of inference.

* (a) The population of interest is all children aged from 5 to 15.
* (b) Unlike the previous exercise, here we have a smaller sample (160 children). However, we saw certain differences among the groups, as well as among the children characteristics. Therefore, we can cautiously generalize, but in order to be safer, more samples have to be taken.

### Exercise 1.15 - Buteyko method, scope of inference.

* (a) The population of interest is ashtma patients (potentially all of them).
* (b) One experiment will never be safe to generalize. Usually it is necessary to have more samples on which we can make the same experiment. And once we see consistent results, we can generalize. Therefore, I am not sure we can generalize such a fact, especially in a medical context.

### Exercise 1.16 - Stealers, scope of inference.

* (a) The population of interest is basically the world population since we are studying the relationship between the socio-economic class the individual belongs to and unethical behavior.
* (b) It cannot be generalized because in the first place we are using a convenience sample, recruited in one place (Berkeley). Berkeley students are not representative of the world population. Another fallacy is the method with which the study was designed. A casual relationship cannot be established.

### Exercise 1.17 - Relaxing after work.

* (a) An observation.
* (b) A variable.
* (c) A sample statistic.
* (d) A population parameter.

### Exercise 1.18 - Cats on YouTube.

* (a) A population parameter.
* (b) A sample statistic.
* (c) An observation.
* (d) A variable.

### Exercise 1.19 - Course satisfaction across sections.

* (a) This seems to be an observational study, but whether or not the surveys can take places multiple times throughout the single section or all sections, the study can be prospective or retrospective.
* (b) To evaluate a particular section, professor can sample students who were previously in that section, but currently are in their current ones: students, in this way, never evaluate the current section, thus making an unbiased judgement. In this way, we will have a fully retrospective study since the section students are evaluating already happened. And such a sampling strategy can either be clustered sampling or multistage if we sample within the clusters.

### Exercise 1.20 - Housing proposal across dorms.

* (a) This seems to be a retrospective study since it is collected at some point, when students have probably inhabitated the dorms for a while.
* (b) They should definitely use stratified sampling to get a fair survey.

### Exercise 1.21 - Internet use and life expectancy.

* (a) Apparently, the more the internet is used in the world, the longer people live.
* (b) This is an observational study.
* (c) Percentage of internet users have increased over time. And over time, we have developed better medicine. So a potential confounding variable can be related to medicine and quality of life.

### Exercise 1.22 - Stressed out, Part I. 

* (a) This is an observational study.
* (b) There definitely can be truth in the fact that increased stress favors muscle cramps, though such a casual relationship is hard to prove without biological evidence. Furthermore, we have seen that stress makes some behaviors arise.
* (c) Drinking coffee and sleeping less might be confounding since they may biologically promote cramps.

### Exercise 1.23 - Evaluate sampling methods.

Point (a) is the least reasonable. Point (b) can be good but the field of study might induce bias due to the fact that some university courses are attended by a certain socio-economic class. Point (c) is reasonable since we are sampling according to age.

### Exercise 1.24 - Random digit dialing.

A possible reason can be an increase in randomness. Using a phone book we might get people on the same area, and we might need more samples to truly achieve randomness.

### Exercise 1.25 - Haters are gonna hate, study confirms.

* (a) Cases are the 200 randomly sampled individuals.
* (b) The response variable is the reaction towards the oven.
* (c) The explanatory variable is the reaction to the subjects on the dispositional attitude measurement.
* (d) This study makes use of random sampling.
* (e) This is an observational study: we are just observing how, given the first reactions, the individual will react to the oven. We are not making any assumption and we are not controlling variables/separating individuals.
* (f) We can't generalize, as one confounding variable can be the mood, or other things this study factors out.
* (g) A group of 200 is not indicative of the whole population.

### Exercise 1.26 - Family size.

Elementary school kids still live with their parents, therefore we will definitely have a bias that will make the family size bigger, since kids live in numerous families usually. The value will definitely be overestimated.

### Exercise 1.27 - Sampling strategies.

* (a) Random sampling. We can expect a mean which is not representative of the population mean.
* (b) Giving it only to his friends is not a technical sample and this does will induce a great bias due to the fact that the friends might have similiar habits / patters of usage of social networks.
* (c) This is a convenience sample: only Facebook users will be able to partecipate. Also, this might bias the result, since Facebook is a social network, and those who do not use will not be included.
* (d) This is a multistage sampling. This is the least biased method presented.

### Exercise 1.28 - Reading the paper.
* (a) We can't conclude that smoking causes dementia because 25% is not a significant percentage and therefore other factors should be held into account.
* (b) The statement is not justified, since sleeping disorders can be caused by the same reasons that cause behavioral disorders, including stress and psychological issues. These are indeed confounding variables.

### Guided Practice 1.16

It is an experiment but it is definitely not blinded since patients know they go the stents.

### Guided Practice 1.17

Stents are invasive. You cannot just replicate them with a placebo: the risks will outweight the benefits.

__NOTE__: see __Sham Surgery__.

### Exercise 1.29 - Light and exam performance.

* (a) The response variable is the exam(s) performance.
* (b) The explanatory variable is the type of lighting which can hold three levels (fluorescent overhead lighting, yellow overhead lighting and no overhead lighting which is desk lamp).
* (c) Gender is the blocking variable, since it might get different effect and therefore we have to make two blocks (males and females) from which we perform our random samples.

### Exercise 1.30 - Vitamin supplements.

* (a) It is an experiment since we are randomizing the partecipants into the four treatment groups, and we are testing an actual hypothesis (Vitamic C reduces cold symptoms duration).
* (b) The explanatory variable is the kind of pill the patients are prescribed (placebo pill, 1g pill, 3g pill, 3g pill with additives).
* (c) The patients were blinded.
* (d) Study is double-blind since researchers do not interact with patients. Nurses do, but nurses here do not play any role in the research.
* (e) It may definitely introduce a confounding variable, because if people are not taking it, any potential effect of that pill to these people will not occur. So we may get to wrong conclusions. One thing may be to exclude those who do not take a pill, or having that pill administered by a nurse (which does not need to be blind since she does not play a significant role in the research).

### Exercise 1.31 - Light, noise, and exam performance.

* (a) This is an experiment.
* (b) In this study, we have the light factor (as in Exercise 1.29), and the noise factor which has three levels: no noise, construction noise and human chatter noise.
* (c) The sex variable is a blocking variable since between males and females might get different effects from light and noise.

### Exercuse 1.32 - Music and learning.

I would randomly select students from elementary, high university/college level schools. From the previous exercise, we have a blocking variable which is sex, then we make the two initial blocks. One other blocking variable might be their performance levels, which can be categorized in discrete levels: this too can be a blocking variable since those who have higher grades might have a better attention. If we include people with learning disorders, we have to make the appropriate blocks. We also want to have a stratified sample, since we want to include students from different university courses. The sample size can vary from 500 to 1000. But given the broad scope of the study, we might even need to recruit more partecipants. Of course, I will not tell them the purpose of the study, so they will be blind. Music will be played in a way they might not realize (I might have them study in a coffee shop with a background music with/without lyrics). They have to be unaware of the fact that they are being studied to understand how music impacts their learning experience. I will let some researchers who are unaware of the treatment groups assess individual students knowledge.

### Exercise 1.33 - Soda preference.

In this case, sample size can be one third or one fourth of the class. Gender is a blocking variable so we have to equally include both males and females. In order for our experiment to have more benefits than risks, we might exclude students with diabetes. Whether or not coke has sugar is not shown in the can/glass. Instead, they will have two letters: __A__ and __B__. Letter __A__ can represent the diet coke while __B__ can represent standard coke and vice versa, i.e. the letter is not tied to one or the other type of coke. Through a randomized Python script I will bind each student to the __A/B__ combination, and for each one of them the combination is different and randomly chosen.  

In [11]:
import random

def generate_combination():
    soda_type = ["diet", "standard"]
    glass_letter = ["A", "B"]

    random.shuffle(soda_type)

    return dict(zip(glass_letter, soda_type))

def bind_partecipants_to_combinations(partecipants):
    bindings = {}

    for partecipant in partecipants:
        bindings[partecipant] = generate_combination()

    return bindings

{'A': 'standard', 'B': 'diet'}


After all, I can know the combinations without influencing anything. I will handle the two glasses and record their preferences. I will then make a chart to visually describe my findings.

### Exercise 1.34 - Exercise and mental health.

* (a) This is an experiment since we are assigning the treatment and control groups which will receive or not the prescription to exercise.
* (b) The treatment group is the one that gets exercise, while the control group is the one that does not.
* (c) There is no blocking.
* (d) There's no blinding mechanism since those who gets prescribed not to exercise are not receiving any placebo who can make them "unaware" of them belonging to the control group.
* (e) From one experiment, it is impossible to fully and confidently generalize on the whole population, but we can definitely keep going on making more and more well designed experiments. We can infer that there might be (if any) a casual relationship, though full certainty can be given by further experiments, which are blind or, even better, double-blind. In this way, we don't get any emotional or personal bias.
* (f) If we can make this experiment blind or double-blind, then I would definitely fund it. Having a second researcher who does the actual assesment, and even better if this researcher is skeptical about the study, will make it a study worth funding.