# Chapter 1: Introduction to Data - Walkthrough

#### Walkthrough of the chapter's *Guided Practice* and exercises.

### Guided Practice 1.1
The proportion of patients in treatment group who had a stroke by the end of their first year can be calculated as 
\begin{equation}
\frac{45}{45 + 179} = \frac{45}{224} = 0.20 = 20\%
\end{equation}

### Exercise 1.1 - Migraine and acupuncture, Part I

* (a) Around 23.26% of those who received acupuncture were pain free in the treatment group after 24 hours.
* (b) In the control group, around 4.34% were pain free after 24 hours.
* (c) In treatment group we find the highest percentage of pain free patients.
* (d) We might have sampled a population that is not representative of the whole population who suffer from migraine. Even if bad sampling can be an issue, it might not be the only one though.


### Exercise 1.2 - Sinusitis and antibiotics, Part I
* (a) Around 77.65% of patients in the treatment group reported improvements in symptoms.
* (b) Around 80.25% of patients in the control group reported improvements in symptoms.
* (c) We have a slightly greater percentage in the control group.
* (d) First of all, we see in this sample a higher percentage in the control group. However, the difference in the percentage is so small it could be from random fluctuations which are normal in these kinds of studies. From this sample, we can't deduct anything real.


### Guided Practice 1.2
The grade of the first loan (as shown in the book) is __A__. The home ownership is __rent__.

### Guided Practice 1.3
An feasible organization of grades could be the following:

|                 |                                            |
|-----------------|--------------------------------------------|
| `student_name`  | The student name                           |
| `homework_type` | The type (can be assignment, quiz or exam) |
| `class`         | The class for which the grade refers to    |
| `grade`         | The actual grade                           |

It is not exhaustive but it gets the job done.

### Guided Practice 1.4

We can set up a data matrix such as:

|                                                    |                                            |
|----------------------------------------------------|--------------------------------------------|
| `county`                                           | The county name                            |
| `state`                                            | The state in which it is located.          |
| `population_in_2017`                               | The class for which the grade refers to    |
| `population_change_2010_2017`                      | The actual grade                           |
| `poverty`                                          | Poverty index.                             |
| `etc...`                                           | The additional six characteristics         |


### Guided Practice 1.6
The variable `group` is categorical, while the variable `num_migraines` is discrete.

### Guided Practice 1.7
In order to create questions, we need to see the data matrix for the dataset `loan50`. To do so we import it with `pandas`. Let's start by importing all the relevant data analysis libraries.

In [8]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

Now let's read the csv file which contains our dataset.

In [9]:
datasets_folder = Path("../datasets/")
loan50_file = datasets_folder / "loan50.csv"

loan50_df = pd.read_csv(loan50_file)

Let's get an idea about the data by showing the first 10 rows.

In [13]:
loan50_df.head(10)

Unnamed: 0,state,emp_length,term,homeownership,annual_income,verified_income,debt_to_income,total_credit_limit,total_credit_utilized,num_cc_carrying_balance,loan_purpose,loan_amount,grade,interest_rate,public_record_bankrupt,loan_status,has_second_income,total_income
0,NJ,3.0,60,rent,59000,Not Verified,0.557525,95131,32894,8,debt_consolidation,22000,B,10.9,0,Current,False,59000
1,CA,10.0,36,rent,60000,Not Verified,1.305683,51929,78341,2,credit_card,6000,B,9.92,1,Current,False,60000
2,SC,,36,mortgage,75000,Verified,1.05628,301373,79221,14,debt_consolidation,25000,E,26.3,0,Current,False,75000
3,CA,0.0,36,rent,75000,Not Verified,0.574347,59890,43076,10,credit_card,6000,B,9.92,0,Current,False,75000
4,OH,4.0,60,mortgage,254000,Not Verified,0.23815,422619,60490,2,home_improvement,25000,B,9.43,0,Current,False,254000
5,IN,6.0,36,mortgage,67000,Source Verified,1.077045,349825,72162,4,home_improvement,6400,B,9.92,0,Current,False,67000
6,NY,2.0,36,rent,28800,Source Verified,0.099722,15980,2872,1,debt_consolidation,3000,D,17.09,0,Current,False,28800
7,MO,10.0,36,mortgage,80000,Not Verified,0.350913,258439,28073,3,credit_card,14500,A,6.08,0,Current,False,80000
8,FL,6.0,60,rent,34000,Not Verified,0.6975,87705,23715,10,credit_card,10000,A,7.97,0,Current,False,34000
9,FL,3.0,60,mortgage,80000,Source Verified,0.166854,330394,32036,4,debt_consolidation,18500,C,12.62,1,Current,True,192000


The questions that I would ask are:
* Is there an association between `annual_income` and/or `total_income` and `homeownership`? 
* How does `loan_amount` affect `interest_rate`?

The first question comes from a personal experience of knowledge according to which those on a low income (either annual or total) usually rent houses rather than owning houses (unless, of course, someone inherited a family member house).

The second question comes from the intuition according to which the amount of the loan influences somehow the interest rate.

### Exercise 1.3
* (a) The research question of the study could be: do certain levels of air pollutants cause preterm births?
* (b) The subjects were __143,196 births__ between the years 1989 and 1993, taken accordingly.
* (c) The continuous explanatory variables in the study are levels of CO, nitrogen dioxide, ozone, PM10 subjects were exposed to which were calculated during gestation. Then we have a discrete explanatory variable which is the year the observation is collected. The response variable is whether or not the preterm birth happened, and this is definitely a categorical variable. If we were to predict, let's say, how many weeks in advance the preterm occurred, then we would state that such variable would be ordinal.

### Exercise 1.4

* (a) The research question is whether the Buteyko method reduces asthma symptoms / improve quality of life.
* (b) The subjects were 600 asthma patients.
* (c) Here we have multiple response variables, due to the fact that we are testing the effectiveness of such method on multiple outcomes, on a scale from 1 to 10: this makes the response variables ordinal categorical. The explanatory variable used is the categorical variable which tells us if the patient took the method or not.