<a href="https://colab.research.google.com/github/Hadiasemi/Data301/blob/main/Copy_of_Chapter_1_3_Rows_and_the_Observational_Unit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rows and the Observational Unit

Recall that the rows of a tabular data set represent observations. Whenever you encounter a new (tabular) data 
set, the first question you should ask yourself is

> "What is the observational unit?"

In other words, what does each row of the 
`DataFrame` represent? In the case of 
the OKCupid data set in the previous section, the observational unit was clearly an OKCupid user. But it is not always so obvious what the observational unit is.

For example, consider the Framingham Heart Study data set, which is available at https://dlsun.github.io/pods/data/framingham_long.csv. 
This data comes from a study of men and women in 
the town of Framingham, Massachusetts, which has enrolled 
thousands of patients since it began in 1948 and is still ongoing. The goal of the study is to identify risk factors for cardivascular disease (CVD) by following the subjects over time. The data set that we will analyze was collected on 4,434 subjects between 1956 and 1968. A description of the data set is available [here](https://biolincc.nhlbi.nih.gov/media/teachingstudies/FHS_Teaching_Longitudinal_Data_Documentation.pdf).

You might guess that the observational unit is a subject. Let's see if that guess is correct.

In [None]:
import pandas as pd

data_dir = "https://dlsun.github.io/pods/data/"
df_framingham = pd.read_csv(data_dir + "framingham_long.csv")
df_framingham.head()

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
0,2448,1,195.0,39,106.0,70.0,0,0.0,26.97,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
1,2448,1,209.0,52,121.0,66.0,0,0.0,,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2,6238,2,250.0,46,121.0,81.0,0,0.0,28.73,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
3,6238,2,260.0,52,105.0,69.5,0,0.0,29.43,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,6238,2,237.0,58,108.0,66.0,0,0.0,28.5,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


Each `RANDID` corresponds to a unique subject in the study, but each subject appears multiple times in the data set. That is because this is a longitudinal study; each subject was measured at multiple points during their lifetime. So the observational unit in the Framingham Heart Study data set is a _measurement_ of a subject at a point in time.

If there is a variable or a set of variables in the data set that uniquely identifies the observational unit, then it is customary to make those variables the index the `DataFrame`. In the Framingham data set, `RANDID` and `TIME` uniquely identify the observational unit, so we move these columns to the index. (Notice that we specify `inplace=True` so that `.set_index()` modifies the existing `DataFrame` rather than returning a new one.)

In [None]:
df_framingham.set_index(["RANDID", "TIME"], inplace=True)
df_framingham.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2448,0,1,195.0,39,106.0,70.0,0,0.0,26.97,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2448,4628,1,209.0,52,121.0,66.0,0,0.0,,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
6238,0,2,250.0,46,121.0,81.0,0,0.0,28.73,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6238,4344,2,237.0,58,108.0,66.0,0,0.0,28.5,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


## Selecting Rows

We can select an individual row from a `DataFrame` using its label in the index. For example, the fourth row in the Framingham data set above has label `(6238, 2156)`. The `.loc` attribute of the `DataFrame` is used to select a row by its label. 

In [None]:
row = df_framingham.loc[(6238, 2156)]
row

SEX            2.00
TOTCHOL      260.00
AGE           52.00
SYSBP        105.00
DIABP         69.50
CURSMOKE       0.00
CIGPDAY        0.00
BMI           29.43
DIABETES       0.00
BPMEDS         0.00
HEARTRTE      80.00
GLUCOSE       86.00
educ           2.00
PREVCHD        0.00
PREVAP         0.00
PREVMI         0.00
PREVSTRK       0.00
PREVHYP        0.00
PERIOD         2.00
HDLC            NaN
LDLC            NaN
DEATH          0.00
ANGINA         0.00
HOSPMI         0.00
MI_FCHD        0.00
ANYCHD         0.00
STROKE         0.00
CVD            0.00
HYPERTEN       0.00
TIMEAP      8766.00
TIMEMI      8766.00
TIMEMIFC    8766.00
TIMECHD     8766.00
TIMESTRK    8766.00
TIMECVD     8766.00
TIMEDTH     8766.00
TIMEHYP     8766.00
Name: (6238, 2156), dtype: float64

We can also select a row by its position using the `.iloc` attribute. Keeping in mind that the first row is actually row 0, the fourth row could also be extracted as:

In [None]:
df_framingham.iloc[3]

SEX            2.00
TOTCHOL      260.00
AGE           52.00
SYSBP        105.00
DIABP         69.50
CURSMOKE       0.00
CIGPDAY        0.00
BMI           29.43
DIABETES       0.00
BPMEDS         0.00
HEARTRTE      80.00
GLUCOSE       86.00
educ           2.00
PREVCHD        0.00
PREVAP         0.00
PREVMI         0.00
PREVSTRK       0.00
PREVHYP        0.00
PERIOD         2.00
HDLC            NaN
LDLC            NaN
DEATH          0.00
ANGINA         0.00
HOSPMI         0.00
MI_FCHD        0.00
ANYCHD         0.00
STROKE         0.00
CVD            0.00
HYPERTEN       0.00
TIMEAP      8766.00
TIMEMI      8766.00
TIMEMIFC    8766.00
TIMECHD     8766.00
TIMESTRK    8766.00
TIMECVD     8766.00
TIMEDTH     8766.00
TIMEHYP     8766.00
Name: (6238, 2156), dtype: float64

Notice that a single row from a `DataFrame` is no longer a `DataFrame` but a different data structure, called a `Series`.

In [None]:
type(row)

pandas.core.series.Series

We can also select multiple rows by passing a _list_ of labels or positions to `.loc` and `.iloc`, respectively.

In [None]:
rows = df_framingham.loc[[(2448, 4628), (6238, 2156)]]
rows

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2448,4628,1,209.0,52,121.0,66.0,0,0.0,,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


In [None]:
df_framingham.iloc[[1, 3]]

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2448,4628,1,209.0,52,121.0,66.0,0,0.0,,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


Notice that when we select multiple rows, we get a `DataFrame` back.

In [None]:
type(rows)

pandas.core.frame.DataFrame

So a `Series` is used to store a single observation (across multiple variables), while a `DataFrame` is used to store multiple observations (across multiple variables).

If selecting consecutive rows, we can use Python's `slice` notation. For example, the code below selects all rows from the fourth row, up to (but not including) the tenth row.

In [None]:
df_framingham.iloc[3:9]

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6238,4344,2,237.0,58,108.0,66.0,0,0.0,28.5,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
9428,0,1,245.0,48,127.5,80.0,1,20.0,25.34,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
9428,2199,1,283.0,54,141.0,89.0,1,30.0,25.34,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
10552,0,2,225.0,61,150.0,95.0,1,30.0,28.58,0,0.0,...,1,1,2956,2956,2956,2956,2089,2089,2956,0
10552,1977,2,232.0,67,183.0,109.0,1,20.0,30.18,0,0.0,...,1,1,2956,2956,2956,2956,2089,2089,2956,0


## Exercises





1\. Suppose you want to extract just the fourth row from the Framingham data set, but you want the result to be a `DataFrame` instead of a `Series`. In other words, you want a `DataFrame` with a single row. Can you figure out how to accomplish this?

In [None]:
# YOUR CODE HERE
type(df_framingham.iloc[[3]])

pandas.core.frame.DataFrame

Questions 2-4 deal with the Titanic data set (https://dlsun.github.io/pods/data/titanic.csv) contains information about 2206 passengers and crew members that were aboard the RMS Titanic when it sank on April 15, 1912.

2\. Read in the Titanic data set. What is the observational unit?

In [None]:
# YOUR CODE HERE
df_titanic = pd.read_csv('https://dlsun.github.io/pods/data/titanic.csv')
df_titanic.head()

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,0
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,1


3\. What column seems to be appropriate index for this data set? Do you see any problems with using this column as the index? (_Hint:_ Try looking up "Kelly, Mr. James" or "Green, Mr. George" in this `DataFrame`.)

In [None]:
# # YOUR CODE HERE
# First way
# df_titanic.set_index(['name'], inplace = True)
# df_titanic.loc[["Kelly, Mr. James"]]
# Second way
df_titanic[df_titanic['name']=="Kelly, Mr. James"]

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived
651,"Kelly, Mr. James",male,19.0,3rd,S,Scotland,363592.0,8.01,0
652,"Kelly, Mr. James",male,44.0,3rd,Q,Ireland,330911.0,7.1607,0
653,"Kelly, Mr. James",male,44.0,engineering crew,S,England,,,0


4\. Regardless of your reservations in the previous question, make "name" the index of the `DataFrame`. Use this to extract a `DataFrame` containing information about the three members of the Widener family:
  - Widener, Mr. George Dunton
  - Widener, Mr. Harry Elkins
  - Widener, Mrs. Eleanor

What became of them? Using a search engine, what else can you learn about them?


In [None]:
# YOUR CODE HERE
df_titanic = pd.read_csv('https://dlsun.github.io/pods/data/titanic.csv')
df_titanic.set_index(['name'], inplace = True)
df_titanic.loc[["Widener, Mr. George Dunton","Widener, Mr. Harry Elkins", "Widener, Mrs. Eleanor"]]

Unnamed: 0_level_0,gender,age,class,embarked,country,ticketno,fare,survived
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Widener, Mr. George Dunton",male,50.0,1st,S,United States,113503.0,211.1,0
"Widener, Mr. Harry Elkins",male,27.0,1st,S,United States,113503.0,211.1,0
"Widener, Mrs. Eleanor",female,50.0,1st,S,United States,113503.0,211.1,1
