# Data Analysis in Python Workshop
André Guerra, andre.guerra@mail.mcgill.ca \
April, 2021 \
<u>Desciption:</u> This workshop focuses on data analysis techniques using python.

This notebook examines the data slicing capabilities to extract data from a pandas dataframe (DF).

___
## Jupyter notebooks
Some useful shortcuts to use in Jupyter notebooks.

When outside a cell:
- <b>A</b>: insert a cell above the current
- <b>B</b>: insert a cell below the current
- <b>D,D</b>: delete the current cell
- <b>M</b>: make the current cell markdown type
- <b>Y</b>: make the current cell code type

When inside a cell:
- <b>shift + enter</b>: execute/run cell
- <b>cmd + /</b>: comment/uncomment a line of code

___
## Data handling using pandas package
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. \
https://pandas.pydata.org/

___
## Import statement(s)
Import the package(s) and assign it to a local variable(s) for use in our code.

In [1]:
import pandas as pd

Set decimal precision in returned values in a dataframe.

In [2]:
pd.set_option("precision",3)

___
## Read data files
### .cvs files
We use the attribute <b>read_csv()</b> in the pandas module. The local variable assigned in the import, <b>pd</b>, is used to call the attribute. The target file name is used as an input parameter.

In [3]:
heart_file = "1_data\heart.csv"
heart_DF = pd.read_csv(heart_file)

___
## Preview the data read in
We use the attribute DF.<b>head()</b> in a pandas dataframe to output the first 5 entries of the dataframe.\
Here we have previously defined the DF heart_DF to contain the heart patient data.

In [4]:
heart_DF.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


___
## Information on the data read in
Learn about the data saved to the dataframes defined above by using the attribute DF.<b>info()</b>. We can find out about the number of entries, column names, data types for each, etc.

In [5]:
heart_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


___
## Slicing information out of the dataframe
We can get subsets of data from the dataframe using python slicing syntax.

### Limited properties (columns)
Limit which columns you would like returned from the dataframe. Here, we slice out `age`, `sex`, and `chol`.

In [6]:
heart_DF[['age','sex','chol']]

Unnamed: 0,age,sex,chol
0,63,1,233
1,37,1,250
2,41,0,204
3,56,1,236
4,57,0,354
...,...,...,...
298,57,0,241
299,45,1,264
300,68,1,193
301,57,1,131


### Conditional
Slice out all entries of `age` above 40. This includes all columns in the dataframe.

In [7]:
heart_DF[heart_DF['age'] > 40]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


Slice out all entries of `age` above 40 years <b>AND</b> of `sex` designation '1'. This includes all columns in the dataframe.

In [8]:
heart_DF[(heart_DF['age'] > 40) & (heart_DF['sex'] == 1)]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,63,1,0,140,187,0,0,144,1,4.0,2,2,3,0
297,59,1,0,164,176,1,0,90,0,1.0,1,2,1,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0


### Conditional and limited
Slice out all entries of `age` above 40 AND `sex` designation '1', <b>but</b> limit the return to columns `chol` and `target`.

In [9]:
heart_DF[(heart_DF['age'] > 40) & (heart_DF['sex'] == 1)][['chol','target']]

Unnamed: 0,chol,target
0,233,1
3,236,1
5,192,1
7,263,1
8,199,1
...,...,...
295,187,0
297,176,0
299,264,0
300,193,0


### Slice using integer indexes AND column names
Use the <b>.loc</b> attribute to index into the dataframe using integer values. This can also be used with column names as seen below.

In [10]:
heart_DF.loc[0:4, 'age':'chol']

Unnamed: 0,age,sex,cp,trestbps,chol
0,63,1,3,145,233
1,37,1,2,130,250
2,41,0,1,130,204
3,56,1,1,120,236
4,57,0,0,120,354


Slice specific rows and columns by feeding lists as parameters to <b>.loc</b>.

In [11]:
heart_DF.loc[[0,5,10,20],['age','chol']]

Unnamed: 0,age,chol
0,63,233
5,57,192
10,54,239
20,59,234
