# Data Analysis in Python Workshop
André Guerra, andre.guerra@mail.mcgill.ca \
April, 2021 \
<u>Desciption:</u> This workshop focuses on data analysis techniques using python.

This second notebook in the series examines reading data from .csv and .xlsx files, examining its contents, and applying descriptive statitics for some quick insight into the data.

___
## Jupyter notebooks
Some useful shortcuts to use in Jupyter notebooks.

When outside a cell:
- <b>A</b>: insert a cell above the current
- <b>B</b>: insert a cell below the current
- <b>D,D</b>: delete the current cell
- <b>M</b>: make the current cell markdown type
- <b>Y</b>: make the current cell code type

When inside a cell:
- <b>shift + enter</b>: execute/run cell
- <b>cmd + /</b>: comment/uncomment a line of code

___
## Data handling using pandas package
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. \
https://pandas.pydata.org/

___
## Import statement(s)
Import the package(s) and assign it to a local variable(s) for use in our code.

In [1]:
import pandas as pd

Set decimal precision in returned values in a dataframe.

In [2]:
pd.set_option("precision",3)

___
## Read data files
### .cvs files
We use the attribute <b>read_csv()</b> in the pandas module. The local variable assigned in the import, <b>pd</b>, is used to call the attribute. The target file name is used as an input parameter.

In [3]:
heart_file = "1_data\heart.csv"
heart_DF = pd.read_csv(heart_file)

### .xlsx files
We use the attribute <b>read_excel()</b> in the pandas module. The local variable assigned in the import, <b>pd</b>, is used to call the attribute. The target file name is used as an input parameter.

In [4]:
covid19_file = "1_data\covid_19_data.xlsx"
covid_DF = pd.read_excel(covid19_file)

___
## Preview the data read in
We use the attribute DF.<b>head()</b> in a pandas dataframe to output the first 5 entries of the dataframe.\
Here we have previously defined the DF heart_DF to contain the heart patient data.

In [5]:
heart_DF.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


We can use the attribute DF.<b>tail()</b> to return the last 5 entries of the dataframe.

In [6]:
heart_DF.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


Alternatively, we can return the dataframe variable to see the first 5 and last 5 entries in the dataframe. Notice this also returns the shape of the dataframe <b>(rows x columns)</b>.

In [7]:
covid_DF

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1,0,0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14,0,0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6,0,0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1,0,0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0,0,0
...,...,...,...,...,...,...,...,...
68553,68554,07/20/2020,Zaporizhia Oblast,Ukraine,2020-07-21 04:38:46,678,20,551
68554,68555,07/20/2020,Zeeland,Netherlands,2020-07-21 04:38:46,791,69,0
68555,68556,07/20/2020,Zhejiang,Mainland China,2020-07-21 04:38:46,1270,1,1267
68556,68557,07/20/2020,Zhytomyr Oblast,Ukraine,2020-07-21 04:38:46,1602,34,1251


___
## Information on the data read in
Learn about the data saved to the dataframes defined above by using the attribute DF.<b>info()</b>. We can find out about the number of entries, column names, data types for each, etc.

In [8]:
heart_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


### Dimensions
Return the dimensions of a dataframe using the DF.<b>shape</b> attribute.

In [9]:
covid_DF.shape

(68558, 8)

### Column names
Return the column names using the DF.<b>columns</b> attribute.

In [10]:
heart_DF.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [11]:
covid_DF.columns

Index(['SNo', 'ObservationDate', 'Province/State', 'Country/Region',
       'Last Update', 'Confirmed', 'Deaths', 'Recovered'],
      dtype='object')

___
## Descriptive statistics
We can obtain basic describtive statistics using the DF.<b>describe()</b> attribute. This will calculate the count, mean, standard deviation, quartiles, min, and max of the data in each column. \
The retuned DF holds the descriptive stats for all columns that contain data of the right data type. This means that any columns with non-numerical data types are ignored by this attribute.

In [12]:
covid_DF.describe()

Unnamed: 0,SNo,Confirmed,Deaths,Recovered
count,68558.0,68558.0,68558.0,68560.0
mean,34279.5,10472.018,564.675,4831.0
std,19791.134,32092.93,2516.088,27120.0
min,1.0,0.0,0.0,0.0
25%,17140.25,107.0,1.0,0.0
50%,34279.5,998.0,17.0,137.0
75%,51418.75,5361.0,168.0,1488.0
max,68558.0,416434.0,41128.0,1160000.0


We can slice certain columns of the dataframe and apply the descriptive statistics to that subset of the data.

In [13]:
heart_DF[['age','chol','thalach']].describe()

Unnamed: 0,age,chol,thalach
count,303.0,303.0,303.0
mean,54.366,246.264,149.647
std,9.082,51.831,22.905
min,29.0,126.0,71.0
25%,47.5,211.0,133.5
50%,55.0,240.0,153.0
75%,61.0,274.5,166.0
max,77.0,564.0,202.0
