<img src="materials/images/introduction-to-pandas-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

This module introduces you to the pandas library for working with structured data. Pandas is powerful and expressive, and it's one of the primary reasons that Python has become a leading option for doing data science. Pandas provides data structures and intuitive capabilities for performing fast and easy **data cleaning, preparation, manipulation, aggregation, and sophisticated analysis**. 

We will go through seven lessons with you:

- [**Lesson 1: Pandas Data Structures**](Lesson_1_Pandas_Data_Structures.ipynb)

- [**Lesson 2: Dropping Rows and Columns**](Lesson_2_Dropping_Rows_and_Columns.ipynb)

- [**Lesson 3: Selecting and Filtering Rows and Columns**](Lesson_3_Selecting_and_Filtering_Rows_and_Columns.ipynb)

- [**Lesson 4: Importing Data**](Lesson_4_Importing_Data.ipynb)

- <font color=#E98300>**Lesson 5: Data Exploration**</font>    `📍You are here.`

- [**Lesson 6: Data Transformation**](Lesson_6_Data_Transformation.ipynb)

- [**Lesson 7: Data Analysis**](Lesson_7_Data_Analysis.ipynb)
    
</br>

### ✅ Exercises
We encourage you to try the exercise questions in this module, and use the [**solutions to the exercises**](Exercise_solutions.ipynb) to help you study.

</br>

<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>

---

# Lesson 5: Data Exploration

We are going to go through these concepts in this module:

- [Descriptive and summary statistics](#Descriptive-and-summary-statistics)
- [Correlation](#Correlation)

`🕒 This module should take about 20 minutes to complete.`

`✍️ This notebook is written using Python.`

## Descriptive and summary statistics

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/heart_disease.csv")

In [3]:
# The describe() method displays the descriptive statistics about a DataFrame including the mean, median, min,
# max and quartile values for each numerical column.

df.describe()

Unnamed: 0,age,chest_pain,rest_bp,chol,max_hr,st_depr
count,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.966997,131.623762,246.264026,149.646865,1.039604
std,9.082101,1.032052,17.538143,51.830751,22.905161,1.161075
min,29.0,0.0,94.0,126.0,71.0,0.0
25%,47.5,0.0,120.0,211.0,133.5,0.0
50%,55.0,1.0,130.0,240.0,153.0,0.8
75%,61.0,2.0,140.0,274.5,166.0,1.6
max,77.0,3.0,200.0,564.0,202.0,6.2


In [4]:
# You can also call individual methods on a column (Series object) to get a particular descriptive value
# for that column:

df["age"].min()
# df["chol"].mean()

29

## ✅ Exercise 1
Display the average values for max_hr and rest_bp. 

In [5]:
df["max_hr"].mean()

149.64686468646866

In [6]:
df["rest_bp"].mean()

131.62376237623764

In [7]:
df[["max_hr", "rest_bp"]].mean()

max_hr     149.646865
rest_bp    131.623762
dtype: float64

---

<div class="alert alert-block alert-info">
<b>Tip:</b> To view a summary of non-numerical (categorical) columns, you should set the "include" parameter of the describe method equal to "object". In pandas, categorical variables are of type "object". 
    
    For example:  df.describe(include="object")

The summary will include the top occurring category for each column along with its frequency. 
</div>

In [8]:
df.describe(include="object")

Unnamed: 0,sex,target
count,303,303
unique,2,2
top,Female,Yes
freq,207,165


In [9]:
df["sex"].describe()

count        303
unique         2
top       Female
freq         207
Name: sex, dtype: object

<div class="alert alert-block alert-success">
<b>Note:</b> If the describe() method is called on a categorical column (as above), then include="object" is assumed and thus does not need to be passed to the method. </div>

### info()
The info() method displays information about a DataFrame including column data type (dtype) and non-null values.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         303 non-null    int64  
 1   sex         303 non-null    object 
 2   chest_pain  303 non-null    int64  
 3   rest_bp     303 non-null    int64  
 4   chol        303 non-null    int64  
 5   max_hr      303 non-null    int64  
 6   st_depr     303 non-null    float64
 7   target      303 non-null    object 
dtypes: float64(1), int64(5), object(2)
memory usage: 19.1+ KB


<div class="alert alert-block alert-success">
    <b>Note:</b> Using the <b>info()</b> method can be a valuable way to determine if there are any missing values in our dataset. Above, note that info() indicates that the dataset has <b>"303 entries"</b>. Also, note that it indicates that each column has <b>"303 non-null"</b> values. A "null" value indicates a missing value, so the above suggests that there are no missing values in the dataset. Further, info() helpfully provides information about the datatype (Dtype) of each column.</div>

### Displaying the unique values

In [11]:
# The Series object's (a single column) unique() method will return the column's unique values.

df["target"].unique()

array(['Yes', 'No'], dtype=object)

In [12]:
# A Series object (single column) can be passed into Python's built-in set() function 
#   to see the column's unique values.

set(df["target"])

{'No', 'Yes'}

<div class="alert alert-block alert-success">
<b>Note:</b> The unique() method tends to be faster. Python's set() function can be a little slower primarily because it sorts the returned values, which can be helpful in locating a specific category when there are a lot of distinct values within a given column. </div>

<div class="alert alert-block alert-info">
<b>Uses</b>: We often need to convert categorical variable names into numbers for data analysis and machine learning, so being able to determine the number of unique values within a column can be helpful.  For example, knowing that there are only two categories within a column, as shown above, we would know that we could perhaps convert <b>'Yes'</b> to 1 and <b>'No'</b> to 0, when numbers are required. If we hadn't determined the unique values with the column, perhaps we could have missed that there was a <b>'Maybe'</b> option that also needed to be converted. <div>

## ✅ Exercise 2
Display the unique values for the variable "chest_pain".

In [13]:
df["chest_pain"].unique()

array([3, 2, 1, 0])

---

## Correlation
The DataFrame's corr() method will display the pairwise correlations among the DataFrame's columns.

In [14]:
df.corr(numeric_only=True)

Unnamed: 0,age,chest_pain,rest_bp,chol,max_hr,st_depr
age,1.0,-0.068653,0.279351,0.213678,-0.398522,0.210013
chest_pain,-0.068653,1.0,0.047608,-0.076904,0.295762,-0.14923
rest_bp,0.279351,0.047608,1.0,0.123174,-0.046698,0.193216
chol,0.213678,-0.076904,0.123174,1.0,-0.00994,0.053952
max_hr,-0.398522,0.295762,-0.046698,-0.00994,1.0,-0.344187
st_depr,0.210013,-0.14923,0.193216,0.053952,-0.344187,1.0


<div class="alert alert-block alert-info">
<b>Uses</b>: 
    
    It's important to be able to interpret correlation coefficients. 
    In general, the degree of correlation is as follows:

<b>Perfect</b>: If the value is near ± 1, then as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).
    
<b>High</b>: If the value lies between ± 0.50 and ± 1, then it is said to be a strong correlation.
    
<b>Moderate</b>: If the value lies between ± 0.30 and ± 0.49, then it is said to be a moderate correlation.
    
<b>Low</b>: When the value lies between ± .29 and 0, then it is said to be a small correlation.
    
<b>No correlation</b>: When the value is zero there is no correlation between the variables.<div>
    
    
   
    Correlation is a very valuable tool when performing data analysis. We will discuss this in more detail  in subsequent modules that cover data analysis. 

In [15]:
# You also can select specific columns of interest and use the corr() method to see the pairwise correlations.

df[["age", "max_hr"]].corr()

Unnamed: 0,age,max_hr
age,1.0,-0.398522
max_hr,-0.398522,1.0


<div class="alert alert-block alert-success">
<b>Note:</b> Identifying correlations among the columns (independent variables) in a dataset can be very important.  This is known as <b>collinearity</b> and can be detrimental to training a machine learning model.</div>

## ✅ Exercise 3
Display the correlation between max_hr and chest_pain.

In [16]:
df[["max_hr", "chest_pain"]].corr()

Unnamed: 0,max_hr,chest_pain
max_hr,1.0,0.295762
chest_pain,0.295762,1.0


---

## Sorting
The DataFrame's sort_values() method takes a "by" parameter to indicate which column the DataFrame should be sorted by:




In [17]:
df.sort_values(by="age").head()

Unnamed: 0,age,sex,chest_pain,rest_bp,chol,max_hr,st_depr,target
72,29,Female,1,130,204,202,0.0,Yes
58,34,Female,3,118,182,174,0.0,Yes
125,34,Male,1,118,210,192,0.7,Yes
65,35,Male,0,138,183,182,1.4,Yes
157,35,Female,1,122,192,174,0.0,Yes


<div class="alert alert-block alert-warning">
<b>Alert:</b> By default, sort_values() will sort the values from <b>smallest to largest</b>.
</div>

<div class="alert alert-block alert-warning">
<b>Alert:</b> Remember, when modifying a DataFrame, pandas typically returns a copy so that the original DataFrame is unchanged. To transfer the modification to the original Dataframe, you can either set the original Dataframe (df) equal to what pandas returns:

    df = df.sort_values(by="age")

Or you can set the method's "inplace" parameter to True so that the Dataframe itself is changed:
    
    df.sort_values(by="age", inplace=True)
    
</div>

In [18]:
# Setting the "ascending" parameter of the sort_values() method to False will sort the values in descending order:

df.sort_values(by="age", ascending=False).head()

Unnamed: 0,age,sex,chest_pain,rest_bp,chol,max_hr,st_depr,target
238,77,Female,0,125,304,162,0.0,No
144,76,Male,2,140,197,116,1.1,Yes
129,74,Male,1,120,269,121,0.2,Yes
151,71,Male,0,112,149,125,1.6,Yes
60,71,Male,2,110,265,130,0.0,Yes


<div class="alert alert-block alert-warning">
<b>Alert:</b> To sort the values from <b>largest to smallest</b>, you can set the "ascending" parameter of the sort_values() method to  <b>False</b>.
</div>

## ✅ Exercise 4
Sort the DataFrame by max_hr in descending order.

In [20]:
df.sort_values(by="max_hr", ascending=False)

Unnamed: 0,age,sex,chest_pain,rest_bp,chol,max_hr,st_depr,target
72,29,Female,1,130,204,202,0.0,Yes
248,54,Female,1,192,283,195,0.0,No
103,42,Female,2,120,240,194,0.8,Yes
125,34,Male,1,118,210,192,0.7,Yes
62,52,Female,3,118,186,190,0.0,Yes
...,...,...,...,...,...,...,...,...
136,60,Male,2,120,178,96,0.0,Yes
262,53,Female,0,123,282,95,2.0,No
297,59,Female,0,164,176,90,1.0,No
243,57,Female,0,152,274,88,1.2,No


---

### We can also sort by requesting the "nlargest" or "nsmallest" of a given column:

### Get the top n of a feature
Use the nlargest() method to sort by a given column and display n number of rows. Pass in the value of n (the number of rows to display) and use the "column" parameter to indicate which row to sort by:




In [21]:
df[["age", "sex", "max_hr"]].nlargest(10, columns="age")

Unnamed: 0,age,sex,max_hr
238,77,Female,162
144,76,Male,116
129,74,Male,121
25,71,Male,162
60,71,Male,130
151,71,Male,125
145,70,Female,143
225,70,Female,125
234,70,Female,109
240,70,Female,112


### Get the bottom n of a feature
Use the nsmallest() method to sort by a given column and display n number of rows. Pass in the value of n (the number of rows to display) and use the "column" parameter to indicate which row to sort by:

In [22]:
df[["age", "sex", "max_hr"]].nsmallest(10, columns=["max_hr"])

Unnamed: 0,age,sex,max_hr
272,67,Female,71
243,57,Female,88
297,59,Female,90
262,53,Female,95
136,60,Male,96
233,64,Female,96
216,62,Male,97
198,62,Female,99
226,62,Female,103
269,56,Female,103


You can sort by multiple columns by passing in a list of the desired columns to sort by, in order:

In [23]:
df[["age", "sex", "max_hr"]].nsmallest(10, columns=["max_hr", "age"])

Unnamed: 0,age,sex,max_hr
272,67,Female,71
243,57,Female,88
297,59,Female,90
262,53,Female,95
136,60,Male,96
233,64,Female,96
216,62,Male,97
198,62,Female,99
269,56,Female,103
226,62,Female,103


---

# 🌟 Ready for the next one?
<br>



- [**Lesson 6: Data Transformation**](Lesson_6_Data_Transformation.ipynb)

- [**Lesson 7: Data Analysis**](Lesson_7_Data_Analysis.ipynb)

---

# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.