<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Forecasting of deaths from Heart Failure on medical measurement

# Lab 1. Dataset investigation

Estimated time needed: **10** minutes

<div>
In this course, you will learn how to understand data and learn how to use libraries in Python to help you import data from many sources. You will then learn how to perform some basic tasks to begin exploring and analyzing an imported heart failure prediction dataset and using Pandas.
</div>

## Objectives

After completing this lab you will be able to:

*   Acquire data in various ways
*   Obtain insights from data with Pandas library


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="https://#data_acquisition">Data Acquisition</a>
    <li><a href="https://#basic_insight">Basic Insight of Dataset</a></li>
</ol>

</div>
<hr>


<h1 id="data_acquisition">Data Acquisition</h1>
<p>
There are various formats for a dataset: .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>

In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>

In our case, the Heart Failure Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

    
The main goal of investigation is to create a model for predicting mortality caused by Heart Failure. The dataset contains complete history of heart patients, the data was collected from a Pakistan hospital in Faisalabad named Institute of Cardiology.  
    
    
<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/asgharalikhan/mortality-rate-heart-patient-pakistan-hospital?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0RS9EN2296-2022-01-01" target="_blank">https://www.kaggle.com/datasets/asgharalikhan/mortality-rate-heart-patient-pakistan-hospital</a></li>
    <li>Data type: csv</li>
</ul>
The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>


In [None]:
#install specific version of libraries used in  lab
#! mamba install pandas==1.3.3  -y
#! mamba install numpy=1.21.2 -y

In [None]:
# import pandas library
import pandas as pd
import numpy as np

<h2>Read Data</h2>
<p>
We use <code>pandas.read_csv()</code> function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.<br>

You can also assign the dataset to any variable you create.

</p>


In [None]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0RS9EN/heart_failure_data.csv"

In [None]:
# Read the online file by the URL provides above, and assign it to variable "df"

df = pd.read_csv(path)

After reading the dataset, we can use the <code>dataframe.head(n)</code> method to check the top n rows of the dataframe, where n is an integer. Contrary to <code>dataframe.head(n)</code>, <code>dataframe.tail(n)</code> will show you the bottom n rows of the dataframe.


In [None]:
# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
df.head(5)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #1: </h1>
<b>Check the bottom 10 rows of data frame "df".</b>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
print("The last 10 rows of the dataframe\n")
df.tail(10)
```


<h3>Delete unnecessary columns</h3>
<p>
This dataset has big number of columns. Some of them are unnecessary and incomprehensible. Let`s reduce this amount to simplify our future work with this dataset.
</p>
<p>
It can be done in three ways:
</p>
<p>
DataFrame has a method called <code>drop()</code> that removes rows or columns according to specify column(label) names and corresponding axis.
</p>
<p>
<code>del</code> is also an option, you can delete a column by <code>del df['column name']</code>. The Python would map this operation to <code>df.__delitem__('column name')</code>, which is an internal method of DataFrame.
</p>
<p>
<code>pop()</code> function would also drop the column. Unlike the other two methods, this function would return the column.
</p>
<p>
Let`s try the first one.
</p>


In [None]:
columns = ["F.History", "Family.History", "B.Urea", "S.Cr", "S.Sodium", "S.Potassium", "S.Chloride"]
df = df.drop(columns = columns)
df

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #2: </h1>
<b>Delete a column with a del method.</b>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
del df['ca']
```

</details>


<h3>Change Headers</h3>
<p>
Take a look at our dataset. Pandas automatically set the header with a first row of .csv file.
</p>
<p>
To better describe our data, we can introduce a header. This information is available at:  <a href="https://www.kaggle.com/datasets/asgharalikhan/mortality-rate-heart-patient-pakistan-hospital?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0RS9EN2296-2022-01-01" target="_blank">https://www.kaggle.com/datasets/asgharalikhan/mortality-rate-heart-patient-pakistan-hospital</a>.
</p>
<p>
Thus, we can change headers of this dataset manually to make them look proper.
</p>
<p>
First, we create a list "headers" that include all column names in order.
Then, we use <code>dataframe.columns = headers</code> to replace the headers with the list we created.
</p>


In [None]:
headers = ["Age", "Age Group", "Gender", "Locality", "Marital Status", "Lifestyle", "Sleep", "Category", "Depression",
            "Hyperlipidemia", "Smoking", "Diabetes", "HTN", "Allergies", "BP", "Thrombolysis", "BGR", "CPK", "CK-MB",
            "ESR", "WBC", "RBC", "Hemoglobin", "PCV", "MCV", "MCH", "MCHC", "PlateletCount", "Neutrophil",
            "Lymphocyte", "Monocyte", "Eosinophil", "Others", "CO", "Diagnosis", "Hypersensitivity", "Chest pain type", 
            "Resting BP", "Serum cholesterol", "FBS", "Resting electrocardiographic", "Max heart rate", "Angina",
            "ST depression", "Slope", "Vessels num", "Thal", "Num", "Streptokinase", "SK React", "Reaction",
            "Mortality", "Follow up"]
print("headers\n", headers)

We replace headers and recheck our dataframe:


In [None]:
df.columns = headers
df.head(10)

Now, we have successfully read the raw dataset and added the correct headers into the dataframe.


 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #3: </h1>
<b>Find the name of the columns of the dataframe.</b>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
print(df.columns)
```

</details>


<h2>Save Dataset</h2>
<p>
Correspondingly, Pandas enables us to save the dataset to csv. By using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.
</p>
<p>
For example, if you would save the dataframe <b>df</b> as <b>heart_failure.csv</b> to your local machine, you may use the syntax below, where <code>index = False</code> means the row names will not be written.
</p>


We can also read and save other file formats. We can use similar functions like **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:


<h2>Read/Save Other Data Formats</h2>

| Data Formate |        Read       |            Save |
| ------------ | :---------------: | --------------: |
| csv          |  `pd.read_csv()`  |   `df.to_csv()` |
| json         |  `pd.read_json()` |  `df.to_json()` |
| excel        | `pd.read_excel()` | `df.to_excel()` |
| hdf          |  `pd.read_hdf()`  |   `df.to_hdf()` |
| sql          |  `pd.read_sql()`  |   `df.to_sql()` |
| ...          |        ...        |             ... |


<h1 id="basic_insight">Basic Insight of Dataset</h1>
<p>
After reading data into Pandas dataframe, it is time for us to explore the dataset.<br>

There are several ways to obtain essential insights of the data to help us better understand our dataset.

</p>


<h2>Data Types</h2>
<p>
Data has a variety of types.<br>

The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:

</p>


In [None]:
df.dtypes


A series with the data type of each column is returned.


In [None]:
# check the data type of data frame "df" by .dtypes
print(df.dtypes)

<p>
As shown above, it is clear to see that the data type of "Age" and "Diabetes" are <code>int64</code>, "Gender" and "Locality" are <code>object</code>, and "BP", "RBC" and "Hemoglobin" are <code>float64</code>, etc.
</p>
<p>
These data types can be changed. For example, we have some fields like "Lifestyle", "Sleep", "Depression", which are <code>object</code> instead of <code>bool</code>. We will learn how to fix it in a later module.
</p>


<h2>Describe</h2>
If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method:


This method will provide various summary statistics, excluding <code>NaN</code> (Not a Number) values.


In [None]:
df.describe()

<p>
This shows the statistical summary of all numeric-typed (int, float) columns.<br>

For example, the attribute "BP" (blood pressure) has 368 counts, the mean value of this column is 121.21, the standard deviation is 24.54, the minimum value is 80.5, 25th percentile is 100.7, 50th percentile is 120.8, 75th percentile is 140.7, and the maximum value is 109.11. <br>

However, what if we would also like to check all the columns including those that are of type object? <br><br>

You can add an argument <code>include = "all"</code> inside the bracket. Let's try it again.

</p>


In [None]:
# describe all the columns in "df" 
df.describe(include = "all")

<p>
Now it provides the statistical summary of all the columns, including object-typed attributes.<br>

We can now see how many unique values there, which one is the top value and the frequency of top value in the object-typed columns.<br>

Some values in the table above show as "NaN". This is because those numbers are not available regarding a particular column type.<br>

Let`s see what values are available for object-typed columns:

</p>


In [None]:
# describe all the columns in "df"
df.describe(include = "object")

<p>There are count, unique, top, freq values for object.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #4: </h1>

<p>
You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3']]</code>
</p>
<p>
Where "column" is the name of the column, you can apply the method  ".describe()" to get the statistics of those columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3'] ].describe()</code>
</p>

Apply the  method to ".describe()" to the columns 'Age' and 'BP'.

</div>


In [None]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['Age', 'BP']].describe()
```

</details>


<h2>Info</h2>
Another method you can use to check your dataset is:


It provides a concise summary of your DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.


In [None]:
# look at the info of "df"
df.info()

<h1>Excellent! You have just completed the  Introduction Notebook!</h1>


### Thank you for completing this lab!

## Author

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">Joseph Santarcangelo</a>

### Other Contributors

<a href="https://www.linkedin.com/in/mahdi-noorian-58219234/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">Mahdi Noorian PhD</a>

Bahare Talayian

Eric Xiao

Steven Dong

Parizad

Hima Vasudevan

<a href="https://www.linkedin.com/in/fiorellawever/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01" target="_blank">Fiorella Wenver</a>

<a href="https:// https://www.linkedin.com/in/yi-leng-yao-84451275/ " target="_blank" >Yi Yao</a>.

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |


<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
