In [1]:
import pandas as pd

### Extract and Read Data With Pandas
Before data can be analyzed, it must be imported/extracted.

In the example below, we show you how to import data using Pandas in Python.

We use the read_csv() function to import a CSV file with the health data:

- Import the Pandas library

- Name the data frame as health_data.

- header=0 means that the headers for the variable names are to be found in the first row (note that 0 means the first row in Python)

- sep="," means that "," is used as the separator between the values. This is because we are using the file type .csv (comma separated values)

In [2]:
health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data)

    Duration Average_Pulse Max_Pulse  Calorie_Burnage  Hours_Work  Hours_Sleep
0       30.0            80       120            240.0        10.0          7.0
1       45.0            85       120            250.0        10.0          7.0
2       45.0            90       130            260.0         8.0          7.0
3       60.0            95       130            270.0         8.0          7.0
4       60.0           100       140            280.0         0.0          7.0
5        NaN           NaN       NaN              NaN         NaN          NaN
6       60.0           105       140            290.0         7.0          8.0
7       60.0           110       145            300.0         7.0          8.0
8       45.0           NaN        AF              NaN         8.0          8.0
9       45.0           115       145            310.0         8.0          8.0
10      60.0           120       150            320.0         0.0          8.0
11      60.0         9 000       130              Na

In [3]:
health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.head())

   Duration Average_Pulse Max_Pulse  Calorie_Burnage  Hours_Work  Hours_Sleep
0      30.0            80       120            240.0        10.0          7.0
1      45.0            85       120            250.0        10.0          7.0
2      45.0            90       130            260.0         8.0          7.0
3      60.0            95       130            270.0         8.0          7.0
4      60.0           100       140            280.0         0.0          7.0


### Remove Blank Rows

We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.

Solution: We can remove the rows with missing observations to fix this problem.

When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values.

So, removing the NaN cells gives us a clean data set that can be analyzed.

We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value

In [4]:
health_data = pd.read_csv("data.csv", header=0, sep=",")

health_data.dropna(axis=0, inplace=True)

print(health_data)

    Duration Average_Pulse Max_Pulse  Calorie_Burnage  Hours_Work  Hours_Sleep
0       30.0            80       120            240.0        10.0          7.0
1       45.0            85       120            250.0        10.0          7.0
2       45.0            90       130            260.0         8.0          7.0
3       60.0            95       130            270.0         8.0          7.0
4       60.0           100       140            280.0         0.0          7.0
6       60.0           105       140            290.0         7.0          8.0
7       60.0           110       145            300.0         7.0          8.0
9       45.0           115       145            310.0         8.0          8.0
10      60.0           120       150            320.0         0.0          8.0
12      45.0           125       150            330.0         8.0          8.0


### Data Categories

To analyze data, we also need to know the types of data we are dealing with.

Data can be split into two main categories:

- **1- Quantitative Data** - Can be expressed as a number or can be quantified. Can be divided into two sub-categories:
    - Discrete data: Numbers are counted as "whole", e.g. number of students in a class, number of goals in a soccer game

    - Continuous data: Numbers can be of infinite precision. e.g. weight of a person, shoe size, temperature


- **2- Qualitative Data** - Cannot be expressed as a number and cannot be quantified. Can be divided into two sub-categories:
    - Nominal data: Example: gender, hair color, ethnicity
    - Ordinal data: Example: school grades (A, B, C), economic status (low, middle, high)

### Data Types

In [5]:
health_data = pd.read_csv("data.csv", header=0, sep=",")

print(health_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Duration         12 non-null     float64
 1   Average_Pulse    11 non-null     object 
 2   Max_Pulse        12 non-null     object 
 3   Calorie_Burnage  10 non-null     float64
 4   Hours_Work       11 non-null     float64
 5   Hours_Sleep      12 non-null     float64
dtypes: float64(4), object(2)
memory usage: 756.0+ bytes
None


In [7]:
health_data = pd.read_csv("data.csv", header=0, sep=",")
health_data.dropna(axis=0, inplace=True)

health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data['Max_Pulse'].astype(float)

print(health_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 0 to 12
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Duration         10 non-null     float64
 1   Average_Pulse    10 non-null     float64
 2   Max_Pulse        10 non-null     float64
 3   Calorie_Burnage  10 non-null     float64
 4   Hours_Work       10 non-null     float64
 5   Hours_Sleep      10 non-null     float64
dtypes: float64(6)
memory usage: 560.0 bytes
None


### Analyze the Data

When we have cleaned the data set, we can start analyzing the data.

We can use the describe() function in Python to summarize data:

In [9]:
health_data = pd.read_csv("data.csv", header=0, sep=",")

pd.set_option('display.max_columns', None)

print(health_data.describe())

        Duration  Calorie_Burnage  Hours_Work  Hours_Sleep
count  12.000000        10.000000   11.000000    12.000000
mean   51.250000       285.000000    6.727273     7.583333
std    10.028369        30.276504    3.466725     0.514929
min    30.000000       240.000000    0.000000     7.000000
25%    45.000000       262.500000    7.000000     7.000000
50%    52.500000       285.000000    8.000000     8.000000
75%    60.000000       307.500000    8.000000     8.000000
max    60.000000       330.000000   10.000000     8.000000


- Count - Counts the number of observations
- Mean - The average value
- Std - Standard deviation (explained in the statistics chapter)
- Min - The lowest value
- 25%, 50% and 75% are percentiles (explained in the statistics chapter)
- Max - The highest value