# Cleaning from start
In this notebook:
* a file already ordered in previous notebook is analyzed to be cleaned

In [1]:
# import libraries
import pandas as pd
from datetime import datetime
import numpy as np


In [2]:
# first open the already ordered file and take a look
# now that the file has been ordered, it is open into a dataframe
df = pd.read_csv('ordered.csv', header='infer')
df


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
...,...,...,...,...,...
164,60,,105,140,290.8
165,60,,110,145,300.4
166,60,,115,145,310.2
167,75,,120,150,320.4


In [3]:
# at first look it's seen Date column has a problems...
# but for now continue analizing the dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Date      31 non-null     object 
 2   Pulse     169 non-null    int64  
 3   Maxpulse  169 non-null    int64  
 4   Calories  164 non-null    float64
dtypes: float64(1), int64(3), object(1)
memory usage: 6.7+ KB


### 1) Check if there are empty rows or columns

In [4]:
# in df.info() it's seen that there are no empty rows
# because info says that there are 169 entries and
# there are three complete columns (with 169 filled values)
# However with isnull we can recheck
# with no empty rows or columns to see if the shape changes

# build dfr with empty rows
empty_rows = df.isnull().all(axis=1)
print(empty_rows)
if True in empty_rows:
    print("Yes, there are empty rows")
else:
    print("There are no empty rows")

# build dfc with empty columns
empty_cols = df.isnull().all(axis=0)
print(empty_cols)
if True in empty_cols:
    print("Yes, there are empty columns")
else:
    print("There are no empty columns")

0      False
1      False
2      False
3      False
4      False
       ...  
164    False
165    False
166    False
167    False
168    False
Length: 169, dtype: bool
There are no empty rows
Duration    False
Date        False
Pulse       False
Maxpulse    False
Calories    False
dtype: bool
There are no empty columns


In [5]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Duration      0
Date        138
Pulse         0
Maxpulse      0
Calories      5
dtype: int64

### 2) Check Duplicates

In [6]:
dupl = df.duplicated()
if True in dupl:
    print("Yes, Duplicated")
else:
    print("There are no duplicated")

There are no duplicated


Obvervations:
* The file has been opened inferring column names
    - column names and type are pretty well inferred
    - Date column shoul be corrected...

In [7]:
# size 170 rows and 5 columns
# columns 1 and 4 have null values
# check data types
df.dtypes

Duration      int64
Date         object
Pulse         int64
Maxpulse      int64
Calories    float64
dtype: object

### 3) completing and fixing date column

In [8]:
# string into datetime
dateserie = pd.to_datetime(df['Date'], format='mixed')
# dateserie
# print(type(dateserie), dateserie.index, dateserie.values)

In [9]:

# here we see that date column has a lot of null dates
pd.isna(dateserie).sum()

138

one way to deal with empty values of dates in date columns
* observe that there are one row per day, one date per row
    - build a new column of cates (completed) and replace for the old one (incomplete)
    - to build the new column we need date sart and periods
        - date start = first date
        - periods = len of column

In [10]:
dstart = dateserie[0]
periods = len(dateserie)

# build span of dates: (pd.date_range(dstart, freq='D', periods=periods))
newdatecol = pd.Series(pd.date_range(dstart, freq='D', periods=periods))

In [11]:
df['Date'] = newdatecol
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
...,...,...,...,...,...
164,60,2021-05-14,105,140,290.8
165,60,2021-05-15,110,145,300.4
166,60,2021-05-16,115,145,310.2
167,75,2021-05-17,120,150,320.4


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Duration  169 non-null    int64         
 1   Date      169 non-null    datetime64[ns]
 2   Pulse     169 non-null    int64         
 3   Maxpulse  169 non-null    int64         
 4   Calories  164 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 6.7 KB


### 4. Process NaN or empty values

In [13]:
# with df info it should be seen there are no more null values than in Calories column
# let see what are the columns in with calories have null values
df[df.Calories.isna()]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
18,45,2020-12-19,90,112,
28,60,2020-12-29,103,132,
91,45,2021-03-02,107,137,
118,60,2021-03-29,105,125,
141,60,2021-04-21,97,127,


In [14]:
# let's choose the mean of calories to replace NaN values
df.Calories.mean().round(1)

376.2

In [15]:
# Replace the nan values with mean of calories in Calories
df['Calories'].fillna(df.Calories.mean().round(1), inplace=True)

In [16]:
# Check: there should be no more NaN in Calories
df[df.Calories.isna()]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories


In [17]:
# Check df info again
# It should be completed and ordered
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Duration  169 non-null    int64         
 1   Date      169 non-null    datetime64[ns]
 2   Pulse     169 non-null    int64         
 3   Maxpulse  169 non-null    int64         
 4   Calories  169 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 6.7 KB


### 5) Replace wrong data

In [18]:
# Let see statistics to observe if we found some issue
df.describe()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
count,169.0,169,169.0,169.0,169.0
mean,66.331361,2021-02-23 00:00:00,107.52071,134.094675,376.226627
min,15.0,2020-12-01 00:00:00,80.0,100.0,50.3
25%,45.0,2021-01-12 00:00:00,100.0,124.0,253.3
50%,60.0,2021-02-23 00:00:00,105.0,131.0,321.0
75%,60.0,2021-04-06 00:00:00,111.0,141.0,384.0
max,450.0,2021-05-18 00:00:00,159.0,184.0,1860.4
std,51.638363,,14.458927,16.398041,262.12674


In [19]:
# max Duration appears to be to high
df[df.Duration == 450]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
7,450,2020-12-08,104,134,253.3


In [20]:
# look row 7: while the mean of Duration is 66 for "376" calories...
# it is observed 450 duration for 253 calories
# to much calories for the 25% percentil of calories
# let replace for the calories for the same percentil
# 45 instead of 450
df.loc[ 7, 'Duration'] = 45
df.head(8)


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.3


In [21]:
# max Calories appears to be to high
# but finally looks posible
df[df.Calories == 1860.4]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
109,210,2021-03-20,137,184,1860.4


### 6) Check correlations

In [22]:
df.corr()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
Duration,1.0,-0.062917,-0.158982,0.006328,0.92123
Date,-0.062917,1.0,0.076277,0.14128,-0.045341
Pulse,-0.158982,0.076277,1.0,0.784804,0.021148
Maxpulse,0.006328,0.14128,0.784804,1.0,0.19966
Calories,0.92123,-0.045341,0.021148,0.19966,1.0


Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which makes sense, each column always has a perfect relationship with itself.

Good Correlation:
"Duration" and "Calories" got a 0.92.. correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out.

Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa.



Finally. Thre is some dataframe!

In [23]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
...,...,...,...,...,...
164,60,2021-05-14,105,140,290.8
165,60,2021-05-15,110,145,300.4
166,60,2021-05-16,115,145,310.2
167,75,2021-05-17,120,150,320.4


### 7) Store

In [24]:
# save the work made up to here even if there some more analisis to be done
# file is pretty cleaned!
df.to_csv("cleaned.csv",index=False)

Daniel Christello