## Preprocessing notebook

Preprocessing steps:
- Log transform the target Calories
- One-hot encoding for Sex (1 for male, 0 for female) and change to type categorical
- Delete id column
- Change Height, Weight, Duration, Heart_Rate from float to int

In order to change Height and Heart_Rate from float to int, we have to round 1 value from each feature to the nearest integer.  These changes reduces the file size of the training dataset by 10.5% and testing dataset by 46.7%.

Import packages and then the train, test set.

In [1]:
import numpy as np
import pandas as pd
import os

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,0,male,36,189.0,82.0,26.0,101.0,41.0,150.0
1,1,female,64,163.0,60.0,8.0,85.0,39.7,34.0
2,2,female,51,161.0,64.0,7.0,84.0,39.8,29.0
3,3,male,20,192.0,90.0,25.0,105.0,40.7,140.0
4,4,female,38,166.0,61.0,25.0,102.0,40.6,146.0


One-hot encode Sex (1-male, 0-female), log-transform Calories, and delete id column in training and testing datasets.

In [20]:
# one-hot encode Sex
train_pp = pd.get_dummies(train, drop_first=True, dtype=int)

# drop id column
train_pp.drop(columns='id', inplace=True)

# keep column name as Sex and as category type, then return to original order
train_pp.rename(columns={'Sex_male': 'Sex'}, inplace=True)
train_pp['Sex'] = train_pp['Sex'].astype('category')
train_pp = train_pp[['Sex', 'Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp', 'Calories']]

# log transform calories
train_pp['Calories'] = np.log1p(train['Calories'])

# repeat for testing set
test_pp = pd.get_dummies(test, drop_first=True, dtype=int)
test_pp.rename(columns={'Sex_male': 'Sex'}, inplace=True)
test_pp['Sex'] = test_pp['Sex'].astype('category')
test_pp.drop(columns='id', inplace=True)
test_pp = test_pp[['Sex', 'Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']]

train_pp.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,1,36,189.0,82.0,26.0,101.0,41.0,5.01728
1,0,64,163.0,60.0,8.0,85.0,39.7,3.555348
2,0,51,161.0,64.0,7.0,84.0,39.8,3.401197
3,1,20,192.0,90.0,25.0,105.0,40.7,4.94876
4,0,38,166.0,61.0,25.0,102.0,40.6,4.990433


Which columns are purely integers?

In [21]:
num_feats = ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']

nonint_feat = []
nonint_feat_test = []


# check training set
for feature in num_feats:
    if np.array_equal(train_pp[feature], train_pp[feature].astype(int)) == False:
        nonint_feat.append(feature)

# print()
print("The non-integer columns in the training set are:", nonint_feat)



# check testing set
for feature in num_feats:
    if np.array_equal(test_pp[feature], test_pp[feature].astype(int)) == False:
        nonint_feat_test.append(feature)

# print()
print("The non-integer columns in the testing set are:", nonint_feat_test)

The non-integer columns in the training set are: ['Height', 'Heart_Rate', 'Body_Temp']
The non-integer columns in the testing set are: ['Body_Temp']


By inspection, it seemed that Height, Heart_Rate was only made up of integers in the training set.  Moreover, we saw that Height, Heart_Rate are purely integers in the testing set.  How many non-integers are in these two columns in the training dataset?

In [22]:
for feature in nonint_feat:
    count = 0
    for x in train_pp[feature].tolist():
        if x.is_integer() == False:
            count += 1
        
    print(f'{feature} has', count, 'non-integer values.')


Height has 1 non-integer values.
Heart_Rate has 1 non-integer values.
Body_Temp has 692127 non-integer values.


Only 1!  Find these entries.  They can be rounded to the nearest integer.

In [23]:
# Find the non-integers in Height, Heart_Rate.
for x in train_pp['Height'].tolist():
    if x.is_integer() == False:
        print("Height:", x)


for x in train_pp['Heart_Rate'].tolist():
    if x.is_integer() == False:
        print("Heart_Rate:", x)

Height: 154.1
Heart_Rate: 109.99323


In [24]:
# Round the non integer height and heart rate to the nearest integer, and verify that Body_Temp is the only non-integer feature.

train_pp['Height'] = train_pp['Height'].round()
train_pp['Heart_Rate'] = train_pp['Heart_Rate'].round()

for feature in ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']:
    print(f'{feature}:', np.array_equal(train_pp[feature], train_pp[feature].astype(int)))

Age: True
Height: True
Weight: True
Duration: True
Heart_Rate: True
Body_Temp: False


In [25]:
# turn integer columns of floating type into integer type to save space

for feature in ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate']:
    train_pp[feature] = train_pp[feature].astype(int)
    test_pp[feature] = test_pp[feature].astype(int)

train_pp.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,1,36,189,82,26,101,41.0,5.01728
1,0,64,163,60,8,85,39.7,3.555348
2,0,51,161,64,7,84,39.8,3.401197
3,1,20,192,90,25,105,40.7,4.94876
4,0,38,166,61,25,102,40.6,4.990433


In [26]:
test_pp.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp
0,1,45,177,81,7,87,39.8
1,1,26,200,97,20,101,40.5
2,0,29,188,85,16,102,40.4
3,0,39,172,73,20,107,40.6
4,0,30,173,67,16,94,40.5


In [27]:
# save new file

train_pp.to_csv('train_pp.csv', index=False)
test_pp.to_csv('test_pp.csv', index=False)

In [28]:
# compare train file sizes

print("The file size of train.csv is", np.round(os.path.getsize('train.csv')/1024), "kb.")
print("The file size of train_pp.csv is", np.round(os.path.getsize('train_pp.csv')/1024), "kb.")

print("Rounding two floating point numbers and removing id reduced the file size by", np.round((1-os.path.getsize('train_pp.csv')/os.path.getsize('train.csv'))*100, decimals=2), "percent!")

The file size of train.csv is 34632.0 kb.
The file size of train_pp.csv is 30983.0 kb.
Rounding two floating point numbers and removing id reduced the file size by 10.54 percent!


In [29]:
# compare test file sizes

print("The file size of test.csv is", np.round(os.path.getsize('test.csv')/1024), "kb.")
print("The file size of test_pp.csv is", np.round(os.path.getsize('test_pp.csv')/1024), "kb.")

print("Converting to int and removing id reduced the file size by", np.round((1-os.path.getsize('test_pp.csv')/os.path.getsize('test.csv'))*100, decimals=2), "percent!")

The file size of test.csv is 10278.0 kb.
The file size of test_pp.csv is 5883.0 kb.
Converting to int and removing id reduced the file size by 42.76 percent!


In [30]:
# make sure file imports as desired!

train_pp_import_test = pd.read_csv('train_pp.csv')

train_pp_import_test.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,1,36,189,82,26,101,41.0,5.01728
1,0,64,163,60,8,85,39.7,3.555348
2,0,51,161,64,7,84,39.8,3.401197
3,1,20,192,90,25,105,40.7,4.94876
4,0,38,166,61,25,102,40.6,4.990433


In [31]:
# make sure file imports as desired!

test_pp_import_test = pd.read_csv('test_pp.csv')

test_pp_import_test.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp
0,1,45,177,81,7,87,39.8
1,1,26,200,97,20,101,40.5
2,0,29,188,85,16,102,40.4
3,0,39,172,73,20,107,40.6
4,0,30,173,67,16,94,40.5
