### This notebook reduces the file size of the training and testing dataset by 33%.

The data set contains 8 numerical columns:
- id
- Age
- Height
- Weight
- Duration
- Heart_Rate
- Body_Temp
- Calories (target, not in the testing set, test.csv)

The id column is not necessary and can be deleted.  Age is of integer type and the others are of type float.  We notice in the test set that Height, Weight, Duration, and Heart_Rate only take integer values.  In the training set, Weight, Duration, and Calories only take integer values.  We then notice in the training set that all but one of 750,000 entries in both Height and Heart_Rate takes a non-integer value.  Rounding these two entries to the nearest integer, changing these columns from float to int, and deleting the id column reduces the file size as claimed above.

Import packages and then the train, test set.

In [1]:
import numpy as np
import pandas as pd
import os

In [5]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
# sample = pd.read_csv("sample_submission.csv")

train.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,0,male,36,189.0,82.0,26.0,101.0,41.0,150.0
1,1,female,64,163.0,60.0,8.0,85.0,39.7,34.0
2,2,female,51,161.0,64.0,7.0,84.0,39.8,29.0
3,3,male,20,192.0,90.0,25.0,105.0,40.7,140.0
4,4,female,38,166.0,61.0,25.0,102.0,40.6,146.0


Which columns are purely integers?

In [35]:
num_feats = ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']

nonint_feat = []
nonint_feat_test = []


# check training set
for feature in num_feats + ['Calories']:
    # print(f'{feature}:', np.array_equal(train[feature], train[feature].astype(int)))
    if np.array_equal(train[feature], train[feature].astype(int)) == False:
        nonint_feat.append(feature)

# print()
print("The non-integer columns in the training set are:", nonint_feat)



# check testing set
for feature in num_feats:
    #print(f'{feature}:', np.array_equal(test[feature], test[feature].astype(int)))
    if np.array_equal(test[feature], test[feature].astype(int)) == False:
        nonint_feat_test.append(feature)

# print()
print("The non-integer columns in the testing set are:", nonint_feat_test)

The non-integer columns in the training set are: ['Height', 'Heart_Rate', 'Body_Temp']
The non-integer columns in the testing set are: ['Body_Temp']


By inspection, it seemed that Height, Heart_Rate was only made up of integers in the training set.  Moreover, we saw that Height, Heart_Rate are purely integers in the testing set.  How many non-integers are in these two columns in the training dataset?

In [36]:
for feature in nonint_feat:
    count = 0
    for x in train[feature].tolist():
        if x.is_integer() == False:
            count += 1
        
    print(f'{feature} has', count, 'non-integer values.')


Height has 1 non-integer values.
Heart_Rate has 1 non-integer values.
Body_Temp has 692127 non-integer values.


Only 1!  Find these entries.  They can be rounded to the nearest integer.

In [None]:
# Find the non-integers in Height, Heart_Rate.
for x in train['Height'].tolist():
    if x.is_integer() == False:
        print("Height:", x)


for x in train['Heart_Rate'].tolist():
    if x.is_integer() == False:
        print("Heart_Rate:", x)

Height: 154.1
Heart_Rate: 109.99323


In [21]:
# create a new train data set of smaller size by using integer types

train_small = train.copy(deep=True)
test_small = test.copy(deep=True)

In [22]:
# Round the non integer height and heart rate to the nearest integer, and verify that Body_Temp is the only non-integer feature.

train_small['Height'] = train_small['Height'].round()
train_small['Heart_Rate'] = train_small['Heart_Rate'].round()

for feature in ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp', 'Calories']:
    print(f'{feature}:', np.array_equal(train_small[feature], train_small[feature].astype(int)))

Age: True
Height: True
Weight: True
Duration: True
Heart_Rate: True
Body_Temp: False
Calories: True


In [23]:
# turn integer columns of floating type into integer type to save space

for feature in ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Calories']:
    train_small[feature] = train_small[feature].astype(int)
    if feature != 'Calories': # test set doesn't have Calories
        test_small[feature] = test_small[feature].astype(int)


# id column is not needed
train_small = train_small.drop(columns=['id'])
test_small = test_small.drop(columns=['id'])

train_small.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,male,36,189,82,26,101,41.0,150
1,female,64,163,60,8,85,39.7,34
2,female,51,161,64,7,84,39.8,29
3,male,20,192,90,25,105,40.7,140
4,female,38,166,61,25,102,40.6,146


In [24]:
test_small.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp
0,male,45,177,81,7,87,39.8
1,male,26,200,97,20,101,40.5
2,female,29,188,85,16,102,40.4
3,female,39,172,73,20,107,40.6
4,female,30,173,67,16,94,40.5


In [25]:
# save new file

train_small.to_csv('train_comp.csv', index=False)
test_small.to_csv('test_comp.csv', index=False)

In [26]:
# compare train file sizes

print("The file size of train.csv is", np.round(os.path.getsize('train.csv')/1024), "kb.")
print("The file size of train_comp.csv is", np.round(os.path.getsize('train_comp.csv')/1024), "kb.")

print("Rounding two floating point numbers and removing id reduced the file size by", np.round((1-os.path.getsize('train_comp.csv')/os.path.getsize('train.csv'))*100, decimals=2), "percent!")

The file size of train.csv is 34632.0 kb.
The file size of train_comp.csv is 23022.0 kb.
Rounding two floating point numbers and removing id reduced the file size by 33.52 percent!


In [27]:
# compare test file sizes

print("The file size of test.csv is", np.round(os.path.getsize('test.csv')/1024), "kb.")
print("The file size of test_comp.csv is", np.round(os.path.getsize('test_comp.csv')/1024), "kb.")

print("Converting to int and removing id reduced the file size by", np.round((1-os.path.getsize('test_comp.csv')/os.path.getsize('test.csv'))*100, decimals=2), "percent!")

The file size of test.csv is 10278.0 kb.
The file size of test_comp.csv is 6860.0 kb.
Converting to int and removing id reduced the file size by 33.26 percent!


In [28]:
# make sure file imports as desired!

train_comp_import_test = pd.read_csv('train_comp.csv')

train_comp_import_test.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,male,36,189,82,26,101,41.0,150
1,female,64,163,60,8,85,39.7,34
2,female,51,161,64,7,84,39.8,29
3,male,20,192,90,25,105,40.7,140
4,female,38,166,61,25,102,40.6,146


In [29]:
# make sure file imports as desired!

test_comp_import_test = pd.read_csv('test_comp.csv')

test_comp_import_test.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp
0,male,45,177,81,7,87,39.8
1,male,26,200,97,20,101,40.5
2,female,29,188,85,16,102,40.4
3,female,39,172,73,20,107,40.6
4,female,30,173,67,16,94,40.5
