# Import libraries

In [2]:
import pandas as pd
import numpy as np

# Load Data

In [3]:
df = pd.read_csv('clean_08.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987 entries, 0 to 986
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   model                 987 non-null    object 
 1   displ                 987 non-null    float64
 2   cyl                   987 non-null    int64  
 3   trans                 987 non-null    object 
 4   drive                 987 non-null    object 
 5   fuel                  987 non-null    object 
 6   veh_class             987 non-null    object 
 7   air_pollution_score   987 non-null    float64
 8   city_mpg              987 non-null    int64  
 9   hwy_mpg               987 non-null    int64  
 10  cmb_mpg               987 non-null    int64  
 11  greenhouse_gas_score  987 non-null    int64  
 12  smartway              987 non-null    object 
dtypes: float64(2), int64(5), object(6)
memory usage: 100.4+ KB


# View Values in `mpg` columns

In [7]:
import re

mpg_columns =  [x for x in df.columns if re.search(r"(mpg)$", x)] 

mpg_columns

['city_mpg', 'hwy_mpg', 'cmb_mpg']

In [10]:
for col in mpg_columns:
    print(df[col].value_counts)
    print("-------------------")

<bound method IndexOpsMixin.value_counts of 0      15
1      17
2      16
3      18
4      17
       ..
982    14
983    14
984    13
985    13
986    18
Name: city_mpg, Length: 987, dtype: int64>
-------------------
<bound method IndexOpsMixin.value_counts of 0      20
1      22
2      24
3      26
4      26
       ..
982    20
983    20
984    19
985    19
986    25
Name: hwy_mpg, Length: 987, dtype: int64>
-------------------
<bound method IndexOpsMixin.value_counts of 0      17
1      19
2      19
3      21
4      20
       ..
982    16
983    16
984    15
985    15
986    21
Name: cmb_mpg, Length: 987, dtype: int64>
-------------------


# Changing the column types to int

*Note I already did this while cleaning the data*

In [11]:
for col in mpg_columns:
    df[col] = df[col].astype(int)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987 entries, 0 to 986
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   model                 987 non-null    object 
 1   displ                 987 non-null    float64
 2   cyl                   987 non-null    int64  
 3   trans                 987 non-null    object 
 4   drive                 987 non-null    object 
 5   fuel                  987 non-null    object 
 6   veh_class             987 non-null    object 
 7   air_pollution_score   987 non-null    float64
 8   city_mpg              987 non-null    int32  
 9   hwy_mpg               987 non-null    int32  
 10  cmb_mpg               987 non-null    int32  
 11  greenhouse_gas_score  987 non-null    int64  
 12  smartway              987 non-null    object 
dtypes: float64(2), int32(3), int64(2), object(6)
memory usage: 88.8+ KB


int32 is way to much to represent data that at maxium is 48 and at the minimum 8. Instead, a smalled int based should be used like int8. The next cell shows the greatest and smallest values used in the mpg_columns.

In [14]:
df[mpg_columns].describe()

Unnamed: 0,city_mpg,hwy_mpg,cmb_mpg
count,987.0,987.0,987.0
mean,17.386018,24.038501,19.788247
std,4.088018,4.753406,4.251565
min,8.0,13.0,10.0
25%,15.0,20.0,17.0
50%,17.0,24.0,20.0
75%,20.0,27.0,22.0
max,48.0,45.0,46.0


## Changing the column values to `int8`

In [15]:
for col in mpg_columns:
    df[col] = df[col].astype("int8")

Data has now gone from 100KB to 80KB

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987 entries, 0 to 986
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   model                 987 non-null    object 
 1   displ                 987 non-null    float64
 2   cyl                   987 non-null    int64  
 3   trans                 987 non-null    object 
 4   drive                 987 non-null    object 
 5   fuel                  987 non-null    object 
 6   veh_class             987 non-null    object 
 7   air_pollution_score   987 non-null    float64
 8   city_mpg              987 non-null    int8   
 9   hwy_mpg               987 non-null    int8   
 10  cmb_mpg               987 non-null    int8   
 11  greenhouse_gas_score  987 non-null    int64  
 12  smartway              987 non-null    object 
dtypes: float64(2), int64(2), int8(3), object(6)
memory usage: 80.1+ KB


# Optmizing String Values

## Viewing the value counts in the string columns

In [35]:
stringColumns = [x for x in df.columns if df[x].dtype == "object"]

stringColumns

['model', 'trans', 'drive', 'fuel', 'veh_class', 'smartway']

In [25]:
for col in stringColumns:
    print(df[col].value_counts())
    print("-----------------")

model
NISSAN Altima             12
HONDA Accord              11
FORD Ranger               10
DODGE RAM 1500             9
DODGE Dakota               8
                          ..
MERCEDES-BENZ SL55 AMG     1
MERCEDES-BENZ SL550        1
MERCEDES-BENZ SL600        1
MERCEDES-BENZ SL65 AMG     1
ACURA MDX                  1
Name: count, Length: 377, dtype: int64
-----------------
trans
Auto-L4    176
Auto-S6    162
Auto-L5    157
Man-6      142
Man-5      123
Auto-S5     71
Auto-L6     56
Auto-AV     45
Auto-S4     21
Auto-L7     15
Auto-S7     11
Auto-6       4
S8           4
Name: count, dtype: int64
-----------------
drive
2WD    662
4WD    325
Name: count, dtype: int64
-----------------
fuel
Gasoline    984
CNG           1
ethanol       1
gas           1
Name: count, dtype: int64
-----------------
veh_class
small car        333
SUV              280
midsize car      138
pickup            83
station wagon     60
large car         55
van               21
minivan           17
Name: coun

## Changing most of the string columns to `category` data type

As it can be seen, all the string columns have between 2-13 unique values except for `model`. For every time a string value is used in the dataframe, a lot of extra space is dedicated to that specfic value. To save up on all that space being used on each character of each string of each string value, a category data type can be used instead. Category types are especially useful when representing data from columns with a very low number of unqiue values. They can be created as categories and pandas will use those categories to store the data more efficently.

The model does not need to be turned into a category since there is way too many model types. So it'll be removed from the columns list.

In [36]:
stringColumns.remove("model")

### Converting columns to category

In [38]:
for col in stringColumns:
    df[col] = df[col].astype("category")

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987 entries, 0 to 986
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   model                 987 non-null    object  
 1   displ                 987 non-null    float64 
 2   cyl                   987 non-null    int64   
 3   trans                 987 non-null    category
 4   drive                 987 non-null    category
 5   fuel                  987 non-null    category
 6   veh_class             987 non-null    category
 7   air_pollution_score   987 non-null    float64 
 8   city_mpg              987 non-null    int8    
 9   hwy_mpg               987 non-null    int8    
 10  cmb_mpg               987 non-null    int8    
 11  greenhouse_gas_score  987 non-null    int64   
 12  smartway              987 non-null    category
dtypes: category(5), float64(2), int64(2), int8(3), object(1)
memory usage: 47.8+ KB


The space of the dataset went down from 80KB to 47KB