# <center> Optimizing CarPrice Dataset in Chunks 

# Introduction 

## 1.1 Problem Statement 

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to unsderstand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market.

The company wants to know :

- Which variables are significant in predicting the price of a car.
- How well those variables describes the price of car.
- Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across American market.

## 1.2 What is Chunks? 

Even after optimizing the data type of the data frame and selecting the appropriate column, the size of the dataset may not be suitable for memory. At this time, it is more efficients to process the entire data frame in Chunk units than to load it into memory. Only a portion of the entire row should be used in memory for a given time. In other words, we need to process tasks using only a fraction of the data, immediately combine the results, and finally put them back together. 

## 1.3 Data Descriptions

| |Features|Description|
|:---:|:---:|:---:| 
|1|Car_ID| Unique id of each observation (Integer) | 
|2|Symboling| Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe (Categorical) |
|3|carCompany| Name of car company (Categorical) |
|4|fueltype| Car fuel type i.e. gas or disel (Categorical) | 
|5|aspiration| Aspiration used in a car (Categorical) |
|7|carbody|	body of car (Categorical) | 
|8|drivewheel| type of drive wheel (Categorical) | 
|9|	enginelocation|	Location of car engine (Categorical) | 
|10|wheelbase| Weelbase of car (Numeric) | 
|11|carlength| Length of car (Numeric) | 
|12|carwidth| Width of car (Numeric) | 
|13|carheight| height of car (Numeric) | 
|14|curbweight|	The weight of a car without occupants or baggage. (Numeric) | 
|15|enginetype|	Type of engine. (Categorical) | 
|16|cylindernumber|	cylinder placed in the car (Categorical) | 
|17|enginesize|	Size of car (Numeric) | 
|18|fuelsystem|	Fuel system of car (Categorical) |
|19|boreratio| Boreratio of car (Numeric) | 
|20|stroke|	Stroke or volume inside the engine (Numeric) | 
|21|compressionratio| compression ratio of car (Numeric) | 
|22|horsepower|	Horsepower (Numeric) | 
|23|peakrpm| car peak rpm (Numeric) | 
|24|citympg| Mileage in city (Numeric) | 
|25|highwaympg|	Mileage on highway (Numeric) | 
|26|price(Dependent variable)| Price of car (Numeric) | 


## 1.4 Source of Data 

- **Competition** : https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
- **Source**: https://archive.ics.uci.edu/ml/datasets/Automobile

# 2. Purpose of this Project 

We will optimize the dataframe and processing in situation only 0.05 megabytes of memory available. 

# 3. Estimating the amount of memory 

## 3.1 Import dataset

In [1]:
cars = pd.read_csv('Datasets/CarPrice_Assignment.csv', nrows=5)
cars

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,carheight,curbweight,enginetype,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,16500
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000,19,26,16500
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10,102,5500,24,30,13950
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8,115,5500,18,22,17450


In [2]:
# The memory usage of dataframe in 100 rows 
memory_hundreds = pd.read_csv('Datasets/CarPrice_Assignment.csv', nrows=100)
memory_hundreds.memory_usage(deep=True).sum() / (2**20)

0.07144641876220703

## 3.2 Calculate total memory usage 

Given that the dataset's 100 rows consume 0.07MB of memory, i can expect chunks of 50 rows to consume 0.035MB. Because available memory usage is 0.05MB, it is good to consume less than 35% of available memory.

In [3]:
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize=50)

for chunk in chunk_iter: 
    print(chunk.memory_usage(deep=True).sum() / 2**20)

0.035719871520996094
0.03585243225097656
0.03571033477783203
0.03588104248046875
0.0036773681640625


## 3.3 Explore dataset in Chunks

**Total number of rows**

In [4]:
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize=50)

nrow = 0
for chunk in chunk_iter: 
    nrow += len(chunk) 
    ncol = len(chunk.columns)

print(nrow, ncol)

205 26


There are 205 rows and 26 columns in carprice dataframe. 

**The number of columns by datatype**

In [5]:
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize=50)

num_numerical_columns = []
num_object_columns = []
for chunk in chunk_iter:
    numerical_columns = chunk.select_dtypes(exclude = ['object']).columns
    num_numerical_columns.append(len(numerical_columns))
    object_columns = chunk.select_dtypes(include = ['object']).columns
    num_object_columns.append(len(object_columns))
    
print(f"Number of numerical columns : {num_numerical_columns}")
print(f"Number of object columns : {num_object_columns}")

Number of numerical columns : [16, 16, 16, 16, 16]
Number of object columns : [10, 10, 10, 10, 10]


In [6]:
print(f"Columns in numerical_columns : {numerical_columns}")
print(f"Columns in object_columns : {object_columns}")

Columns in numerical_columns : Index(['car_ID', 'symboling', 'wheelbase', 'carlength', 'carwidth',
       'carheight', 'curbweight', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')
Columns in object_columns : Index(['CarName', 'fueltype', 'aspiration', 'doornumber', 'carbody',
       'drivewheel', 'enginelocation', 'enginetype', 'cylindernumber',
       'fuelsystem'],
      dtype='object')


# 4. Check columns for optimizing

The next step is finding columns need to be optimizing. The workflow of check columns for optimizing is same as below : 

- Check unique values in each object columns. (Find object columns contain values that are less than 50% unique) 
- Check candidates for conversion to the integer type and have no missing values. 

## 4.1 Check object columns 

In [7]:
# Count number of unique values in object columns 
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize=50)

unique_series = {}
for chunk in chunk_iter : 
    object_df = chunk.select_dtypes(include = ['object'])
    object_columns = object_df.columns
    for col in object_columns : 
        col_unique = chunk[col].value_counts() 
        if col in unique_series : 
            unique_series[col].append(col_unique)
        else : 
            unique_series[col] = [col_unique]

nunique_series = {} 
for col in unique_series : 
    col_concat = pd.concat(unique_series[col])
    col_group = col_concat.groupby(col_concat.index).sum()
    nunique_series[col] = len(col_group)

nunique_series = pd.Series(nunique_series).sort_values(ascending = False)
nunique_series[nunique_series <= 50]

fuelsystem        8
enginetype        7
cylindernumber    7
carbody           5
drivewheel        3
fueltype          2
aspiration        2
doornumber        2
enginelocation    2
dtype: int64

To optimize loan dataframe, above columns need to be covert in category data type.

## 4.2 Check numerical columns

In [8]:
# Check columns have no missing values 
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize = 3000)

missing = []
for chunk in chunk_iter : 
    numeric_df = chunk.select_dtypes(exclude = ['object'])
    missing.append(numeric_df.isnull().sum())

missing_series = pd.concat(missing)
missing_series = missing_series.groupby(missing_series.index).sum().sort_values(ascending = False)
missing_series

boreratio           0
car_ID              0
carheight           0
carlength           0
carwidth            0
citympg             0
compressionratio    0
curbweight          0
enginesize          0
highwaympg          0
horsepower          0
peakrpm             0
price               0
stroke              0
symboling           0
wheelbase           0
dtype: int64

## 4.3 Calculate memory usage across all chunks

In [9]:
# Check memory usage of object columns 
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize = 50)

object_memory = 0
numeric_memory = 0
total_memory = 0
for chunk in chunk_iter :
    object_df = chunk.select_dtypes(include = ['object'])
    numeric_df = chunk.select_dtypes(exclude = ['object'])
    object_memory += object_df.memory_usage(deep = True).sum()/(2**20)
    numeric_memory += numeric_df.memory_usage(deep = True).sum()/(2**20)
    total_memory += chunk.memory_usage(deep = True).sum()/(2**20)
    
print(f"Memory usage of object columns : {round(object_memory, 3)}MB")
print(f"Memory usage of numeric columns : {round(numeric_memory, 3)}MB")
print(f"Memory usage of total columns : {round(total_memory, 3)}MB")

Memory usage of object columns : 0.122MB
Memory usage of numeric columns : 0.026MB
Memory usage of total columns : 0.147MB


# 5. Optimizing Datasets

## 5.1 Object columns 

In [10]:
# Check unique value in object columns 
chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize = 50)

unique_series = {}
for chunk in chunk_iter : 
    object_df = chunk.select_dtypes(include = ['object'])
    object_columns = object_df.columns
    for col in object_columns : 
        col_unique = chunk[col].value_counts() 
        if col in unique_series : 
            unique_series[col].append(col_unique)
        else : 
            unique_series[col] = [col_unique]

for col in unique_series : 
    col_concat = pd.concat(unique_series[col])
    col_group = col_concat.groupby(col_concat.index).sum()
    print(col_group)

Nissan versa                       1
alfa-romero Quadrifoglio           1
alfa-romero giulia                 1
alfa-romero stelvio                1
audi 100 ls                        1
audi 100ls                         2
audi 4000                          1
audi 5000                          1
audi 5000s (diesel)                1
audi fox                           1
bmw 320i                           2
bmw x1                             1
bmw x3                             2
bmw x4                             1
bmw x5                             1
bmw z4                             1
buick century                      1
buick century luxus (sw)           1
buick century special              1
buick electra 225 custom           1
buick opel isuzu deluxe            1
buick regal sport coupe (turbo)    1
buick skyhawk                      1
buick skylark                      1
chevrolet impala                   1
chevrolet monte carlo              1
chevrolet vega 2300                1
d

We will convert following columns into categorical data types : fuelsystem, enginetype, cylindernumber, carbody, drivewheel, fueltype, aspiration, doornumber, enginelocation. 

In [11]:
# Convert object type to category
print(f"Previous total memory : {total_memory}")

cat_cols = ['fuelsystem', 'enginetype', 'cylindernumber', 'carbody', 'drivewheel', 'fueltype', 'aspiration', 'doornumber', 'enginelocation']
chunk_iter = chunk_iter = pd.read_csv('Datasets/CarPrice_Assignment.csv', chunksize = 50)
total_memory = 0
for chunk in chunk_iter: 
    for col in cat_cols: 
        chunk[col] = chunk[col].astype('category') 
    total_memory += chunk.memory_usage(deep=True).sum() / 2**20
    
print(f"Current total memory : {round(total_memory,3)}MB")

Previous total memory : 0.14684104919433594
Current total memory : 0.054MB


# 6. Check optimized data types

In [12]:
chunk.dtypes

car_ID                 int64
symboling              int64
CarName               object
fueltype            category
aspiration          category
doornumber          category
carbody             category
drivewheel          category
enginelocation      category
wheelbase            float64
carlength            float64
carwidth             float64
carheight            float64
curbweight             int64
enginetype          category
cylindernumber      category
enginesize             int64
fuelsystem          category
boreratio            float64
stroke               float64
compressionratio     float64
horsepower             int64
peakrpm                int64
citympg                int64
highwaympg             int64
price                  int64
dtype: object