# Exercise 05 : Pandas optimizations

## Imports

In [1]:
import pandas as pd
import gc

## Read the fines.csv that you saved in the previous exercise

In [3]:
df = pd.read_csv('../data/fines.csv')
df.head()

Unnamed: 0,CarNumber,Refund,Fines,Make,Model,Year
0,Y163O8161RUS,2,3200.0,Ford,Focus,1989
1,E432XX77RUS,1,6500.0,Toyota,Camry,1995
2,7184TT36RUS,1,2100.0,Ford,Focus,1984
3,X582HE161RUS,2,2000.0,Ford,Focus,2015
4,92918M178RUS,1,5700.0,Ford,Focus,2014


## Iterations: in all the following subtasks, you need to calculate fines/refund*year for each row and create a new column with the calculated data and measure the time using the magic command %%timeit in the cell

## Loop: 
- write a function that iterates through the dataframe using for i in range(0, len(df)), iloc and append() to a list, assign the result of the function to a new column in the dataframe

In [4]:
def iterations_loop_test(df):
    result = []
    for i in range(0, len(df)):
        result.append(df.iloc[i]['Fines'] / df.iloc[i]['Refund'] * df.iloc[i]['Year'])
    df['Calculated'] = result

In [10]:
%%timeit

iterations_loop_test(df)
df

151 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


- Do it using iterrows()

In [6]:
def iterations_iterrows_test(df):
    result = []
    for row in df.iterrows():
        result.append(row[1]['Fines'] / row[1]['Refund'] * row[1]['Year'])
    df['Calculated'] = result

In [7]:
%%timeit

iterations_iterrows_test(df)
df

30.1 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


- Do it using apply() and lambda function

In [15]:
def iterations_apply_test(df):
    df['Calculated'] = df.apply(lambda row: row['Fines'] /
                                row['Refund'] * row['Year'],
                                axis='columns')

In [16]:
%%timeit

iterations_apply_test(df)
df

6.89 ms ± 9.73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


- Do it using Series objects from the dataframe

In [17]:
def iterations_series_test(df):
    df['Calculated'] = df['Fines'] / df['Refund'] * df['Year']

In [18]:
%%timeit

iterations_series_test(df)
df

188 µs ± 4.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


- Do it as in the previous subtask but with the method .values

In [19]:
def iterations_series_values_test(df):
    df['Calculated'] = df['Fines'].values / df['Refund'].values * df['Year'].values

In [20]:
%%timeit

iterations_series_values_test(df)
df

82.7 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## Indexing:
- Measure the time using the magic command %%timeit in the cell
- Get a row for a specific CarNumber, for example, ’O136HO197RUS’

In [21]:
%%timeit

df[df['CarNumber'] == 'O136HO197RUS']

198 µs ± 5.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


- Set the index in your dataframe with CarNumber

In [25]:
df.set_index('CarNumber', inplace=True)
df

Unnamed: 0_level_0,Refund,Fines,Make,Model,Year,Calculated
CarNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Y163O8161RUS,2,3200.0,Ford,Focus,1989,3.182400e+06
E432XX77RUS,1,6500.0,Toyota,Camry,1995,1.296750e+07
7184TT36RUS,1,2100.0,Ford,Focus,1984,4.166400e+06
X582HE161RUS,2,2000.0,Ford,Focus,2015,2.015000e+06
92918M178RUS,1,5700.0,Ford,Focus,2014,1.147980e+07
...,...,...,...,...,...,...
K089PY178RUS,1,1234.0,Ford,Mustang,1969,2.429746e+06
C718MC178RUS,2,4321.0,Ford,Mustang,1969,4.254024e+06
K361KA178RUS,3,2345.0,Ford,Mustang,1969,1.539102e+06
O432AB178RUS,4,5432.0,Ford,Mustang,1969,2.673902e+06


- Again, get a row for the same CarNumber

In [27]:
%%timeit

df.loc['O136HO197RUS']

94.3 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## Downcasting:
- Run df.info(memory_usage=’deep’), pay attention to the Dtype and the memory usage

In [28]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to X023HA178RUS
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Refund      930 non-null    int64  
 1   Fines       930 non-null    float64
 2   Make        930 non-null    object 
 3   Model       919 non-null    object 
 4   Year        930 non-null    int64  
 5   Calculated  930 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 236.0 KB


- Make a copy() of your initial dataframe into another dataframe optimized
- Downcast from float64 to float32 for all the columns
- Downcast from int64 to the smallest numerical dtype possible

In [33]:
copy = df.copy()
fcols = copy.select_dtypes('float').columns
icols = copy.select_dtypes('integer').columns

copy[fcols] = copy[fcols].apply(pd.to_numeric, downcast='float')
copy[icols] = copy[icols].apply(pd.to_numeric, downcast='integer')

- Run info(memory_usage=’deep’) for your new dataframe, pay attention to the Dtype and the memory usage

In [34]:
copy.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to X023HA178RUS
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Refund      930 non-null    int8   
 1   Fines       930 non-null    float32
 2   Make        930 non-null    object 
 3   Model       919 non-null    object 
 4   Year        930 non-null    int16  
 5   Calculated  930 non-null    float64
dtypes: float32(1), float64(1), int16(1), int8(1), object(2)
memory usage: 220.6 KB


## Categories:
- Change the object type columns to the type category

In [35]:
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

- This time, check the memory usage, it probably has a decrease of 2-3 times compared to the initial dataframe

In [36]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to X023HA178RUS
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Refund      930 non-null    int64   
 1   Fines       930 non-null    float64 
 2   Make        930 non-null    category
 3   Model       919 non-null    category
 4   Year        930 non-null    int64   
 5   Calculated  930 non-null    float64 
dtypes: category(2), float64(2), int64(2)
memory usage: 127.3 KB


## Memory clean

- using %reset_selective and the library gc clean the memory of your initial dataframe only

In [37]:
%reset_selective df

gc.collect()

Once deleted, variables cannot be recovered. Proceed (y/[n])?  y


2641