In [1]:
import pandas as pd
import gc

# 1. Read the fines.csv file that you saved in the previous exercise.

In [2]:
df = pd.read_csv('../data/fines.csv')
def calculate_new_value(row):
    try:
        return row['Fines'] / row['Refund'] * row['Year']
    except (ZeroDivisionError, TypeError):
        return None

# 2. Iterations: in all the following subtasks, you need to calculate `fines/refund*year` for each row. Create a new column with the calculated data. Measure the time using the magic command `%%timeit` in the cell.
- Write a function that loops through the dataframe using `for i in range(0, len(df))`, `iloc`, and `append()` to a list. Assign the result of the function to a new column in the dataframe.

In [3]:
%%timeit  
result_list = []
for i in range(len(df)):
    row = df.iloc[i]
    result_list.append(calculate_new_value(row))
df['strange'] = result_list

29.8 ms ± 481 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


- Do it using `iterrows()`.

In [4]:
%%timeit
result_list = []
for index, row in df.iterrows():
    result_list.append(calculate_new_value(row))
df['strange'] = result_list

28.5 ms ± 458 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


- Do it using `apply()` and a lambda function.

In [5]:
%%timeit
df['strange'] = df.apply(lambda row: calculate_new_value(row), axis=1)

6.02 ms ± 179 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


- Do it using `Series` objects from the dataframe.

In [6]:
%%timeit
df['strange'] = df['Fines'] / df['Refund'] * df['Year']
df

145 μs ± 2.61 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


- Do it as in the previous subtask, but use the method `.values`.


In [7]:
%%timeit
df['strange'] = df['Fines'].values / df['Refund'].values * df['Year'].values
df

67.4 μs ± 840 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


# 3. Indexing: measure the time using the magic command `%%timeit` in the cell.
- Get a row for a specific `CarNumber`, for example, "O136HO197RUS."

In [8]:
%%timeit
row = df[df['CarNumber'] == 'O136HO197RUS']
row

180 μs ± 2.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


- Set the index in your dataframe with `CarNumber`.

In [9]:
df_indexed = df.set_index('CarNumber')

- Again, get a row for the same `CarNumber`.

In [10]:
%%timeit
df_indexed.loc['O136HO197RUS']

78.4 μs ± 416 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


# 4. Downcasting:
- Run `df.info(memory_usage='deep')`, and pay attention to the Dtype and memory usage.

In [11]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CarNumber  930 non-null    object 
 1   Refund     930 non-null    float64
 2   Fines      930 non-null    float64
 3   Make       930 non-null    object 
 4   Model      919 non-null    object 
 5   Year       930 non-null    float64
 6   strange    930 non-null    float64
dtypes: float64(4), object(3)
memory usage: 182.1 KB


- Make a `copy()` of your initial dataframe into another dataframe, `optimized_df`.

In [12]:
optimized_df = df.copy()

- Downcast from `float64` to `float32` for all columns.

In [13]:
float_columns = optimized_df.select_dtypes(include=['float64']).columns
for col in float_columns:
    optimized_df[col] = pd.to_numeric(optimized_df[col], downcast='float')

- Downcast from `int64` to the smallest numerical Dtype possible.

In [14]:
int_columns = optimized_df.select_dtypes(include=['int64']).columns
for col in int_columns:
    optimized_df[col] = pd.to_numeric(optimized_df[col], downcast='integer')

- Run `info(memory_usage='deep')` for your new dataframe. Pay attention to the Dtype and memory usage

In [15]:
optimized_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CarNumber  930 non-null    object 
 1   Refund     930 non-null    float32
 2   Fines      930 non-null    float32
 3   Make       930 non-null    object 
 4   Model      919 non-null    object 
 5   Year       930 non-null    float32
 6   strange    930 non-null    float64
dtypes: float32(3), float64(1), object(3)
memory usage: 171.2 KB


# 5. Categories:
- Change the `object` type columns to `category`.

In [16]:
object_columns = optimized_df.select_dtypes(include=['object']).columns
for col in object_columns:
    optimized_df[col] = optimized_df[col].astype('category')

- This time, check the memory usage. It will probably decrease by 2–3 times compared to the initial dataframe.

In [17]:
optimized_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   CarNumber  930 non-null    category
 1   Refund     930 non-null    float32 
 2   Fines      930 non-null    float32 
 3   Make       930 non-null    category
 4   Model      919 non-null    category
 5   Year       930 non-null    float32 
 6   strange    930 non-null    float64 
dtypes: category(3), float32(3), float64(1)
memory usage: 71.4 KB


# 6. Memory clean:
- Using the library `gc` and the command `%reset_selective`, clean the memory of your initial dataframe only.

In [18]:
del df
gc.collect()
%reset_selective df