## Pandas optimizations

Этот ноутбук выполнит:

* Загрузку данных из fines.csv.
* Измерение времени выполнения различных операций.
* Оптимизацию кода, включая downcasting (уменьшение памяти).

#### 1. Загружаем данные из ex04

In [2]:
import pandas as pd
import gc 

df = pd.read_csv("../ex04/fines.csv")

df.head()

Unnamed: 0,Refund,Fines,Make,Model,Year
0,1,6500.0,Toyota,Camry,1989
1,1,2100.0,Ford,Focus,1995
2,2,2000.0,Ford,Focus,1984
3,1,7458.528951,Ford,Focus,2015
4,2,6000.0,Ford,Focus,2014


#### 2. Создание нового столбца и измерение времени

✅ Метод 1: Цикл for + iloc + append()

In [3]:
%%timeit
def calc_with_loop(df):
    result = []
    for i in range(len(df)):
        value = df.iloc[i]["Fines"] / df.iloc[i]["Refund"] * df.iloc[i]["Year"]
        result.append(value)
    return result

df["Calc_Loop"] = calc_with_loop(df)

25.1 ms ± 108 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


✅ Метод 2: iterrows()

In [4]:
%%timeit
def calc_with_iterrows(df):
    result = []
    for _, row in df.iterrows():
        result.append(row["Fines"] / row["Refund"] * row["Year"])
    return result

df["Calc_Iterrows"] = calc_with_iterrows(df)

8.74 ms ± 33.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


✅ Метод 3: apply() и lambda

In [5]:
%%timeit
df["Calc_Apply"] = df.apply(lambda row: row["Fines"] / row["Refund"] * row["Year"], axis=1)

2.47 ms ± 3.37 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


✅ Метод 4: Использование Pandas Series

In [6]:
%%timeit
df["Calc_Series"] = df["Fines"] / df["Refund"] * df["Year"]

61.1 μs ± 2.78 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


✅ Метод 5: .values (NumPy)

In [7]:
%%timeit
df["Calc_Values"] = (df["Fines"].values / df["Refund"].values) * df["Year"].values

32.6 μs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


#### 3. Индексация

In [8]:
del df
gc.collect()

0

In [9]:
df = pd.read_csv("../ex04/owners.csv")

✅ 1. Доступ к строке по CarNumber без индекса

In [10]:
%%timeit
df[df["CarNumber"] == "O136HO197RUS"]

65.3 μs ± 259 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


✅ 2. Установка CarNumber в качестве индекса

In [11]:
df.set_index("CarNumber", inplace=True)

✅ 3. Доступ к строке после установки индекса

In [12]:
%%timeit
df.loc["O136HO197RUS"]

7.16 μs ± 5.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


#### 4. Оптимизация памяти (Downcasting)

In [13]:
del df
gc.collect()
df = pd.read_csv("../ex04/fines.csv")

✅ 1. Анализ памяти перед оптимизацией

In [14]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687 entries, 0 to 686
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Refund  687 non-null    int64  
 1   Fines   687 non-null    float64
 2   Make    687 non-null    object 
 3   Model   676 non-null    object 
 4   Year    687 non-null    int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 88.4 KB


✅ 2. Преобразование числовых типов

In [15]:
df_optimized = df.copy()

df_optimized["Fines"] = pd.to_numeric(df_optimized["Fines"], downcast="float")
df_optimized["Refund"] = pd.to_numeric(df_optimized["Refund"], downcast="integer")
df_optimized["Year"] = pd.to_numeric(df_optimized["Year"], downcast="integer")

print("До оптимизации:")
df.info(memory_usage="deep")

print("\nПосле оптимизации:")
df_optimized.info(memory_usage="deep")

До оптимизации:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687 entries, 0 to 686
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Refund  687 non-null    int64  
 1   Fines   687 non-null    float64
 2   Make    687 non-null    object 
 3   Model   676 non-null    object 
 4   Year    687 non-null    int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 88.4 KB

После оптимизации:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687 entries, 0 to 686
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Refund  687 non-null    int8   
 1   Fines   687 non-null    float32
 2   Make    687 non-null    object 
 3   Model   676 non-null    object 
 4   Year    687 non-null    int16  
dtypes: float32(1), int16(1), int8(1), object(2)
memory usage: 77.0 KB


#### 5. Очистка памяти

In [16]:
df.to_csv("optimized_fines.csv", index=False)

In [17]:
del df
gc.collect() 

0