# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Lets optimize</p>

> The objective of this NB is to showcase some **optimization (Processing & Memory optimization)** while working with Pandas 

> The NB includes optimization techniques involving the use of **Pandas** and **Numpy** mainly

> I am also planning to add **Python optimization** techniques & maybe something on **ML & NLP models optimization** depending on the feedback this NB receives

-----------------

<div class="alert alert-success" role="info">
<p>
<li> I am no expert in optimizations, the whole point of the NB is to optmize the code especially building data pipelines<br>
    
<li>Please feel free to share any feedback or share any corrections </li> <br>
    
<li>For a few usecases I have used <b>TPS March 2022</b> and <b> Ubiquant market prediction data </b> & others Just to make sure these optimizations work for real data as well </li>
</p>
</div>


<div class="alert alert-info" role="info">
Tricks with <b>emojis</b> must not be missed
</div>

In [None]:
import pandas as pd

import numpy as np
from matplotlib import pyplot as plt
import os

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> 📌Datatypes</p>


We generally dont give much importance to datatypes in datasets, but the fact is using correct datatypes can save us a lot of memory and time which working/loading the datasets


Following tricks can be used to optimize using correct datatypes

- Float & Int datatype
- datetime vs string
- object vs category
--------------

Will use `ubiquant-market-prediction` dataset for this section
https://www.kaggle.com/c/ubiquant-market-prediction

### Float & Int datatype


- Downcasting `float` and `int` datatypes can give huge memory boost
- Based on the range of data and the usecase, we can choose to downcast the datatypes

-------------------
For **Int** , following are the ranges for each int (precision) can store

```
int8 can store integers from -128 to 127
int16 can store integers from -32768 to 32767.
int64 can store integers from -9223372036854775808 to 9223372036854775807.
```

-------------------
**Float** also has float8,float16, etc precision values

ref : https://numpy.org/doc/stable/user/basics.types.html



<div class="alert alert-success" role="info">
<p>
Loading data with downcasted dtypes can save a lot of space and time    
</p>
</div>


- As can be seen, using **lower precision** in floats and ints can easily give us **2X- 4X** boost
> Size of **64bit (df64)** precision data is **4x** to that of **16bit (df16)** precision

In [None]:
data_sample = pd.read_csv("../input/ubiquant-market-prediction/train.csv", nrows = 10)
float_col_list = data_sample.filter(like = "f_").columns.tolist()

# 32 bit conversion mapping
fp32_dtypes = {x:y for x,y in zip(float_col_list, ["float32"]*len(float_col_list))}
fp32_dtypes.update({"time_id" : "int32", "investment_id" : "int32", "target" : "float32"})

# 16 bit conversion mapping
fp16_dtypes = {x:y for x,y in zip(float_col_list, ["float16"]*len(float_col_list))}
fp16_dtypes.update({"time_id" : "int16", "investment_id" : "int16", "target" : "float16"})

In [None]:
df64 = pd.read_csv("../input/ubiquant-market-prediction/train.csv", nrows = 10000)
print("df size with int64/float64 in MB:",df64.memory_usage(deep = True).sum()/(1024**2))

In [None]:
df32 = pd.read_csv("../input/ubiquant-market-prediction/train.csv", dtype = fp32_dtypes ,nrows = 10000)
print("df size with int32/float32 in MB:",df32.memory_usage(deep = True).sum()/(1024**2))

In [None]:
df16 = pd.read_csv("../input/ubiquant-market-prediction/train.csv", dtype = fp16_dtypes ,nrows = 10000)
print("df size with int16/float16 in MB:",df16.memory_usage(deep = True).sum()/(1024**2))

### Datetime datatype

- Generally when we load csv data, **datetime** data gets loaded as **object** dtype
- Loading date-time data in datetime format using `pd.to_datetime` can save a sizeable memory
- similar to shown incase of int/float downcasting , we can specify datetime dtype at the time of loading as well

----------------------

Using https://www.kaggle.com/c/tabular-playground-series-mar-2022/data `TPS march 2022` data


**Memory reduction : 61 MB to 6.5 MB (~90% reduction)**

In [None]:
# tabular playground - mar 2022 data
tps_march = pd.read_csv("../input/tabular-playground-series-mar-2022/train.csv")
tps_march.head(2)

In [None]:
print("time column dtype :", tps_march.time.dtype)

In [None]:
print("size of time column in MB (when loaded as object dtype) : ",tps_march.time.memory_usage(deep = True)/(1024**2))

In [None]:
# converting time column to datetime from object
tps_march["time"] = pd.to_datetime(tps_march.time)
print("size of time column in MB (when loaded as datetime dtype) : ",tps_march.time.memory_usage(deep = True)/(1024**2))

### Category datatype

- Category dtype is a must to use datatype whenever you have **less # unique categories** in string data
- Category dtype does not provide much gains for `high # unique categories`


<div class="alert alert-success" role="info">
<p>
It's not a magic how the size is getting reduced, it has got to do with how the data is stored in memory    
</p>
</div>

ref : https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html


----------------------

**Memory reduction : 47.7MB to 0.81MB (~98% reduction)**

In [None]:
# tabular playground - mar 2022 data
tps_march = pd.read_csv("../input/tabular-playground-series-mar-2022/train.csv")
tps_march.head(2)

In [None]:
print("direction column dtype :", tps_march.time.dtype)

In [None]:
print("size of direction column in MB (when loaded as object dtype) :",tps_march.direction.memory_usage(deep = True)/(1024**2))

In [None]:
# converting direction column to category from object
tps_march["direction"] = tps_march.direction.astype("category")

print("size of direction column in MB (when loaded as category dtype) :",tps_march.direction.memory_usage(deep = True)/(1024**2))

```
> Dummy usecase to test above rule

v1 = np.random.choice(["1","2","3","4","5"], size = 100000)
v2  = np.random.choice(["XXL","XL","L","M","S"], size = 100000)

df_obj = pd.DataFrame(zip(v1, v2), columns = ["col1",'col2'])

print("df size in MB (with object dtypes) :", df_obj.memory_usage(deep = True).sum()/(1024**2))

df_cat = df_obj.apply(lambda x:x.astype("category"))

print("df size in MB (with category dtypes) :",df_cat.memory_usage(deep = True).sum()/(1024**2))
```

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Merge operation</p>

- **Merge operation** on **dataframe's index** is faster than on a non-index columns
- A smart way of merging dataframes is by using the **joining key as index**
--------------

- we are able to save **1X - 5X** time by merging df on **index**

In [None]:
np.random.seed(123)

nrows = 100000
ncols = 70

key = np.array(["id_" + str(x) for x in range(nrows)]).reshape(nrows,1)
values1 = np.random.rand(nrows,ncols)
values2 = np.random.rand(nrows,ncols)

In [None]:
df1 = pd.DataFrame(np.concatenate([key, values1],axis =1), columns = ["id"] + ["col_1" + str(x) for x in range(ncols)])
df2 = pd.DataFrame(np.concatenate([key, values2],axis =1), columns = ["id"] + ["col_2" + str(x) for x in range(ncols)])

In [None]:
print("df1")
display(df1.head(2))

print()

print("df2")
display(df2.head(2))

- Merging on column = "id"

In [None]:
%%time
out = df1.merge(df2, on = "id")

- Merging on index = "id"

In [None]:
# setting column = id as index
df1_ = df1.set_index("id")
df2_ = df2.set_index("id")

In [None]:
print("df1_")
display(df1_.head(2))

print()

print("df2_")
display(df2_.head(2))

In [None]:
%%time
out = df1_.merge(df2_, left_index = True, right_index = True)

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Chaining operations</p>


- When chaining multiple operations, the order is important. For example, it's faster to **filter first and then merge**
- This technique can give great performance boost if we have duplicate keys in dataframe we are filtering

-------------

```
* df1 : 50k records
* df2 : 100k records

> df1 & df2 has 50k common records
```

**We can save ~30% time by chaining operation**

In [None]:
np.random.seed(123)

nrows = 100000
ncols = 50

key1 = np.array(["id_" + str(x) for x in range(nrows//2)]).reshape(nrows//2,1)
values1 = np.random.rand(nrows//2,ncols)

key2 = np.array(["id_" + str(x) for x in range(nrows)]).reshape(nrows,1)
values2 = np.random.rand(nrows,ncols)

In [None]:
df1 = pd.DataFrame(np.concatenate([key1, values1],axis =1), columns = ["id"] + ["col_1" + str(x) for x in range(ncols)])
df2 = pd.DataFrame(np.concatenate([key2, values2],axis =1), columns = ["id"] + ["col_2" + str(x) for x in range(ncols)])


print(f'df1 shape : {df1.shape} | df2 shape : {df2.shape}')

- Merging df1 & df2 directly

In [None]:
%%timeit
out = df1.merge(df2, on = "id")

- Merging df1 & df2 by filtering df2 first

In [None]:
# common ids - df1 & df2
req_id = set(df2.id) & set(df1.id)

In [None]:
%%timeit
out = df1.merge(df2[df2.id.isin(req_id)], on = "id")

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Apply</p> 

- Below mentioned are the good practices while applying functions using **apply** method

- `Some of these might not always give you better performance but will help you in many cases`


-------------
> 1. Calling functions directly without lambda

> 2. Using Raw = True for row/columnwise operations

In [None]:
# generating data
df = pd.DataFrame({"A" : np.random.choice(["a","abc","prdde","eass","rffd",
                                           "ashcs","rprf","a","","sca","cas"],1000000)})

df.head(4)

### Calling functions directly without lambda
- Calling functions directly without using lambda gives you some additional performance gains

- **case1**

>  using builtin `len` function

In [None]:
%%timeit -r 10 -n 10
df.applymap(lambda x:len(x))

In [None]:
%%timeit -r 10 -n 10
df.applymap(len)

- **case2**

>  using builtin `str.upper` function

In [None]:
%%timeit -r 10 -n 10
df.applymap(lambda x:str.upper(x))

In [None]:
%%timeit -r 10 -n 10
df.applymap(str.upper)

- **case3**

>  using custom  function - `sigmoid`

>  Performance gains might look small now but think about 1000 activations with sigmoid!!

In [None]:
def sigmoid(x):
    sig = 1/(1 + np.exp(-x))
    return sig

In [None]:
# generating data
np.random.seed(123)
df1 = pd.DataFrame({"B" : np.random.rand(1000000)})

df1.head(4)

In [None]:
%%timeit -r 10 -n 10
df1.applymap(lambda x:sigmoid(x))

In [None]:
%%timeit -r 10 -n 10
df1.applymap(sigmoid)

### Using Raw = True for transformations across row/columns

https://towardsdatascience.com/how-to-make-your-pandas-operation-100x-faster-81ebcd09265c

-------

**Raw = True in apply** 

- Determines if row or column is passed as a `Series or ndarray object`
- if True, `bypasses` the `overhead` associated with the `Pandas series object` and use simple map objects instead
- using Raw = True can give us a good boost in performance

In [None]:
# generating data
np.random.seed(123)
df = pd.DataFrame(np.random.rand(10000,100))

df.head(4) 

- **case 1**

> calc. max across rows

In [None]:
%%timeit
df.apply(lambda x: max(x) , axis = 1)

In [None]:
%%timeit
df.apply(lambda x: max(x) , axis = 1, raw = True)

- **case 2**

> performing calc across rows

In [None]:
%%timeit
df.apply(lambda x: (x[0] + x[2] + x[4])**2/x[6] , axis = 1)

In [None]:
%%timeit
df.apply(lambda x: (x[0] + x[2] + x[4])**2/x[6] , axis = 1, raw = True)

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> 🚀Looping</p>

- We love looping, there are better alternatives that can get us **100,000X** performance gains 
- Did you say 100k X gains!😥

In [None]:
def celsius_to_faren(temp_c):
    return (temp_c * 9/5) + 32

### Iterrows 

- It generally works slightly better than **df.iloc[i][j]  within a for loop** 
- Takes **~30** sec to apply `celsius_to_faren` to **100k data**

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])
df.head()

In [None]:
%%time
out = []
for index, row in df.iterrows():
    temp_f = celsius_to_faren(row)
    out.append(temp_f.values[0])

### Itertuples

- Itertuple is **300-400X faster** than iterrows

----------------

<u> Why Itertuple is faster </u>
> Itertuples make a comparatively less number of function calls than iterrows() and carry much lesser overhead

ref1 : https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee

ref2 : https://medium.com/swlh/why-pandas-itertuples-is-faster-than-iterrows-and-how-to-make-it-even-faster-bc50c0edd30d

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

In [None]:
%%timeit
out = []
for row in df.itertuples():
    row_ = row.temp_celsius 
    temp_f = celsius_to_faren(row_)
    out.append(temp_f)

### apply

> Apply is relatively faster than the above methods because it uses **Cython** backend. However, it can also use **Python** backend depending on the nature of **lambda function** in apply
> - Apply is **600-800X faster** than iterrows



---------------
- ref1 : https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
- ref2 : https://realpython.com/fast-flexible-pandas/

Cython : https://cython.org/

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

In [None]:
%%timeit
out = df.temp_celsius.apply(lambda x:celsius_to_faren(x))

### Eval

> - Eval is **15K times faster** than iterrows

- Eval is much more than iterator, it can used for doing any transformations on dataframe
- Eval relies on **Numexpr** package for faster processing

> The point of using eval() for expression evaluation rather than plain Python is two-fold: 
    > 1. large DataFrame objects are evaluated more efficiently  
    > 2. large arithmetic and boolean expressions are evaluated all at once by the underlying engine (by default numexpr is used for evaluation


`You should not use eval() for simple expressions or for expressions involving small DataFrames`

---------------

Awesome blog on **eval**
https://pandas.pydata.org/docs/user_guide/enhancingperf.html#expression-evaluation-via-eval

In [None]:
!pip install numexpr -q

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

- Eval on dataframe 

In [None]:
%%timeit -r 10 -n 200
out = df.eval("(temp_celsius*9/5) + 32")

- Eval on numpy array 

In [None]:
vals = df.temp_celsius

In [None]:
%%timeit -r 10 -n 200
out = pd.eval("(vals*9/5) + 32")

### Numexpr

- NumExpr is **50 - 80K times faster** than iterrows

- NumExpr is a fast numerical expression evaluator for NumPy

- NumExpr achieves better performance than NumPy because it avoids allocating memory for intermediate results. 

- This results in better `cache utilization` and `reduces memory access` in general. Due to this, NumExpr works best with large arrays
--------------

- eval uses numexpr engine which makes it really fast. if you dont want to try numexpr, just use eval

ref: https://github.com/pydata/numexpr#:~:text=NumExpr%20is%20a%20fast%20numerical,the%20same%20calculation%20in%20Python.

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

In [None]:
import numexpr as ne

In [None]:
vals = df.temp_celsius

In [None]:
%%timeit
out = ne.evaluate("(vals*9/5) + 32")

### Vectorization on pandas series

- If possible always try to vectorize your code. 
- Vectorization will always give you better results than native pandas functions (like apply,etc)
------------------

Numexpr beats vectorization but still vectorization is still way faster than eval & other native pandas functions

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

In [None]:
%%timeit
out = celsius_to_faren(df.temp_celsius)

### Vectorization on numpy arrays

- Finally we can see that vectorization performed on np arrays is **as good as numexpr**
- Always try vectorizing (with numpy arrays) to get maximum performance gains

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

In [None]:
%%timeit
out = celsius_to_faren(df.temp_celsius.values)

### numba

- Numba is a **just-in-time (JIT) compiler** for Python specifically focused on code that runs in loops over NumPy arrays
- With numba, Vectorized operations tends to become even faster 
- numba is **100k - 200k** times faster than iterrows

> Numba allows you to write a pure Python function which can be **JIT compiled** to native machine instructions, similar in performance to C, C++ and Fortran, by decorating your function with @jit.


---------------
> The @jit compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets

- https://pandas.pydata.org/docs/user_guide/enhancingperf.html#numba-jit-compilation

In [None]:
from numba import njit

@njit
def celsius_to_faren_numba(temp_c):
    return (temp_c * 9/5) + 32

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,100,100000), columns = ["temp_celsius"])

In [None]:
%%timeit
out = celsius_to_faren_numba(df.temp_celsius.values)

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> 💥Appending</p> 

- Appending to a list is a faster operation than appending directly to a dataframe because

> 1. Initializing dataframe is a slow process

> 2. Appending to a df involves a lot of `overhead` and `copy/paste operation` in memory


-------------

- We are easily getting close to **1000X** performance gain

ref: 
https://stackoverflow.com/questions/27929472/improve-row-append-performance-on-pandas-dataframes#:~:text=append%20will%20be%20faster%20if,but%20it%20doesn't%20scale.&text=When%20we%20run%20this%20with,see%20much%20more%20dramatic%20results.&text=So%20we%20can%20see%20an,insert%20with%20a%20numpy%20array.

- Appending directly to a dataframe

In [None]:
%%timeit
df= pd.DataFrame()

for x,y in zip(range(1000),range(1000,2000)):
    
    df = df.append({"A" : x , "B" : y} 
                   , ignore_index = True)

- Appending to a list & then creating a df
- **~1000X gains**

In [None]:
%%timeit
li = []
for x,y in zip(range(1000),range(1000,2000)):
    li.append(([x,y]))

df = pd.DataFrame(li, columns = ["A","B"])

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> Initialization</p>

We can also squeeze some time in how we initialize variables (esp list and dictionary)

- li = list() vs li = [] 
- d = dict() vs d = {} 


-----------------
For building pipelines, if we are initializing multiple DS, then these difference can make a big impact 

**initlizing without using functions is always faster.eg  li = [] is faster than li = list()**

- **li = list() vs li = []**

In [None]:
%%timeit
li = list()

In [None]:
%%timeit
li = []

- **d = dict() vs d = {}**

In [None]:
%%timeit
d = dict()

In [None]:
%%timeit
d = {}

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Misc Tricks</p>

- replace vs df.loc[] == val
- appending to list vs dictionary (incase of k,v type)

### Replace vs loc

> Replace is generally a better choice over df.loc for imputing/replacing constants in dataframes

- **using df.loc for imputation**

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.choice(list("abcdeffghijk"), (1000000,2)) , columns  = ["A","B"])

In [None]:
%%timeit
df.loc[df.A == "c", "A"] = "replace_val"

- **using replace for imputation**

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.choice(list("abcdeffghijk"), (1000000,2)) , columns  = ["A","B"])

In [None]:
%%timeit
df.A.replace("c" ,"replace_val", inplace = True)

### df.at vs df.loc

- **df.at** is generally a faster method for indexing pandas dataframe 
- df.at is limited only to indexing value for a specific index. we cant use for indexing a range

https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2#:~:text=Pandas%20has%20optimized%20operations%20based,merging%20on%20indices%20is%20faster

In [None]:
%%timeit
df.loc[1001,"A"]

In [None]:
%%timeit
df.at[1001,"A"]

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Pandas-Block</p>

-------------------
**How does Pandas store a DataFrame under the hood?**

- Pandas groups columns of the `same type` into what is called a **block** 
- A DataFrame is actually stored as `one or more blocks` 
- Using metadata, these blocks are composed into a DataFrame by a **BlockManager**. Thus, only a `block is contiguous in memory`
----------------

<u> We have 3 Datablocks </u>

> NumericBlock : 2 columns (dtype : int32)

> NumericBlock : 301 columns (dtype : float32)

> ObjectBlock : 1 columns (dtype : object)


<div class="alert alert-success" role="info">

<p>
<li> <b>Block</b> are important when dealing with <b>mixed dtypes</b>, <br></li> 
<li> Eg : slicing <b>df.loc[: 1000,["string_col","int_col"]]</b> might be <b>slower</b> than <b>df.loc[: 1000,["int_col1","int_col2"]]</b> because of operations on contiguous data is always faster </li> 
</p>
</div>

In [None]:
block = df32._data

print("3 datablocks - grouped basis the datatype")
block.blocks

In [None]:
print("columns - block mapping")
block.blknos

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> 🔥Saving & Loading data</p> 

- **csv**
> Comma seperated files

- **Parquet**
> Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.

- **feather**
> Feather is a data format for storing data frames. It’s designed around a simple premise — to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.

ref : https://towardsdatascience.com/stop-using-csvs-for-storage-here-are-the-top-5-alternatives-e3a7c9018de0

### csv

- **50-70 sec** to save 100K rows (from ubiquant data)
- **6 - 8 sec** to load 100K rows (from ubiquant data)
- **500+ MB** size on disk

In [None]:
df = pd.read_csv("../input/ubiquant-market-prediction/train.csv",nrows = 100000)
df.head(2)

- saving

In [None]:
%%time
df.to_csv("df.csv")

- loading

In [None]:
%%time
df_load = pd.read_csv("df.csv")

- File size on Disk

In [None]:
print("file size in MB:" ,os.path.getsize("df.csv")/(1024**2))

### Parquet

**~ 2.5 - 4 sec** to **save** 100K rows from ubiquant data **(20X time improvement over csv)**

**~ 0.5 sec** to **load** 100K rows from ubiquant data **(15X time improvement over csv)**

**~ 216 MB** file size **(2.5X improvement over csv)**

- Saving
> 2.5 sec saving time

In [None]:
%%timeit

# saving as parquet & not using any compression (for apple to apple comparison)
df.to_parquet("df.parquet", compression=None)

- Loading
> 0.5 sec saving time

In [None]:
pip install fastparquet -q

In [None]:
%%timeit
df = pd.read_parquet("df.parquet", engine="fastparquet")

In [None]:
%%timeit
df = pd.read_parquet("df.parquet", engine="pyarrow")

- File size on Disk

In [None]:
print("file size in MB:" ,os.path.getsize("df.parquet")/(1024**2))


### Feather

**App. ~1.3 sec** to **save** 100K rows from ubiquant data **(2X time improvement over Parquet)**

**App. ~0.3 sec** to **load** 100K rows from ubiquant data **(33% time improvement over Parquet)**

**App. ~ 142 MB** file size **(3.5X improvement over csv)**

---------------------
- https://github.com/wesm/feather
- https://pythontic.com/pandas/serialization/feather

In [None]:
pip install feather-format -q

- Saving
> 1.28 sec saving time

In [None]:
%%timeit
df.to_feather("df.ftr")

- Loading
> 0.3 sec loading time

In [None]:
%%timeit
df_load = pd.read_feather('df.ftr')

- File size on Disk

In [None]:
print("file size in MB:" ,os.path.getsize("df.ftr")/(1024**2))

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> Chunking</p>


> Sometimes your data file is so large you can’t load it into memory at all, even with compression. So how do you process it quickly?

> By loading and then processing the data in chunks, you can load only part of the file into memory at any given time. And that means you can process files that don’t fit in memory

``
Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks. For more complicated workflows, you’re better off using another library.
``

https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/

----------------------


<u> Value counts implementation using chunking </u>
> **Peak memory usage  : 90% performance gain**

In [None]:
import tracemalloc

In [None]:
# starting the monitoring
tracemalloc.start()

tps_march = pd.read_csv("../input/tabular-playground-series-mar-2022/train.csv")
result = tps_march.direction.value_counts()

# displaying the memory
print("peak memory usage: ",tracemalloc.get_traced_memory()[1])
 
# stopping the library
tracemalloc.stop()

In [None]:
# starting the monitoring
tracemalloc.start()
 
# function call
result = None
for chunk in pd.read_csv("../input/tabular-playground-series-mar-2022/train.csv" ,chunksize=1000 ):
    df_chunk = chunk["direction"]
    
    value_counter = df_chunk.value_counts()
    if result is None:
        result = value_counter
    else:
        result = result.add(value_counter, fill_value=0)

result.sort_values(ascending=False, inplace=True)
 
# displaying the memory
print("peak memory usage:",tracemalloc.get_traced_memory()[1])
 
# stopping the library
tracemalloc.stop()

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> List comprehension vs loops</p>

- Always prefer using **list comprehensions** over **looping** directly

- Normal loop

In [None]:
%%timeit
li = []
for x in range(100000):
    li.append(x)

In [None]:
%%timeit
li = [x for x in range(100000)]

- Nested loop

In [None]:
%%timeit
li = []
for x in range(100):
    for y in range(100):
        li.append((x,y))

In [None]:
%%timeit
li = [(x,y) for x in range(100) for y in range(100)]

# <p style="background-color:#535353;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;"> Thanks</p>


<div class="alert alert-success" role="info">

<h4> <li>  Please <b>upvote</b> 👍, it will motivate me to create more such content  </li> </h4> <br>
<h4> <li> Also checkout my one of my most upvoted NB <a href="https://www.kaggle.com/rohan1506/pandas-tips-tricks-tutorial">Pandas tips & tricks</a> for great Pandas tricks </li> </h4> 

</div>
