# <font size=12 color=red>Scaling Data with Limited Memory</font>

# `Mr Fugu Data Science`

# (◕‿◕✿)

In [None]:
# imports:

import pandas as pd

import os, gc # gc is used for garbage collection

# `Considerations:`

+ Loading data into memory can be tricky and problematic if you have large datasets. Taking this into account is important and techniques used should be evaluated in order to be efficient with processing.

**Ex. )** `Pandas` **will load all your data into memory** causing issues if you have a dataset larger than the memory you have available for your machine.

* Pay attention to your data, you may be making (intermediate) copies of the data and slowing down processing!
    + `Pandas isn't the best choice for all situations`

+ We can:
    + Buy a faster machine or rent access like cloud computing
    + You can try to offload from memory onto disk, but this is slower than RAM

`Hmm. What do we do and how to approach this issue?`


`--------------------------------------`

**Have you ever used a dataset and wondered where did my memory go?**
+ I mean, you have a small file maybe 3GB and while doing manipulations and for some reason your data have grown by a few GB.

https://pythonspeed.com/articles/function-calls-prevent-garbage-collection/

# A good recommendation would be to figure out how much memory your using before/during your analysis as a reference

* **`Profiling:`**: can be a measure used to aid in investigating your memory usuage


`------------------------------------`

* **Alternate option to consider:**
    * **`Mac/Linux`**: consider a library `from resource import getrusage, RUSAGE_SELF`
    * **`Windows:`** try using `import psutil`
    

`------------------------------------`
    
* **`Terminal/Command Prompt:`** consider `htop` command
    
`------------------- Possible Reading & Insight ------------------------------`


https://pythonspeed.com/articles/estimating-memory-usage/

https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets#

# Profiling Ex.) 

https://www.geeksforgeeks.org/memory-profiling-in-python-using-memory_profiler/

https://medium.com/zendesk-engineering/hunting-for-memory-leaks-in-python-applications-6824d0518774

https://medium.com/codex/profiling-your-code-4a1538afd1e1


In [7]:
# %%time

# data_01 = pd.read

In [8]:
# data_.info(memory_usuage=?) # need to view parameters!

# to convert into megabytes we need to multiply by 1*e-6

In [10]:
# Library for alternative:



In [None]:
# Memory Profiling:

# from memory_profiler import profile

    
# `Next, Consider Modeling/Estimating` your memory usage

This will help you figure out where you may be eating up the memory and slowing everything down.

+ But, what if you are stuck here and can't run till complete for a task/model?
    + `Well, we have to consider using smaller models, possibly sample the data or even change the number of inputs. From there try to extrapolate your data and get a general idea of how much memory you may be using.`
    
    (**Sampling can bring you to a point where you may lose information, take this into account**)
    
`------------------------------------`
    
**Caution:** there is a difference between `Peak Resident Memory & Peak Memory Usuage`

In [None]:
# Ex.) 

# `Compressing Data (Managing Resources Locally)`

+ Changing Datatypes
+ Chunking
+ Indexing

**`We do need to think about something, losing information for the sake of compression`** 

In [1]:
# Chunking/Loading: good if you only want to load data once



In [None]:
# Changing Data-Types:


In [9]:
# Indexing: good if you want to access many times, `side note`: good with databases for fast access



# `Using Different Libraries`

+ Parallel Computing options such as `Dask` & `Vaex`
    + Learn their limitations, because there are some!

In [2]:
# Think Parallel

# Converting file types:

+ consider converting your file to a lower memory usuage type such as:

**`File Storage:`**

* `Pickle Files`
* `Parquet`

# `Pickle Files:`

+ easily read by Python and stores all of your data as well as datatypes, unlike a csv while saving memory.

(**do NOT un-pickle a file you are not sure of its source due to coruption issues!**)

https://towardsdatascience.com/the-power-of-pickletools-handling-large-model-pickle-files-7f9037b9086b

In [None]:
# ex.) compare Pickle file vs same CSV file size



# Parquet file:

+ This can be a good option when you have an explosion of data for `pd.concat` for example is a memory hog.
    + Storing on disk can be a better option in this case

In [None]:
# ``

# `Changing Data Types & Watch Out For Functions in Pandas:`


* Changing Data Types to save memory:
    * 

* `pandas.concat():` will explode the memory usuage so be aware of that. 
    * Solution: well if you can write to disk then do so with something such as a `parquet file` this is readable by pandas and you are able to store more on disk than RAM. **There is a tradeoff though** `time`. You will save memory usuage but sacrifce time. Depending on the task you should beaware of this!
    
    
By default `Pandas` will infer your datatype and in doing so, will create large usuage of memory.
    
`Consider reading these:`
    
https://arrow.apache.org/docs/python/parquet.html

https://www.confessionsofadataguy.com/solving-the-memory-hungry-pandas-concat-problem/

https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets

# `Do you need all of the data?`

`We want to minimize resources used and maximize our time!` ¯\_(ツ)_/¯

+ Can you shrink the dataset?
    + Such as `dropping columns`
    + Think about `garbage collection` such as `import gc` and use `gc.collect()`
    + Consider `generator` functions

`Garbage Collection:` can effectively aid in saving memory if you take control of it. Python in general will store everything after it is loaded into RAM or even when variables are created. 
    
https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3

# `Pandas Concat & Memory Issues:`


https://www.confessionsofadataguy.com/solving-the-memory-hungry-pandas-concat-problem/

`-----------------------------------------`

# Lastly, Consider tools such as:
+ Cloud Computing
+ Google Colab
+ Using a Cluster for parallel computing if your datasets are very large
    + `Dask` & `Vaex` or even `Spark` may be an option

# `Citations & Help`

# ◔̯◔

[pandas thoughts](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#:~:text=pandas%20provides%20data%20structures%20for,need%20to%20make%20intermediate%20copies) | [Memory and Speed Python](https://pythonspeed.com/memory/) | [Further Help Kaggle](https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets)

https://towardsdatascience.com/what-to-do-when-your-data-is-too-big-for-your-memory-65c84c600585

https://www.ionos.com/digitalguide/hosting/technical-matters/in-memory-databases/

https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask

https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets

https://www.kdnuggets.com/2021/03/pandas-big-data-better-options.html

https://annefou.github.io/metos_python/07-LargeFiles/

https://www.youtube.com/watch?v=HNE0qHJ9A9o

https://pythonspeed.com/articles/estimating-memory-usage/

https://datascience.stackexchange.com/questions/27767/opening-a-20gb-file-for-analysis-with-pandas

https://www.confessionsofadataguy.com/solving-the-memory-hungry-pandas-concat-problem/

https://docs.dask.org/en/latest/best-practices.html#store-data-efficiently

https://docs.dask.org/en/latest/best-practices.html

https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3

`---------------------`

https://realpython.com/python-mmap/ (memory mapping)

https://towardsdatascience.com/optimized-i-o-operations-in-python-194f856210e0 (optimizing I/O)

https://www.codementor.io/@satwikkansal/python-practices-for-efficient-code-performance-memory-and-usability-aze6oiq65

https://towardsdatascience.com/reduce-memory-usage-and-make-your-python-code-faster-using-generators-bd79dbfeb4c

https://pythonspeed.com/articles/function-calls-prevent-garbage-collection/