# <font size=12 color=red>Scaling Data with Limited Memory</font>

# `Mr Fugu Data Science`

# (◕‿◕✿)

# `Considerations:`

+ Loading data into memory can be tricky and problematic if you have large datasets. Taking this into account is important and techniques used should be evaluated in order to be efficient with processing.

**Ex. )** `Pandas` **will load all your data into memory** causing issues if you have a dataset larger than your memory.

* Pay attention to your data, you may be making (intermediate) copies of the data and slowing down processing!
    + `Pandas isn't the best choice for all situations`

+ We can:
    + Buy a faster machine or rent access like cloud computing
    + You can try to offload from memory onto disk, but this is slower than RAM

`Hmm. What do we do and how to approach this issue?`


`--------------------------------------`

**Have you ever used a dataset and wondered where did my memory go?**
+ I mean, you have a small file maybe < 1GB and while doing manipulations your data have grown by a few GB's.

https://pythonspeed.com/articles/function-calls-prevent-garbage-collection/

# A good recommendation would be to figure out how much memory your using before/during your analysis as a reference

* **`Profiling:`**: can be a measure used to aid in investigating your memory usuage

https://www.geeksforgeeks.org/memory-profiling-in-python-using-memory_profiler/

https://medium.com/zendesk-engineering/hunting-for-memory-leaks-in-python-applications-6824d0518774

https://medium.com/codex/profiling-your-code-4a1538afd1e1

`------------------------------------`

* **Alternate option to consider:**
    * **`Mac/Linux`**: consider a library `from resource import getrusage, RUSAGE_SELF`
    * **`Windows:`** try using `import psutil`
    

`------------------------------------`
    
* **`Terminal/Command Prompt:`** consider `htop` command
    
`------------------- Possible Reading & Insight ------------------------------`


https://pythonspeed.com/articles/estimating-memory-usage/

https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets#

# **`Profiling:`** 

+ Consider needing to find the time certain operations run in a script, NOT benching marking. This is because you are running statistics on the entire program.
    + In that case you would want to use something such as "timeit"

`Two types of Profiling:`

+ **Deterministic:**
    * monitoring events, while being accurate will have an effect on performance overhead. This would be better run on small functions or operations.

+ **Statistical:**
    * Less accurate but also uses fewer overhead resources by taking samples.
    
Something really useful and pretty cools is the (Call Graph) look into **gprof2dot** for example. It will convert your script into a graph like structure showing what functions are calling each other.


# `Other Tools:`

    + vprof
    + pyflame
    + stackImpact
    
https://medium.com/@antoniomdk1/hpc-with-python-part-1-profiling-1dda4d172cdf

    
# `Next, Consider Modeling/Estimating` your memory usage

This will help you figure out where you may be eating up the memory and slowing everything down.

+ But, what if you are stuck here and can't run till complete for a task/model?
    + `Well, we have to consider using smaller models, (possibly sample the data or even change the number of inputs).` 
    + From there try to extrapolate your data and get a general idea of how much memory you may be using.
    
    (**Sampling can bring you to a point where you may lose information, take this into account**)
    
`------------------------------------`
    
**Caution:** there is a difference between `Peak Resident Memory & Peak Memory Usuage`

# `(Managing Resources Locally)`

+ `Think about garbage collection:` this can be handy if you are doing a lot of transformations or creation/deletion of objects. 
    + Mac: htop function
    + PC: psutil

+ **Changing Datatypes:** this can drastically change memory requirements with minimal work

     + **Chunking:** ingesting your data as the name implies "by chunks"
     
        + **Indexing:** useful if you need subsets of the data
        
            + **Compression:** there are two forms so be careful "lossless" and "lossy"

**`We do need to think about something, losing information for the sake of compression`** 

https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets (good read)

# `Using Different Libraries`

+ Parallel Computing options such as `Dask` & `Vaex`

`Dask:` scaling for clusters

`Vaex:` work with big data on a single machine

    + Learn their limitations, because there are some!

# Converting file types:

+ consider converting your file to a lower memory usuage type such as:

**`File Storage:`**

* `Pickle Files`
* `Parquet`

# `Pickle Files:` used to store data in byte format (serialize/deserialize)

`You can take that formatting and convert it back for python to read. But, you have to be care full of memory usuage exploding going in reverse. There are tools to help you manage memory such as (pickletools).` 
    + Pickling is used to transfer data in a manageable way, in the real world.
    
    --------------------------

+ `easily read by Python and stores all of your data as well as datatypes, unlike a csv while saving memory.`
    * CSV doesn't store information such as datatypes.

(**do NOT un-pickle a file you are not sure of its source due to coruption issues!**)

https://towardsdatascience.com/the-power-of-pickletools-handling-large-model-pickle-files-7f9037b9086b

https://analyticsindiamag.com/complete-guide-to-different-persisting-methods-in-pandas/

# **`Background Motivation:`**

`File formats such as JSON, CSV and databases such as PSQL use row-wise storing. This isn't necessarily what you want when you have files in the GB ranges for read/write operations on a single machine.`

Working with columns allows you flexibility calling only the data you want and saving memory. 
    + If you can flatten columns with redundant information you can efficiently manage memory (with a column) oriented scheme.
    
    
`-----------------------------------------`


# `Parquet file:` *(helps with storage size)*


+ This can be a good option when you want to:
    * Supports nested data
    * Save datatype information
    * Faster to read than csv and supported by Hadoop
    * Great for running queries on big data (GB for each file), reduces overhead costs vs csv files
    * Supports various encodings, 
    
`3 ways Parquet can help you:`

1. ) **Columnar Storage**
    * Build for efficiency vs CSV files format

2. ) **Columnar Compression**
    * able to use different codecs for different files aiding in compression and helps with query speed

3. ) **Data Partitioning**
    * This is done by spliting on unique values and crating a tree structure, like key-value pairs.

**Reading data such as csv or json are done row-wise but, this has an inherent problem when reading in large files which are usually iteratively!** `Therefore, we have the columnar approach which reduces our overhead by storing data column-wise. allowing you to use only the columns you need`
    
    
https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d (GOOD Reading)

https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce (inspiration for this cell)

https://docs.microsoft.com/en-us/answers/questions/381106/parquet-in-data-lake-vs-csv-file.html

# `Comparing:` look at (game column)


| Some_ID 	| Pokemon_Card 	| Game_Number 	| Dollar_Value 	|
|---------	|--------------	|-------------	|--------------	|
| 3       	| Charizard    	| S4a         	| 200          	|
| 2       	| Pikachu      	| S4a         	| 5            	|
| 4       	| Eevee        	| S4a         	| 30           	|
| 1       	| Charmander   	| S4a         	| 10           	|
| 6       	| Skylar       	| S4a         	| 15           	|
 
 
 + We would have to go line by line (row-wise) and find redundant values
 
 + Whereas, if we are using column based approaches we can flatten these data.

`----------------------------------------------------------------------------------`

# `Other Ways To Save Memory`

`------------------------------------------------`


# 1.) **Changing Data Types to save memory**
   

`-------------------------------------------------`

# 2.) `Watch out for pandas.concat(): will explode the memory usuage so be aware of that. `

   **Solution**: if you can write to disk then do so with something such as a `parquet file`. This is readable by pandas and you are able to store more on disk, since you are RAM limited. **There is a tradeoff though** `time`. You will save memory usuage but, sacrifce time. Depending on the task you should beaware of this!
    
    
By default `Pandas` will infer your datatype and in doing so, will create large usuage of memory.
    
`Consider reading these:`
    

https://www.confessionsofadataguy.com/solving-the-memory-hungry-pandas-concat-problem/

https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets

# `Do you need all of the data?`

`We want to minimize resources used and maximize our time!` ¯\_(ツ)_/¯

+ Can you shrink the dataset?
    + Such as `dropping columns`
    + Think about `garbage collection` such as `import gc` and use `gc.collect()`
    + Consider `generator` functions

`Garbage Collection:` can effectively aid in saving memory if you take control of it. Python in general will store everything after it is loaded into RAM or even when variables are created. 
    
https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3

https://pythonspeed.com/articles/function-calls-prevent-garbage-collection/

# `Pandas Concat & Memory Issues:`


https://www.confessionsofadataguy.com/solving-the-memory-hungry-pandas-concat-problem/

`-----------------------------------------`

# Lastly, Consider tools such as:
+ Cloud Computing
+ Google Colab
+ Using a Cluster for parallel computing if your datasets are very large
    + `Dask` & `Vaex` or even `Spark` may be an option

# <font color=red>Like,</font><font color=red> Share & SUB</font>scribe

# `Citations & Help`

# ◔̯◔

[pandas thoughts](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#:~:text=pandas%20provides%20data%20structures%20for,need%20to%20make%20intermediate%20copies) | [Memory and Speed Python](https://pythonspeed.com/memory/) | [Further Help Kaggle](https://www.kaggle.com/frankherfert/tips-tricks-for-working-with-large-datasets)

https://towardsdatascience.com/what-to-do-when-your-data-is-too-big-for-your-memory-65c84c600585

https://www.ionos.com/digitalguide/hosting/technical-matters/in-memory-databases/

https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask

https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets

https://www.kdnuggets.com/2021/03/pandas-big-data-better-options.html

https://annefou.github.io/metos_python/07-LargeFiles/

https://www.youtube.com/watch?v=HNE0qHJ9A9o

https://pythonspeed.com/articles/estimating-memory-usage/

https://datascience.stackexchange.com/questions/27767/opening-a-20gb-file-for-analysis-with-pandas

https://www.confessionsofadataguy.com/solving-the-memory-hungry-pandas-concat-problem/

https://docs.dask.org/en/latest/best-practices.html#store-data-efficiently

https://docs.dask.org/en/latest/best-practices.html

https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3

`---------------------`

https://realpython.com/python-mmap/ (memory mapping)

https://towardsdatascience.com/optimized-i-o-operations-in-python-194f856210e0 (optimizing I/O)

https://www.codementor.io/@satwikkansal/python-practices-for-efficient-code-performance-memory-and-usability-aze6oiq65

https://towardsdatascience.com/reduce-memory-usage-and-make-your-python-code-faster-using-generators-bd79dbfeb4c

https://pythonspeed.com/articles/function-calls-prevent-garbage-collection/

https://www.toucantoco.com/en/tech-blog/python-performance-optimization

https://arrow.apache.org/docs/python/parquet.html