# **Reducing RAM usage of Programs**

All our programs consume some amount of memory. But if this memory consumption is larger than our computer memory (i.e. RAM), the program will not be able to execute. Also in general more memory usage means higher level of overhead. Therefore we should always try to reduce the memory consumption of our programs as much as possible. We can do that by several ways like using better data structures, data compression techniques, avoiding unnecessary data movements etc.

For an example lets consider a list with same object in all indices vs a list with unique items in all indices and their memory cost.


In [1]:
%load_ext memory_profiler

In [2]:
%memit [0] * int(1e8)

peak memory: 849.66 MiB, increment: 763.10 MiB


In [2]:
%memit [n for n in range(int(1e8))]

peak memory: 4026.67 MiB, increment: 3940.16 MiB


We can see that list with unique items consume close to 4GB of memory. But if we use a different data stucture to store the same amount of data memory consumption is different.

In [3]:
import array
%memit array.array('l', range(int(1e8)))

peak memory: 712.28 MiB, increment: 624.44 MiB


As we can see it only consume ~600MB of memory. This result emphasize the importance of using proper data structure for right purpose. You may think that then usage of arrays would be great for anything. But it is not. It have limited set of supported data types. Also the moment we derefence our data from array, python will build a new object which will cost memory. So if your program is going to deference data frequently and process no memory saving will occur.

If we need array with more data type support, we can use numpy arrays. They act very simialar to the normal python arrays and provides better data type support.



In [7]:
import numpy as np
%memit arr=np.ones(int(1e8), np.int8)

peak memory: 289.00 MiB, increment: 95.37 MiB


### NumExpr to optimize memory

During some of the numpy operations, program can create intermediate objects which may cause unexpected/unwanted memory consumption. The `NumExpr` is a tool that can both speed up and reduce the effect of such intermediate items problem. 

> NumExpr breaks the long vectors into shorter, cache-friendly chunks and processes each in series, so local chunks of results are calculated in a cache-friendly way.

It should be noted that NumExpr support both numpy and pandas libraries and specially pandas dataframe.eval function uses NumExpr for processing if the package is available in the context.

### Python's way of data storing

In python, most of the objects are dynamic. Which means they take more space than more staticly typed programs. For example python python interger takes one byte. But if the interger is large that size may differ. Check below.

In [8]:
import sys

In [9]:
sys.getsizeof(0)

24

In [10]:
sys.getsizeof(1)

28

In [11]:
sys.getsizeof(2**30-1)

28

In [12]:
sys.getsizeof(2**30)

32

We can see that depending on the size of the interger memory it consume change. Same goes for lists and strings as well. But above used `sys.getsizeof` function may not reflect actual memory usage of the full object. For example it will not consider object hierarchies or underline data structures etc. So relying solely on the output given by above function can be confusing.

In [13]:
sys.getsizeof([b"asdfghjklqwertyuiop"])

64

In [14]:
sys.getsizeof([b"a", b"b"])

72

Among the most common data types strings are one of the hardest type to store/process efficiently. But we can use some special data structures to compress the representation while still allowing fast operations.

## **Storing Strings in RAM**

Lets assume we have a huge amount of tokens to store and we have to search for a given token in our token pile. 

The easiest method would be storing the token pile in a list and when a query comes we can compare each and see. Obviously this is dumb. Putting aside the memory clearly this takes huge amount of time. As a somewhat improvement we can store tokens in a sorted manner. This will help us in future search operations. But may fire back at us if we are to add more tokens later.

Another solution is using a set. This is indeed easy (at least in python). But memory usage is still considerable. Using a set may become problematic is you need do get number of tokens available (count) like operations. In such situation we can always use a dictionary(map).

But instead such typical data structures, we can use our advanced data stuctures knowledge into use here by using tree like data structure. This option gives faster access/search times as well as good compression.


**Directed Acyclic Word Graph (DAWG)**

This is a very interesting implementation of graph based data structure to store/compress string. The storing method can be different based on the implementation. But idea is each charater have its own node and tokens means connections between such nodes. This dramatically reduce the memory consumption and reduce the search time as well..

You can read more details about it [here](https://dawg.readthedocs.io/en/latest/).


**Tries**

Another data structure we can use to store string data is tries. Since it behaves simialar to trees we can store characters in nodes and their connections. Since in a language like English considerable amount of strings have similar prefixes, tries can help greatly to reduce memory consumption while keeping the processing speed. But it is worth noting that DAWGs can compress better compared to tries. In python `Marisa Tries` is a package, we can use for this purpose.


> Other than those we can use probabilistic data structures to save values as well. This causes our values to be less accurate but with incredible memory savings. Example for such data structures a `Morris Counters`, `K-min Values`, `Bloom Filters`, `LogLog Counter` etc. These have their specific qualities and therefore usage is highly dependent on usecases.