# Fasting Pandas - A guide into optimizing your analytical processing 

### Part 2

---

Now that we know how to handle the panda, it's time to share it with the world by saving it into a file. Maybe we want to use that file later on to do some other work. Just another day at the office, or your home. Everyone works remote now, right?

Nooo, not yet. We need to take a step back and look into parsing first. 

### A new problem

Our skills have grown and now we are blazing fast into our analysis. But the reality is we still don't know what we are doing and got greedy by trying to work with a bigger dataset. We thought we could handle it but our computer's RAM is starting to melt.

Let's create a dataset of 10 million rows, way beyond the Excel realm and analyze it a bit.

In [1]:
import fasting_pandas as fp
import pandas as pd
import os

In [4]:
# %timeit /
df = fp.generate_results(10_000_000)
df.info()
# df.to_csv(os.path.join(fp.DATA_DIR, 'slow.csv'), index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 6 columns):
 #   Column  Dtype         
---  ------  -----         
 0   size    object        
 1   age     int32         
 2   team    object        
 3   result  object        
 4   date    datetime64[ns]
 5   prob    float64       
dtypes: datetime64[ns](1), float64(1), int32(1), object(3)
memory usage: 419.6+ MB


Ok, so our 10 million row 6 column dataset is consuming 420mb of RAM just to have it available as a dataframe. You might think this is not a lot, but when merging or applying any of our cool transformations memory spikes start to happen and trouble knocks down the front door. What if we had to work with 20, 50 or 100 million? Yeah, not good.

So what do we do?

Let's start with the obvious. Our first question should be what should we expect from our outcome. Do we really need all these fields for our analysis? If we don't, then we should consider filtering unnecessary columns. This will help our memory efficiency issues. 

As a side note, this would be analogous to using the select * statement in SQL. Are you doing that? No! Bad analyst; bad analyst!

For our example we will save the dataset as a csv and read it.

In [6]:
df.to_csv(os.path.join(fp.DATA_DIR,'slow.csv'))
df = pd.read_csv(os.path.join(fp.DATA_DIR,'slow.csv'), usecols = ['size', 'age', 'date'])
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   size    object
 1   age     int64 
 2   date    object
dtypes: int64(1), object(2)
memory usage: 228.9+ MB


Now, something quite interesting has happened. Can you spot it?

Yes yes, our memory usage was cut down by almost half which makes sense since we removed half the columns. But something else changed, and if your answer was the datatype of age changed from int32 to int64 then good for you Sherlock, you make me proud.

When we first generated our dataframe we did it by calling a function. This function provided the DataFrame the logic it needed to create itself, and the dataframe object assigned a data type to each field. When generating age pandas decided it would parse them as an int32 type.

When we saved the csv, the csv encoder read the memory types but also used it's own set of rules to save the file. It incremented the byte size from 32 bits to 64 bits, and when we read it back, the Dataframe was created by parsing the instructions provided by the csv file. So even though we did good by using only the columns we needed, some of our benefits were stained by using a unnecessary data type for our age field.

### Downcasting numbers
The takeaway is that we should be aware of the data we are dealing with, and based on it's extremes we can assign appropiately the size of the data type field in order to save memory.

#### Integers
- Int8 stores integers from -128 to 127
- Int16 (short) stores integers from -32,768 to 32,767
- Int32 stores integers from -2,147,483,648 to 2,147,483,647 
- Int64 (long) stores integers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
#### Floats
- Float16 (short) stores 3 decimal places
- Float32 (single) stores 7 decimal places
- Float64 (double) stores 15 decimal places
- Float128 (long double) stores 19 decimal places

So, let's do that real quick.

In [10]:
df= pd.read_csv(os.path.join(fp.DATA_DIR,'slow.csv'), usecols = ['size', 'age', 'date'],dtype={'age': 'int32'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   size    object
 1   age     int32 
 2   date    object
dtypes: int32(1), object(2)
memory usage: 190.7+ MB


Cool. So we could have saved 30mb just by being aware of the change on the datatype. But lets use our brains for a second.

It seems age is an interesting contender to parse as an int8, since no one should have an age of negative nature or greater than 127 years. Maybe some time in the future when we turn into robots. For now, it seems reasonable to work within these constraints. Lets do a quick comparison to see how much difference there is between different int data byte sizes.

In [11]:
df['age'].astype('int8').info(),df['age'].astype('int16').info(),df['age'].astype('int32').info(),df['age'].astype('int64').info()

<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: age
Non-Null Count     Dtype
--------------     -----
10000000 non-null  int8 
dtypes: int8(1)
memory usage: 9.5 MB
<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: age
Non-Null Count     Dtype
--------------     -----
10000000 non-null  int16
dtypes: int16(1)
memory usage: 19.1 MB
<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: age
Non-Null Count     Dtype
--------------     -----
10000000 non-null  int32
dtypes: int32(1)
memory usage: 38.1 MB
<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: age
Non-Null Count     Dtype
--------------     -----
10000000 non-null  int64
dtypes: int64(1)
memory usage: 76.3 MB


(None, None, None, None)

From 9.5 to 76.3. That is almost a 9x difference of unnecesary memory waste. Feels bad, doesn't it.

Let's look at our other colums, size and date. Currently they are parsed as objects, which is default behaviour. So the question is, can we do better?

Yes, yes we can.

Enter the categorical type. In essence, a category is a fixed number of possible and limited values. I don't want to go to deep into it, but the main benefit is you will experience a drastic improvement on memory usage when reading the file.

But..

Categorical columns are a very fragile thing. It's very easy parsing back to object columns when you apply transformations, so you could perfectly experience the same memory spike when working with the column indifferent of it's type. This power comes with responsability, and your first one is to go and look into it by yourself.

Having said that, I will show you now the difference when parsing a column as an object or category.

In [12]:
df.date.astype('category').info(),df.date.astype('object').info(), df.date.astype('datetime64[ns]').info()

<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: date
Non-Null Count     Dtype   
--------------     -----   
10000000 non-null  category
dtypes: category(1)
memory usage: 19.2 MB
<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: date
Non-Null Count     Dtype 
--------------     ----- 
10000000 non-null  object
dtypes: object(1)
memory usage: 76.3+ MB
<class 'pandas.core.series.Series'>
RangeIndex: 10000000 entries, 0 to 9999999
Series name: date
Non-Null Count     Dtype         
--------------     -----         
10000000 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 76.3 MB


(None, None, None)

Significant, yes. 

Putting it all together.

In [13]:
df = pd.read_csv(os.path.join(fp.DATA_DIR,'slow.csv'), usecols=['size', 'age', 'date'], dtype={'age': 'int16', 'size': 'category', 'date': 'category'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
 #   Column  Dtype   
---  ------  -----   
 0   size    category
 1   age     int16   
 2   date    category
dtypes: category(2), int16(1)
memory usage: 47.8 MB


We reduced our dataframe memory consumption from 420mb to 48mb. So much room for possibilities!

All right, so data type parsing is important. But parsing data like a pro while working with csvs is pretty comical to say the least. 

I will show you the way. Nobody will be able to say we trained you, as a joke.

Go to lesson 3.