## Introduction

The goal of this notebook is to show how you can find data types in a very big file and limit memory usage

In Talking Data competition, if you load train.csv file with a standard pd.read_csv command, the resulting DataFrame would use 11GB of memory. This is too large for most of personal computers setup.

By the end of this session you will know how to :
 
 - read the first N rows of a csv file
 - read the last N rows of a csv file
 - find the memory usage of a pandas DataFrame
 - find the best data types for each feature in the file to limit memory usage
 - use python garbage collection to delete objects that you do not need anymore and free up memory
 


In [1]:
import pandas as pd
import numpy as np

Please change the file_path so that it points to where the train file is on your system  

In [2]:
file_path = "../input/train.csv.zip"

The first thing you should do is open the file with a limited number of lines

To do this you can tell pandas to read the first N rows of the file.

The following line of code shows how you can tell pandas to only read the 20000 first rows of the file

In [3]:
train = pd.read_csv(file_path, nrows=20000)
train.head(20)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,,0
1,17357,3,1,19,379,2017-11-06 14:33:34,,0
2,35810,3,1,13,379,2017-11-06 14:34:12,,0
3,45745,14,1,13,478,2017-11-06 14:34:52,,0
4,161007,3,1,13,379,2017-11-06 14:35:08,,0
5,18787,3,1,16,379,2017-11-06 14:36:26,,0
6,103022,3,1,23,379,2017-11-06 14:37:44,,0
7,114221,3,1,19,379,2017-11-06 14:37:59,,0
8,165970,3,1,13,379,2017-11-06 14:38:10,,0
9,74544,64,1,22,459,2017-11-06 14:38:23,,0


We now know the columns (or features) are available in train.csv

To know the data types pandas has assigned to each column we can use the **dtypes** attribute of a pandas DataFrame

In [4]:
train.dtypes

ip                  int64
app                 int64
device              int64
os                  int64
channel             int64
click_time         object
attributed_time    object
is_attributed       int64
dtype: object

The preceeding result shows the type of data pandas has inferred based on the 20000 rows we just read.

To get an idea of the memory usage of the the DataFrame we can use the memory_usage() method that will give us the memory used for each column in the dataframe in bytes

In [5]:
train.memory_usage()

Index                  80
ip                 160000
app                160000
device             160000
os                 160000
channel            160000
click_time         160000
attributed_time    160000
is_attributed      160000
dtype: int64

Using the sum() method we can find the total memory usgae in bytes.

In [6]:
train.memory_usage().sum()

1280080

This means 20000 samples already take 1.22 MB in memory, when the full train.csv file contains 184903890 rows.

Therefore the full data file would use around 11GB in memory!

Pandas always uses the highest possible data type for a feature. Even if a feature contains a boolean like **is_attributed** it will give the column an **int64** data type when a simple **uint8** would be enough.

So let's see how we can limit memory usage by telling pandas the right data type for each column.

The first thing you can do is limit the columns that are read by pandas by providing pandas with a list of features and using the **usecols** parameter 

In [7]:
pd.read_csv(file_path, nrows=20, usecols=['ip', 'is_attributed'])

Unnamed: 0,ip,is_attributed
0,83230,0
1,17357,0
2,35810,0
3,45745,0
4,161007,0
5,18787,0
6,103022,0
7,114221,0
8,165970,0
9,74544,0


## Exercise

As seen above the train file contains the following features or columns (time related have been excluded for now):
  - ip
  - app
  - device
  - os
  - channel
  - is_attributed
 
 
**Your exercise is to determine the minimu data type we can use for each of these columns.**
 
The following example reads the full train.csv file for column **ip**. The returned DataFrame will use approximately 1.4 GB of memory. If this is too big you can limit the pd.read_csv method to the first 20 million rows. 
 
 Here is an example for the column **ip**
.

In [8]:
# read the column
column = 'ip'
column_df = pd.read_csv(file_path, usecols=[column])
# If memory is a big problem for you, please use the following command
# column_df = pd.read_csv(file_path, nrows=20000000, usecols=[column])
# Display memory usage in GB
print("Memory usage = %.3f GB" % (column_df.memory_usage().sum() / 1024 ** 3))
# Find the min 
the_min = column_df[column].min()
# Find the max
the_max = column_df[column].max()
# display min and max and determine the minimum data type
print("min=", the_min, ", max=", the_max)

Memory usage = 1.378 GB
min= 1 , max= 364778


So here we see the min is *1* and max is *364778* is and we can determine that **ip** can be coded with an **int32**

We can now read this column and tell pandas the type we want to use with the following statement and setting the dtype argument with a dictionary that maps a column name with a type:

In [9]:
column_df = pd.read_csv(file_path, usecols=[column], dtype={column: np.int32})
print("Memory usage after setting data type = %.3f GB" % (column_df.memory_usage().sum() / 1024 ** 3))

Memory usage after setting data type = 0.689 GB


Memory usage has been reduced by 50%

## Reading the last N rows of the file

Reading the last N rows of a csv file is a bit more challenging and uses the **skiprows** argument 

The next cell shows an example to read the last 20000 rows of the file

In [10]:
# First we will use the garbage collector to free some memory
import gc
# Enable grabage collection
gc.enable()
# Delete objects that we won't use anymore
del column_df, train
# Trigger garbage collection
gc.collect()

101

In [11]:
column = 'ip'
total_rows = 184903890
df = pd.read_csv(
    file_path, 
    skiprows=range(1, int(total_rows - 20000)),
    dtype={column: np.int32},
    usecols=['ip']
)
df.tail()

Unnamed: 0,ip
19996,121312
19997,46894
19998,320126
19999,189286
20000,106485


The skiprows argument gives a list of row numbers that pandas has to skip. In this case it tells pandas to skip rows 1 to 184883890.

We start skipping rows at number 1 since row number 0 is the header row and gives the column names. 