# Learning Notebook - Part 3 of 3 - Dealing with larger datasets

Up to this point, you've already handled several datasets. Most were small. When a dataset is bigger, Pandas can have problems digesting all of it. 

<img src="./media/panda_eating.jpg" width="600">

Here we're going to handle larger datasets using pandas. Let's start by doing a couple of read tricks.

In [1]:
import os
import pandas as pd
import random

In [2]:
# Again, a helper function to get filepaths
def pokemons_filepath(filename):
    return os.path.join('data', 'pokemons', filename)

## 1. Reading large data files

### 1.1 Only reading n lines

Sometimes we're just interested in previewing our data.
With the argument **nrows**, we can specify how many lines of the file should be read into the DataFrame, and the first _n_ lines of the file will be read.

We first use a method from the first part of this unit to see the number of lines, then read the first three lines into a DataFrame.

In [3]:
# Number of lines in the pokemons file - don't count the header row
lines_in_file = ! wc -l < data/pokemons/pokemons.csv
lines_in_file = int(lines_in_file[0])-1
lines_in_file

800

In [4]:
pd.read_csv(pokemons_filepath('pokemons.csv'), nrows=3)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False


### 1.2 Only reading some columns

If we only want to read some columns, we can use the argument **usecols** and give it a list of column names or numbers.

In [5]:
pd.read_csv(pokemons_filepath('pokemons_short.csv'), usecols=[1, 3, 7])

Unnamed: 0,Name,Type 2,Sp. Atk
0,Bulbasaur,Poison,65
1,Ivysaur,Poison,80
2,Venusaur,Poison,100
3,Mega Venusaur,Poison,122
4,Charmander,,60


### 1.3 Read n random lines

To read a set of n random lines from a file, we can do a little trick by using the **skiprows** argument of the read_csv function! Let's see how.

We already know that the file has 800 lines.

Now, let's imagine we want a sample of only 10 random rows from the file. We're going to use `skiprows` to... well... skip a couple of rows. The `skiprows` argument will now be a list of the row numbers to skip instead of just one number. The trick is to randomly pick a set of rows to be skipped, while leaving 10 rows in place.

In [6]:
sample_number = 10
n_rows_to_skip = lines_in_file - sample_number

random.seed(42) # this is to always get the same sample; it can be removed if we want the sample to change
rows_to_skip = random.sample(
    range(1, lines_in_file+1), # this is a range from the first row after the header, to the last row on the file
    n_rows_to_skip # this is the number of rows we want to random sample here,
                   # and that will be skipped in pd.read_csv with argument skiprows
)

pd.read_csv( 
    pokemons_filepath('pokemons.csv'),
    skiprows=rows_to_skip
)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,359,Spinda,Normal,,60,60,60,60,60,60,3,False
1,375,Crawdaunt,Water,Dark,63,120,85,90,55,55,3,False
2,451,Luxray,Electric,,80,120,79,95,79,70,4,False
3,491,Spiritomb,Ghost,Dark,50,92,108,92,108,35,4,False
4,532,Rotom,Electric,Ghost,50,50,77,95,77,91,4,False
5,574,Simisear,Fire,,75,98,63,98,63,101,5,False
6,648,Sawsbuck,Normal,Grass,80,100,70,60,70,95,5,False
7,650,Karrablast,Bug,,50,75,45,40,45,60,5,False
8,733,Scatterbug,Bug,,38,35,40,27,25,35,6,False
9,774,Carbink,Rock,Fairy,50,50,150,50,150,50,6,False


The downside of the approach is that you have to open the file twice (one for getting the number of rows and another to take the sample).

If you don't care a lot about the sample size, and you're happy taking a sample that is a certain percentage of the rows in the file, you can actually avoid opening the file twice.

The `p` variable determines percentage of rows to get. We will give the `skiprows` argument a function which determines the probability to skip each line. Each row will be skipped with a probability of `1 - p`. The `i > 0` is to avoid skipping the header.

In [7]:
# Statistically sample approximately 1% of the rows using a probability function
p = 0.01
random.seed(42)

pd.read_csv( 
    pokemons_filepath('pokemons.csv'), 
    skiprows=lambda i: i > 0 and random.random() > p
)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,20,Mega Beedrill,Bug,Poison,65,150,40,15,80,145,1,False
1,125,Mega Kangaskhan,Normal,,105,125,100,60,100,100,1,False
2,270,Lugia,Psychic,Flying,106,90,130,90,154,110,2,True
3,291,Beautifly,Bug,Flying,60,70,50,100,50,65,3,False
4,298,Nuzleaf,Grass,Dark,70,70,40,60,40,60,3,False
5,369,Seviper,Poison,,73,100,60,100,60,65,3,False
6,395,Wynaut,Psychic,,95,23,48,23,48,23,3,False
7,428,Jirachi,Steel,Psychic,100,100,100,100,100,100,3,True
8,482,Chingling,Psychic,,45,30,50,65,50,45,4,False
9,685,Golurk,Ground,Ghost,89,124,80,55,80,55,5,False


## 2. Load the file by chunks

Sometimes files are very large, but you still want to go through all the precious data stored in them.
The problem is that you can easily fill up your memory and processing the data then becomes really slow!

In these cases you can use chunks! Using chunks means loading the data in small portions at each time, and applying any processing you like to those small portions of data. This should allow you to process the data in large files faster.

This is a simple toy example:

In [8]:
# the iterator for the chunks
chunks_iter = pd.read_csv(
    pokemons_filepath('pokemons.csv'),
    chunksize=99
)

# here you'd do your processing
# in this toy example we just append the chunks to a list, which kind of goes against the spirit of this example :)
chunk_arr = []
for data_chunk in chunks_iter:
    chunk_arr.append(data_chunk)
    
print(
    "We analyzed a total of",
    sum([len(c) for c in chunk_arr]), 
    "rows divided in", 
    len(chunk_arr), 
    "chunks with the following configuration:\n",
    [len(c) for c in chunk_arr]
)

We analyzed a total of 800 rows divided in 9 chunks with the following configuration:
 [99, 99, 99, 99, 99, 99, 99, 99, 8]


Since `pd.read_csv` with chunksize returns an [iterator](https://wiki.python.org/moin/Iterator), you can handle each chunk with a small amount of memory. 

Let's now see an example with real data.
We'll load our pokemons file in chunks and, in each chunk we'll only keep the rows where the pokemon has `Type 1 = Ice`. 
Then, we'll create a DataFrame with the filtered chunks.

In [9]:
def filter_type1_ice(data):
    return data.loc[data['Type 1'] =='Ice']

In [10]:
chunks_iter = pd.read_csv(
    pokemons_filepath('pokemons.csv'),
    chunksize=99
)

chunk_arr = []
for data_chunk in chunks_iter:
    data_chunk_filtered = filter_type1_ice(data_chunk)
    chunk_arr.append(data_chunk_filtered)

final_pd = pd.concat(chunk_arr, axis=0)
final_pd.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
133,134,Jynx,Ice,Psychic,65,50,35,115,95,95,1,False
156,157,Articuno,Ice,Flying,90,85,100,95,125,85,1,True
238,239,Swinub,Ice,Ground,50,50,40,30,30,50,2,False
239,240,Piloswine,Ice,Ground,100,100,80,60,60,50,2,False
243,244,Delibird,Ice,Flying,45,55,45,65,45,75,2,False


Of course for illustration purposes we are applying big files techniques to small files! Given this file size we could simply apply
```
data = pd.read_csv(pokemons_filepath('pokemons.csv'))
data.loc[data['Type 1'] =='Ice',:]

```

If for some reason you still need to reduce the resulting DataFrame size, there are two other things you can do: 
- reduce the number of columns
- change field types to the most appropriate ones, eg, you can "shrink" ints and floats to 32 bits or even 16 bits

Here is an example of this and how we can measure the memory "savings".

In [11]:
# size in bytes
final_pd.memory_usage().sum()

np.int64(2328)

Here we remove some columns, and the DataFrame size is already reduced.

In [12]:
cols = ['Name', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation']
final_pd_shrinked = final_pd.loc[:,cols]
final_pd_shrinked.memory_usage().sum()

np.int64(1536)

Here we change some column types, from `int64` to `int16`:

In [13]:
cols = ['Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation']
final_pd_shrinked[cols] = final_pd_shrinked[cols].astype('int16')

#size in bytes
final_pd_shrinked.memory_usage().sum()

np.int64(672)