# Applications of Artificial Intelligence
## Large Data Sets
### Chunking
In this notebook we'll work with a large CSV file. 

If you downloaded the large version of this demo, then the file has 5 million rows and is 624 MB. This doesn't seem that large compared to the RAM of modern machines, but even if we manage to load it, we shouldn't expect any analysis to run smoothly. 

If you downloaded the smaller version for bandwidth reasons, then the dataset is just 100,000 rows, and you might be able to read it all, but the same principles will work just for demonstration.

This notebook is going to rely on two of the features in the CSV reader from Pandas. It's worth considering how you would have incorporated these into the CSV reader you wrote in an earlier week.

First, let's suppose the file is so big you have no way to open it in any application. We can't write our analysis code if we don't even know what columns the dataset contains. Thankfully, we can get a glimpse of what's contained in the data set using the 'number of rows' parameter `nrows` of the Pandas CSV reader. Here we have set that to five so we can see the first five rows.

In [1]:
import pandas as pd

df = pd.read_csv('data.csv', nrows=5)
df

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Australia and Oceania,Palau,Office Supplies,Online,H,3/6/2016,517073523,3/26/2016,2401,651.21,524.96,1563555.21,1260428.96,303126.25
1,Europe,Poland,Beverages,Online,L,4/18/2010,380507028,5/26/2010,9340,47.45,31.79,443183.0,296918.6,146264.4
2,North America,Canada,Cereal,Online,M,1/8/2015,504055583,1/31/2015,103,205.7,117.11,21187.1,12062.33,9124.77
3,Europe,Belarus,Snacks,Online,C,1/19/2014,954955518,2/27/2014,1414,152.58,97.44,215748.12,137780.16,77967.96
4,Middle East and North Africa,Oman,Cereal,Offline,H,4/26/2019,970755660,6/2/2019,7027,205.7,117.11,1445453.9,822931.97,622521.93


In this case, we can see that our file contains some sales data with information about the region and country in which a sale took place, the type of items involved, order ID, shipment dates, unit prices, and so on. 

Now that we know what the data looks like, we can build something that works over the entire dataset. To get through the data with reasonable memory usage, we'll use a strategy called chunking, where we read the CSV file in blocks or chunks and process the data for that chunk independently. This allows us to work with data files that may contain millions or even billions of data entries. 

First let's work out what we are trying to do to each chunk, and ensure this code works. Suppose we want a list of every country from the dataset, along with the number of rows in which that country appears. We can try this out on a smaller chunk of data just to ensure it works first of all.

We can use the `.value_counts()` method on the country column to count the unique values:

In [2]:
df = pd.read_csv('data.csv', nrows=100)
country_counts = df["Country"].value_counts()
country_counts

Portugal      3
Oman          3
Montenegro    3
Poland        3
Qatar         3
             ..
Austria       1
Namibia       1
Estonia       1
Grenada       1
Nauru         1
Name: Country, Length: 77, dtype: int64

This returns a Pandas `Series` object, which is like a 1D DataFrame. Each row is labelled with the name of the country. 

If we add two `Series` objects together, it will sum the values that are in the rows with the same label. However, we will get errors in any countries which only occur in one of the chunks but not the other – they will be set to NaN.

In the following cell we demonstrate this on two chunks of 100 values from the CSV file. We use `skiprows` to select the *second* chunk of 100 rows – we must keep row zero, since it contains the header, so we skip rows from `range(1, 100)`.

In [3]:
df = pd.read_csv('data.csv', nrows=100, skiprows=range(1,100))
country_counts = country_counts + df["Country"].value_counts()
country_counts

Algeria                 NaN
Andorra                 NaN
Antigua and Barbuda     2.0
Austria                 NaN
Azerbaijan              NaN
                       ... 
Uzbekistan              NaN
Vatican City            NaN
Yemen                   4.0
Zambia                  2.0
Zimbabwe                2.0
Name: Country, Length: 118, dtype: float64

If we use the `.add(...)` method rather than `+`, we can specify the `fillvalue` argument, which will give a default value to any missing countries (which we'll set to zero).

In [4]:
# get chunk 1 and count countries
df = pd.read_csv('data.csv', nrows=100)
country_counts = df["Country"].value_counts()

# get chunk 2 and count countries
df = pd.read_csv('data.csv', nrows=100, skiprows=range(1, 100))
chunk_counts = df["Country"].value_counts()

# add chunk 2 to chunk 1 with missing values set to zero
country_counts = country_counts.add(chunk_counts, fill_value=0)

country_counts.astype(int)

Algeria                 1
Andorra                 1
Antigua and Barbuda     2
Austria                 1
Azerbaijan              1
                       ..
Uzbekistan              2
Vatican City            2
Yemen                   4
Zambia                  2
Zimbabwe                2
Name: Country, Length: 118, dtype: int64

Now that we have the theory in place for how to do what we want in chunks, we can apply this to the entire CSV file, then tune the chunk size to fit into memory.

The number of options available in the Pandas CSV reader is overwhelming, but you can see all of the details in [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). To go through the CSV file in chunks, we'll enable the `iterator` option, and provide a value for the the `chunksize` option as well. This will mean that the `read_csv` function itself will not return a DataFrame, but an iterator, that will itself return DataFrame chunks when iterated. As usual, the easiest way to go through the contents of an iterator is in a for loop. 

In [5]:
import time

start_time = time.process_time()

country_counts = pd.Series(dtype=int)
iterator = pd.read_csv('data.csv', iterator=True, chunksize=100000)

for df in iterator:
    chunk_counts = df["Country"].value_counts()
    country_counts = country_counts.add(chunk_counts, fill_value=0)
    
print(country_counts.astype(int).sort_values(ascending=False))

print()
end_time = time.process_time()
print(f"Total time: {end_time-start_time:.1f} seconds")

Liberia                           27226
Panama                            27150
Belize                            27144
Federated States of Micronesia    27139
Cote d'Ivoire                     27126
                                  ...  
Lithuania                         26920
Vatican City                      26903
Malaysia                          26875
Netherlands                       26874
Belarus                           26867
Length: 185, dtype: int64

Total time: 8.1 seconds


The code above still takes a few seconds to run – it's processing 5 million rows after all – but at least it can run without running out of memory. A bigger choice of chunk size might improve the performance at the cost of using more memory at once.

The important thing is that provided you have written the analysis correctly, the results will always be the same. So, this is an effective way of working with large data files that you may have stored locally on your machine – it enables some analysis that would otherwise be infeasible. 

One key point to watch out for is that you need to design your processing carefully so that any operations are valid when the data is chunked in this way. For example, if a calculation needed information from multiple rows across multiple chunks, then we would need to consider some alternative way of handling the data. 