# Handling 'Medium' Data with Dask - Explore SA Gawler Challenge

### In previous notebooks we've looked at data cleaning, feature engineering and machine learning on geological datasets. All these datasets are considerably small in terms of 'big data'. The processes and variables have been workable within the bounds of my laptops RAM, or more simply the files have been small enough for my laptop to handle. However, in reality the more data we have the more information we can get, but more data means a bigger the strain on computing power. This notebook looks at using python package Dask to perfom processes with large data on a little laptop, with a couple of helpful python hints along the way.

<img src="Images/Intro.png" />

Firstly, let's define what big, medium and small data is in terms of size. A lot of what I've read online characterizes data as 'Big Data' when it contains high volume of data, a high variety in data types and slow velocity when it comes to processing (known as the 3 V's). A general rule of thumb I like is that small data is <10GB, medium data is <1TB and anything over 1TB is considered big data. As mentioned above, all previous notebook datasets have been within the 'small data' category and could be created/proccessed using python packages such as pandas, numpy and sklearn, but how do we do this for data that is too large for my RAM? 

The objective of this notebook is to use python to manufacture a 'medium data' sized DataFrame that is ready undergo data analysis and create Machine Learning models on my laptop.

## The Data

With the [Unearthed ExploreSA Gawler](https://unearthed.solutions/u/exploresa-gawler) challenge coming up, I thought it'd a great oppurtunity to utilize spatial data from [SARIG](https://map.sarig.sa.gov.au/) for the example dataset. For this, I selected a large northern chunk of the Gawler Craton to focus on which incorporated 36 different map zones with a total area of 582,918.7 km^2. From SARIG, I downloaded a whole bunch of different geological data spanning from the Archaen basement lithology to surface regolith material. Using QGIS, I chose 22 features (listed below) that I wanted to put into a workable dataframe. Keep in mind the selection of these features were chosen purely for this example and may or may not win you thr Gawler Challenge.

<img src="Images/Table.png" />

The way I chose to create this dataset was by converting QGIS vector files into images and then loading the image pixels as rows and columns to a dataframe. First I clipped the data to my selected study area and then converted the files to raster, with each pixel representing a 50m x 50m area. Because the study area covers 582,918.7 km^2, each feature is composed of 242,832,800 pixels, we're definitely venturing into the realm of big data now. I saved each image as a ~1GB .tif file, ready to be loaded into python.

The GIF below shows the study area and a snippet of the features used within this notebook.

![SegmentLocal](Images/gawlerexample.gif)

## Loading in the Data - Dask Array

With smaller datasets pandas can be used to create dataframes, do feature engineering etc. but when we are dealing with 22 seperate image files at 1GB each pandas just doesn't quite cut it. This is because pandas tries to combine all the files into one dataframe and store it in my computers memory (or RAM). My computer has a RAM of 8GB and the file I'm trying to create is 22GB. This is where python package Dask comes into play. The way Dask operates is better explained [here](https://www.analyticsvidhya.com/blog/2018/08/dask-big-datasets-machine_learning-python/), but it effectively splits large data files into multiple partitions and works on one partition at a time. Another great thing about Dask is that it is built on common python packages pandas, numpy and scikit learn, which means if you're already familiar with these packages it will be easy to understand dask. Let's get started in setting up our DataFrame.

In [1]:
#Importing all required packages
import numpy as np
import pandas as pd

import skimage

import glob
import os

import dask
import dask.array as da
import dask.dataframe as dd

After importing the python libraries our next step is to load up our image files. To do this we will use dask.delayed to 'lazily' load in the images as dask arrays. Lazily loading something as an array just means we get the data type of the file and the shape of it but don't actually do anything with the file yet. 

As we want to turn these images into columns we must 'flatten' them. This means lining up all the pixels in a row as a 1 dimensional array. After we have lazily loaded the images we can convert them into dask arrays and then stack them all together into the shape of the dataframe we want (as columns).

In [2]:
#Preparing to use skimage to lazily load in files
imread = dask.delayed(skimage.io.imread, pure=True)

#Reading the files and flattening them
filenames = sorted(glob.glob(" >>FILEPATH FOR TIFF IMAGES<<< "))
lazy_images = [imread(path) for path in filenames]
lazy_flat = [i.flatten() for i in lazy_images]

#Gaining the shape of the arrays
sample = lazy_flat[0].compute()

#Creating dask arrays from all the read images
arrays = [da.from_delayed(lazy_image, dtype=sample.dtype, shape=sample.shape) for lazy_image in lazy_flat]

#Stacking all the arrays together
stack = da.stack(arrays, axis=1)

After the above cell is executed we will have a stack of all our 1D image arrays. Before we go anywhere though, lets load in the image names. NOTE: it is important to make sure the right names match up with the right arrays, best bet is to sort everything alphabetically before loading in.

In [3]:
#Gaining the names of the files and removing the file extension
names = sorted(os.listdir(" >>FILEPATH FOR TIFF IMAGES<<< "))
names = [s.replace('.tif', '') for s in names]

Now lets have a look at the shape of our stacked array.

In [4]:
stack

Unnamed: 0,Array,Chunk
Bytes,21.37 GB,971.33 MB
Shape,"(242832800, 22)","(242832800, 1)"
Count,88 Tasks,22 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 21.37 GB 971.33 MB Shape (242832800, 22) (242832800, 1) Count 88 Tasks 22 Chunks Type float32 numpy.ndarray",22  242832800,

Unnamed: 0,Array,Chunk
Bytes,21.37 GB,971.33 MB
Shape,"(242832800, 22)","(242832800, 1)"
Count,88 Tasks,22 Chunks
Type,float32,numpy.ndarray


In [5]:
stack.shape

(242832800, 22)

Here we can see that the shape of our stacked arrays are 22 arrays of 242,832,800 values. Our next step is to convert this into a dask DataFrame so we can do some work on it. To do this we can just convert it straight from the stack of arrays, using the file names as column headers.

## Creating the DataFrame- Dask DataFrame

In [None]:
#Creating a dask dataframe from the array
df = dd.from_dask_array(stack, columns=names)

In [7]:
#Analysing the dataframe data types
df.describe()

Unnamed: 0_level_0,Arch 200m Fault Buffer,Arch 200m Geology Buffer,Arch Geology,Areas of Interest Points,Calcareous Induration,Dep-Ero-Res,Distance to Craton Scaled,Ferruginous Induration,Gypsiferous Induration,Map Tiles,Mid Meso 200m Fault Buffer,Mid Meso 200m Geology Buffer,Mid Meso Geology,Mixed Calcareous-Gypsiferous Induration,Neo-Ord 200m Fault Buffer,Neo-Ord 200m Geology Buffer,Neo-Ord Geology,Regolith Landform,Regolith Material,Siliceous Induration,Surface Geology 2M,Trans-Insitu
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32
242832799,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


We've now got a fully functioning dask dataframe. Notice that npartition = 1, this means that if we were going to compute any processes it would just do it on the whole DataFrame instead of splitting it up into smaller DataFrames. We should change this by repartitioning the DataFrame.

<img src="Images/partition.png" />

In [9]:
#Repartitioning the df into 1000 parts
df = df.repartition(npartitions=1000)

In [10]:
df

Unnamed: 0_level_0,Arch 200m Fault Buffer,Arch 200m Geology Buffer,Arch Geology,Areas of Interest Points,Calcareous Induration,Dep-Ero-Res,Distance to Craton Scaled,Ferruginous Induration,Gypsiferous Induration,Map Tiles,Mid Meso 200m Fault Buffer,Mid Meso 200m Geology Buffer,Mid Meso Geology,Mixed Calcareous-Gypsiferous Induration,Neo-Ord 200m Fault Buffer,Neo-Ord 200m Geology Buffer,Neo-Ord Geology,Regolith Landform,Regolith Material,Siliceous Induration,Surface Geology 2M,Trans-Insitu
npartitions=1000,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32
242832,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242589966,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242832799,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


We can see that the dask dataframe has now been split up into 1000 smaller DataFrames with approximately 242,832 rows each. This makes it a lot easier for the computer to process the DataFrame as a whole. 

This next step may seem a little strange, but I have noticed that saving these newly partitioned DataFrames to .csv files and reloading them back into the notebook as a new Dask DataFrame significantly decreases processing time. So in a similar fashion to pandas, I will save the 1000 partitioned .csv files to a hard drive and read them back into the workspace.

In [11]:
#Saving all the partitions to csv to not be running from memory
df.to_csv('  >>FILEPATH FOR CSV FILES<<<  ')

['D:/EP/Dask/df-000.csv',
 'D:/EP/Dask/df-001.csv',
 'D:/EP/Dask/df-002.csv',
 'D:/EP/Dask/df-003.csv',
 'D:/EP/Dask/df-004.csv',
 'D:/EP/Dask/df-005.csv',
 'D:/EP/Dask/df-006.csv',
 'D:/EP/Dask/df-007.csv',
 'D:/EP/Dask/df-008.csv',
 'D:/EP/Dask/df-009.csv',
 'D:/EP/Dask/df-010.csv',
 'D:/EP/Dask/df-011.csv',
 'D:/EP/Dask/df-012.csv',
 'D:/EP/Dask/df-013.csv',
 'D:/EP/Dask/df-014.csv',
 'D:/EP/Dask/df-015.csv',
 'D:/EP/Dask/df-016.csv',
 'D:/EP/Dask/df-017.csv',
 'D:/EP/Dask/df-018.csv',
 'D:/EP/Dask/df-019.csv',
 'D:/EP/Dask/df-020.csv',
 'D:/EP/Dask/df-021.csv',
 'D:/EP/Dask/df-022.csv',
 'D:/EP/Dask/df-023.csv',
 'D:/EP/Dask/df-024.csv',
 'D:/EP/Dask/df-025.csv',
 'D:/EP/Dask/df-026.csv',
 'D:/EP/Dask/df-027.csv',
 'D:/EP/Dask/df-028.csv',
 'D:/EP/Dask/df-029.csv',
 'D:/EP/Dask/df-030.csv',
 'D:/EP/Dask/df-031.csv',
 'D:/EP/Dask/df-032.csv',
 'D:/EP/Dask/df-033.csv',
 'D:/EP/Dask/df-034.csv',
 'D:/EP/Dask/df-035.csv',
 'D:/EP/Dask/df-036.csv',
 'D:/EP/Dask/df-037.csv',
 'D:/EP/Dask

In [12]:
#Reading all csv files into a dask dataframe
dfn = dd.read_csv('  >>FILEPATH FOR CSV FILES<<<  ')

<img src="Images/csv_files.png" />

With the DataFrame loaded back into the notebook let's take a look at the head of it. This only computes on the first partition, unless stated otherwise.

In [13]:
#Looking at the head of the first partition
dfn.head()

Unnamed: 0.1,Unnamed: 0,Arch 200m Fault Buffer,Arch 200m Geology Buffer,Arch Geology,Areas of Interest Points,Calcareous Induration,Dep-Ero-Res,Distance to Craton Scaled,Ferruginous Induration,Gypsiferous Induration,...,Mid Meso Geology,Mixed Calcareous-Gypsiferous Induration,Neo-Ord 200m Fault Buffer,Neo-Ord 200m Geology Buffer,Neo-Ord Geology,Regolith Landform,Regolith Material,Siliceous Induration,Surface Geology 2M,Trans-Insitu
0,0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
#Looking at the datatypes of each column
dfn.describe()

Unnamed: 0_level_0,Unnamed: 0,Arch 200m Fault Buffer,Arch 200m Geology Buffer,Arch Geology,Areas of Interest Points,Calcareous Induration,Dep-Ero-Res,Distance to Craton Scaled,Ferruginous Induration,Gypsiferous Induration,Map Tiles,Mid Meso 200m Fault Buffer,Mid Meso 200m Geology Buffer,Mid Meso Geology,Mixed Calcareous-Gypsiferous Induration,Neo-Ord 200m Fault Buffer,Neo-Ord 200m Geology Buffer,Neo-Ord Geology,Regolith Landform,Regolith Material,Siliceous Induration,Surface Geology 2M,Trans-Insitu
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Our DataFrame is now available to be used for any kind of data analysis and feature engineering we see fit to do. As an example lets try to dummy encode our categorical features. 

## Dummy Encode Variables - Dask ML

This is the first time we've looked at encoding in my notebooks, and all it effectively does is convert categorical features to numerical features. The main reason for this is because some machine learning models only take numerical features as inputs. Encoding is quite simple and below image is a visual representation of how dummy encoding works.

<img src="Images/dummies.png" />

We will do the exact same thing to our categorical features in this example. Firstly, lets get a list of all the categorical features we want to get dummies for and convert them to dtype 'object'.

In [15]:
#Creating list of categorical features
cats = ['Arch Geology', 'Calcareous Induration', 'Dep-Ero-Res', 'Ferruginous Induration', 'Gypsiferous Induration', 
        'Mid Meso Geology', 'Mixed Calcareous-Gypsiferous Induration', 'Neo-Ord Geology', 'Regolith Landform', 
        'Regolith Material', 'Siliceous Induration', 'Surface Geology 2M', 'Trans-Insitu']

In [16]:
#Converting the dtypes
for i in cats:
    dfn[i] = dfn[i].astype('str')

In [17]:
dfn.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 23 entries, Unnamed: 0 to Trans-Insitu
dtypes: object(13), float64(9), int64(1)

Here we can our 13 selected features as data type object now. Next we will import a categorizer and dummy encoder then apply them to our data using sklearns pipeline. This is also the first time we've looked at pipelines in these notebooks. Pipeline is a tool that creates a repeatable set of processing steps that can be applied to our data. In this case we will create a two step pipeline, the first step categorizes the object type features (requirement for DummyEncoder) and the second step applies the encoder. We use pipeline instead of doing both of these steps individually because it works similtaneously on the data.

In [19]:
from sklearn.pipeline import make_pipeline
from dask_ml.preprocessing import DummyEncoder, Categorizer

In [20]:
#Making the pipeline
pipe = make_pipeline(Categorizer(), DummyEncoder())

In [21]:
#Fitting the pipeline to the dataframe
pipe.fit(dfn)

Pipeline(memory=None,
         steps=[('categorizer', Categorizer(categories=None, columns=None)),
                ('dummyencoder', DummyEncoder(columns=None, drop_first=False))],
         verbose=False)

In [22]:
#Transforming the dataframe
dft = pipe.transform(dfn)

In [23]:
dft

Unnamed: 0_level_0,Unnamed: 0,Arch 200m Fault Buffer,Arch 200m Geology Buffer,Areas of Interest Points,Distance to Craton Scaled,Map Tiles,Mid Meso 200m Fault Buffer,Mid Meso 200m Geology Buffer,Neo-Ord 200m Fault Buffer,Neo-Ord 200m Geology Buffer,Arch Geology_0.0,Arch Geology_34.0,Arch Geology_32.0,Arch Geology_33.0,Arch Geology_31.0,Arch Geology_35.0,Arch Geology_36.0,Arch Geology_37.0,Arch Geology_38.0,Arch Geology_39.0,Arch Geology_45.0,Arch Geology_11.0,Arch Geology_12.0,Arch Geology_10.0,Arch Geology_49.0,Arch Geology_5.0,Arch Geology_13.0,Arch Geology_9.0,Arch Geology_1.0,Arch Geology_14.0,Arch Geology_48.0,Arch Geology_22.0,Arch Geology_3.0,Arch Geology_7.0,Arch Geology_4.0,Arch Geology_15.0,Arch Geology_42.0,Arch Geology_6.0,Arch Geology_19.0,Arch Geology_46.0,Arch Geology_47.0,Arch Geology_25.0,Arch Geology_30.0,Arch Geology_2.0,Arch Geology_20.0,Arch Geology_16.0,Arch Geology_43.0,Arch Geology_41.0,Arch Geology_26.0,Arch Geology_44.0,Arch Geology_40.0,Arch Geology_24.0,Arch Geology_8.0,Arch Geology_23.0,Arch Geology_51.0,Arch Geology_53.0,Arch Geology_54.0,Arch Geology_21.0,Arch Geology_17.0,Arch Geology_28.0,Arch Geology_27.0,Arch Geology_50.0,Arch Geology_18.0,Arch Geology_29.0,Arch Geology_55.0,Arch Geology_56.0,Arch Geology_52.0,Calcareous Induration_0.0,Calcareous Induration_1.0,Dep-Ero-Res_0.0,Dep-Ero-Res_3.0,Dep-Ero-Res_1.0,Dep-Ero-Res_2.0,Dep-Ero-Res_4.0,Ferruginous Induration_0.0,Ferruginous Induration_1.0,Gypsiferous Induration_0.0,Gypsiferous Induration_1.0,Mid Meso Geology_0.0,Mid Meso Geology_1.0,Mixed Calcareous-Gypsiferous Induration_0.0,Mixed Calcareous-Gypsiferous Induration_1.0,Neo-Ord Geology_0.0,Neo-Ord Geology_3.0,Neo-Ord Geology_2.0,Neo-Ord Geology_20.0,Neo-Ord Geology_1.0,Neo-Ord Geology_17.0,Neo-Ord Geology_19.0,Neo-Ord Geology_18.0,Neo-Ord Geology_16.0,Neo-Ord Geology_13.0,Neo-Ord Geology_24.0,Neo-Ord Geology_14.0,Neo-Ord Geology_21.0,Neo-Ord Geology_6.0,Neo-Ord Geology_15.0,Neo-Ord Geology_4.0,Neo-Ord Geology_5.0,Neo-Ord Geology_9.0,Neo-Ord Geology_22.0,Neo-Ord Geology_25.0,Neo-Ord Geology_23.0,Neo-Ord Geology_12.0,Neo-Ord Geology_11.0,Neo-Ord Geology_26.0,Neo-Ord Geology_10.0,Neo-Ord Geology_7.0,Neo-Ord Geology_8.0,Regolith Landform_0.0,Regolith Landform_7.0,Regolith Landform_5.0,Regolith Landform_1.0,Regolith Landform_2.0,Regolith Landform_6.0,Regolith Landform_8.0,Regolith Landform_3.0,Regolith Landform_4.0,Regolith Landform_9.0,Regolith Material_0.0,Regolith Material_9.0,Regolith Material_5.0,Regolith Material_2.0,Regolith Material_7.0,Regolith Material_4.0,Regolith Material_1.0,Regolith Material_11.0,Regolith Material_3.0,Regolith Material_10.0,Regolith Material_12.0,Regolith Material_8.0,Regolith Material_6.0,Regolith Material_13.0,Siliceous Induration_0.0,Siliceous Induration_1.0,Surface Geology 2M_0.0,Surface Geology 2M_12.0,Surface Geology 2M_1.0,Surface Geology 2M_16.0,Surface Geology 2M_13.0,Surface Geology 2M_9.0,Surface Geology 2M_15.0,Surface Geology 2M_4.0,Surface Geology 2M_5.0,Surface Geology 2M_10.0,Surface Geology 2M_14.0,Surface Geology 2M_6.0,Surface Geology 2M_2.0,Surface Geology 2M_7.0,Surface Geology 2M_19.0,Surface Geology 2M_11.0,Surface Geology 2M_3.0,Surface Geology 2M_8.0,Surface Geology 2M_18.0,Surface Geology 2M_31.0,Surface Geology 2M_17.0,Surface Geology 2M_25.0,Surface Geology 2M_38.0,Surface Geology 2M_26.0,Surface Geology 2M_22.0,Surface Geology 2M_40.0,Surface Geology 2M_30.0,Surface Geology 2M_24.0,Surface Geology 2M_41.0,Surface Geology 2M_21.0,Surface Geology 2M_29.0,Surface Geology 2M_35.0,Surface Geology 2M_34.0,Surface Geology 2M_44.0,Surface Geology 2M_27.0,Surface Geology 2M_33.0,Surface Geology 2M_37.0,Surface Geology 2M_42.0,Surface Geology 2M_36.0,Surface Geology 2M_62.0,Surface Geology 2M_28.0,Surface Geology 2M_32.0,Surface Geology 2M_20.0,Surface Geology 2M_23.0,Surface Geology 2M_39.0,Surface Geology 2M_45.0,Surface Geology 2M_53.0,Surface Geology 2M_43.0,Surface Geology 2M_50.0,Surface Geology 2M_49.0,Surface Geology 2M_48.0,Surface Geology 2M_46.0,Surface Geology 2M_47.0,Surface Geology 2M_58.0,Surface Geology 2M_56.0,Surface Geology 2M_51.0,Surface Geology 2M_55.0,Surface Geology 2M_52.0,Surface Geology 2M_88.0,Surface Geology 2M_57.0,Surface Geology 2M_69.0,Surface Geology 2M_61.0,Surface Geology 2M_65.0,Surface Geology 2M_54.0,Surface Geology 2M_59.0,Surface Geology 2M_68.0,Surface Geology 2M_78.0,Surface Geology 2M_71.0,Surface Geology 2M_70.0,Surface Geology 2M_107.0,Surface Geology 2M_77.0,Surface Geology 2M_74.0,Surface Geology 2M_63.0,Surface Geology 2M_87.0,Surface Geology 2M_72.0,Surface Geology 2M_91.0,Surface Geology 2M_66.0,Surface Geology 2M_67.0,Surface Geology 2M_60.0,Surface Geology 2M_76.0,Surface Geology 2M_95.0,Surface Geology 2M_98.0,Surface Geology 2M_100.0,Surface Geology 2M_113.0,Surface Geology 2M_101.0,Surface Geology 2M_81.0,Surface Geology 2M_82.0,Surface Geology 2M_106.0,Surface Geology 2M_105.0,Surface Geology 2M_79.0,Surface Geology 2M_109.0,Surface Geology 2M_75.0,Surface Geology 2M_96.0,Surface Geology 2M_93.0,Surface Geology 2M_108.0,Surface Geology 2M_111.0,Surface Geology 2M_94.0,Surface Geology 2M_83.0,Surface Geology 2M_64.0,Surface Geology 2M_97.0,Surface Geology 2M_117.0,Surface Geology 2M_84.0,Surface Geology 2M_85.0,Surface Geology 2M_120.0,Surface Geology 2M_121.0,Surface Geology 2M_118.0,Surface Geology 2M_80.0,Surface Geology 2M_110.0,Surface Geology 2M_92.0,Surface Geology 2M_103.0,Surface Geology 2M_86.0,Surface Geology 2M_73.0,Surface Geology 2M_119.0,Surface Geology 2M_90.0,Surface Geology 2M_89.0,Surface Geology 2M_122.0,Surface Geology 2M_102.0,Surface Geology 2M_115.0,Surface Geology 2M_112.0,Surface Geology 2M_104.0,Surface Geology 2M_116.0,Surface Geology 2M_114.0,Surface Geology 2M_99.0,Trans-Insitu_0.0,Trans-Insitu_2.0,Trans-Insitu_1.0,Trans-Insitu_3.0
npartitions=1000,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1,Unnamed: 229_level_1,Unnamed: 230_level_1,Unnamed: 231_level_1,Unnamed: 232_level_1,Unnamed: 233_level_1,Unnamed: 234_level_1,Unnamed: 235_level_1,Unnamed: 236_level_1,Unnamed: 237_level_1,Unnamed: 238_level_1,Unnamed: 239_level_1,Unnamed: 240_level_1,Unnamed: 241_level_1,Unnamed: 242_level_1,Unnamed: 243_level_1,Unnamed: 244_level_1,Unnamed: 245_level_1,Unnamed: 246_level_1,Unnamed: 247_level_1,Unnamed: 248_level_1,Unnamed: 249_level_1,Unnamed: 250_level_1,Unnamed: 251_level_1,Unnamed: 252_level_1,Unnamed: 253_level_1,Unnamed: 254_level_1,Unnamed: 255_level_1,Unnamed: 256_level_1,Unnamed: 257_level_1,Unnamed: 258_level_1,Unnamed: 259_level_1,Unnamed: 260_level_1,Unnamed: 261_level_1,Unnamed: 262_level_1
,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


We now have a DataFrame with all our new dummy encoded features. We have successfully added 240 columns to our DataFrame, making it even bigger than before, without having to buy a brand new super computer. From here we can go onto further feature engineering, data cleaning, other preprocessing and eventually machine learning if we wanted to. 

With Dask we've just collated and run some processes on a large amount of data using nothing but a laptop and a hard drive. I hope this notebook can shed some light on and evoke some ideas of how Dask can be used for handling large datasets. Cheers!