<a href="https://colab.research.google.com/github/AceroMike/Big-Data/blob/main/Dask_DFs_are_like_Pandas_DFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook I will be going over a few examples simply to show that if you know pandas, you know some Dask. As mentioned in previous notebooks, the advatange that Dask provides is that it allows us to parallelize the operations. Therefore, we can have multiple parts being computed at the same time. Much of the syntax in Dask is similar to Pandas as we will soon see. We will be working with the [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) dataset from Kaggle which contains financial transactions and identifies whether each transaction is fraud or not. Data contains principal components of original data. 

In [None]:
# Installations

!pip install dask[complete] --quiet
!pip install dask distributed --upgrade --quiet
!pip install aiohttp --quiet
!pip install dask-ml --quiet

In [2]:
# Import Dask Data Frames
import dask.dataframe as dd

Now we want to load in the data into a Dask Dataframe. There is one thing worth looking out for. When Dask loads in a data set, it will attempt to infer the data types via lazy eval which means that it randomly samples a subset and guesses their data types. This can be problematic. 

If you encounter this type of situation, you need to specify the column types manually. And in some situations, you may wish to set the `assume_missing` parameter of the `read_csv()` function to `True`; this informs Dask that all integer columns that aren't specified in `dtype` parameter are assumed to contain missing values, so they are converted to floats.

In [4]:
# Load in the data
df = dd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/creditcard.csv', dtype={'Time': 'float64'})

We will be answering 3 simple questions: 
1. How many transactions do we have in total?
2. How many transactions are fraud and how many are not?
3. What is the maximum amount in fraud transactions?

These questions are simple and if you are familiar with Pandas then you will be able to answer them with Dask.

In [8]:
# Question 1
print("We have", len(df), "transactions")

We have 284807 transactions


In [9]:
# Question 2
df.groupby("Class")["Time"].count().compute()

Class
0    284315
1       492
Name: Time, dtype: int64

Of our 284807 transactions, 492 are fraud. 

In [11]:
# Question 3
df[df.Class == 1]['Amount'].max().compute()

2125.87