# Mining Association Rules in Distibuted Databases

Matheus Schmitz  
<a href="https://www.linkedin.com/in/matheusschmitz/">LinkedIn</a>  
<a href="https://matheus-schmitz.github.io/">Github Portfolio</a>  

## Problem Statement

My intent for this project is to use Spark (this time in combination with R) to mine association rules using data mining techniques. This of couse requires tools suited for Big Data approaches, hence Spark is chosen as it allows for the manipulation of large datasets distributed across multiple computing nodes.

Big data presents certain hinderances to neural networks and other gradient descent learning approaches, while being more friendly (less unfriendly?) to techniques that are more easily parallelized, such as those employed when mining association rules. Hence, very commonly such approaches are used to analyse very large datasets, the practice of which is my goal here. Among the multiple algolrithms available, I'll focus on one which is widely regarded as being among the best: the Frequent Pattern Growth Algorithm.

For this project I'll be using crime data available from the UK Police's Open Data Portal, which contains a variety of records on all registered crimes. The data is available from 2014 onwards, although I've chosen to work with two years of data, from january 2019 to december 2020, which allow for a control and a test group for exploring the impacts of covid-19 on crime patterns.

**Data Source:** https://data.police.uk/data/

## Dataset Creation

In [1]:
import glob
import pandas as pd
from tqdm import tqdm

# Before starting the work in R, I need to run a python script to concatenate all CSVs
dfs = glob.glob('**/*.csv', recursive=True)
print(f'Number of files found: {len(dfs)}')

Number of files found: 1061


There are 44 or 45 CSVs files per month (one per region), considering 24 months the expected number of files was between 1056 and 1080, so seems like we got them all!

In [2]:
# Now concatete all CSVs into a single datafram
result = pd.concat([pd.read_csv(df) for df in tqdm(dfs)], ignore_index=True)
result.shape

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1061/1061 [02:18<00:00,  7.66it/s]


(12880086, 12)

In [3]:
# Then save the dataframe to disk
result.to_csv('data/UK_Crime.csv', index=False)

In [4]:
# Since the base dataset turned out to be so massive, I'll also create a downsampled size to ensure I can run it
result_donwsampled = result.sample(frac=0.001)
result_donwsampled.to_csv('data/UK_Crime_downsampled.csv', index=False)

https://stackoverflow.com/questions/59552212/choosing-support-and-confidence-values-with-ml-fpgrowth-in-sparklyr