# 1. Parallel Processing

Let's explore how we can use *parallelism*, locally, to scale up big data.

We'll start by looking at supporting multiple cores...

## 1.1 Parallel dataframe processing with Dask

The Dask library implements a subset of the Pandas API (and some others, such as Numpy) in a way that can run in multiple CPU threads (and thus on multiple cores).  It also supports certain cluster-based computations, although that won't be our focus.

Let's start by installing Dask...

In [None]:
!pip install dask[complete]

In [None]:
import pandas as pd
import numpy as np

# JSON parsing
import json

import dask
import dask.dataframe as dd

# HTML parsing
from lxml import etree
import urllib

# SQLite RDBMS
import sqlite3

# Time conversions
import time

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

In [None]:
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1CtSFvqTM-JTxWu7-lfGYba1tLYkcqIZC',
                                    dest_path='/content/linkedin_small.json.txt')


In [None]:
%%time
# 100K records from linkedin
linked_in = open('/content/linkedin_small.json.txt')
    
people = []

for line in linked_in:
    person = json.loads(line)
    people.append(person)
    
people_df = pd.DataFrame(people)
print ("%d records"%len(people_df))

people_df

In [None]:
people_df

In [None]:
skills_df = people_df[['_id','skills']].explode('skills')
education_df = people_df[['_id','education']].explode('education')
experience_df = people_df[['_id','experience']]
honors_df = people_df[['_id', 'honors']]

linkedin_df = people_df.copy().drop(columns=['skills','education','experience','honors'])

In [None]:
linkedin_df

In [None]:
%%time

linkedin_df.merge(experience_df, on='_id').merge(skills_df, on='_id').merge(honors_df, on='_id').merge(education_df, on='_id')

In [None]:
linkedin_ddf = dd.from_pandas(linkedin_df,npartitions=100)
skills_ddf = dd.from_pandas(skills_df,npartitions=100)
experience_ddf = dd.from_pandas(experience_df,npartitions=100)
education_ddf = dd.from_pandas(education_df,npartitions=100)
honors_ddf = dd.from_pandas(honors_df, npartitions=10)

In [None]:
%%time
linkedin_ddf.merge(experience_ddf, on='_id').merge(skills_ddf, on='_id').merge(honors_ddf, on='_id').merge(education_ddf, on='_id')