<a href="https://colab.research.google.com/github/MarinaWolters/Coding-Tracker/blob/master/W4_ParallelProcessing_Dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 2 - Parallel Processing

Let's explore how we can use *parallelism*, both locally and on a cluster, to scale up big data.

We'll start by looking at supporting multiple cores...

## 1.1 Parallel dataframe processing with Dask

The Dask library implements a subset of the Pandas API (and some others, such as Numpy) in a way that can run in multiple CPU threads (and thus on multiple cores).  It also supports certain cluster-based computations, although that won't be our focus.

Let's start by installing Dask...

In [1]:
!pip install dask[complete]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import pandas as pd
import numpy as np

# JSON parsing
import json

import dask
import dask.dataframe as dd

# HTML parsing
from lxml import etree
import urllib

# SQLite RDBMS
import sqlite3

# Time conversions
import time

# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

In [3]:
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1CtSFvqTM-JTxWu7-lfGYba1tLYkcqIZC',
                                    dest_path='/content/linkedin_small.json.txt')


Downloading 1CtSFvqTM-JTxWu7-lfGYba1tLYkcqIZC into /content/linkedin_small.json.txt... Done.


In [4]:
%%time
# 100K records from linkedin
linked_in = open('/content/linkedin_small.json.txt')
    
people = []

for line in linked_in:
    person = json.loads(line)
    people.append(person)
    
people_df = pd.DataFrame(people)
print ("%d records"%len(people_df))

people_df

JSONDecodeError: ignored

In [None]:
people_df

Unnamed: 0,_id,name,locality,skills,industry,summary,url,also_view,education,group,overview_html,interval,experience,specilities,events,interests,homepage,honors
0,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,"[Key Account Development, Strategic Planning, ...",Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,,,,
1,in-00001,"{'family_name': 'Forslund', 'given_name': 'Ann'}","Antwerp Area, Belgium","[Molecular Biology, Biomarkers]",Pharmaceuticals,Ph.D. scientist with background in cancer rese...,http://be.linkedin.com/in/00001,[{'url': 'http://www.linkedin.com/pub/peter-ki...,"[{'start': '2008', 'major': 'Economics', 'end'...","{'affilition': ['ASMALLWORLD.net', 'Biomarker ...","<dl id=""overview""><dt id=""overview-summary-cur...",20.0,"[{'org': 'Johnson and Johnson', 'title': 'Seni...","Biomarkers in Oncology, Cancer Genomics, Molec...","[{'from': 'Sahlgrenska University Hospital', '...",,,
2,in-00006,"{'family_name': 'Douglas', 'given_name': 'Shawn'}","San Francisco, California","[DNA, Nanotechnology, Molecular Biology, Softw...",Research,I am interested in inventing new methods to co...,http://www.linkedin.com/in/00006,[{'url': 'http://www.linkedin.com/pub/george-c...,"[{'major': 'Biophysics', 'end': '2009', 'name'...",,"<dl id=""overview""><dt id=""overview-summary-cur...",0.0,"[{'org': 'UCSF', 'title': 'Assistant Professor...",,[{'from': 'Wyss Institute for Biologically Ins...,"personal genomics, nanotechnology","{'BIOMOD': ['http://biomod.net/'], 'Company We...",
3,in-000montgomery,"{'family_name': 'Kilimann', 'given_name': 'Edr...",San Francisco Bay Area,,Information Technology and Services,OBJECTIVE<Primary> Work on an interesting and ...,http://www.linkedin.com/in/000montgomery,[{'url': 'http://www.linkedin.com/pub/david-br...,,"{'affilition': ['Big Data, Low Latency', 'Expe...",,5.0,"[{'org': '<Online Recruiting Company>', 'desc'...",,"[{'from': '<Employee Benefits, Administration ...",,,
4,in-000vijaychauhan,"{'family_name': 'Chauhan, PMP', 'given_name': ...","Chennai Area, India","[Program Management, French, Avionics, Embedde...",Aviation & Aerospace,"Experience in Avionics Systems, Embedded Syste...",http://in.linkedin.com/in/000vijaychauhan,[{'url': 'http://in.linkedin.com/in/sandeeprag...,"[{'start': '1988', 'end': '1989', 'name': 'Eco...",{'member': 'Member of Project Management Insti...,,,,,,"Literature, Philosophy, Music",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,in-dorothyballarini,"{'family_name': 'Ballarini', 'given_name': 'Do...","London, United Kingdom","[Zbrush, 3D Studio Max, Concept Design, Charac...",Motion Pictures and Film,I have 10 years of experience working with the...,http://uk.linkedin.com/in/dorothyballarini,[{'url': 'http://es.linkedin.com/pub/rebeca-pu...,"[{'start': '2007', 'major': 'Design', 'end': '...","{'affilition': ['3D Animation, VFX & Games', '...",,6.0,"[{'org': 'ZOO', 'title': 'Freelancer 3D artist...",,[{'from': 'Universidade Federal do Rio Grande ...,"Arts, Computer Graphics, Cinema, Video Games",,
99996,in-dorothybarnettgrimes,"{'family_name': 'Grimes', 'given_name': 'Dorot...","Houston, Texas Area","[Internal Controls, Revenue Recognition, Sarba...",Oil & Energy,• Results driven finance leader and business p...,http://www.linkedin.com/in/dorothybarnettgrimes,[{'url': 'http://www.linkedin.com/pub/ken-greg...,"[{'major': 'Business', 'name': 'Santa Clara Un...",,,25.0,"[{'org': 'Spectrum ASA', 'title': 'Multi-Clien...",,"[{'from': 'TanThap Inc and Digitan Systems', '...",,,[Certified Public Accountant - State of Texas]
99997,in-dorothycarroll,"{'family_name': 'Huffman', 'given_name': 'Doro...","Richmond, Virginia Area",,Information Technology and Services,•Over 6 years experience in all phases of soft...,http://www.linkedin.com/in/dorothycarroll,[{'url': 'http://www.linkedin.com/pub/kim-cava...,"[{'major': 'Religion', 'end': '2009', 'name': ...",{'member': 'CERT (Community Emergency Response...,,31.0,"[{'org': 'Estes Express Lines', 'title': 'QA L...",,"[{'from': 'Circuit City', 'to': 'Circuit City'...",,,
99998,in-dorothyczudziak,"{'family_name': 'Czudziak', 'given_name': 'Dor...",Greater New York City Area,,Entertainment,,http://www.linkedin.com/in/dorothyczudziak,[{'url': 'http://www.linkedin.com/pub/lindsay-...,"[{'start': '1998', 'end': '2004', 'name': 'Que...",,,0.0,"[{'start': 'January 2000', 'desc': 'I'm a full...",,[],,,


In [None]:
skills_df = people_df[['_id','skills']].explode('skills')
education_df = people_df[['_id','education']].explode('education')
experience_df = people_df[['_id','experience']]
honors_df = people_df[['_id', 'honors']]

linkedin_df = people_df.copy().drop(columns=['skills','education','experience','honors'])

In [None]:
linkedin_df

Unnamed: 0,_id,name,locality,industry,summary,url,also_view,group,overview_html,interval,specilities,events,interests,homepage
0,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,
1,in-00001,"{'family_name': 'Forslund', 'given_name': 'Ann'}","Antwerp Area, Belgium",Pharmaceuticals,Ph.D. scientist with background in cancer rese...,http://be.linkedin.com/in/00001,[{'url': 'http://www.linkedin.com/pub/peter-ki...,"{'affilition': ['ASMALLWORLD.net', 'Biomarker ...","<dl id=""overview""><dt id=""overview-summary-cur...",20.0,"Biomarkers in Oncology, Cancer Genomics, Molec...","[{'from': 'Sahlgrenska University Hospital', '...",,
2,in-00006,"{'family_name': 'Douglas', 'given_name': 'Shawn'}","San Francisco, California",Research,I am interested in inventing new methods to co...,http://www.linkedin.com/in/00006,[{'url': 'http://www.linkedin.com/pub/george-c...,,"<dl id=""overview""><dt id=""overview-summary-cur...",0.0,,[{'from': 'Wyss Institute for Biologically Ins...,"personal genomics, nanotechnology","{'BIOMOD': ['http://biomod.net/'], 'Company We..."
3,in-000montgomery,"{'family_name': 'Kilimann', 'given_name': 'Edr...",San Francisco Bay Area,Information Technology and Services,OBJECTIVE<Primary> Work on an interesting and ...,http://www.linkedin.com/in/000montgomery,[{'url': 'http://www.linkedin.com/pub/david-br...,"{'affilition': ['Big Data, Low Latency', 'Expe...",,5.0,,"[{'from': '<Employee Benefits, Administration ...",,
4,in-000vijaychauhan,"{'family_name': 'Chauhan, PMP', 'given_name': ...","Chennai Area, India",Aviation & Aerospace,"Experience in Avionics Systems, Embedded Syste...",http://in.linkedin.com/in/000vijaychauhan,[{'url': 'http://in.linkedin.com/in/sandeeprag...,{'member': 'Member of Project Management Insti...,,,,,"Literature, Philosophy, Music",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,in-dorothyballarini,"{'family_name': 'Ballarini', 'given_name': 'Do...","London, United Kingdom",Motion Pictures and Film,I have 10 years of experience working with the...,http://uk.linkedin.com/in/dorothyballarini,[{'url': 'http://es.linkedin.com/pub/rebeca-pu...,"{'affilition': ['3D Animation, VFX & Games', '...",,6.0,,[{'from': 'Universidade Federal do Rio Grande ...,"Arts, Computer Graphics, Cinema, Video Games",
99996,in-dorothybarnettgrimes,"{'family_name': 'Grimes', 'given_name': 'Dorot...","Houston, Texas Area",Oil & Energy,• Results driven finance leader and business p...,http://www.linkedin.com/in/dorothybarnettgrimes,[{'url': 'http://www.linkedin.com/pub/ken-greg...,,,25.0,,"[{'from': 'TanThap Inc and Digitan Systems', '...",,
99997,in-dorothycarroll,"{'family_name': 'Huffman', 'given_name': 'Doro...","Richmond, Virginia Area",Information Technology and Services,•Over 6 years experience in all phases of soft...,http://www.linkedin.com/in/dorothycarroll,[{'url': 'http://www.linkedin.com/pub/kim-cava...,{'member': 'CERT (Community Emergency Response...,,31.0,,"[{'from': 'Circuit City', 'to': 'Circuit City'...",,
99998,in-dorothyczudziak,"{'family_name': 'Czudziak', 'given_name': 'Dor...",Greater New York City Area,Entertainment,,http://www.linkedin.com/in/dorothyczudziak,[{'url': 'http://www.linkedin.com/pub/lindsay-...,,,0.0,,[],,


In [None]:
%%time

linkedin_df.merge(experience_df, on='_id').merge(skills_df, on='_id').merge(honors_df, on='_id').merge(education_df, on='_id')

CPU times: user 2.99 s, sys: 281 ms, total: 3.28 s
Wall time: 3.3 s


Unnamed: 0,_id,name,locality,industry,summary,url,also_view,group,overview_html,interval,specilities,events,interests,homepage,experience,skills,honors,education
0,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,,,Key Account Development,,
1,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,,,Strategic Planning,,
2,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,,,Market Planning,,
3,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,,,Team Leadership,,
4,in-00000001,"{'family_name': 'Mazalu MBA', 'given_name': 'D...",United States,Medical Devices,SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJ...,http://www.linkedin.com/in/00000001,[{'url': 'http://www.linkedin.com/pub/krisa-dr...,,,,,,,,,Negotiation,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2340598,in-dorothydalton,"{'family_name': 'Dalton', 'given_name': 'Dorot...","Brussels Area, Belgium",Human Resources,,http://be.linkedin.com/in/dorothydalton,[],,,,,,,,,Staff Development,,
2340599,in-dorothydalton,"{'family_name': 'Dalton', 'given_name': 'Dorot...","Brussels Area, Belgium",Human Resources,,http://be.linkedin.com/in/dorothydalton,[],,,,,,,,,Employee Wellness,,
2340600,in-dorothydalton,"{'family_name': 'Dalton', 'given_name': 'Dorot...","Brussels Area, Belgium",Human Resources,,http://be.linkedin.com/in/dorothydalton,[],,,,,,,,,Personnel Management,,
2340601,in-dorothydalton,"{'family_name': 'Dalton', 'given_name': 'Dorot...","Brussels Area, Belgium",Human Resources,,http://be.linkedin.com/in/dorothydalton,[],,,,,,,,,Sourcing,,


In [None]:
linkedin_ddf = dd.from_pandas(linkedin_df,npartitions=100)
skills_ddf = dd.from_pandas(skills_df,npartitions=100)
experience_ddf = dd.from_pandas(experience_df,npartitions=100)
education_ddf = dd.from_pandas(education_df,npartitions=100)
honors_ddf = dd.from_pandas(honors_df, npartitions=10)

In [None]:
%%time
linkedin_ddf.merge(experience_ddf, on='_id').merge(skills_ddf, on='_id').merge(honors_ddf, on='_id').merge(education_ddf, on='_id')

CPU times: user 162 ms, sys: 3.8 ms, total: 166 ms
Wall time: 187 ms


Unnamed: 0_level_0,_id,name,locality,industry,summary,url,also_view,group,overview_html,interval,specilities,events,interests,homepage,experience,skills,honors,education
npartitions=100,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,object,object,object,object,object,object,object,object,object,float64,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
