<a href="https://colab.research.google.com/github/SurajKande/Pipelining/blob/master/parallel_computing_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [48]:
# to connect gdrive to colab for importing dataset
from google.colab import drive
drive.mount('/content/gdrive', force_remount = True)

Mounted at /content/gdrive


In [0]:
import pandas as pd
athlete_events = pd.read_csv('/content/gdrive/My Drive/datasets/athlete_events.csv')

In [3]:
athlete_events.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [0]:
from multiprocessing import Pool
def take_mean_age(year_and_group):
  year, group = year_and_group
  return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])

In [0]:
#decorator function 
import time
from functools import wraps
def print_timing(func):
    @wraps(func) 
    def wrapper(*arg):
        start = time.perf_counter()
        result = func(*arg)
        end = time.perf_counter()
        fs = '{} processing time {:.3f}'
        print(fs.format(func.__name__, (end - start)*1000))
        #return result
    return wrapper

In [0]:
# Function to apply a function over multiple cores
@print_timing
def parallel_apply(apply_func, groups, nb_cores):
    with Pool(nb_cores) as p:
        results = p.map(apply_func, groups)
    return pd.concat(results)

In [44]:
# Parallel apply using 1 core
parallel_apply(take_mean_age, athlete_events.groupby('Year'), 1)

parallel_apply processing time 596.610


In [39]:
# Parallel apply using 2 cores
parallel_apply(take_mean_age, athlete_events.groupby('Year'), 2)

parallel_apply processing time 595.000


In [42]:
# Parallel apply using 4 cores
parallel_apply(take_mean_age, athlete_events.groupby('Year'), 4)

parallel_apply processing time 594.408


It's essential to understand this on a lower level, but in reality, you'll never use this kind of APIs. A more convenient way to parallelize an apply over several groups is using the dask framework and its abstraction of the pandas DataFrame

In [47]:
import dask.dataframe as dd

# Set the number of pratitions
athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)

# Calculate the mean Age per Year
print(athlete_events_dask.groupby('Year').Age.mean().compute())

Year
1896    23.580645
1900    29.034031
1904    26.698150
1906    27.125253
1908    26.970228
1912    27.538620
1920    29.290978
1924    28.373325
1928    29.112557
1932    32.582080
1936    27.530328
1948    28.783947
1952    26.161546
1956    25.926674
1960    25.168848
1964    24.944397
1968    24.248046
1972    24.308607
1976    23.841818
1980    23.694743
1984    23.898347
1988    24.079432
1992    24.318895
1994    24.422103
1996    24.915045
1998    25.163197
2000    25.422504
2002    25.916281
2004    25.639515
2006    25.959151
2008    25.734118
2010    26.124262
2012    25.961378
2014    25.987324
2016    26.207919
Name: Age, dtype: float64
