# Data Science at Scale

## What do we mean by "scale"

* Scale is determined by
    * Size of data
    * Capacity of hardware

## Big Data is

* data you can't open in Excel
* data you can't fit in RAM
* data you can't fit on a single machine

## A data scientist operates on many scales

* Can't open in Excel $\rightarrow$ use `Pandas` and chunking
* Can't fit in RAM $\rightarrow$ use a database or stream the file
* Can't fit on a single machine $\rightarrow$ use Hadoop and `PySpark`

## Example - Average Super Hero Height - Pandas

In [6]:
# !pip install dfply

In [10]:
import pandas as pd
from dfply import *

heroes = pd.read_csv('./data/heroes_information.csv')
major_publisher = ['Marvel Comics', 'DC Comics']

(heroes >> 
   filter_by(X.Publisher.isin(major_publisher)) >>
   group_by(X.Publisher) >>
   summarise(N_heroes=n(X.Height), mean_height = mean(X.Height), mean_weight = mean(X.Weight)))

Unnamed: 0,Publisher,N_heroes,mean_height,mean_weight
0,DC Comics,215,91.072093,36.148837
1,Marvel Comics,388,142.756443,78.850515


## Example - Average Super Hero Height - `sqlalchemy`

In [8]:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine, func
from heroes import Base, Hero

engine = create_engine('sqlite:///heroes.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()


session.query(Hero.publisher, func.avg(Hero.height).label('avg_ht')).\
  filter(Hero.publisher.in_(major_publisher)).\
  group_by(Hero.publisher).\
  all()

[('DC Comics', 91.07209302325582), ('Marvel Comics', 142.75644329896906)]

## Example - Average Super Hero Height - `pyspark`

In [12]:
# !pip install pyspark

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/88/01/a37e827c2d80c6a754e40e99b9826d978b55254cc6c6672b5b08f2e18a7f/pyspark-2.4.0.tar.gz (213.4MB)
Collecting py4j==0.10.7 (from pyspark)
  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
Building wheels for collected packages: pyspark
  Running setup.py bdist_wheel for pyspark: started
  Running setup.py bdist_wheel for pyspark: finished with status 'done'
  Stored in directory: C:\Users\ox6036qb\AppData\Local\pip\Cache\wheels\cd\54\c2\abfcc942eddeaa7101228ebd6127a30dbdf903c72db4235b23
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.0


You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()
df = spark1.read.csv('data/heroes_information.csv', inferSchema=True, header=True)

df.where(col('Publisher').isin(major_publisher)).\
   groupBy("Publisher").\
   agg(mean('Height')).\
   show()

+-------------+------------------+
|    Publisher|       avg(Height)|
+-------------+------------------+
|Marvel Comics|142.75644329896906|
|    DC Comics| 91.07209302325582|
+-------------+------------------+



## <font color="red"> Exercise 1: Compare and Contrast </font>

<img src="img/all_three_1.png" width=600>

Your thoughts here

## Filter using in/isin

<img src="img/all_three_2.png" width=600>

## Group by publisher

<img src="img/all_three_3.png" width=600>

## Aggregate the mean height

<img src="img/all_three_4.png" width=500>

## Course outline

* Part 1 - Working with Tabular Data

* Part 2 - Working with Unstructured Data


## Part 1 - Working with Tabular Data

* Cleaning and prepping data in `Pandas` (2-3 weeks)
* SQL Alchemy (2 weeks)
* Spark SQL (3 weeks)

## Part 2 - Working with Unstructured Data

* Introduction to functional list processing (3 weeks)
* Processing Unstructured Data with Spark
* Project