# Big-data   & Data Engineering 

# Big data 

 * Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations

!['bigdata'](big_data.png)

# Data Engineering 

####  Build pipelines that prepare and transform data
#### Data engineers set up and maintain the data infrastructures

 

# Data workflow 

!['workflow'](dataengworkflow.png)

# Introduction to the data pipeline and an overview of big data architecture 

# Perspective

### Three stakeholders involved in building data analytics or machine learning applications: 
* Data scientists, 

He aim is to find the most robust and computationally least expensive model for a given problem using available data.
* Engineers, 

The aim is to build things that others can depend on; to innovate either by building new things or finding better ways to build existing things that function 24x7 without much human intervention.

* Business managers.

he aim is to deliver value to customers; science and engineering are means to that end.

### Desired engineering characteristics of a data pipeline are:

* Accessibility: 
    
  Data being easily accessible to data scientists for hypothesis evaluation and model experimentation, preferably through a     query language.

* Scalability: 
    
  The ability to scale as the amount of ingested data increases, while keeping the cost low.

* Efficiency: 

  data and machine learning results being ready within the specified latency to meet the business objectives.

* Monitoring:
    
  automatic alerts about the health of the data and the pipeline, needed for proactive response to potential business risks.

# Pipeline

### A data pipeline 
Stitches together the end-to-end operation consisting of collecting the data, transforming it into insights, training a model, delivering insights, applying the model whenever and wherever the action needs to be taken to achieve the business goal.

### data pipeline has five stages 

#### Collection: 

Data sources (mobile apps, websites, web apps, microservices, IoT devices etc.) are instrumented to collect relevant data.

#### Ingestion: 

The instrumented sources pump the data into various inlet points (HTTP, MQTT, message queue etc.). There can also be jobs to import data from services like Google Analytics. The data can be in two forms: blobs and streams. All this data gets collected into a Data Lake.

#### Preparation: 

It is the extract, transform, load (ETL) operation to cleanse, conform, shape, transform, and catalog the data blobs and streams in the data lake; making the data ready-to-consume for ML and store it in a Data Warehouse.

#### Computation:

This is where analytics, data science and machine learning happen. Computation can be a combination of batch and stream processing. Models and insights (both structured data and streams) are stored back in the Data Warehouse.

#### Presentation: 

The insights are delivered through dashboards, emails, SMSs, push notifications, and microservices. The ML model inferences are exposed as microservices.



# Big Data Architecture: Your choice of the stack on the cloud

### An architecture of the data pipeline using open source technologies

!['OPENSOURCE'](opensource.png)

#### HTTP / MQTT 
Endpoints for ingesting data, and also for serving the results. There are several frameworks and technologies for this.
#### Pub/Sub Message Queue 
for ingesting high-volume streaming data. Kafka is currently the de-facto choice. It is battle-proven to scale to high event ingestion rate.
#### Low-Cost High-Volume Data Store for data lake (and data warehouse), 
Hadoop HDFS or cloud blob storage like AWS S3.
#### Query and Catalog Infrastructure for converting a data lake into a data warehouse,
Apache Hive  is a popular query language choice. or Pyspark  
#### Map-Reduce Batch Compute engine for high throughput processing, 
e.g. Hadoop Map-Reduce, Apache Spark.
#### Stream Compute for latency-sensitive processing, 
e.g. Apache Storm, Apache Flink. Apache Beam is an emerging choice for writing compute data-flow, and can be deployed on a Spark batch runner or Flink stream runner.
#### Machine Learning Frameworks for data science and ML.
Scikit-Learn, TensorFlow,MLLIB, and PyTorch are a popular choice for implementing machine learning.
#### Low-Latency Data Stores for storing the results. 
There are too many well-established choices of data stores depending on data type, performance, scale and cost to cover here.
#### Deployment orchestration
options are Hadoop YARN, Kubernetes / Kubeflow.

### Big Data Architecture: Serverless

Typical serverless architectures of big data pipelines on 
* Amazon Web Services, 
* Microsoft Azure, 
* Google Cloud Platform (GCP) 

### Serverless big data pipeline architecture on Amazon Web Services (AWS)

!['aws'](aws.png)

### Serverless big data pipeline architecture on Microsoft Azure

!['azure'](azure.png)

### Serverless big data pipeline architecture on Google Cloud Platform (GCP)


!['gcp'](gcp.png)

# large-scale data processing Library

100GB Dataset
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

* Dask DataFrame — Flexible parallel computing library for analytics. 

https://docs.dask.org/en/latest/dataframe.html

* PySpark — A unified analytics engine for large-scale data processing based on Spark.
https://spark.apache.org/docs/latest/api/python/index.html

* Koalas — Pandas API on Apache Spark.
https://koalas.readthedocs.io/en/latest/index.html

* Vaex — A Python library for lazy Out-of-Core dataframes.
https://vaex.readthedocs.io/en/latest/

* Turicreate — A relatively clandestine machine learning package with its dataframe structure — SFrame, which qualifies.
https://github.com/apple/turicreate

* Datatable — The backbone of H2O’s Driverless.ai. A dataframe package with specific emphasis on speed and big data support for a single node.

* H2O — The standard in-memory dataframe is well-rounded. Still, with the recommendations of a cluster four times the size of the dataset, you need deep pockets to use it for exploration and development.

* cuDF (RapidAI) — A GPU dataframe package is an exciting concept. For big data, you must use distributed GPUs with Dask to match your data size, perfect for bottomless pockets.

* Modin — A tool to scale Pandas without changes to the API which uses Dask or Ray in the backend. Sadly at this moment, it can only read a single parquet file while I already had a chunked parquet dataset. With the prospect of getting similar results as * 

* Dask DataFrame, it didn’t seem to be worth pursuing by merging all parquet files to a single one at this point.

* Vaex does have a GPU and numba support for heavy calculations which I did not benchmark.

https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13

# Example Big data  architecture 


### Uber’s Machine Learning Platform

!['uber'](uber.png)

https://eng.uber.com/michelangelo-machine-learning-platform/

### Spotify  generate the ‘Discover Weekly’ personalized music

!['spotify'](spotify1.jpg)

# Netflix

!['netflix'](netflix.png)

Other arc

https://keen.io/blog/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest/

# ETL 

ETL is a process that extracts the data from different source systems, then transforms the data and finally loads the data into the Data Warehouse system.

!['netflix'](etlpic.png)

# EXTRACT 

Extract data from SQL  database


In [None]:
import pandas as pd
import sqlalchemy 

uri = "postgresql://repl:password@africadataschool:6000/john"  

db_engine = sqlalchemy.create_engine(uri)  
 
pd.read_sql("SELECT * FROM student", db_engine)

Extract data into pyspark

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
# Test the spark 

df = spark.read.format("csv").load("dbfs:/FileStore/shared_uploads/africadataschool@outlook.com/student _data.csv",header=True,inferSchema=True)
df.show()

# Transform 

In [None]:
# transform operations 

* Filter
* Selection & renaming
* Gropping and Aggregation
* joining 
* ordering 

In [None]:
# PySpark DataFrame with student 
student_df 
# PySpark DataFrame with marks 
marks_df 
# Groupby marks
marks_per_student = marks_df.groupBy("student_id").mean("marks") 
 
# Join on customer ID 
student_df.join( marks_per_student,  student_df.student_id==marks_per_student.student_id ) 

# LOAD

In [None]:
# Pandas .to_parquet() method 
df.to_parquet("./s3://path/to/bucket/student.parquet") 

# PySpark .write.parquet() method 

df.write.parquet("./s3://path/to/bucket/customer.parquet") 

In [None]:
# Load into PostgreSQL database 

marks_df.to_sql("marks",
                db_engine, 
                schema="store",
                if_exists="replace") 

# Building a data Pipeline now 

In [None]:
#EXTRACT
def extract_StudentTable_to_df(tablename, db_engine):   
    return pd.read_sql("SELECT * FROM {}".format(tablename), db_engine) 
#TRAN 
def join_table_transform(df, column, pat, suffixes): 
    #   join table ...  
    return transformeddf 
#LOAD
def load_df_into_dwh(film_df, tablename, schema, db_engine):  
    # load to datawarehouse 
    return pd.to_sql(tablename, db_engine, schema=schema, if_exists="replace") 
 
db_engines = { ... } # Needs to be configured 


def etl_marks():   
  # Extract   
   film_df = extract_table_to_df("film", db_engines["store"])  


   # Transform   
   film_df = split_columns_transform(film_df, "rental_rate", ".", ["_dollar", "_cents"])  

  # Load   
   load_df_into_dwh(film_df, "film", "store", db_engines["dwh"]) 
    
    return marks 

# AIRFLOW 

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.


!['netflix'](airdag.png)

In [None]:
from airflow.models import DAG 
from airflow.operators.python_operator import PythonOperator 

dag = DAG(dag_id="etl_pipeline",
          schedule_interval="0 0 * * *")  

etl_task = PythonOperator(task_id="etl_task",
                          python_callable=etl_marks, 
                          dag=dag)  

send_email_heamaster=PythonOperator(task_id="etl_task",
                          python_callable=etl_marks, 
                          dag=dag)  



etl_task.set_downstream(wait_for_this_task) 