# Processing Big Data - Data Ingestion
© Explore Data Science Academy

## Honour Code
I **Daluxolo**, **Mbatha**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).
    Non-compliance with the honour code constitutes a material breach of contract.



## Context 

To work constructively with any dataset, one needs to create an ingestion profile to make sure that the data at the source can be readily consumed. For this section of the predict, as the Data Engineer in the team, you will be required to design and implement the ingestion process. For the purposes of the project the AWS cloud storage service, namely, the S3 bucket service will act as your data source. All the data required can be found [here](https://processing-big-data-predict-stocks-data.s3.eu-west-1.amazonaws.com/stocks.zip).

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/DataIngestion.jpg"
     alt="Data Ingestion"
     style="float: center; padding-bottom=0.5em"
     width=40%/>
     <p><em>Figure 1. Data Ingestion</em></p>
</div>

Your manager, Gnissecorp Atadgib, knowing very well that you've recently completed your Data Engineering qualification, asks you to make use of Apache Spark for the ingestion as well as the rest of the project. His rationale being, that stock market data is generated every day and is quite time-sensitive and would require scalability when deploying to a production environment. 

## Dataset - US Nasdaq




<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/Nasdaq.png"
     alt="Nasdaq"
     style="float: center; padding-bottom=0.5em"
     width=50%/>
     <p><em>Figure 2. Nasdaq</em></p>
</div>

The data that you will be working with is a historical snapshot of market data taken from the Nasdaq electronic market. This dataset contains historical daily prices for all tickers currently trading on Nasdaq. The up-to-date list can be found on their [website](https://www.nasdaq.com/)


The provided data contains price data dating back from 02 January 1962 up until 01 April 2020. The data found in the S3 bucket has been stored in the following structure:

```
     stocks/<Year>/<Month>/<Day>/stocks.csv
```
Each CSV file for every trading day contains the following details:
- **Date** - specifies trading date
- **Open** - opening price
- **High** - maximum price during the day
- **Low** - minimum price during the day
- **Close** - close price adjusted for splits
- **Adj Close** - close price adjusted for both dividends and splits
- **Volume** - the number of shares that changed hands during a given day

## Basic initialisation
To get you started, let's import some basic Python libraries as well as Spark modules and functions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pyspark import SparkContext
from pyspark.sql import SparkSession
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql.types import *

Remember that we need a `SparkContext` and `SparkSession` to interface with Spark.
We will mostly be using the `SparkContext` to interact with RDDs and the `SparkSession` to interface with Python objects.

> ℹ️ **Instructions** ℹ️
>
>Initialise a new **Spark Context** and **Session** that you will use to interface with Spark.

In [2]:
#TODO: Write your code here
spark = SparkSession.builder\
  .config("spark.sql.shuffle.partitions", 5)\
  .config("spark.executor.memory", "8g")\
  .master("local[*]")\
  .appName("Predict")\
  .getOrCreate()

In [3]:
spark.version

'3.0.0'

## Investigate dataset schema
At this point, it is enough to read in a single file to ascertain the data structure. You will be required to use the information obtained from the small subset to create a data schema. This data schema will be used when reading the entire dataset using Spark.

> ℹ️ **Instructions** ℹ️
>
>Make use of Pandas to read in a single file and investigate the plausible data types to be used when creating a Spark data schema. 
>
>*You may use as many coding cells as necessary.*

In [4]:
#TODO: Write your code here
stocks_raw = pd.read_csv('/Users/daluxolombatha/Desktop/last_dance/stocks/1962/01/02/stocks.csv')
stocks_raw.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,stock
0,1962-01-02,6.532155,6.556185,6.532155,6.532155,1.536658,55900.0,AA
1,1962-01-02,6.125844,6.160982,6.125844,6.125844,1.414651,59700.0,ARNC
2,1962-01-02,0.837449,0.837449,0.823045,0.823045,0.145748,352200.0,BA
3,1962-01-02,1.604167,1.619792,1.588542,1.604167,0.136957,163200.0,CAT
4,1962-01-02,0.0,3.296131,3.244048,3.296131,0.051993,105600.0,CVX


## Read CSV files

When working with big data, it is often not tenable to keep processing an entire data batch when you are in the process of development - this can be quite time-consuming. If the data is uniform, it is sufficient to work with a smaller subset to create basic functionality. Your manager has identified the year **1962** to perform the initial testing for data ingestion. 

> ℹ️ **Instructions** ℹ️
>
>Read in the data for **1962** using a data schema that purely uses string data types. You will be required to convert to the appropriate data types at a later stage.
>
>*You may use as many coding cells as necessary.*

In [6]:
stocks_comb = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/1962/*/*/stocks.csv",header=True)
stocks_comb.show(2)

+----------+-----------------+------------------+-----------------+-----------------+------------------+-------+-----+
|      Date|             Open|              High|              Low|            Close|         Adj Close| Volume|stock|
+----------+-----------------+------------------+-----------------+-----------------+------------------+-------+-----+
|1962-02-19|5.839290142059326| 5.907374858856201|5.839290142059326|5.863319873809815|1.3863292932510376|29900.0|   AA|
|1962-02-19|5.481634140014648|5.5284857749938965|5.481634140014648|5.516772747039795|1.2804527282714844|32000.0| ARNC|
+----------+-----------------+------------------+-----------------+-----------------+------------------+-------+-----+
only showing top 2 rows



In [7]:
stocks_comb.describe().toPandas()

Unnamed: 0,summary,Date,Open,High,Low,Close,Adj Close,Volume,stock
0,count,5106,5106.0,5106.0,5084,5106.0,5106,5085.0,5106
1,mean,,1.0904873526012002,16.757624946793637,15.728619917198033,16.64199179044607,5.9866425135353065,540930.2458210423,
2,stddev,,2.364453525304909,53.91407348193545,51.322922898144,53.546771399008016,24.64637054715319,864596.244052551,
3,min,1962-01-02,0.0,0.0,-0.040435523172560696F44,0.0536249764263629,-0.04100876298375362F22,0.0,AA
4,max,1962-12-31,7.7133331298828125,9.984375,9.96875,9.984375,7.029978519312864e-07,998400.0,XOM


Looking at the summary stats, it already looks like ther's quite a bit of missing data in the low and the adjusted close columns, these rows might just need to be dropped.

## Update column names
To make the data easier to work with, you will need to make a few changes:
1. Column headers should all be in lowercase; and
2. Whitespaces should be replaced with underscores.


> ℹ️ **Instructions** ℹ️
>
>Make sure that the column headers are all in lowercase and that any whitespaces are replaced with underscores.
>
>*You may use as many coding cells as necessary.*

In [8]:
stocks_comb.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'stock']

In [17]:
#TODO: Write your code here
# Loop over each column name within our DataFrame.
for column in stocks_comb.columns:
    stocks_comb = stocks_comb.withColumnRenamed(column, '_'.join(column.split()).lower())
    
stocks_comb.show(3)

+----------+------------------+------------------+------------------+-----------------+------------------+--------+-----+
|      date|              open|              high|               low|            close|         adj_close|  volume|stock|
+----------+------------------+------------------+------------------+-----------------+------------------+--------+-----+
|1962-02-19| 5.839290142059326| 5.907374858856201| 5.839290142059326|5.863319873809815|1.3863292932510376| 29900.0|   AA|
|1962-02-19| 5.481634140014648|5.5284857749938965| 5.481634140014648|5.516772747039795|1.2804527282714844| 32000.0| ARNC|
|1962-02-19|0.9074074029922484|0.9156378507614136|0.8991769552230835|0.903292179107666|0.1614154428243637|619400.0|   BA|
+----------+------------------+------------------+------------------+-----------------+------------------+--------+-----+
only showing top 3 rows



## Null Values
Null values often represent missing pieces of data. It is always good to know where your null values lie - so you can quickly identify and remedy any issues stemming from these.

> ℹ️ **Instructions** ℹ️
>
>Write code to count the number of null values found in each column.
>
>*You may use as many coding cells as necessary.*

In [18]:
#TODO: Write your code here

nulls = stocks_comb['open', 'high', 'low', 'close', 'adj_close', 'volume']
nulls_nums = ['open', 'high', 'low', 'close', 'adj_close', 'volume']


from pyspark.sql.functions import isnan, when, count, col

nulls.select([count(when(isnan(c), c)).alias(c) for c in nulls.columns]).show()

+----+----+---+-----+---------+------+
|open|high|low|close|adj_close|volume|
+----+----+---+-----+---------+------+
|   0|   0|  0|    0|        0|     0|
+----+----+---+-----+---------+------+



In [19]:
nulls.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in nulls.columns]).show()


+----+----+---+-----+---------+------+
|open|high|low|close|adj_close|volume|
+----+----+---+-----+---------+------+
|   0|   0| 22|    0|        0|    21|
+----+----+---+-----+---------+------+



Looking at the datatframe with it's incorrect schema, it appears there is a tottal of only 43 null values, 22 of them can be found in the **low** column, and 21 located int the **Volume** column. 

## Data type conversion - The final data schema

Now that we have identified the number of missing values in the data set, we'll move on to convert our data schema to the required data types. 

> ℹ️ **Instructions** ℹ️
>
>Use typecasting to convert the string data types in your current data schema to more appropriate data types.
>
>*You may use as many coding cells as necessary.*

In [20]:
#TODO: Write your code here
stocks_schema = StructType([
    StructField('date', DateType()),
    StructField('open', FloatType()),
    StructField('high', FloatType()),
    StructField('low', FloatType()),
    StructField('close', FloatType()),
    StructField('adj_close', FloatType()),
    StructField('volume', FloatType()),
    StructField('stock', StringType()),
])

stocks = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/1962/*/*/stocks.csv",schema=stocks_schema, header=True)
stocks.printSchema()

root
 |-- date: date (nullable = true)
 |-- open: float (nullable = true)
 |-- high: float (nullable = true)
 |-- low: float (nullable = true)
 |-- close: float (nullable = true)
 |-- adj_close: float (nullable = true)
 |-- volume: float (nullable = true)
 |-- stock: string (nullable = true)



In [28]:
stocks_1963 = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/1963/*/*/stocks.csv",schema=stocks_schema, header=True)
stocks_1974 = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/1974/*/*/stocks.csv",schema=stocks_schema, header=True)
stocks_1985 = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/1985/*/*/stocks.csv",schema=stocks_schema, header=True)
stocks_1996 = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/1996/*/*/stocks.csv",schema=stocks_schema, header=True)
stocks_2007 = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/2007/*/*/stocks.csv",schema=stocks_schema, header=True)
stocks_2018 = spark.read.csv("/Users/daluxolombatha/Desktop/last_dance/stocks/2018/*/*/stocks.csv",schema=stocks_schema, header=True)

## Consolidate missing values
We have to check if the data type conversion above was done correctly.
If the casting was not successful, a null value gets inserted into the dataframe. You can thus check for successful conversion by determining if any null values are included in the resulting dataframe.


> ℹ️ **Instructions** ℹ️
>
>Write code to compare the number of invalid entries (nulls) pre-conversion and post-conversion.
>
>*You may use as many coding cells as necessary.*

In [21]:
#TODO: Write your code here
nulls_1 = stocks['open', 'high', 'low', 'close', 'adj_close', 'volume']
nulls_1.select([count(when(isnan(c), c)).alias(c) for c in nulls_1.columns]).show()
nulls_1.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in nulls_1.columns]).show()

+----+----+---+-----+---------+------+
|open|high|low|close|adj_close|volume|
+----+----+---+-----+---------+------+
|   0|   0|  0|    0|        0|     0|
+----+----+---+-----+---------+------+

+----+----+---+-----+---------+------+
|open|high|low|close|adj_close|volume|
+----+----+---+-----+---------+------+
|   0|   0| 42|    0|       21|    21|
+----+----+---+-----+---------+------+



looks like the type casting has definatetely added a few more null pr NaN valuues than we had before. Initially we only had 22 missing values in the low column, now we have 42 (20 more) and we also have 21 missing values in the adjusted close column...which is new. The casting had no efefect on the Volume column wu

Here you should be able to see if any of your casts went wrong. 
Do not attempt to correct any missing values at this point. This will be dealt with in later sections of the predict.

## Generate parquet files
When writing in Spark, we typically use parquet format. This format allows parallel writing using Spark's optimisation while maintaining other useful things like metadata.

When writing, it is good to make sure that the data is sufficiently partitioned. 

Generally, data should be partitioned with one partition for every 200MB of data, but this also depends on the size of your cluster and executors. 


### Check the size of the dataframe before partitioning

In [14]:
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

In [15]:
rdd = stocks.rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
obj = rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
size = spark._jvm.org.apache.spark.util.SizeEstimator.estimate(obj)
size_MB = size/1000000
partitions = max(int(size_MB/200), 2)
print(partitions)
print(f'The dataframe is {size_MB} MB')

2
The dataframe is 9.973432 MB


### Write parquet files to the local directory
> ℹ️ **Instructions** ℹ️
>
> Use the **coalesce** function and the number of **partitions** derived above to write parquet files to your local directory 
>
>*You may use as many coding cells as necessary.*

In [31]:
#TODO: Write your code here
stocks_2018.coalesce(partitions).write.format("parquet").save("/Users/daluxolombatha/Desktop/daluxolos_predict/three/2018")


In [32]:
spark.stop()