# Processing Big Data - Data Ingestion
© Explore Data Science Academy

## Honour Code
I {**YOUR NAME**, **YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).
    Non-compliance with the honour code constitutes a material breach of contract.



## Context 

To work constructively with any dataset, one needs to create an ingestion profile to make sure that the data at the source can be readily consumed. For this section of the predict, as the Data Engineer in the team, you will be required to design and implement the ingestion process. For the purposes of the project the AWS cloud storage service, namely, the S3 bucket service will act as your data source. All the data required can be found [here](https://processing-big-data-predict-stocks-data.s3.eu-west-1.amazonaws.com/stocks.zip).

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/DataIngestion.jpg"
     alt="Data Ingestion"
     style="float: center; padding-bottom=0.5em"
     width=40%/>
     <p><em>Figure 1. Data Ingestion</em></p>
</div>

Your manager, Gnissecorp Atadgib, knowing very well that you've recently completed your Data Engineering qualification, asks you to make use of Apache Spark for the ingestion as well as the rest of the project. His rationale being, that stock market data is generated every day and is quite time-sensitive and would require scalability when deploying to a production environment. 

## Dataset - US Nasdaq




<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/Nasdaq.png"
     alt="Nasdaq"
     style="float: center; padding-bottom=0.5em"
     width=50%/>
     <p><em>Figure 2. Nasdaq</em></p>
</div>

The data that you will be working with is a historical snapshot of market data taken from the Nasdaq electronic market. This dataset contains historical daily prices for all tickers currently trading on Nasdaq. The up-to-date list can be found on their [website](https://www.nasdaq.com/)


The provided data contains price data dating back from 02 January 1962 up until 01 April 2020. The data found in the S3 bucket has been stored in the following structure:

```
     stocks/<Year>/<Month>/<Day>/stocks.csv
```
Each CSV file for every trading day contains the following details:
- **Date** - specifies trading date
- **Open** - opening price
- **High** - maximum price during the day
- **Low** - minimum price during the day
- **Close** - close price adjusted for splits
- **Adj Close** - close price adjusted for both dividends and splits
- **Volume** - the number of shares that changed hands during a given day

## Basic initialisation
To get you started, let's import some basic Python libraries as well as Spark modules and functions.

In [1]:
""" 
Colab is essentially running on a linux machine on Google Cloud Platform.
This means that should you want to install something in your notebook you
would have to run headless installs as well as wget. Copying the installs
below will ensure that you have spark and java installed in the environment
as well as available for the notebook.
"""
!apt-get install openjdk-17-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark

In [2]:
"""
This section helps in rendering your notebook operable and ensuring that 
your environment variables for spark are correct. Running this cell in any
notebook allows for any unresolved spark environment to be fixed.
"""
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

In [3]:
"""
We need to locate Spark in the system. For that, we import findspark and 
use the findspark.init() method. If you want to know the location where 
Spark is installed, use findspark.find()
"""
import findspark
findspark.init()
# findspark.find()
import pyspark

In [57]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!unzip "/content/drive/MyDrive/data/stocks.zip"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
   creating: stocks/1981/07/22/
  inflating: stocks/1981/07/22/stocks.csv  
   creating: stocks/1981/07/23/
  inflating: stocks/1981/07/23/stocks.csv  
   creating: stocks/1981/07/24/
  inflating: stocks/1981/07/24/stocks.csv  
   creating: stocks/1981/07/27/
  inflating: stocks/1981/07/27/stocks.csv  
   creating: stocks/1981/07/28/
  inflating: stocks/1981/07/28/stocks.csv  
   creating: stocks/1981/07/29/
  inflating: stocks/1981/07/29/stocks.csv  
   creating: stocks/1981/07/30/
  inflating: stocks/1981/07/30/stocks.csv  
   creating: stocks/1981/07/31/
  inflating: stocks/1981/07/31/stocks.csv  
   creating: stocks/1981/08/
   creating: stocks/1981/08/03/
  inflating: stocks/1981/08/03/stocks.csv  
   creating: stocks/1981/08/04/
  inflating: stocks/1981/08/04/stocks.csv  
   creating: stocks/1981/08/05/
  inflating: stocks/1981/08/05/stocks.csv  
   creating: stocks/1981/08/06/
  inflating: stocks/1981/08/06/stocks.

In [6]:
drive.flush_and_unmount()

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

Remember that we need a `SparkContext` and `SparkSession` to interface with Spark.
We will mostly be using the `SparkContext` to interact with RDDs and the `SparkSession` to interface with Python objects.

> ℹ️ **Instructions** ℹ️
>
>Initialise a new **Spark Context** and **Session** that you will use to interface with Spark.

In [8]:
#TODO: Write your code here
sc = SparkContext()
spark = SparkSession(sc)

## Investigate dataset schema
At this point, it is enough to read in a single file to ascertain the data structure. You will be required to use the information obtained from the small subset to create a data schema. This data schema will be used when reading the entire dataset using Spark.

> ℹ️ **Instructions** ℹ️
>
>Make use of Pandas to read in a single file and investigate the plausible data types to be used when creating a Spark data schema. 
>
>*You may use as many coding cells as necessary.*

In [11]:
test_df = spark.read.csv('/content/stocks/1962/01/03/stocks.csv', header=True)

In [18]:
#TODO: Write your code here
test_df.show(5, truncate=False)
test_df.dtypes

+----------+------------------+------------------+------------------+------------------+-------------------+--------+-----+
|Date      |Open              |High              |Low               |Close             |Adj Close          |Volume  |stock|
+----------+------------------+------------------+------------------+------------------+-------------------+--------+-----+
|1962-01-03|6.5321550369262695|6.632279872894287 |6.5241451263427725|6.632279872894287 |1.5602117776870728 |74500.0 |AA   |
|1962-01-03|6.125843524932861 |6.219546318054199 |6.11413049697876  |6.219546318054199 |1.436288595199585  |79600.0 |ARNC |
|1962-01-03|0.8353909254074097|0.8518518805503845|0.8353909254074097|0.8395061492919922|0.1486625373363495 |710400.0|BA   |
|1962-01-03|1.6041666269302368|1.6197916269302368|1.5885416269302368|1.6197916269302368|0.13829129934310913|156000.0|CAT  |
|1962-01-03|0.0               |3.3035714626312256|3.2738094329833984|3.2886905670166016|0.05187510699033737|126400.0|CVX  |
+-------

[('Date', 'string'),
 ('Open', 'string'),
 ('High', 'string'),
 ('Low', 'string'),
 ('Close', 'string'),
 ('Adj Close', 'string'),
 ('Volume', 'string'),
 ('stock', 'string')]

## Read CSV files

When working with big data, it is often not tenable to keep processing an entire data batch when you are in the process of development - this can be quite time-consuming. If the data is uniform, it is sufficient to work with a smaller subset to create basic functionality. Your manager has identified the year **1962** to perform the initial testing for data ingestion. 

> ℹ️ **Instructions** ℹ️
>
>Read in the data for **1962** using a data schema that purely uses string data types. You will be required to convert to the appropriate data types at a later stage.
>
>*You may use as many coding cells as necessary.*

In [20]:
#TODO: Write your code here
df = spark.read.csv('/content/stocks/1962', header=True, recursiveFileLookup=True)

## Update column names
To make the data easier to work with, you will need to make a few changes:
1. Column headers should all be in lowercase; and
2. Whitespaces should be replaced with underscores.


> ℹ️ **Instructions** ℹ️
>
>Make sure that the column headers are all in lowercase and that any whitespaces are replaced with underscores.
>
>*You may use as many coding cells as necessary.*

In [21]:
#TODO: Write your code here
for column in df.columns:
    df = df.withColumnRenamed(column, '_'.join(column.split()).lower())

## Null Values
Null values often represent missing pieces of data. It is always good to know where your null values lie - so you can quickly identify and remedy any issues stemming from these.

> ℹ️ **Instructions** ℹ️
>
>Write code to count the number of null values found in each column.
>
>*You may use as many coding cells as necessary.*

In [27]:
#TODO: Write your code here
null_values = {}
for col in df.columns:
  before_count = df.select(F.count(F.when(F.isnull(col), col)).alias(col)).collect()
  null_values[col] = before_count[0][0] 

null_values

{'adj_close': 0,
 'close': 0,
 'date': 0,
 'high': 0,
 'low': 22,
 'open': 0,
 'stock': 0,
 'volume': 21}

In [28]:
null_sum = sum(null_values.values())
null_sum

43

## Data type conversion - The final data schema

Now that we have identified the number of missing values in the data set, we'll move on to convert our data schema to the required data types. 

> ℹ️ **Instructions** ℹ️
>
>Use typecasting to convert the string data types in your current data schema to more appropriate data types.
>
>*You may use as many coding cells as necessary.*

In [42]:
schema = StructType([StructField('date', StringType(), True),
                     StructField('open', FloatType(), True),
                     StructField('high', FloatType(), True),
                     StructField('low', FloatType(), True),
                     StructField('close', FloatType(), True),
                     StructField('adj_close', FloatType(), True),
                     StructField('volume', FloatType(), True),
                     StructField('stock', StringType(), True)])

In [43]:
#TODO: Write your code here
working_df = spark.read.csv('/content/stocks/1962', header=False, schema=schema, recursiveFileLookup=True)

In [44]:
working_df = working_df.withColumn('date', F.to_date(F.col('date')))
working_df = working_df.withColumn('volume', F.col('volume').cast(IntegerType()))
working_df = working_df.where(F.col('stock')!='stock')

In [45]:
working_df.show(5)

+----------+---------+----------+----------+---------+-----------+------+-----+
|      date|     open|      high|       low|    close|  adj_close|volume|stock|
+----------+---------+----------+----------+---------+-----------+------+-----+
|1962-02-19|  5.83929|  5.907375|   5.83929|  5.86332|  1.3863293| 29900|   AA|
|1962-02-19| 5.481634|  5.528486|  5.481634|5.5167727|  1.2804527| 32000| ARNC|
|1962-02-19|0.9074074|0.91563785|0.89917696|0.9032922| 0.16141544|619400|   BA|
|1962-02-19|1.6770834| 1.6927084| 1.6614584|1.6770834|  0.1440587|170400|  CAT|
|1962-02-19|      0.0|  3.578869|      20.0| 3.549107|0.056501225|273600|  CVX|
+----------+---------+----------+----------+---------+-----------+------+-----+
only showing top 5 rows



## Consolidate missing values
We have to check if the data type conversion above was done correctly.
If the casting was not successful, a null value gets inserted into the dataframe. You can thus check for successful conversion by determining if any null values are included in the resulting dataframe.


> ℹ️ **Instructions** ℹ️
>
>Write code to compare the number of invalid entries (nulls) pre-conversion and post-conversion.
>
>*You may use as many coding cells as necessary.*

In [46]:
#TODO: Write your code here
missing_values = {}
for col in working_df.columns:
  after_count = working_df.select(F.count(F.when(F.isnull(col), col)).alias(col)).collect()
  missing_values[col] = after_count[0][0] 

missing_values

{'adj_close': 21,
 'close': 0,
 'date': 0,
 'high': 0,
 'low': 42,
 'open': 0,
 'stock': 0,
 'volume': 21}

In [49]:
#To count tickers for year 1962
working_df.groupby('stock').count().show()

+-----+-----+
|stock|count|
+-----+-----+
|   AA|  252|
|  XOM|  252|
|  DIS|  252|
|   PG|  252|
|   GT|  252|
|   MO|  252|
|  IBM|  252|
|  JNJ|  252|
|  CVX|  252|
|  DTE|  252|
|   BA|  242|
|   GE|  252|
|  HPQ|  309|
| ARNC|  231|
|  CAT|  252|
|   IP|  252|
|   FL|  252|
|   ED|  271|
|  NAV|  252|
|   KO|  252|
+-----+-----+
only showing top 20 rows



In [55]:
#To check ticker with working_df.where(F.isnull('adj_close')).show()

+----------+----+---------+---------+---------+---------+------+-----+
|      date|open|     high|      low|    close|adj_close|volume|stock|
+----------+----+---------+---------+---------+---------+------+-----+
|1962-06-12| 0.0|5.6666665|5.6041665|5.6041665|     null| 50400|   FL|
|1962-06-06| 0.0|    5.875|  5.65625|5.8333335|     null|109200|   FL|
|1962-06-07| 0.0|  5.84375|5.7291665|5.7291665|     null| 36000|   FL|
|1962-06-18| 0.0|5.6666665|      5.5|   5.5625|     null| 42000|   FL|
|1962-06-15| 0.0|5.6666665|5.4583335|5.6666665|     null| 64800|   FL|
|1962-06-22| 0.0|5.3645835|5.2916665|5.2916665|     null| 67200|   FL|
|1962-06-21| 0.0|  5.46875|5.4166665|5.4166665|     null| 49200|   FL|
|1962-06-19| 0.0|      5.5|5.4791665|5.4791665|     null| 19200|   FL|
|1962-06-20| 0.0|  5.53125|5.4583335|5.4583335|     null| 34800|   FL|
|1962-06-01| 0.0|5.8333335|     5.75|5.7916665|     null| 49200|   FL|
|1962-06-11| 0.0|5.7083335|5.6458335|5.6458335|     null|  8400|   FL|
|1962-

Here you should be able to see if any of your casts went wrong. 
Do not attempt to correct any missing values at this point. This will be dealt with in later sections of the predict.

## Generate parquet files
When writing in Spark, we typically use parquet format. This format allows parallel writing using Spark's optimisation while maintaining other useful things like metadata.

When writing, it is good to make sure that the data is sufficiently partitioned. 

Generally, data should be partitioned with one partition for every 200MB of data, but this also depends on the size of your cluster and executors. 


### Check the size of the dataframe before partitioning

In [47]:
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

In [48]:
rdd = df.rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
obj = rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
size = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(obj)
size_MB = size/1000000
partitions = max(int(size_MB/200), 2)
print(f'The dataframe is {size_MB} MB')

The dataframe is 19.034008 MB


In [56]:
partitions

2

### Write parquet files to the local directory
> ℹ️ **Instructions** ℹ️
>
> Use the **coalesce** function and the number of **partitions** derived above to write parquet files to your local directory 
>
>*You may use as many coding cells as necessary.*

In [58]:
#TODO: Write your code here
working_df.coalesce(2).write.format("parquet").save("/content/drive/MyDrive/data/Part_I")