# Project Title
### Data Engineering Capstone Project

#### Project Summary
In this project we will collect data sets and create tables which will allow to assess the influence of solar flares on weathern patterns on Earth. An investigation showed that small fluctuations in spot coverage can influence the climate.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import os, glob
import shutil
import tempfile
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

### Step 1: Scope the Project and Gather Data

#### Scope 
In order to evaluate the influence of sunspot coverage on earth weather patter we combined data from two sources. For one, we need the time series for the sun spot coverage and the other data which logs the weather.


#### Describe and Gather Data 
Two data sets are used:

* Sunspots ([link](https://www.kaggle.com/robervalt/sunspots)): This dataset is a csv file with the date and the montly mean total sunspot number as obtained by the Database from SIDC - Solar Influences Data Analysis Center - the solar physics research department of the Royal Observatory of Belgium.

* NOAA GSOD ([link](https://data.noaa.gov/dataset/dataset/global-surface-summary-of-the-day-gsod)): This is a public data set from the National Oceanic and Atmospheric Administration (NOAA). This data set is very large and only a limited data set is on the workspace.

### Data exploration:

We start by understanding the content of the data.

In [2]:
#create spark session
spark = SparkSession.builder\
        .appName("Comparing solar flares with global weathern pattern")\
        .enableHiveSupport().getOrCreate()



Read sunspot csv file.

In [3]:
#csvFile = 'hessi.solar.flare.2002to2016.csv'
csvFile = 'Sunspots.csv'
dfSpots = spark.read.csv(csvFile, header=True, inferSchema=True)


Print the schema, show the size of the data and the date range.

In [4]:
dfSpots.printSchema()
print(dfSpots.count())
print(dfSpots.head())
dfSpots.select([min("Date")]).show()
dfSpots.select([max("Date")]).show()

root
 |-- _c0: integer (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Monthly Mean Total Sunspot Number: double (nullable = true)

3252
Row(_c0=0, Date=datetime.datetime(1749, 1, 31, 0, 0), Monthly Mean Total Sunspot Number=96.7)
+-------------------+
|          min(Date)|
+-------------------+
|1749-01-31 00:00:00|
+-------------------+

+-------------------+
|          max(Date)|
+-------------------+
|2019-12-31 00:00:00|
+-------------------+



NOAA data is available as `tar.gz` files for each year. We start by exploring only one year. Each `tar.gz` file has several csv files. As an example we start by showing just one file.

In [5]:
#create a temporary directory where the data will be extracted and explored
with tempfile.TemporaryDirectory() as temp_directory:
    #unpack to year 2016 data to temporary directory
    shutil.unpack_archive('/home/workspace/NOAA_data/2016.tar.gz', temp_directory)
    #get a list of all csv files
    csvFiles = glob.glob( os.path.join(temp_directory,'*.csv') )
    #read the first csv file and print some information
    dfNOAA = spark.read.csv(csvFiles[0], header=True, inferSchema=True)
    dfNOAA.printSchema()
    print(dfNOAA.count())
    print(dfNOAA.head())

root
 |-- STATION: long (nullable = true)
 |-- DATE: timestamp (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)
 |-- ELEVATION: double (nullable = true)
 |-- NAME: string (nullable = true)
 |-- TEMP: double (nullable = true)
 |-- TEMP_ATTRIBUTES: double (nullable = true)
 |-- DEWP: double (nullable = true)
 |-- DEWP_ATTRIBUTES: double (nullable = true)
 |-- SLP: double (nullable = true)
 |-- SLP_ATTRIBUTES: double (nullable = true)
 |-- STP: double (nullable = true)
 |-- STP_ATTRIBUTES: double (nullable = true)
 |-- VISIB: double (nullable = true)
 |-- VISIB_ATTRIBUTES: double (nullable = true)
 |-- WDSP: double (nullable = true)
 |-- WDSP_ATTRIBUTES: double (nullable = true)
 |-- MXSPD: double (nullable = true)
 |-- GUST: double (nullable = true)
 |-- MAX: double (nullable = true)
 |-- MAX_ATTRIBUTES: string (nullable = true)
 |-- MIN: double (nullable = true)
 |-- MIN_ATTRIBUTES: string (nullable = true)
 |-- PRCP: double (nullable = 

### Step 2: Explore and Assess the Data
#### Explore the Data 

***Columns of interest for NOAA data:***

* STATION: Station number (WMO/DATSAV3 number) for the location

* DATE: Date of reading

* LATITUDE: Station latitude

* LONGITUDE: Station longitude

* ELEVATION: Station elevation

* NAME: Station name

* TEMP: Temperature reading

* DEWP: Dew Point reading

* SLP: Mean sea level pressure for the day in millibars to tenths. Missing = 9999.9

* STP: Mean station pressure for the day in millibars to tenths. Missing = 9999.9

* VISIB: Mean visibility for the day in miles to tenths. Missing = 999.9

* WDSP: Mean wind speed for the day in knots to tenths. Missing = 999.9

* MXSPD: Maximum sustained wind speed reported for the day in knots to tenths. Missing = 999.9

* GUST: Maximum wind gust reported for the day in knots to tenths. Missing = 999.9

* MAX: Maximum temperature reported during the day in Fahrenheit to tenths--time of max temp report varies by country and region.

* MIN: Minimum temperature reported during the day in Fahrenheit to tenths--time of min temp report varies by country and region.

* PRCP: Total precipitation (rain and/or melted snow) reported during the day in inches and hundredths.


***Columns of interest for Sun Spot data:***

* _c0: Idenfier
* Date: Date
* Monthly Mean Total Sunspot Number

#### Cleaning Steps

For the Sunspot data we only drop duplicate entries and separete date and time column.

For the NOAA data we select only columns of interestes, we drop duplicates **and** filter missing values. We iterate over all `tar.gz` files and read all csv files in to a spark dataframe. We will save our staged result in a parquet file.

In [6]:
# READ sunspot csv file, drop duplicates and add columns dayofmonth, month and year.
csvFile = 'Sunspots.csv'
dfSpots = spark.read.csv(csvFile, header=True, inferSchema=True)

cleanedSpots = dfSpots.select(col('_c0').alias('spotId'), 
                            col('Date').alias('ts'),
                            date_format(dfSpots["Date"], 'yyyy-MM-dd').alias('SpotDate'),
                            date_format(dfSpots["Date"], 'h:m:s a').alias('Time'),
                            col('Monthly Mean Total Sunspot Number').alias('SunspotNumber'),)\
                            .dropDuplicates()
# Show schema for verfication propuses.
cleanedSpots.printSchema()
# Show number of counts. This table actually has no duplicates.
cleanedSpots.count()

root
 |-- spotId: integer (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- SpotDate: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- SunspotNumber: double (nullable = true)



3252

In [7]:
cleanedSpots.head()

Row(spotId=653, ts=datetime.datetime(1803, 6, 30, 0, 0), SpotDate='1803-06-30', Time='12:0:0 AM', SunspotNumber=60.0)

**Staging the NOAA data**

In [8]:
# READ NOAA data.
noaaFiles = glob.glob('NOAA_data/*.tar.gz') # List of all tar.gz files
for tarfile in noaaFiles[0:3]:
    with tempfile.TemporaryDirectory() as temp_directory:
        #unpack
        shutil.unpack_archive(tarfile, temp_directory)
        csvFiles = glob.glob( os.path.join(temp_directory,'*.csv') )
        #read all csv files into spark dataframe
        dfNOAA = spark.read.csv(csvFiles, header=True, inferSchema=True)
        #clean up the data
        noaaTable = dfNOAA.select(col('STATION').alias('StationId'), col('DATE').alias('ts'),
                                date_format(dfNOAA["DATE"], 'yyyy-MM-dd').alias('Date'),
                                date_format(dfNOAA["DATE"], 'h:m:s a').alias('Time'),
                                'LATITUDE', 'LONGITUDE', 'ELEVATION',
                                'NAME', 'TEMP', 'DEWP', 'SLP', 'STP', 'VISIB', 'WDSP', 'MXSPD', 
                                'GUST', 'MAX', 'MIN', 'PRCP').dropDuplicates()\
                                .filter("SLP != 9999.9")\
                                .filter("STP != 9999.9")\
                                .filter("VISIB != 999.9")\
                                .filter("WDSP != 999.9")\
                                .filter("MXSPD != 999.9")\
                                .filter("GUST != 999.9")
        #Append the data to a parquet file
        noaaTable.write.mode('append').parquet("NOAA_data.parquet")



Check that parquet file has been created and the schema is correct.

In [9]:
noaaParquet = spark.read \
                .format('parquet') \
                .load('NOAA_data.parquet')


In [10]:
noaaParquet.printSchema()
print(noaaParquet.head())
noaaParquet.count()

root
 |-- StationId: long (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)
 |-- ELEVATION: double (nullable = true)
 |-- NAME: string (nullable = true)
 |-- TEMP: double (nullable = true)
 |-- DEWP: double (nullable = true)
 |-- SLP: double (nullable = true)
 |-- STP: double (nullable = true)
 |-- VISIB: double (nullable = true)
 |-- WDSP: double (nullable = true)
 |-- MXSPD: double (nullable = true)
 |-- GUST: double (nullable = true)
 |-- MAX: double (nullable = true)
 |-- MIN: double (nullable = true)
 |-- PRCP: double (nullable = true)

Row(StationId=72250512904, ts=datetime.datetime(2016, 5, 31, 0, 0), Date='2016-05-31', Time='12:0:0 AM', LATITUDE=26.22806, LONGITUDE=-97.65417, ELEVATION=10.4, NAME='HARLINGEN RIO GRANDE VALLEY INTERNATIONAL AIRPORT, TX US', TEMP=83.9, DEWP=74.4, SLP=1010.3, STP=9.0, VISIB=9.5, WDSP=10.0, MXS

1097077

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
We use a NoSQL data model, because of the very large NOAA data set. 


<img src="starSchema_capstone.png">


#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

After cleaning the data, we will create a final table which can be queried for each months and it will return sun spot coverage and average weather for each site and globally.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model. We start by aggregating the NOAA parquet data and averaging over the weather data.

In [11]:
stationTable = noaaParquet.select('stationId', 'latitude', 'longitude', 'elevation', 'Name')

In [12]:
WeatherTable = noaaParquet.select(date_format(noaaParquet["ts"], 'yyyy-MM-dd h:m:s a').alias('ConditionId'), 
                                'TEMP', 'DEWP', 'SLP', 'STP', 'VISIB', 'WDSP', 'MXSPD', 'GUST', 
                                'MAX', 'MIN', 'PRCP')

In [13]:
dateTable = noaaParquet.select(col('ts'), 
                                     hour(noaaParquet.ts).alias('hour'), 
                                     dayofmonth(noaaParquet.ts).alias('day'), 
                                     weekofyear(noaaParquet.ts).alias('week'),
                                     month(noaaParquet.ts).alias('month'),
                                     year(noaaParquet.ts).alias('year'),
                                     dayofweek(noaaParquet.ts).alias('weekday')) \
                                     .dropDuplicates()

In [14]:
spotsTable = cleanedSpots.select('spotId','SunspotNumber')

We now joint the NOAA and sunspot data.

In [15]:
dfSpot_join_Weather = noaaParquet.join(cleanedSpots, 
                (cleanedSpots.SpotDate == noaaParquet.Date) )\
                .drop(cleanedSpots.ts)\
                .sort('Date', ascending=False)

In [16]:
dfSpot_join_Weather.head()

Row(StationId=72522404852, ts=datetime.datetime(2016, 12, 31, 0, 0), Date='2016-12-31', Time='12:0:0 AM', LATITUDE=40.47194, LONGITUDE=-81.42361, ELEVATION=272.8, NAME='NEW PHILADELPHIA CLEVER FIELD, OH US', TEMP=33.3, DEWP=21.8, SLP=1013.3, STP=979.9, VISIB=9.8, WDSP=7.1, MXSPD=15.0, GUST=29.9, MAX=45.0, MIN=23.0, PRCP=0.01, spotId=3215, SpotDate='2016-12-31', Time='12:0:0 AM', SunspotNumber=18.5)

In [17]:
dfSpotWeather = dfSpot_join_Weather.select(col('ts'), 
                                     date_format(noaaParquet["ts"], 'yyyy-MM-dd h:m:s a').alias('ConditionId'), 
                                     col('StationId'), 
                                     col('spotId'), ) \
                                     .withColumn('ts', monotonically_increasing_id())


In [18]:
dfSpotWeather.printSchema()
dfSpotWeather.count()

root
 |-- ts: long (nullable = false)
 |-- ConditionId: string (nullable = true)
 |-- StationId: long (nullable = true)
 |-- spotId: integer (nullable = true)



36199

In [19]:
dfSpotWeather.toPandas()

Unnamed: 0,ts,ConditionId,StationId,spotId
0,0,2016-12-31 12:0:0 AM,72522404852,3215
1,1,2016-12-31 12:0:0 AM,3088099999,3215
2,2,2016-12-31 12:0:0 AM,71862099999,3215
3,3,2016-12-31 12:0:0 AM,71120099999,3215
4,4,2016-12-31 12:0:0 AM,15333099999,3215
5,5,2016-12-31 12:0:0 AM,72254203999,3215
6,6,2016-12-31 12:0:0 AM,72511493778,3215
7,7,2016-12-31 12:0:0 AM,71808099999,3215
8,8,2016-12-31 12:0:0 AM,72203812897,3215
9,9,2016-12-31 12:0:0 AM,83899099999,3215


#### 4.2 Data Quality Checks
 
Run Quality Checks

We check that we have no NULL elements in the table:

In [20]:
# Check that key fields have valid values (no nulls or empty)
dfSpotWeather.createOrReplaceTempView("SpotWeatherTable")
SpotWeather_check = spark.sql("""
    SELECT  COUNT(*)
    FROM SpotWeatherTable
    WHERE ts IS NULL OR ConditionId IS NULL OR StationId IS NULL OR spotId is NULL
""")
SpotWeather_check.show(1)
SpotWeather_check.collect()[0][0]

+--------+
|count(1)|
+--------+
|       0|
+--------+



0

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

**Fact table**

*dfSpotWeather*
* ts: timestap 
* ConditionId: Weather condition identifier from noaaParquet
* StationId: Station id from noaaParquet
* spotId: Spot id from cleanedSpots

**Dim Table**

*stationTable*
* stationId: Station id from noaaParquet
* longitude: Station longitude from noaaParquet
* latitude: Station latitude from noaaParquet
* elevation: Station elevation from noaaParquet
* name: Station name from noaaParquet
    
*dateTable*
* ts: timestamp from noaaParquet
* hour: hour from noaaParquet
* dayofmonth: day of the month from noaaParquet
* weekofyear: week of the year from noaaParquet
* month: month from noaaParquet
* year: year from noaaParquet
* dayofweek: day of the week from noaaParquet
    
*WeatherTable*
* ConditionId: Weather condition identifier from noaaParquet
* TEMP: Temperature reading from noaaParquet
* DEWP: Dew Point reading from noaaParquet
* SLP: Mean sea level pressure from noaaParquet 
* STP: Mean station pressure from noaaParquet
* VISIB: Mean visibility from noaaParquet
* WDSP: Mean wind speed from noaaParquet
* MXSPD: Maximum sustained wind speed reported from noaaParquet 
* GUST: Maximum wind gust reported from noaaParquet 
* MAX: Maximum temperature reported from noaaParquet 
* MIN: Minimum temperature reported from noaaParquet 
* RCP: Total precipitation from noaaParquet
    
*spotTable*
* spotsId: spot identifier from cleanedSpots
* sunspotNumber: Monthly Mean Total Sunspot Number


#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

Given the already large data set Spark is very usuful to deal with it. 

The data should be updated on a yearly basis.

I choose NoSQL as data model which is good for linear scalbility. Increasing the data set posese no problems. The Spark framework allows access for hundreds of people.