# Simulation of data relating to weather at Dublin Airport
* [Introduction](#Introduction)
* [What is the dataset?](#What-is-the-dataset?)
* [Setup](#Setup)
* [Examination of the dataset](#Examination-of-the-dataset)
 * [Description of dataset](#Description-of-dataset)
 * [Skewness and kurtosis of dataset](#Skewness-and-kurtosis-of-dataset)
 * [Correlation](#Correlation)
 * [Plotting statistics](#Plotting-statistics)
 * [Discussion of the dataset](#Discussion-of-the-dataset)
* [Simulation of new data](#Simulation-of-new-data)
* [Further Analysis](#Further-Analysis)
* [Bibliography](#Bibliography)

## Introduction
This notebook is intended to fulfil two tasks, namely, to review a data set; and to simulate data to resemble the dataset chosen. In order to do these tasks, the project (and notebook) will be broken into 2 sections. In the first section, a review of the dataset chosen, in this case, the weather at Dublin Airport, will be conducted. This review will include a statistical review of the data, as well as discussion of what the statistics mean. The second section will be an attempt to simulate like data, based on the information gleaned in the first section.

Throughout the notebook, there will be code used. These snippets of code will be used to cleanse the data, provide the statistical analysis, and ultimately attempt to simulate the data. It should be noted that some of the data generated will be random, and therefore the values of the generated data will change, in a [pseudorandom](https://www.random.org/randomness/) manner.
## What is the dataset?
The dataset that was chosen is the Dublin Airport Weather records from the 1st January, 2016 to the 31st December, 2018. This data was sourced from the [Government of Ireland data website]( https://data.gov.ie/dataset/dublin-airport-hourly-weather-station-data/resource/bbb2cb83-5982-48ca-9da1-95280f5a4c0d?inner_span=True). The dataset from the source is made up of record readings of various weather attributes recorded every hour from the 1st January, 1989 to the 31st December, 2018. Each row in the dataset is made up of the following columns:
<ls>
* Rain: the amount of precipitation to have fallen within the last hour. Measured in millimetres (mm).
* Temp: the air temperature at the point of record. Measured in degrees Celsius (°C).
* Wetb: the wet bulb temperature at the point of record. Measured in degrees Celsius (°C).
* Dewpt: dew point air temperature at the point of record. Measured in degrees Celsius (°C).
* Vappr: the vapour pressure of the air at the point of record. Measured in hectopascals (hpa).
* Rhum: the relative humidity for the given air temperature. Measured in percent (%).
* Msl: mean sea level pressure. Measured in hectopascals (hpa).
* Wdsp: Mean hourly wind speed. Measured in knots (kt).
* Wddir: Predominant wind direction. Measured in knots (kt).
* Ww: Synop code for resent weather.
* W: Synop code for past weather.
* Sun: The duration of the sun for the last hour. Measured in hours (h).
* Vis: Visibility, or air clarity. Measured in metres (m).
* Clht: Cloud ceiling height. Measured in hundreds of feet (100 ft).
* Clamt: Amount of cloud. Measured using okta.
</ls>
There are also a number of indicators for some of the data recorded. Given the timespan of the data (30 years), the number of record points for each row (up to 21 points), and the hourly record taking, the data set is very large, comprising of nearly 11,000 days, more than 262,000 rows, and 6,300,000 data points.

The retrieved dataset is too large for the proposed simulation. It is therefore intended reduce it in size. This has been done by limiting the data to the period of the years 2016 to 2018 inclusive. The number of record points has been reduced to rain, temperature, relative humidity, sun, and visibility. Additionally, the rows of data have been reduced by amalgamating the hourly records into days. The rainfall levels, and hours of sunshine have been added together to provide a total sum for each day. The temperature, relative humidity, and visibility have been averaged for the day in question. This has reduced the number of dataset to 1,097 rows, and 6 columns. 

Both the original and new datasets are available in this repository.

### Why was this dataset chosen?

The dataset was chosen for a number of reasons. Primarily, it was chosen as it provides a large volume of data, with interrelated variables. Some of these variables may be positively, or negatively, correlated to each other. This would stand to reason, as the number of hours of sunshine, and the millimetres of rain that have fallen would normally be negatively correlated. Secondly, the dataset is related to the weather in Ireland, or at least Dublin. As the weather .is a favourite topic of conversation, the dataset seemed appropriate.
## Setup
Before the analysis of the dataset can begin, it is necessary to import the data into a dataframe. This will allow the determination of various statistics with regards to the dataset, as well as providing a basis for the simulation to be run. 

The script below will import the data, and set it up in a dataframe.

In [1]:
import numpy as np
import pandas as pd
from datetime import date
import seaborn as sns
import matplotlib.pyplot as plt

url = 