# MSA 2022 Phase 2 Data Science

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Part 1: Exploratory Data Analysis

### Raw Dataframe

```
Weather Data
        station             valid  tmpc  dwpc   relh  sknt  gust  peak_wind_drct
0          NZAA  2015-01-01 00:00  21.0  11.0  52.77  15.0   NaN             NaN
1          NZAA  2015-01-01 00:30  21.0  10.0  49.37  16.0   NaN             NaN
2          NZAA  2015-01-01 01:00  21.0  12.0  56.38  16.0   NaN             NaN
3          NZAA  2015-01-01 01:30  21.0  13.0  60.21  16.0   NaN             NaN
4          NZAA  2015-01-01 02:00  21.0  12.0  56.38  16.0   NaN             NaN
...         ...               ...   ...   ...    ...   ...   ...             ...
103708     NZAA  2020-12-30 21:30  19.0  14.0  72.74   5.0   NaN             NaN
103709     NZAA  2020-12-30 22:00  19.0  14.0  72.74   6.0   NaN             NaN
103710     NZAA  2020-12-30 22:30  20.0  14.0  68.35   6.0   NaN             NaN
103711     NZAA  2020-12-30 23:00  20.0  14.0  68.35   7.0   NaN             NaN
103712     NZAA  2020-12-30 23:30  22.0  14.0  60.44   6.0   NaN             NaN
```

### Average

```
tmpc              15.811503
dwpc              12.115772
relh              79.782307
sknt               8.919029
gust              30.962594
peak_wind_drct          NaN
```

### Standard Deviation

```
tmpc               4.235197
dwpc               3.738005
relh              12.562199
sknt               5.348379
gust               6.319510
peak_wind_drct          NaN
```

### Percentile (25% - 75%)

```
tmpc              16.00
dwpc              12.00
relh              81.99
sknt               8.00
gust              31.00
peak_wind_drct      NaN
```

### Correlation Plot

![](Images/part1-correlation.png)

### Line Plot

![](Images/part1-line-tmpc.png)

### Comments

What I noticed in the dataset from the ```weather-data.csv``` file is that there was only very few data on the ```gust``` column and no data on the ```peak_wind_drct``` column. So when I process the data, I would remove the ```gust``` and ```peak_wind_drct``` columns.

Looking at the Correlation Plot for the dataset, there is a positive correlation between the ```tmpc``` and ```dwpc``` groups, but an approxaimate negative correlation when those two groups were conpared to the other groups. The ```peak_wind_drct``` noticcably lacks any data as stated above.

Looking at the Line Plot of the ```valid``` times and ```tmpc``` group, the environmental temperature increases and decreases as time goes on. It can be seen that the temperature peaks at about the beginning of the years and the troughs at about the middle of the year. This concides with the temperatures in the southern hemisphere, despite this data comes from a source in the northern hemisphere (from the Towa State University which is in USA in the northern hemisphere).

In [None]:
data = pd.read_csv("MSA2022-Phase2-ProjectV2\Data-Science\weather-data.csv") # Raw Dataframe
data_average = data[["tmpc", "dwpc", "relh", "sknt", "gust", "peak_wind_drct"]].mean() # Average Dataframe
data_std = data[["tmpc", "dwpc", "relh", "sknt", "gust", "peak_wind_drct"]].std() # Standard Deviation Dataframe
data_quantile = data[["tmpc", "dwpc", "relh", "sknt", "gust", "peak_wind_drct"]].quantile() # Percentile (25%-75%) Dataframe

corr = data.corr() # Correlation Plot
sns.heatmap(corr, 
    xticklabels=corr.columns,
    yticklabels=corr.columns)
plt.show()

data.plot.line("valid", "tmpc") # Line Graph between "valid" (time) group and "tmpc" (environmental temerature) group
plt.plot(data.valid, data.tmpc)
plt.show()

## Part 2: Data Processing