# WiDS 2023 - Codeup Submission
## Predict the arithmetic mean of the max and min observed temperature over the next 14 days for specific locations and start dates

## Project Description
Extreme weather events are sweeping the globe and range from heat waves, wildfires and drought to hurricanes, extreme rainfall and flooding. These weather events have multiple impacts on agriculture, energy, transportation, as well as low resource communities and disaster planning in countries across the globe.

Accurate long-term forecasts of temperature and precipitation are crucial to help people prepare and adapt to these extreme weather events. Currently, purely physics-based models dominate short-term weather forecasting. But these models have a limited forecast horizon. The availability of meteorological data offers an opportunity for data scientists to improve sub-seasonal forecasts by blending physics-based forecasts with machine learning. Sub-seasonal forecasts for weather and climate conditions (lead-times ranging from 15 to more than 45 days) would help communities and industries adapt to the challenges brought on by climate change.

Participants will submit forecasts of temperature and precipitation for one year, competing against the other teams as well as official forecasts from NOAA.

## Project Goals
* Determine which columns to use for our data exploration.
* Explore to find features that indicate the ```mean_temp```.
* Based on the findings predict the ```mean_temp``` for the test_data.
* Submit our finidings to the WiDS 2023 competition.

# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import wrangle as w
import explore as e
import model as m

from importlib import reload

# Acquire

* Data acquired from [Kaggle](https://www.kaggle.com/competitions/widsdatathon2023/data)
* It contained 375734 rows and 245 columns before cleaning
* Each row represents a specific location on a specific start date
* Each column represents a weather/climate measurement

# Prepare

**Prepare Actions:**
* Binned regions (Dry, Temperate, Continental) 
* Binned elevation ('bottom_low', 'top_low', 'mid', 'high')
* Split data into train, validate and test (approx. 60/25/15)
* Scaled continuous variables (min/max scaler)
* Outliers have not been removed for this iteration of the project

## Data Dictionary:
### Target
| Target | Definition | Data Type | Unit |
| :---- | :---- | :---- | :---- |
| **mean_temp**| the arithmetic mean | *float64* | celsius |

### Features

| Feature Name | Definition | Data Type | Unit |
| :---- | :---- | :---- | :---- |
| region | Köppen-Geigerclimateclassifications | object | specified regions |
| elevation | elevation | int64 | meters |
| lat| latitude of location (anonymized) | float64 | latitude |
| lon | longitude of location (anonymized) | float64 | longitude |
| startdate | startdate of the 14 day period | object | dates |
| potential_evap| potential evaporation | float64 | mL |
| precip| measured precipitation | float64 | mm |
| barometric_pressure | pressure | float64 |Hg (inches of mercury) |
| all_atmos_precip | precipitable water for entire atmosphere | float64 | mm |
| relative humidity | relative humidity | float64 | percent of atmospheric capacity |
| sea level pressure | sea level pressure at surface | float64 | hectoPascals (hPa), also called millibars |
| geopotential height at 10 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| geopotential height at 100 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| geopotential height at 500 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| geopotential height at 850 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| zonal wind at 250 millibars | east-west wind velocity| float64 | meters per second |
| zonal wind at 925 millibars | east-west wind velocity | float64 | meters per second|
| longitudinal wind at 250 millibars | north-south velocity | float64 | meters per second|
| longitudinal wind at 925 millibars | north-south velocity | float64 |meters per second |

In [2]:
# acquiring data
df = w.get_explore_data()

# prepping data
df = w.get_contest_data(df)

# splitting data into train, validate, and test
train, validate, test = w.split_data(df)

FileNotFoundError: [Errno 2] No such file or directory: 'train_data.csv'

## A brief look at the data

In [None]:
train.head()

# Explore

In [None]:
e.data_distribution(train)

## Is there a difference in the temperatures in different climateregions?

In [None]:
e.region_viz(train)

## Does elevation impact temperature?

In [None]:
e.elevation_bin_viz(train)

In [None]:
e.elevation_bin_dist_viz(train)

In [None]:
e.elevation_bin_kruskal_test(train)

## Is there a correlation between precipitation and mean_temp?

In [None]:
e.precipitation_viz(train)

In [None]:
e.precip_spearmanr_test(train)

## Is there a correlation between potential evap and mean_temp?

In [None]:
e.potential_evap_viz(train)

In [None]:
e.potential_evap_spearmanr_test(train)

## Is there a correlation between mean_temp and geopotential at different heights?

In [None]:
e.geopotential_viz(train)

# Exploration Summary

* We made the decision to look at the features closest to our target (features with 14d and categorical variables) and explored them.
* We saw that all of the continuous variables have a correlation with our target variable. 
* We looked at our regions and saw that there is a difference in the mean mean_temp in each region.
* We binned our elevations, splitting into 4 bins based on quantiles. We saw that there is a differenc in mean mean_temp of each bin.

# Features I am moving to modeling with

| Feature Name | Definition | Data Type | Unit |
| :---- | :---- | :---- | :---- |
| region | Köppen-Geigerclimateclassifications | object | specified regions |
| elevation | elevation | int64 | meters |
| lat| latitude of location (anonymized) | float64 | latitude |
| lon | longitude of location (anonymized) | float64 | longitude |
| startdate | startdate of the 14 day period | object | dates |
| potential_evap| potential evaporation | float64 | mL |
| precip| measured precipitation | float64 | mm |
| barometric_pressure | pressure | float64 |Hg (inches of mercury) |
| all_atmos_precip | precipitable water for entire atmosphere | float64 | mm |
| relative humidity | relative humidity | float64 | percent of atmospheric capacity |
| sea level pressure | sea level pressure at surface | float64 | hectoPascals (hPa), also called millibars |
| geopotential height at 10 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| geopotential height at 100 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| geopotential height at 500 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| geopotential height at 850 millibars | actual height of a pressure surface above mean sea-level | float64 | millibars |
| zonal wind at 250 millibars | east-west wind velocity| float64 | meters per second |
| zonal wind at 925 millibars | east-west wind velocity | float64 | meters per second|
| longitudinal wind at 250 millibars | north-south velocity | float64 | meters per second|
| longitudinal wind at 925 millibars | north-south velocity | float64 |meters per second |

In [None]:
#all of the features
drivers = list(train.columns)
#drop startdate and target variable
drivers.remove('startdate')
drivers.remove('mean_temp')

# Modeling

In [None]:
# prep data for modeling
X_train, y_train, X_validate, y_validate, X_test, y_test = m.prep_for_model(train, validate, test, 'mean_temp', drivers)

## Mean Baseline RMSE

In [None]:
# show baseline model
m.baseline_models(y_train, y_validate)

# Comparing Models

In [None]:
m.regression_models(X_train, y_train, X_validate, y_validate)

## Best Model on Test

In [None]:
m.best_model(X_train, y_train, X_validate, y_validate, X_test, y_test)

### Modeling Summary

* We looked at three different kinds of models.
* Quadratic model performed best on both train and validate.
* We used that model to predict on test and our RMSE remained at a **1.27**.

# Conclusions

### Exploration

* We saw that all of the continuous variables have a correlation with our target variable. 
* We looked at our regions and saw that there is a difference in the mean mean_temp in each region.
* We binned our elevations, splitting into 4 bins based on quantiles. We saw that there is a differenc in mean mean_temp of each bin.

### Modeling

* We looked at three different kinds of models.
* Saw that our quadratic model performed best on both train and validate.
* We used that model to predict on test and our RMSE remained at a **1.27**.

### Recommendations

* We can use this model to predict the mean temp for the next 14d.

### Next Steps
* We want to look into creating a model based on the region.
* We want to continue looking at other features in the data set to see if they have correlation with the target variable.
* We want to look into other models that might help improve our model.