**Project Start Date** <br>
17th December 2020 <br>

**Data Sources** <br>
https://www.kaggle.com/usaf/world-war-ii/notebooks <br> 
https://www.kaggle.com/smid80/weatherww2/data <br>

**Background** <br>
Aerial Bombing Operations in WW2 - Bombing operations data <br>

This dataset consists of digitized paper mission reports from WWII. Each record includes the date, conflict, geographic location, and other data elements to form a live-action sequence of air warfare from 1939 to 1945. The records include U.S. and Royal Air Force data, in addition to some Australian, New Zealand and South African air force missions.

Weather Conditions in WW2 (Weather Stations / Weather Conditions) <br>
The dataset contains information on weather conditions recorded on each day at various weather stations around the world. Information includes precipitation, snowfall, temperatures, wind speed and whether the day included thunder storms or other poor weather conditions.

**Aim of this project** <br>
Implement a GLM model that predicts the maximum weather temperature (based on the minimum temperature)

**Analysis regarding Data Quality** <br>
Understanding of the sampling procedure 
- Since our project team did not participate in planning the study or data collection, it is possible that we are missing crucial context which could render our conclusions invalid. <br>

Potential biases <br> 
Real-world actions that generated the data you inherited <br>

**Objectives & Hypothesises to Test (max. 10)** <br>
<u>Exploratory Analysis</u>
- High level discriptive statistics 
- Do any values look to be recorded to accommodate missing values? e.g. 999, 9999 etc.
- Assessment of feature distributions
    https://towardsdatascience.com/ridgeline-plots-the-perfect-way-to-visualize-data-distributions-with-python-de99a5493052
- Assessment of feature relationships:
    - Is there a relationship between the daily minimum and maximum temperature (TimeSeries Analysis)?
    - It is expected that average temperatures are colder in winter months than summer months
    - It is expected that more snowfall occurs in the winter months (for northern hemisphere regions)
    - It is expected that more Precipitation occurs in the winter months (for northern hemisphere)
    - It is expected that lower temperatures correlate with higher snowfall and precipation 
    - It is expected that higher levels above the sea have greater precipation
    - It is expected that the accuracy of recordings based on stations may not be uniform (outlier detection)
<br>

**Statistical Model/Machine Learning Applications**
- Create a dummy model (Predict the average temperature for that monthly/quarter)
- Explain the train/test split
- Predict the maximum temperature given the minimum temperature (GLM Models & Bayesian Versions)?
- Explain appropriate error metric
- Explain class balance and any required action
- Explain what features are developed and transformations applied
- Explain if the model is exhibiting high bias or high variance and how this can be improved
    - Plot learning curves to deduce high bias/high variance and conclude what means could be applied to solve these issues
- Explain where the model seems to perform poorly - In what situations does the model make mistakes?

**Additional Learning notes from Reviewing 3 other Kaggle Notebooks** <br> 

**Next steps** <br>

**References**


In [None]:
# Package Requirements
import os

import warnings
warnings.filterwarnings('ignore')

# Data Wrangling
import pandas as pd
import numpy as np
import datetime 

# Data Exploration and Visualisation
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
os.getcwd()

In [None]:
aerial_bombing_data = pd.read_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/01_raw/ww2_boming_operations.csv')
weather_summary = pd.read_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/01_raw/WeatherTempPrediction.csv')
weather_station_location = pd.read_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/01_raw/WeatherStationLocations.csv')

data_list = []
data_list.append(aerial_bombing_data)
data_list.append(weather_summary)
data_list.append(weather_station_location)

In [None]:
# State the assumptions you’re being forced to make.
# Write up caveat notes to be included in the appendix of your final report
# Write cautionary notes that warn the decision-maker (and your other readers) that conclusions from the study will 
# need to be downgraded due to potential data issues

In [None]:
for i in data_list:
    print("Dataframe Dimensions")
    print(i.shape)
    print("")

    print("Dataframe Columns and respective types")
    print(i.dtypes)
    print("")

    print("Percentage of Missing Data")
    print(i.isnull().sum() * 100 / len(i))
    
    print("")

## Data Analysis
- Data visualization of features
- Handling categorical data
- Normalization and standardization of features
- Dimensionality Reduction

In [None]:
# Investiage options to link the dataframes with a unique key: weather_summary and weather_station_location look to be 
# connected via STA and WBAN respectively 
def uncommon_elements(list1, list2):
    ## Add something clever so the look up is always against the set with the largest number of unique records
    
    return [element for element in list2 if element not in list1]

STA = set(weather_summary.STA)
print(len(STA))

WBAN = set(weather_station_location.WBAN)
print(len(WBAN))

print('Sets are of the same data type: ', type(weather_summary.STA) == type(weather_station_location.WBAN))

print('Stations that are uncommon across both sets: ', uncommon_elements(STA, WBAN))

In [None]:
combined_data = pd.merge(weather_summary, 
                         weather_station_location, 
                         how = 'inner', # takes care of only keeping records in both sets
                         left_on='STA',
                         right_on='WBAN')

print(len(combined_data))

## Columns for Combined Data
**STA**
- STA: represent the Weather Station
- Not all STA codes represent the same time frequency 

**Date** 
- Date has been split into DA MO and YR respectively, note the century has been dropped when recording the YR

**Precip** 
- Precipitation in mm. This consists of numerical values and 'T' for 16,754 entries. This looks to be a mistake in the data collection (Impute precip == 0 for these cases)

**MaxTemp and MinTemp** 
- These are features that have been transformed into celcius from fahrenheit readings MAX/MIN and these have been recorded to 6 decimal places. The degrees celcius value has additionally been converted to an average. Using celcius will have a smaller range than the fahrenheit records. Patterns may be more easily seen based fahrenheit columns 

**MEA** 
- This is the mean for the fahrenheit MAX / MIN columns and this has been rounded to 1 d.p. Drop this columns and calculate the extact mean value

**Snowfall**
- This looks to be measures in terms of the amount of snow that fell in mm. The units are not obvious so there are two options
- Either assume the units are centiments by attempting to research more about the data OR normalise all the columns so they are on the same scale

**SNF**
 - After research it is unclear what SNF relates too and seems to gave a range of 0 - 3.4 (Agree to remove)

**PRCP**
- This column looks to have been scaled by a factor of 1/25.4*Precip (Agree to remove)

**TSHDSBRSGF**
- This is a repeat for PoorWeather so can be removed

**WBAN**
- Same as STA, representing the Weather Station
- Not all Weather Stations are located in the USA (unique STATE/COUNTRY ID = 63)
- This will be duplicated due to the merge so can be removed 

**NAME**
- This is the name of the weather station. It has a many:1 relationship with State/Country ID i.e. more than one station can be present per country 

**STATE/COUNTRY ID**
- This is the location of the weather station at state/country level

**LAT**
- This is the decimal latitude in string format 

**LON**
- This is the decimal latitude in string format 

**ELEV**
- Explanation not given - Expected to be level above the sea 
- Note that an elevation of 9999 means unknown

**Latitude**
- This is the decimal latitude calculated from the LAT/LON provided (use this over string as in format for ML)

**Longitude**
- This is the decimal longitude calculated from the LAT/LON provided (use this over string as in format for ML)

### Data Cleaning - Remove duplicate Rows & Columns
- Remove all columns that exhibit over 90% missing values
- Remove celcius columns 'MaxTemp', 'MinTemp', 'MeanTemp' and 'MEA'
- Remove duplicated/scaled columns: 'PRCP', 'TSHDSBRSGF'
- Remove PoorWeather for the inital analysis as unclear how the data has been recorded 
- Remove LAT as string format
- Remove LON as string format
- Remove those columns with zero variance
- Remove duplicated rows

In [None]:
# Handling missing data - Remove any columns with over 90% missing data 
def remove_missing_values(data, thresold_limit = 0.9):
    
    return data.loc[:, data.isnull().sum() < thresold_limit*data.shape[0]]

combined_data = remove_missing_values(combined_data)

# Remove additional columns based on explanation above
combined_data.drop(['MaxTemp', 
                    'MinTemp', 
                    'MeanTemp', 
                    'MEA', 
                    'PoorWeather', 
                    'TSHDSBRSGF', 
                    'PRCP', 
                    'SNF', 
                    'WBAN',
                    'LAT', 
                    'LON'], axis=1, inplace=True)

# Data Quality Expectations: Test for zero variance 
combined_data = combined_data.loc[:, combined_data.apply(pd.Series.nunique) != 1]

# Data Quality Expectations: Duplicated Records
print('Duplicated rows for index: ', combined_data[combined_data.duplicated()].index)
#print(len(combined_data))
combined_data = combined_data.drop_duplicates()
#print(len(combined_data))

### Data Cleaning - Correct Data Types and Imputation of Missing Values

In [None]:
# Correct Date
weather_summary.Date = pd.to_datetime(weather_summary['Date'], format = '%Y-%m-%d')

# Correct Object Types
weather_summary.STA = weather_summary['STA'].astype('object')
weather_summary.YR = weather_summary['YR'].astype('object')
weather_summary.MO = weather_summary['MO'].astype('object')
weather_summary.DA = weather_summary['DA'].astype('object')

# Deal with Missing/Inaccurate values and correct data types 
weather_summary.Precip = np.where((weather_summary.Precip == 'T') | (weather_summary.Precip == ' '), 0, weather_summary.Precip)
weather_summary.Precip = weather_summary['Precip'].astype('float')

weather_summary.SNF = np.where((weather_summary.SNF == 'T') | (weather_summary.SNF == ' '), 0, weather_summary.SNF)
weather_summary.SNF = weather_summary['SNF'].astype('float')

# Check if any features are transformations of each other 

### Exploratory Data Analysis 
**Hypothesis & Expectations to Test**
- Is this a Global study? What are the locations associated with the experiment?
- High level discriptive statistics 
- Do any values look to be recorded to accommodate missing values? e.g. 999, 9999 etc.
- Assessment of feature distributions
- Assessment of feature relationships:
    - It is expected that average temperatures are colder in winter months than summer months
    - It is expected that more snowfall occurs in the winter months (for northern hemisphere regions)
    - It is expected that more Precipitation occurs in the winter months (for northern hemisphere)
    - It is expected that lower temperatures correlate with higher snowfall and precipation 
    - It is expected that higher levels above the sea have greater precipation
    - It is expected that the accuracy of recordings based on stations may not be uniform (outlier detection)
- Time Series Analysis (max. 3 graphs/analyses)

**References** <br>
https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34 <br>
https://towardsdatascience.com/ridgeline-plots-the-perfect-way-to-visualize-data-distributions-with-python-de99a5493052 <br>
https://towardsdatascience.com/all-you-want-to-know-about-preprocessing-data-preparation-b6c2866071d4 <br>
https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

In [None]:
combined_data

In [None]:
pd.crosstab(combined_data['NAME'], combined_data['STATE/COUNTRY ID'])

In [None]:
# Write clean data to 02_intermediate data folder
#weather_summary.to_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/02_intermediate/data_cleaning.csv')

### Data Cleaning - Feature Engineering

In [None]:
#weather_summary = pd.read_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/02_intermediate/data_cleaning.csv')

In [None]:
# Calculate the average based on the fahrenheit columns
weather_summary['MeanTemp_F'] = (weather_summary['MAX'] + weather_summary['MIN'])/2

# Create a simplier binary feature for snowfall occurrence
weather_summary['Snowfall_bin'] =  np.where(weather_summary.Snowfall.isnull(), 0, 1)

# Add full state names to the analysis

STATES_TUPLE = [("AL","Alabama"),
                ("AK","Alaska"),
                ("AZ","Arizona"),
                ("AR","Arkansas"),
                ("CA", "California"),
                ("CO", "Colorado"),
                ("CT","Connecticut"),
                ("DC","Washington DC"),
                ("DE","Delaware"),
                ("FL","Florida"),
                ("GA","Georgia"),
                ("HI","Hawaii"),
                ("ID","Idaho"),
                ("IL","Illinois"),
                ("IN","Indiana"),
                ("IA","Iowa"),
                ("KS","Kansas"),
                ("KY","Kentucky"),
                ("LA","Louisiana"),
                ("ME","Maine"),
                ("MD","Maryland"),
                ("MA","Massachusetts"),
                ("MI","Michigan"),
                ("MN","Minnesota"),
                ("MS","Mississippi"),
                ("MO","Missouri"),
                ("MT","Montana"),
                ("NE","Nebraska"),
                ("NV","Nevada"),
                ("NH","New Hampshire"),
                ("NJ","New Jersey"),
                ("NM","New Mexico"),
                ("NY","New York"),
                ("NC","North Carolina"),
                ("ND","North Dakota"),
                ("OH","Ohio"),
                ("OK","Oklahoma"),
                ("OR","Oregon"),
                ("PA","Pennsylvania"),
                ("RI","Rhode Island"),
                ("SC","South Carolina"),
                ("SD","South Dakota"),
                ("TN","Tennessee"),
                ("TX","Texas"),
                ("UT","Utah"),
                ("VT","Vermont"),
                ("VA","Virginia"),
                ("WA","Washington"),
                ("WV","West Virginia"),
                ("WI","Wisconsin"),
                ("WY","Wyoming")]


# Add sine and cos features for seasonal elements 

In [None]:
# Write clean data to 02_intermediate data folder
weather_summary.to_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/03_processed/data_std_feature_eng.csv')

# Save the features to a pickle file


#weather_summary_tm.to_csv('/Users/Rej1992/Documents/GitHub/RegressionModels/data/03_processed/data_tm_feature_eng.csv')

In [None]:
weather_summary['Snowfall_bin'].value_counts().sort_values().plot(kind = 'barh')

In [None]:
# call regplot on each axes
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)

sns.displot(weather_summary, x="MIN", ax=ax1)
sns.displot(weather_summary, x="MAX", ax=ax2)

In [None]:
sns.displot(weather_summary, x="MAX", hue="YR", kind="kde")

In [None]:
weather_summary[weather_summary['STA'] == 10001]

**Data Analysis & Visualization of features**
- Timeseries Dataframe: weather_summary_tm

In [None]:
## Sort the data into date order and reset index
## Create a new timeseries dataframe 
weather_summary.set_index('Date', drop=True, inplace=True)
weather_summary

In [None]:
# Annual Analysis


In [None]:
# Monthly Analysis


In [None]:
# Daily Analysis
plt.scatter(weather_summary.DA, weather_summary.MAX)

In [None]:
# ## TimeSeries Analysis 
# sns.lineplot(x='Date', 
#              y='MIN', 
#              data=weather_summarytrans, 
#              hue='STA'); # ';' is to avoid extra message before plot

### Weather Location Analysis

**Hypothesis & Expectations to Test**
- What are the locations associated with the study?


In [None]:
initial_min_temperature = Data.loc[0, 'Average Tank Temperature (deg F)']
initial_max_temperature = Data.loc[0, 'Average Tank Temperature (deg F)']

final_min_temperature = Data.loc[Data.index.max(), 'Average Tank Temperature (deg F)']
final_max_temperature = Data.loc[Data.index.max(), 'Average Tank Temperature (deg F)']

min_temperature = Data['T_Amb (deg F)'].min()
max_temperature = Data['T_Amb (deg F)'].max()

min_temperature_sd = Data['T_Amb (deg F)'].sd()
max_temperature_sd = Data['T_Amb (deg F)'].sd()

min_temperature_avg = Data['T_Amb (deg F)'].mean()
max_temperature_avg = Data['T_Amb (deg F)'].mean()

min_temperature_median
max_temperature_median

min_temperature_mode
max_temperature_mode

## Model Building 
- Data partitioning into training, validation and testing sets
- Select the model that you would like to use
- Hyperparameter tuning is used to fine-tune the model in order to prevent overfitting 
- Cross-validation is performed to ensure the model performs well on the validation set 
- Model is applied to the test data set
- Save the trained model to a pickle file

## Error Handling Scripts

In [None]:
SuspiciousTests_Test = pd.DataFrame(columns = ['Filename', 'Test Parameters', 'Code', 'Value'])

## Application