#### CRISP

## Business Understanding

- There are 2 datasets train_data.csv and test_data.csv
- The contest-tmp2m-14d__tmp2m is the mean (tmax+tmin / 2) temperature and is to be predicted for the test data
- Latitude and longitude are anonymized so latitude information cannot be used for temperature prediction
- startdate indicates the start of a 14 day period
- The data provided is between 2014 and 2016, therefore the affect of El Nino is to be considered
- nmme forecast values and other forecast values will not be part of the feature set used for this model
- The 2010 data for geopotential, wind, etc. will also be discarded for this model
- The 2010 data for sea surface temperature will be however used

## Data Analysis

#### Import libraries and set Universal params

In [None]:
# Import main libraries for data analysis and modelling
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt


# Import additional helper libraries
import os
import datetime as dt
from IPython.display import display
from math import radians, cos, sin, asin, sqrt

In [None]:
pd.set_option("display.max.columns", None)

#### Set paths and create dataframes

In [None]:
# Define the filepath

data_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir)) + '/data/'

train_csv = data_dir + 'train_data.csv'
test_csv = data_dir + 'test_data.csv'

print(train_csv)
print(test_csv)

In [None]:
# Load the training data set
train_df_raw = pd.read_csv(train_csv)

# Load the test data set
test_df_raw = pd.read_csv(test_csv)

#### Initial Analysis

In [None]:
display(train_df_raw.info())
display(train_df_raw.head())
display(train_df_raw.tail())
display(train_df_raw.describe())

In [None]:
with open('train_columns.txt', 'w', encoding='utf-8') as f:
    for col in train_df_raw.columns:
        f.write(f'{col},{train_df_raw.dtypes[col]},{len(train_df_raw[col].unique())}\n')

with open('train_df_info.txt', 'w', encoding='utf-8') as f:
    train_df_raw.info(verbose=True, buf=f)

In [None]:
# Find any column with empty/null values
print(f'Columns with null vaules in Training data are {train_df_raw.columns[train_df_raw.isnull().any()]}')

# Find the target column
target_column = train_df_raw.columns.difference(test_df_raw.columns)[0]
print(f'The target column for prediction is {target_column}')

In [None]:
# Get current precision of latitude and longitude
loc_data = train_df_raw[['lat','lon']]
precision = loc_data.applymap(lambda x: len(str(x).split('.')[1]))

print(f'Current precision of latitude in training data is {precision.lat.max()}')
print(f'Current precision of longitude in training data is {precision.lon.max()}')

loc_data = test_df_raw[['lat','lon']]
precision = loc_data.applymap(lambda x: len(str(x).split('.')[1]))

print(f'Current precision of latitude in test data is {precision.lat.max()}')
print(f'Current precision of longitude in test data is {precision.lon.max()}')

##### Initial Data Transformations
Needed for the data to be properly visualized. E.g. Convert startdate from mm/dd/yy to ISO format; Sorting by location and date; Combine the latitude and longitude to create location; Reduce the precision of latitude and longitude to 14 to omit superfluous locations

In [None]:
# create new copies of the dataframes for further operations
train_df = train_df_raw
test_df = test_df_raw

In [None]:
train_df['startdate'] = pd.to_datetime(train_df['startdate'], format='%m/%d/%y')

In [None]:
# train_df.sort_values(by='startdate',inplace=True)
train_df['startdate'].is_monotonic_increasing

In [None]:
# Add standard 15 digit decimal places precision to latitude and longitude
# There is some trial and error involved here to get combined common locations in the following 4 code snippets
train_df['lat'] = train_df['lat'].round(14)
train_df['lon'] = train_df['lon'].round(14)
test_df['lat'] = test_df['lat'].round(14)
test_df['lon'] = test_df['lon'].round(14)

In [None]:
# Need to combine the latitude and longitude for easier data handling
# 'Single-point' Haversine: Calculates the great circle distance between a point on Earth and the (0, 0) lat-long coordinate

def single_pt_haversine(lat, lon, degrees=True):
    
    r = 6371 # Earth's radius (km)

    # Convert decimal degrees to radians
    if degrees:
        lat, lon = map(radians, [lat, lon])

    # 'Single-point' Haversine formula
    a = sin(lat/2)**2 + cos(lat) * sin(lon/2)**2
    d = 2 * r * asin(sqrt(a)) 

    return d

In [None]:
# Combine latitude and longitude to generate unique geolocations

train_df['haversine_distance'] = [single_pt_haversine(x, y) for x, y in zip(train_df.lat, train_df.lon)]
print(f'There are {train_df.haversine_distance.nunique()} unique locations in training data')

In [None]:
test_df['haversine_distance'] = [single_pt_haversine(x, y) for x, y in zip(test_df.lat, test_df.lon)]
print(f'There are {test_df.haversine_distance.nunique()} unique locations in test data')

In [None]:
# Get unique locations by combining test and training data and grouping by latitude and longitude
combined_data = pd.concat([train_df,test_df], axis=0)
combined_data['haversine_distance'] = [single_pt_haversine(x, y) for x, y in zip(combined_data.lat, combined_data.lon)]
print(f'There are {combined_data.haversine_distance.nunique()} unique locations in combined data')

#### Visualize Data

## Data Preparation

## Modelling

## Evaluation