#### CRISP

## Business Understanding

- There are 2 datasets `train_data.csv` and `test_data.csv`
- The `contest-tmp2m-14d__tmp2m` is the mean `(tmax+tmin / 2)` temperature and is to be predicted for the test data
- Latitude and longitude are anonymized so latitude information cannot be used for temperature prediction
- `startdate` indicates the start of a 14 day period
- The data provided is between **2014** and **2016**, therefore the affect of **El Nino** is to be considered
- `nmme` forecast values and other forecast values will not be part of the feature set used for this model
- The 2010 data for geopotential, wind, etc. will also be discarded for this model
- The 2010 data for sea surface temperature will be however used

NOTE: *There are inferences below some of the data analysis/visualization which dictates the next set of data transformations*

## Data Analysis

#### Import libraries and set Universal params

In [None]:
# Import main libraries for data analysis and modelling
import pandas as pd
import numpy as np

from shapely.geometry import Point
from shapely import wkt
import geopandas as gpd
from geopandas import GeoDataFrame

import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px

# Import additional helper libraries
import os
import datetime as dt
from IPython.display import display
# import math
# from math import radians, cos, sin, asin, sqrt
# import itertools

In [None]:
pd.set_option("display.max.columns", None)

#### Set paths and create dataframes

In [None]:
# Define the filepath

data_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir)) + '/data/'

train_csv = data_dir + 'train_data.csv'
test_csv = data_dir + 'test_data.csv'

print(train_csv)
print(test_csv)

In [None]:
# Load the training data set
train_df_raw = pd.read_csv(train_csv)

# Load the test data set
test_df_raw = pd.read_csv(test_csv)

#### Initial Analysis

In [None]:
# Display primary observations
display(train_df_raw.info())
display(train_df_raw.head())
display(train_df_raw.tail())
display(train_df_raw.describe())

In [None]:
with open('train_columns.txt', 'w', encoding='utf-8') as f:
    for col in train_df_raw.columns:
        f.write(f'{col},{train_df_raw.dtypes[col]},{len(train_df_raw[col].unique())}\n')

with open('train_df_info.txt', 'w', encoding='utf-8') as f:
    train_df_raw.info(verbose=True, buf=f)

- `startdate` is an object and needs to be converted to `datetime` and later to `ordinal` Int type for usability
- `climateregions__climateregion` is an object and needs to be converted to string type for usability

In [None]:
# Find any column with empty/null values
print(f'Columns with null vaules in Training data are {train_df_raw.columns[train_df_raw.isnull().any()]}')

# Find the target column
target_column = train_df_raw.columns.difference(test_df_raw.columns)[0]
print(f'The target column for prediction is {target_column}')

The features having null value are all prediction data and hence could be ignored

In [None]:
# Check unique locations
print('Unique locations in train data ',train_df_raw.groupby(['lat','lon']).ngroup().nunique())
print('Unique locations in test data ',test_df_raw.groupby(['lat','lon']).ngroup().nunique()) 
print('Unique locations in combined data ',pd.concat([train_df_raw,test_df_raw], axis=0).groupby(['lat','lon']).ngroup().nunique())

Combined dataframe gives more unique locations than either train or test data. Check precision of location data to determine practicality

In [None]:
# Get current precision of latitude and longitude
precision = train_df_raw[['lat','lon']].applymap(lambda x: len(str(x).split('.')[1]))

print(f'Current precision of latitude in training data is {precision.lat.max()}')
print(f'Current precision of longitude in training data is {precision.lon.max()}')

precision = test_df_raw[['lat','lon']].applymap(lambda x: len(str(x).split('.')[1]))

print(f'Current precision of latitude in test data is {precision.lat.max()}')
print(f'Current precision of longitude in test data is {precision.lon.max()}')

- Precision 16 is too high for practical purpose. This indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless.
- The ninth decimal place is worth up to 110 microns. So, this is getting into the range of microscopy. 
- For almost any conceivable application with earth positions, this is overkill and will be more precise than the accuracy of any surveying device.
- Decision is to reduce precision to 6 decimal places

In [None]:
# This will simply check whether a column is sorted. This is done as is_monotonic is deprecated
def check_sort(df,col):
    if df[col].is_monotonic_increasing or df[col].is_monotonic_decreasing:
        return True
    else:
        return False

In [None]:
# Check the current sorting order on important columns
print('Sorted by index = ', check_sort(train_df_raw,'index'))
print('Sorted by latitude = ', check_sort(train_df_raw,'lat'))
print('Sorted by longitude = ', check_sort(train_df_raw,'lon'))
print('Sorted by date = ', check_sort(train_df_raw,'startdate'))

A useful sorting would be to sort by location (combined latitude and longitude) and then by date

##### Raw Data Visualizations

In [None]:
# Check initial locations on a map to understand the geography of train data
location_df = train_df_raw[['lat','lon']].drop_duplicates().copy()
location_df.head()

geometry = [Point(xy) for xy in zip(location_df['lat'], location_df['lon'])]
gdf = GeoDataFrame(location_df, geometry=geometry)   

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(10, 6)), marker='o', color='red', markersize=1);


In [None]:
# Check initial locations on a map to understand the geography of test data
location_df = test_df_raw[['lat','lon']].drop_duplicates().copy()
location_df.head()

geometry = [Point(xy) for xy in zip(location_df['lat'], location_df['lon'])]
gdf = GeoDataFrame(location_df, geometry=geometry)   

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(10, 6)), marker='o', color='red', markersize=1);


- Both train and test data are supposedly at the same location.
- Location is anonymized therefore it could be either omitted or converted to classification feature

In [None]:
# Visualize the spread of time in train data
time_df = pd.DataFrame()
time_df['startdate'] = pd.to_datetime(train_df_raw['startdate'], format='%m/%d/%y')
time_df.groupby([time_df['startdate'].dt.year, time_df['startdate'].dt.month]).count().plot(kind='bar')

In [None]:
# Visualize the spread of time in test data
time_df = pd.DataFrame()
time_df['startdate'] = pd.to_datetime(test_df_raw['startdate'], format='%m/%d/%y')
time_df.groupby([time_df['startdate'].dt.year, time_df['startdate'].dt.month]).count().plot(kind='bar')

In [None]:
# Visualize temperature 
temp_df = train_df_raw[['lat','lon','startdate','contest-tmp2m-14d__tmp2m']].copy()
temp_df['startdate'] = pd.to_datetime(temp_df['startdate'], format='%m/%d/%y')
temp_df['location'] = [Point(xy) for xy in zip(temp_df['lat'], temp_df['lon'])] 
temp_df['location_str'] = temp_df['location'].apply(lambda x: wkt.dumps(x)) # Testing Point geometry to String for use in ML training

temp_df = temp_df.pivot(index='startdate', columns='location_str', values='contest-tmp2m-14d__tmp2m')
temp_df.head()
temp_df.plot(legend=False)

- The temperature spread is even and reasoanable across the year matching the northern hemisphere seasonal temperature variation
- The temperature spread by location indicates the locations to be spread across a large geographical area and multiple climatic regions

##### Initial Data Transformations
Needed for the data to be properly visualized. E.g. Convert startdate from mm/dd/yy to ISO format; Sorting by location and date; Combine the latitude and longitude to create location; Reduce the precision of latitude and longitude to 6 to omit superfluous locations

In [None]:
# Temporary functions to test

In [None]:
# create new copies of the dataframes for further operations
train_df = train_df_raw.copy()
test_df = test_df_raw.copy()

In [None]:
train_df['startdate'] = pd.to_datetime(train_df['startdate'], format='%m/%d/%y')
train_df['startdate_ordinal'] = train_df['startdate'].apply(lambda x:x.toordinal())

test_df['startdate'] = pd.to_datetime(test_df['startdate'], format='%m/%d/%y')
test_df['startdate_ordinal'] = test_df['startdate'].apply(lambda x:x.toordinal())

In [None]:
# Round to 6 decimal places precision to latitude and longitude for all practical purpose
scale = 6
train_df['lat'] = train_df['lat'].round(scale)
train_df['lon'] = train_df['lon'].round(scale)
test_df['lat'] = test_df['lat'].round(scale)
test_df['lon'] = test_df['lon'].round(scale)

In [None]:
print('Unique locations in train data ',train_df.groupby(['lat','lon']).ngroup().nunique())
print('Unique locations in test data ',test_df.groupby(['lat','lon']).ngroup().nunique()) 
print('Unique locations in combined data ',pd.concat([train_df,test_df], axis=0).groupby(['lat','lon']).ngroup().nunique())

In [None]:
# For now Haversine distance will not be used, instead Point geometry data for location will be converted to string for use as classification feature

# Need to combine the latitude and longitude for easier data handling
# 'Single-point' Haversine: Calculates the great circle distance between a point on Earth and the (0, 0) lat-long coordinate

# def single_pt_haversine(lat, lon, degrees=True):
    
#     r = 6371 # Earth's radius (km)

#     # Convert decimal degrees to radians
#     if degrees:
#         lat, lon = map(radians, [lat, lon])

#     # 'Single-point' Haversine formula
#     a = sin(lat/2)**2 + cos(lat) * sin(lon/2)**2
#     d = 2 * r * asin(sqrt(a)) 

#     return d

In [None]:
# Combine latitude and longitude to generate unique geolocations. Convert to String for later use

# train_df['haversine_distance'] = [single_pt_haversine(x, y) for x, y in zip(train_df.lat, train_df.lon)]
train_df['location'] = [Point(xy) for xy in zip(train_df['lat'], train_df['lon'])] 
train_df['location'] = train_df['location'].apply(lambda x: wkt.dumps(x))

test_df['location'] = [Point(xy) for xy in zip(test_df['lat'], test_df['lon'])] 
test_df['location'] = test_df['location'].apply(lambda x: wkt.dumps(x))

In [None]:
# Check if data is sorted by new location information
print('Sorted by Location = ', check_sort(train_df,'location'))

In [None]:
# Convert Climate regions to string
train_df['climateregions__climateregion'] = train_df['climateregions__climateregion'].astype(str)

#### Further analyse and Visualize Data 

##### Elevation

In [None]:
# Confirm if elevation is consistant for a location
# elevation_df = train_df[['location','elevation__elevation']].drop_duplicates().copy()
print('Unique combination of location and elevation are ',train_df[['location','elevation__elevation']].drop_duplicates().shape[0])

In [None]:
# Visualize temperature against elevation
elevation_df = train_df[['elevation__elevation','startdate','contest-tmp2m-14d__tmp2m']].drop_duplicates().copy()

elevations = elevation_df.elevation__elevation.unique()
elevations.sort()

# Plot temperature against elevation for a group of elevations 
# for e in itertools.islice(elevations, 40, 43):
#     elevation_df_1 = elevation_df[elevation_df['elevation__elevation']==e]
#     print('Elevation ',e)
#     elevation_df_1.plot.line(x='startdate',y='contest-tmp2m-14d__tmp2m')

# Plot temperature against elevation for a range of elevations
for e in list(filter(lambda e: (e>=100 and e<=300), elevations)):
    elevation_df_1 = elevation_df[elevation_df['elevation__elevation'] == e]
    title_str = 'Elevation' + str(e)
    elevation_df_1.plot.line(x='startdate',y='contest-tmp2m-14d__tmp2m',title=title_str)

The effect of temperature with changing elevation is clear. Therefore this is an important feature.

##### Climate Region

In [None]:
# Plot effect of climate region on temperature
climate_df = train_df[['climateregions__climateregion','startdate','contest-tmp2m-14d__tmp2m']].drop_duplicates().copy()

climates = climate_df.climateregions__climateregion.unique()
display(print('Unique climate regions ',len(climates)))

for e in climates:
    climate_df_1 = climate_df[climate_df['climateregions__climateregion'] == e]
    title_str = 'Climate region ' + e
    climate_df_1.plot.line(x='startdate',y='contest-tmp2m-14d__tmp2m',title=title_str)

In [None]:
# Combined plot using plotly express

fig = px.line(climate_df, x='startdate', 
              y='contest-tmp2m-14d__tmp2m', 
              color = 'climateregions__climateregion', 
              facet_row='climateregions__climateregion',facet_row_spacing=0.04,
              labels={"contest-tmp2m-14d__tmp2m":"Temp", "climateregions__climateregion":"Climate Region"},
              template = 'plotly_white', height=2000)

fig.update_layout(title='Mean temperature variations by climate regions', xaxis_title='Date')
fig.update_yaxes(visible=True, matches=None)
fig.update_layout(annotations=[], overwrite=True)

fig.show()

The effect of climate region on temperature is evident from the plots. Therefore this is a very important feature.

#####  Multivariate ENSO index

In [None]:
# Visualize effect of El Niño on temperature
print('Unique combination of location and NIP are ',train_df[['location','mei__nip']].drop_duplicates().shape[0])
print('Unique combination of location and NIP are ',train_df[['location','mei__mei']].drop_duplicates().shape[0])

In [None]:
# Visualize temperature against MEI
mei_df = train_df[['mei__mei','startdate','contest-tmp2m-14d__tmp2m']].drop_duplicates().copy()

meis = mei_df.mei__mei.unique()
meis.sort()

# Plot temperature against elevation for a range of MEI
for e in list(filter(lambda e: (e>=0 and e<=0.5), meis)):
    mei_df_1 = mei_df[mei_df['mei__mei'] == e]
    title_str = 'MEI ' + str(e)
    mei_df_1.plot.line(x='startdate',y='contest-tmp2m-14d__tmp2m',title=title_str)


In [None]:
# Visualize temperature against Nino Index Phase
nip_df = train_df[['mei__nip','startdate','contest-tmp2m-14d__tmp2m']].copy()

fig = px.line(nip_df, x='startdate', 
              y='contest-tmp2m-14d__tmp2m', 
              color = 'mei__nip', 
              facet_row='mei__nip',facet_row_spacing=0.04,
              labels={"contest-tmp2m-14d__tmp2m":"Temp", "mei__nip":"NIP"},
              template = 'plotly_white', height=300)

fig.update_layout(title='Mean temperature variations by NIP', xaxis_title='Date')
fig.update_yaxes(visible=True, matches=None)
fig.update_layout(annotations=[], overwrite=True)

fig.show()

Effect of MEI and NIP is not clearly visible, although slight increase in temperature is evident with nigher NIP. This is a secondary feature.

## Data Preparation

## Modelling

## Evaluation