# Pickle engineered features

Before starting on the actual assigment 4B, this script will cover the gap between last weeks 4A and this weeks 4B. That is, last week we did not go ahead and pickle the data after creating our new features. This we will do here. We will do so in regards to both the `train_set` and the `val_set`. The code below is also construted so we can create a `train_set` (which includes the data from the `val_set`) and a final `test_set`. This last bit will not be relevant before later in the course, but you should know that the functionality is already here. 

Now, to understand what is going on, it should be noted that we would often implent most of our feature engeneering on the full data set before splitting it into train, val and test. Or indeed it might be an iterative process where we se what works before going back and tweaking it. In this course we go a bit back and forth, but this is not as much for tweaking. Instead it is because I wanted you to start predicting and evaluationg models as soon as possible, which required us to create a train and a validation set.

However, we naturally need the same features in our `val_set`/`test_set` as we have in our `train_set`. As such it is easier to create the features before splitting into `train/val/test`. Instead of importing the full_df the function below concatenate the train and validation (and potentially also the test) set back togheter and run all the feature engeneering we did last time on the full data set. Afterwards it splits the data again into train and validation set (or train and test if `test_time = True`).

A couple of things to be aware of when doing such operations: Be aware if you have any features which might send informaiton "back through time", such as "total fatalities count across all years". This is analog to the situtioan surrounding `best_ratio`. That is, when crating a static measure across all years we must only use the years in out `train_set` to create said static measure for the `train_set`. It is important that such static measure does not provide informaiton from the vlaidation/test set back into the training set. For the validation however, we are free to use information from both the training set and the validation set, and for the test set we can freely use information from both train, validation and test set. This might sound a bit convoluted because, well, it is. My best advise to you is simply to think very carefully about what information you would expect to have at your disposal if you where to forecast into the actual future. that is; what information would you have if you did not have a testset?

Also, note that when creating our final `test_set` we want to make our validation set part of the training set first (can you see why?). As mentioned, this can be handled by the function below. Specifically by setting `test_time = True`. Don't worry to much about this for now though. We will return to it later; for now it is just important that you understand that this is what is handled in the start of the `featureEng()` function below. Indeed, we will not worry about the test set before we know which features we would like to keep (that is after 4B).

As an optional exercise I recommend you to go through the code below and insert some comments regarding what each part does and why that feature might be relevant. Note that I included i two new features `log_best_decay5` and `log_best_decay10` and also made a slight correction to `past_fatalities_country_Npop` , `past_magnitude_country_Npop` and `past_events_country_Npop`. 

As a last note: **make sure you backup your `train_set.pkl`, `val_set.pkl`, `test_set.pkl` in a safe folder somewhere. Just so you can go get them if you accidentially overwrites the original ones.**


In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import pickle
import time

import seaborn as sns

In [2]:
def decay5(data, window=5):

    alpha = 2 /(window + 1.0)
    alpha_rev = 1-alpha
    n = data.shape[0]

    pows = alpha_rev**(np.arange(n+1))

    scale_arr = 1/pows[:-1]
    offset = data.iloc[0]*pows[1:]
    pw0 = alpha*alpha_rev**(n-1)

    mult = data*pw0*scale_arr
    cumsums = mult.cumsum()
    out = offset + cumsums*scale_arr[::-1]
    return out

def decay10(data, window=10):

    alpha = 2 /(window + 1.0)
    alpha_rev = 1-alpha
    n = data.shape[0]

    pows = alpha_rev**(np.arange(n+1))

    scale_arr = 1/pows[:-1]
    pw0 = alpha * alpha_rev**(n-1)

    mult = data*pw0*scale_arr
    cumsums = mult.cumsum()
    out = cumsums * scale_arr[::-1]
    return out


In [3]:
def featureEng(test_time = False):
    
    pkl_file = open('train_set.pkl', 'rb')
    train_set = pickle.load(pkl_file)
    pkl_file.close()
    
    if test_time == True:
        
        pkl_file = open('val_set.pkl', 'rb')
        val_set = pickle.load(pkl_file)
        pkl_file.close()
    
        pkl_file = open('test_set.pkl', 'rb')
        test_set = pickle.load(pkl_file)
        pkl_file.close()
    
        df = pd.concat([train_set, val_set, test_set])

    
    elif test_time == False:
    
        
        pkl_file = open('val_set.pkl', 'rb')
        val_set = pickle.load(pkl_file)
        pkl_file.close()
    
        df = pd.concat([train_set, val_set])
    
    else:
        
        print("unrecognized input for test_time...")
        
    df.sort_values('year', inplace = True)



    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    
    df['past_fatalities'] = df.sort_values('year').groupby(['gid'])['best'].cumsum()
    df['past_magnitude'] = df.sort_values('year').groupby(['gid'])['log_best'].cumsum()
    df['past_events'] = df.sort_values('year').groupby(['gid'])['binary_best'].cumsum()

    df['past_fatalities_country'] = df.sort_values('year').groupby(['gwno'])['best'].cumsum()
    df['past_magnitude_country'] = df.sort_values('year').groupby(['gwno'])['log_best'].cumsum()
    df['past_events_country'] = df.sort_values('year').groupby(['gwno'])['binary_best'].cumsum()
        
    
    
    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    
    features_to_normalize = ['gwarea', 'interp_pop_gpw_sum', 'past_fatalities_country', 'past_magnitude_country', 'past_events_country']

    for feature in features_to_normalize:

        new_name = 'norm_' + feature
        df[new_name] = (df[feature]- df[feature].min())/(df[feature].max()-df[feature].min())        
        

    # ----------------------------------------------------------------------------------------------
    # Your comment here:

    df['past_fatalities_country_Narea'] = df['norm_past_fatalities_country'] / (df['norm_gwarea']+1)
    df['past_magnitude_country_Narea'] = df['norm_past_magnitude_country'] / (df['norm_gwarea']+1)
    df['past_events_country_Narea'] = df['norm_past_events_country'] / (df['norm_gwarea']+1)
    
    
    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    
    df['norm_interp_pop_gpw_sum_country'] = df.sort_values('year').groupby(['gwno'])['norm_interp_pop_gpw_sum'].transform(np.sum)#

    
#     df['past_fatalities_country_Npop'] = df['norm_past_fatalities_country'] / (df['norm_interp_pop_gpw_sum']+1) # so maybe this hsould be country pop not cell pop..
#     df['past_magnitude_country_Npop'] = df['norm_past_magnitude_country'] / (df['norm_interp_pop_gpw_sum']+1)
#     df['past_events_country_Npop'] = df['norm_past_events_country'] / (df['norm_interp_pop_gpw_sum']+1)

    df['past_fatalities_country_Npop'] = df['norm_past_fatalities_country'] / (df['norm_interp_pop_gpw_sum_country'])
    df['past_magnitude_country_Npop'] = df['norm_past_magnitude_country'] / (df['norm_interp_pop_gpw_sum_country'])
    df['past_events_country_Npop'] = df['norm_past_events_country'] / (df['norm_interp_pop_gpw_sum_country'])
    
    
    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    
    df['best_decay5'] = df.sort_values('year').groupby(['gid'])['best'].apply(decay5)
    df['best_decay10'] = df.sort_values('year').groupby(['gid'])['best'].apply(decay10)
    
    df['log_best_decay5'] = df.sort_values('year').groupby(['gid'])['log_best'].apply(decay5)
    df['log_best_decay10'] = df.sort_values('year').groupby(['gid'])['log_best'].apply(decay10)

    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    df['cell_light_Pcap'] = (df['interp_nlights_mean'])/(df['norm_interp_pop_gpw_sum']+1)
    
    
    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    
    df['interp_nlights_mean_country'] = df.sort_values('year').groupby(['gwno'])['interp_nlights_mean'].transform(np.sum)
#     df['norm_interp_pop_gpw_sum_country'] = df.sort_values('year').groupby(['gwno'])['norm_interp_pop_gpw_sum'].transform(np.sum)#

    df['country_light_Pcap'] = (df['interp_nlights_mean_country'])/(df['norm_interp_pop_gpw_sum_country']+1)
    df['country_light_Area'] = (df['interp_nlights_mean_country'])/(df['norm_gwarea']+1)
    
    
    
    # ----------------------------------------------------------------------------------------------
    # Your comment here:
    
    df['interp_nlights_mean_norm'] = (df['interp_nlights_mean'] - df['interp_nlights_mean'].min())/(df['interp_nlights_mean'].max()-df['interp_nlights_mean'].min())

    df['interp_nlights_mean_country_norm'] = (df['interp_nlights_mean_country'] - df['interp_nlights_mean_country'].min())/(df['interp_nlights_mean_country'].max()-df['interp_nlights_mean_country'].min())
    df['country_light_Pcap_norm'] = (df['country_light_Pcap'] - df['country_light_Pcap'].min())/(df['country_light_Pcap'].max()-df['country_light_Pcap'].min())
    df['country_light_Area_norm'] = (df['country_light_Area'] - df['country_light_Area'].min())/(df['country_light_Area'].max()-df['country_light_Area'].min())

    df['low_ratio_light'] = np.minimum(df['interp_nlights_mean_country_norm'] / df['interp_nlights_mean_norm'],1)  
    df['low_ratio_light_Pcap'] = np.minimum(df['country_light_Pcap_norm'] / df['interp_nlights_mean_norm'],1)  
    df['low_ratio_light_Area'] = np.minimum(df['country_light_Area_norm'] / df['interp_nlights_mean_norm'],1)
    
    last_year = df['year'].max()
    
    train_set = df[df['year'] < last_year]
    other_set = df[df['year'] == last_year] # val or test set

    return train_set, other_set

In [4]:
train_set, val_set = featureEng()

In [5]:
#pickle test with all the features

file_name = "train_set_featureEng.pkl"
output = open(file_name, 'wb')
pickle.dump(train_set, output)
output.close()

#pickle val with all the features

file_name = "val_set_featureEng.pkl"
output = open(file_name, 'wb')
pickle.dump(val_set, output)
output.close()

### On to 4B! 