# **Dimensionality reduction FCDO data**

This script reduces the dimensionality of the satalite extracted data. First it preprocesses the data to account for missing values and some string inconveniences. Then it uses word recognition in the column names to group column names based on the presence of this word in the column name. The dimensionality will be reduced using TSNE algorithm in accordance with these groups. The causal models can take as input the dimensionality reduced data. This is a work in progress. The following points should be improved upon:
1.   Non-linear Dimensionality reduced features. The improvement necessary is two-fold. First, the variables should not be multiplied beyond necessity. A review of the groupings is necessary to ensure this. Second, the estimates of the dimensionality reduced features should be interpretation along the lines of the used dimensionality reduction technique. 
2.   Missing data. Missing data should not be arbitrarily replaced by the mean. Research is needed to figure out whether data is missing by random or not by random such that an approriate method can be chosen to deal with missing data., see *Mack, C., Su, Z., & Westreich, D. (2018). Managing missing data in patient registries: addendum to registries for evaluating patient outcomes: a user’s guide.*
3.   Use of data. The current models only contains climatological related variables and conflict variables. Hypothesized is that food related variables in population related variables also explain a role in the causal mechanism. These should therefore be included.

## **Import**

In [None]:
!pip install pandas==1.1.5 &> /dev/nul
!pip install pickle-mixin &> /dev/nul

In [None]:
import numpy as np
import pandas as pd
import pickle as pckl 
from sklearn.preprocessing import RobustScaler
from sklearn.manifold import TSNE
from google.colab import auth
import logging.config
import sys
import difflib as dl

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Import data from public repository on GitHub
url = 'https://raw.githubusercontent.com/HCSS-Data-Lab/FCDO/main/Data/FCDO_data.csv'
data = pd.read_csv(url) 

## **Prepare non-reduced data**

In [None]:
class Prepare_Data:
    """
    Class function to preprocess the data that one can then use for causal modeling. Imports the raw data and exports data that 
    immediatly fits causal model arguments. 
    Attributes:
        change_object_type (pd.Dataframe): replace data and change object type to numeric
        deal_with_string_column_names (pd.Dataframe): eliminate problematic string data
        include_range_data (pd.Dataframe): add new features based on min-max range
        deal_with_missing_data (pd.Dataframe): removes or replaces nans
    """

    def __init__(self, data):
        """
        Initiate data and logger
        :data (pd.Dataframe) : to be converted data
        """
            # Create logger
        log_format = '%(asctime)s - %(name)s - %(levelname)s - %(funcName)s - %(message)s'
        logging.basicConfig(format=log_format, level=logging.INFO, stream=sys.stdout)
        logger = logging.getLogger()
        self.data = data
        self.logger = logging.getLogger(__name__)

    def prepare_data(self):
        """
        This functions continuously plays all the pre processing activities for causal modeling.
        :return (pd.Dataframe): preprocessed data
        """
        self.logger.info("start preprocessing data")

        # Change the object types
        self.change_object_type()
        # Deal with awkward column string names
        self.deal_with_string_column_names()
        # Add new features based on range
        self.include_range_data()
        # Deal with missing data
        self.deal_with_missing_data()

        return self.data

    def change_object_type(self):
        """
        Since all the data are floats, we convert the data type.
        :return (pd.Dataframe): preprocessed data
        """

        self.logger.info("start initiating data")
        data = self.data

        # Eliminate all rows where there is no data present
        data.replace(['--'], [np.nan], inplace=True)  # replace -- by np.nan
        data.loc[:, data.columns != 'ADM3_EN'] = data.loc[:, data.columns != 'ADM3_EN'].apply(pd.to_numeric, errors='raise')
        self.data=data

    def deal_with_string_column_names(self):
        """
        This functions adjusts column names by removing problemetic string combinations.
        :return (pd.Dataframe): preprocessed data
        """

        self.logger.info("deal with missing data")
        data = self.data
        
        # #Strip symbols from column names for efficient column selection
        data.columns = data.columns.str.replace("'","")
        data.columns = data.columns.str.replace("''","")
        data.columns = data.columns.str.replace("(","")
        data.columns = data.columns.str.replace(")","")

        self.data = data

    def include_range_data(self):
        """
        Create new features based on range between the minimum and the maximum value of that feature.
        :return (pd.Dataframe): preprocessed data
        """

        self.logger.info("Start adding range features.")
        data = self.data
        
        # Find columns minimums and maximums. (Note: there are no min/max columns (without capital letter))
        min_cols = [col for col in data.columns if 'Min' in col]
        max_cols = [col for col in data.columns if 'Max' in col]
        
        # Find which column from min corresponds with max: #https://docs.python.org/3/library/difflib.html#difflib.get_close_matches
        for min_col in min_cols:
            max_col = dl.get_close_matches(min_col, max_cols, n=1)[0]
            rangename=min_col.replace('Min','Range')
            data[rangename] = data[max_col]-data[min_col]

        self.data=data
        
    def deal_with_missing_data(self):
        """
        This functions deals with missing data and replaces it with the mean. 
        :return (pd.Dataframe): preprocessed data
        """

        self.logger.info("deal with missing data")
        data = self.data
        
        #Replace inf by NaN values
        data.replace([np.inf, -np.inf], np.nan, inplace=True)
        
        #Fill NaN values
        data.fillna(data.mean(), inplace=True)
        self.data = data

In [None]:
# Preprocess data the data.
pre_processing = Prepare_Data(data)
prepared_data = pre_processing.prepare_data()

2022-03-28 14:59:26,723 - __main__ - INFO - prepare_data - start preprocessing data
2022-03-28 14:59:26,729 - __main__ - INFO - change_object_type - start initiating data
2022-03-28 14:59:26,957 - __main__ - INFO - deal_with_string_column_names - deal with missing data
2022-03-28 14:59:26,961 - __main__ - INFO - include_range_data - Start adding range features.
2022-03-28 14:59:26,985 - numexpr.utils - INFO - _init_num_threads - NumExpr defaulting to 2 threads.
2022-03-28 14:59:28,128 - __main__ - INFO - deal_with_missing_data - deal with missing data


## **Dimensionality Reduction**

In short: we separate the conflict data from the predictive features as the conflict data should not be reduced. Create groups for the remaining variables based on word presence (except for food as the food features do not necessarily share a word. Hence this group is created in the end based on remaining column names). Then we reduce the dimensionality of all the features in each group separetely and merge the dim reduced features to the conflict data again. As mentioned before, the following improvements should be processed regarding the dimensionality reduction:
1.   Non-linear Dimensionality reduced features. The improvement necessary is two-fold. First, the variables should not be multiplied beyond necessity. A review of the groupings is necessary to ensure this. Second, the estimates of the dimensionality reduced features should be interpretation along the lines of the used dimensionality reduction technique. 
2.   Missing data. Missing data should not be arbitrarily replaced by the mean. Research is needed to figure out whether data is missing by random or not by random such that an approriate method can be chosen to deal with missing data., see *Mack, C., Su, Z., & Westreich, D. (2018). Managing missing data in patient registries: addendum to registries for evaluating patient outcomes: a user’s guide.*
3.   Use of data. The current models only contains climatological related variables and conflict variables. Hypothesized is that food related variables in population related variables also explain a role in the causal mechanism. These should therefore be included.


### **Group Variables**

In [None]:
# Separate the ACLED conflict data from the features that are to be reduced (which are only the predictive features)
non_predictive_columns = ['fatalities, Riots', 'fatalities, Battles', 'fatalities, Protests',
                                      'Battles','Explosions/Remote violence','Protests','Riots','Strategic developments',
                                      'Violence against civilians','total_event_types','fatalities, Explosions/Remote violence',
                                      'fatalities, Strategic developments','fatalities, Violence against civilians','total_fatalities, ']
predictive_variables = prepared_data.loc[:, ~prepared_data.columns.isin(non_predictive_columns)]

# Set index to be the Admin3 district, cause this is text and should also not be reduced.
predictive_variables.set_index('ADM3_EN', inplace=True)

# Define the groups in advance (these are the words that will be looked for in the column names)
names = ['Water_Runoff','Coastal','Landslide','Storm_Surface','Riverine','Precipitation'
             ,'Evapotranspiration','Skin_Reservoir','Evaporation','Soil_Temperature'
             ,'Radiative_Temperature','Soil_Water','LeafArea','Soil_Level','Latent_Heat'
             ,'Soil_Heat','Wind_Speed','Soil_Moisture','Surface_Pressure'
             ,'Vapor_Pressure', 'Groundwater_Runoff','Sensible_Heat','Humidity','Dew'
             ,'Density','Heatwave','Water_Deficit','Surface_Air','Temperature']

In [None]:
# Divide the predictive columns into pre-defined groups
grouped_columns_category = [] # Initiate the list that should contain all the variables within groups as a list
leftover_features = predictive_variables.columns #Keep track of the leftover
for i in range(0,len(names)):
    grouped_features = [col for col in predictive_variables.columns if names[i] in col] # features per group
    grouped_columns_category.append(grouped_features) # Add to list
    leftover_features = [feature for feature in leftover_features if feature not in grouped_features] # Remove from leftover features

print('The leftover features are the following food-related features: ', leftover_features)

# Only leftover features are food related hence add to grouped features and group names:
grouped_columns_category.append(leftover_features)
names.append('Food')

The leftover features are the following food-related features:  ['Cropland_Sum', 'Cropland_SD', 'Pasture_Sum', 'Pasture_SD', 'Cattle_Sum', 'Cattle_SD', 'Chicken_Sum', 'Chicken_SD', 'Ducks_Sum', 'Ducks_SD', 'Goats_Sum', 'Goats_SD', 'Pigs_Sum', 'Pigs_SD', 'Sheep_Sum', 'Sheep_SD']


### **Reduce Dimensionality**

In [None]:
class nonlinear_dimensionality_reduction:
  """
  Class function that reduced the dimensionality of the variables using t-SNE. 
  Applying dimensionality reduction can make causal models more information dense.
  Attributes:
      transform_data (pd.Dataframe): replace data and change object type to numeric
      embed_data (pd.Dataframe): eliminate problematic string data
  """

  def __init__(self, data_to_reduce, variables):
    """
    Initiate data to the reduced variables
    :data_to_reduce (pd.Dataframe) : to be converted data
    :variables (list) : subselection of columns to be reduced together
    """    
    self.data = data_to_reduce[variables]

  def transform_data(self):
    """
    Apply Robust Scaler to the data: value = (value – median) / (p75 – p25)
    :return (pd.Dataframe): preprocessed data
    """
    transformer = RobustScaler().fit(self.data)
    data_transformed = transformer.transform(self.data)
    return data_transformed

  def embed_data(self):
    """
    Apply t-SNE to reduce the selected columns to 1 latent dimension by minimizing the divergence between two distributions, see
      Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
    :return (list): list of the reduced data
    :return (list): list of the kl_divergence errors
    """
    embeddingTSNE = TSNE(n_components=1, init='pca', random_state=0, perplexity=50.0, early_exaggeration=12.0, learning_rate=200.0, 
                     n_iter=10000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', verbose=0, method='barnes_hut', 
                     angle=0.5, n_jobs=-1)
    embedded_data = embeddingTSNE.fit_transform(nonlinear_dimensionality_reduction.transform_data(self))
    kl_divergence = embeddingTSNE.kl_divergence_
    return embedded_data, kl_divergence

In [None]:
#Alternative non-linear dimensionality reduction methods have also been tried. Another alternative to t-distributed stochastic neighbor embedding (t-SNE) that is also stochastic is variational autoencoders (VAE).
#VAEs have more potential, but are also more complex, and their parameterization can also be a source of error. Since the t-SNE method gave better results, we settled on the t-SNE method.
data_list = []
listed_errors = []
for i in range(len(grouped_columns_category)):
  print(grouped_columns_category[i], names[i])
  variables = grouped_columns_category[i]
  data_reduction = nonlinear_dimensionality_reduction(prepared_data, grouped_columns_category[i])
  reduced_data, reconstruction_error = data_reduction.embed_data()
  data_list.append(reduced_data)
  listed_errors.append(reconstruction_error)

In [None]:
reduced_features = pd.DataFrame(np.concatenate(data_list, axis=1), columns = names)
reduced_features

Unnamed: 0,Water_Runoff,Coastal,Landslide,Storm_Surface,Riverine,Precipitation,Evapotranspiration,Skin_Reservoir,Evaporation,Soil_Temperature,...,Groundwater_Runoff,Sensible_Heat,Humidity,Dew,Density,Heatwave,Water_Deficit,Surface_Air,Temperature,Food
0,-9.785743,13.071511,-29.065561,-4.913549,6.574868,3.072558,-5.588446,-2.096827,3.877077,-5.948453,...,8.519344,-4.578318,11.598210,4.583045,-152.199020,18.577457,-6.741825,-4.564607,-4.090545,-7.880719e+36
1,-4.451058,4.962664,-23.626535,9.772707,0.807747,8.971795,-1.891566,-8.279265,5.505229,-0.574631,...,-4.825117,-1.902246,2.015493,2.307190,-161.501495,-4.089179,-1.686300,1.700426,2.290412,-7.880719e+36
2,-11.909466,11.712390,-29.582335,0.864251,6.113179,4.052352,-4.988898,-0.585704,3.359438,-5.382941,...,9.714569,-3.752325,-1.404066,4.024570,-150.578903,19.041609,-3.968739,-1.682609,-3.251885,-7.880719e+36
3,7.758137,15.811904,-16.693804,7.044371,10.137346,-7.126973,3.069797,4.485955,-6.747284,6.822594,...,4.045427,6.127324,-6.695104,-3.842552,-147.193863,-7.609262,7.824786,9.714209,8.119581,-7.880719e+36
4,-6.743314,-8.275314,-28.289974,-6.074180,-9.680875,6.680084,-7.311238,-8.077782,4.487056,-10.362889,...,-7.066473,-0.136558,4.283571,7.555860,-150.895477,1.023686,-7.457350,-9.680089,-6.696109,-7.880719e+36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,-3.214485,2.588127,105.697723,3.687900,-3.637111,0.374221,-1.623459,-8.998659,-9.054413,-7.769559,...,1.862587,-2.797554,4.991419,-0.201390,-150.613419,-0.547425,-11.837461,-8.512124,-1.034662,-7.880719e+36
264,-2.088866,-7.943555,-24.605865,3.153948,-10.300833,-2.507312,0.813194,-7.614503,-5.791650,0.804240,...,-12.758312,5.546475,8.837021,0.647893,-161.825607,5.863933,3.777467,3.053351,3.148316,-7.880719e+36
265,-6.108713,5.823532,106.002380,-9.769992,1.107011,2.518385,-5.584548,0.394428,2.808425,-2.500868,...,-6.738771,-7.869524,1.087065,2.833279,-163.234970,9.920849,-3.587361,-2.612592,-2.351717,-7.880719e+36
266,8.186574,-4.825428,-19.113449,11.183458,1.367947,-9.118014,4.459762,8.672816,-6.498763,9.030070,...,5.193818,6.654664,-6.426256,-7.458625,-154.061096,-9.073239,9.911867,11.274250,10.432647,-7.880719e+36


In [None]:
# Define non-predictive data
non_predictive_data = prepared_data.loc[:, prepared_data.columns.isin(non_predictive_columns)]

In [None]:
# Concatenate dataset with reduced features to non predictive data set (conflict features)
frames = [reduced_features, non_predictive_data]
df_cross_section = pd.concat(frames, axis = 1)
df_cross_section 

Unnamed: 0,Water_Runoff,Coastal,Landslide,Storm_Surface,Riverine,Precipitation,Evapotranspiration,Skin_Reservoir,Evaporation,Soil_Temperature,...,Strategic developments,Violence against civilians,total_event_types,"fatalities, Battles","fatalities, Explosions/Remote violence","fatalities, Protests","fatalities, Riots","fatalities, Strategic developments","fatalities, Violence against civilians","total_fatalities,"
0,-9.785743,13.071511,-29.065561,-4.913549,6.574868,3.072558,-5.588446,-2.096827,3.877077,-5.948453,...,0,2,6,0,0,0,0,0,2,2
1,-4.451058,4.962664,-23.626535,9.772707,0.807747,8.971795,-1.891566,-8.279265,5.505229,-0.574631,...,1,2,13,3,5,0,0,0,1,9
2,-11.909466,11.712390,-29.582335,0.864251,6.113179,4.052352,-4.988898,-0.585704,3.359438,-5.382941,...,7,7,61,31,30,0,0,0,9,70
3,7.758137,15.811904,-16.693804,7.044371,10.137346,-7.126973,3.069797,4.485955,-6.747284,6.822594,...,1,0,4,0,2,0,0,0,0,2
4,-6.743314,-8.275314,-28.289974,-6.074180,-9.680875,6.680084,-7.311238,-8.077782,4.487056,-10.362889,...,2,5,32,4,0,0,1,0,3,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,-3.214485,2.588127,105.697723,3.687900,-3.637111,0.374221,-1.623459,-8.998659,-9.054413,-7.769559,...,0,0,1,0,0,0,0,0,0,0
264,-2.088866,-7.943555,-24.605865,3.153948,-10.300833,-2.507312,0.813194,-7.614503,-5.791650,0.804240,...,14,2,52,97,61,0,0,3,1,162
265,-6.108713,5.823532,106.002380,-9.769992,1.107011,2.518385,-5.584548,0.394428,2.808425,-2.500868,...,4,5,73,35,4,0,0,0,5,44
266,8.186574,-4.825428,-19.113449,11.183458,1.367947,-9.118014,4.459762,8.672816,-6.498763,9.030070,...,0,0,5,0,9,0,0,0,0,9


In [None]:
# Save the reduced data
df_cross_section.to_csv("FCDO_data_dim_reduced.csv") 

from google.colab import files
files.download('FCDO_data_dim_reduced.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>