## Flood Prediction in Malawi:
On 14 March 2019, tropical Cyclone Idai made landfall at the port of Beira, Mozambique, before moving across the region. Millions of people in Malawi, Mozambique and Zimbabwe have been affected by what is the worst natural disaster to hit southern Africa in at least two decades.

In recent decades, countries across Africa have experienced an increase in the frequency and severity of floods. Malawi has been hit with major floods in 2015 and again in 2019. In fact, between 1946 and 2013, floods accounted for 48% of major disasters in Malawi. The Lower Shire Valley in southern Malawi, bordering Mozambique, composed of Chikwawa and Nsanje Districts is the area most prone to flooding.

The objective of this challenge is to build a machine learning model that helps predict the location and extent of floods in southern Malawi.

This competition is sponsored by [Arm](https://www.arm.com/) and [UNICEF](https://www.unicef.org/) as part of the [2030Vision](https://www.2030vision.com/) initiative.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Paths to data
data_path = './raw_data/Train.csv'
submission_path = './raw_data/SampleSubmission.csv'

## Data Cleaning and Analysis

In [3]:
# Custom functions
def read_data(path: str, transpose=False) -> pd.DataFrame:
    """Read a csv file into a pandas DataFrame.
    Args:
    `path`: the path where the csv is located.
    Returns:
    > The number of rows and unique columns;
    > and the top 3 records of the dataset.
    """
    dataset = pd.read_csv(path)
    if(transpose==True):
        return dataset.head(3).T
    return dataset

def analyze_data(data: pd.DataFrame):
    """Computes and returns insights into the data.
    Args:
    data: The dataset to be investigated/analyzed.
    """
    print(f'There are {(dataset.columns.nunique(1))} unique columns and'
          f' {len(dataset)} rows in the dataset\n')
    print('='*100)
    print(f'Null value check:\n {data.isnull().sum()}\n')
    print('='*100)
    print(f'Dtypes info: {data.info()}')
    print('='*100)
    return data.describe().T


In [4]:
# The datasets
dataset = read_data(data_path)
submission = read_data(submission_path)

In [5]:
read_data(data_path, transpose=True)

Unnamed: 0,0,1,2
X,34.26,34.26,34.26
Y,-15.91,-15.9,-15.89
target_2015,0,0,0
elevation,887.764,743.404,565.728
precip 2014-11-16 - 2014-11-23,0,0,0
precip 2014-11-23 - 2014-11-30,0,0,0
precip 2014-11-30 - 2014-12-07,0,0,0
precip 2014-12-07 - 2014-12-14,14.844,14.844,14.844
precip 2014-12-14 - 2014-12-21,14.5528,14.5528,14.5528
precip 2014-12-21 - 2014-12-28,12.2378,12.2378,12.2378


In [6]:
# Train info
analyze_data(dataset)

There are 40 unique columns and 16466 rows in the dataset

Null value check:
 X                                 0
Y                                 0
target_2015                       0
elevation                         0
precip 2014-11-16 - 2014-11-23    0
precip 2014-11-23 - 2014-11-30    0
precip 2014-11-30 - 2014-12-07    0
precip 2014-12-07 - 2014-12-14    0
precip 2014-12-14 - 2014-12-21    0
precip 2014-12-21 - 2014-12-28    0
precip 2014-12-28 - 2015-01-04    0
precip 2015-01-04 - 2015-01-11    0
precip 2015-01-11 - 2015-01-18    0
precip 2015-01-18 - 2015-01-25    0
precip 2015-01-25 - 2015-02-01    0
precip 2015-02-01 - 2015-02-08    0
precip 2015-02-08 - 2015-02-15    0
precip 2015-02-15 - 2015-02-22    0
precip 2015-02-22 - 2015-03-01    0
precip 2015-03-01 - 2015-03-08    0
precip 2015-03-08 - 2015-03-15    0
precip 2019-01-20 - 2019-01-27    0
precip 2019-01-27 - 2019-02-03    0
precip 2019-02-03 - 2019-02-10    0
precip 2019-02-10 - 2019-02-17    0
precip 2019-02-17 - 20

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X,16466.0,35.077656,0.392395,34.26,34.76,35.05,35.39,35.86
Y,16466.0,-15.813802,0.359789,-16.64,-16.07,-15.8,-15.52,-15.21
target_2015,16466.0,0.076609,0.228734,0.0,0.0,0.0,0.0,1.0
elevation,16466.0,592.848206,354.790357,45.541444,329.063852,623.0,751.434813,2803.303645
precip 2014-11-16 - 2014-11-23,16466.0,1.61076,4.225461,0.0,0.0,0.0,1.261848,19.354969
precip 2014-11-23 - 2014-11-30,16466.0,2.502058,8.631846,0.0,0.0,0.0,0.0,41.023858
precip 2014-11-30 - 2014-12-07,16466.0,1.162076,4.396676,0.0,0.0,0.0,0.0,22.020803
precip 2014-12-07 - 2014-12-14,16466.0,8.27061,4.263375,1.411452,5.54844,7.941822,10.887235,18.870675
precip 2014-12-14 - 2014-12-21,16466.0,8.892459,3.760052,3.580342,5.90544,8.61839,10.960668,23.04434
precip 2014-12-21 - 2014-12-28,16466.0,9.572821,4.523767,1.254098,6.179885,8.78678,12.670775,21.757828


In [7]:
features_new = []
features_old = []
for column in dataset.columns:
    if '2019' not in column:
        features_old.append(column)
    else:
        features_new.append(column)
features_new.extend(['X',	'Y',	'elevation', 'LC_Type1_mode',	'Square_ID'])
train_set = dataset[features_old]
test_set = dataset[features_new]

In [8]:
test_set.head(2)

Unnamed: 0,precip 2019-01-20 - 2019-01-27,precip 2019-01-27 - 2019-02-03,precip 2019-02-03 - 2019-02-10,precip 2019-02-10 - 2019-02-17,precip 2019-02-17 - 2019-02-24,precip 2019-02-24 - 2019-03-03,precip 2019-03-03 - 2019-03-10,precip 2019-03-10 - 2019-03-17,precip 2019-03-17 - 2019-03-24,precip 2019-03-24 - 2019-03-31,...,precip 2019-04-14 - 2019-04-21,precip 2019-04-21 - 2019-04-28,precip 2019-04-28 - 2019-05-05,precip 2019-05-05 - 2019-05-12,precip 2019-05-12 - 2019-05-19,X,Y,elevation,LC_Type1_mode,Square_ID
0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,...,0.0,0.0,0.0,0.0,0.0,34.26,-15.91,887.764222,9,4e3c3896-14ce-11ea-bce5-f49634744a41
1,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,...,0.0,0.0,0.0,0.0,0.0,34.26,-15.9,743.403912,9,4e3c3897-14ce-11ea-bce5-f49634744a41


## Modelling and Evaluation

In [9]:
# splitting dataset into train&test
target = train_set.pop('target_2015')
#Realigning the train&test sets
train_set, test_set = train_set.align(test_set, join='inner', axis=1)
Id = test_set['Square_ID']

train_set.drop(['Square_ID'], axis=1, inplace=True)
test_set.drop(['Square_ID'], axis=1, inplace=True)

In [10]:
scorer = make_scorer(mean_squared_error)
pipe = make_pipeline(StandardScaler(),RandomForestRegressor(random_state=42))

param_grid = {'randomforestregressor__n_estimators': [50, 100],
              'randomforestregressor__min_samples_leaf': [1, 5],
              'randomforestregressor__n_jobs': [-1]}
gs = GridSearchCV(pipe, param_grid, cv=8, scoring=scorer)

gs.fit(train_set, target)

GridSearchCV(cv=8, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('randomforestregressor',
                                        RandomForestRegressor(bootstrap=True,
                                                              criterion='mse',
                                                              max_depth=None,
                                                              max_features='auto',
                                                              max_leaf_nodes=None,
                                                              min_impurity_decrease=0.0,
                                                              min_impurity_split=

In [11]:
#evaluation
y_hat = gs.predict(train_set)
error = np.sqrt(mean_squared_error(target, y_hat))
error

0.03395380373621893

In [12]:
train_set.shape

(16466, 4)

### Predictions

In [13]:
predictions = gs.predict(test_set)
predictions.shape

(16466,)

In [14]:
submission['Square_ID'] = Id
submission['target_2019'] = predictions
submission.to_csv('submission.csv', index=False) #0.163846

In [15]:
submission.min(), submission.max()

(Square_ID      4e3c3896-14ce-11ea-bce5-f49634744a41
 target_2019                                       0
 dtype: object, Square_ID      4e6f5e01-14ce-11ea-bce5-f49634744a41
 target_2019                                       1
 dtype: object)