<a href="https://colab.research.google.com/github/DariusTheGeek/Flood-Prediction-in-Malawi--Zindi-Competition/blob/master/Malawi_Flood_Prediction__starter_code__by_DariusMoruri.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Starter Code for Flood Prediction in Malawi
### Author: [Darius Moruri](https://www.linkedin.com/in/dariusmoruri/)


---

 - This is a simple starter code to get you going for the [Zindi flood prediction competition](https://zindi.africa/competitions/2030-vision-flood-prediction-in-malawi)
 - As it is just a basic machine learning pipeline, the following aspects haven't been covered:
    - Exploratory Data Analysis
    - Feature Engineering
    - Feature Selection
    - Hyperparameter Tuning
    - Model Evaluation
    - Model interpretation
    - Sourcing for more data
    - Documentation and Presentation

*Despite its basic approach, this starter code yieldied a satisfacatory RMSE of **0.11866** and a **top 15 ranking** (as at the time of writing) in the [public leaderboard](https://zindi.africa/competitions/2030-vision-flood-prediction-in-malawi/leaderboard)*

## Context
On 14 March 2019, tropical Cyclone Idai made landfall at the port of Beira, Mozambique, before moving across the region. Millions of people in Malawi, Mozambique and Zimbabwe have been affected by what is the worst natural disaster to hit southern Africa in at least two decades.

In recent decades, countries across Africa have experienced an increase in the frequency and severity of floods. Malawi has been hit with major floods in 2015 and again in 2019. In fact, between 1946 and 2013, floods accounted for 48% of major disasters in Malawi. The Lower Shire Valley in southern Malawi, bordering Mozambique, composed of Chikwawa and Nsanje Districts is the area most prone to flooding.

The objective of this challenge is to build a machine learning model that helps predict the location and extent of floods in southern Malawi.


## Data
The training data for this competion can be found [here](https://drive.google.com/file/d/13PmGuIpBbgc-BaDeXxR8-i-9E3oGZYY0/view?usp=sharing)
and a sample of the submission file can be found [here](https://drive.google.com/file/d/1HBdLXuiXkhRHDoPSUUpbvw6Eh5OredLy/view?usp=sharing)

## Evaluation
The error metric for this competition is the Root Mean Squared Error



## Importing the Necessary Libraries

In [0]:
# Importing libraries
#
import pandas as pd
import numpy as np
import requests
from io import StringIO 
import warnings
warnings.filterwarnings('ignore')

## Reading the Data

In [0]:
# Google drive links to shared submission and training datasets
#
submission = 'https://drive.google.com/file/d/1HBdLXuiXkhRHDoPSUUpbvw6Eh5OredLy/view?usp=sharing'
train = 'https://drive.google.com/file/d/13PmGuIpBbgc-BaDeXxR8-i-9E3oGZYY0/view?usp=sharing'


# Creating a function to read a csv file shared via google
#
def read_csv(url):
  url = 'https://drive.google.com/uc?export=download&id=' + url.split('/')[-2]
  csv_raw = requests.get(url).text
  csv = StringIO(csv_raw)
  df = pd.read_csv(csv)
  return df

# Creating submission and training datataframes
#
sub = read_csv(submission)
df = read_csv(train)

## Basic Data Analysis

In [3]:
# Previewing the first five rows of the dataframe
#
df.head()

Unnamed: 0,X,Y,target_2015,elevation,precip 2014-11-16 - 2014-11-23,precip 2014-11-23 - 2014-11-30,precip 2014-11-30 - 2014-12-07,precip 2014-12-07 - 2014-12-14,precip 2014-12-14 - 2014-12-21,precip 2014-12-21 - 2014-12-28,precip 2014-12-28 - 2015-01-04,precip 2015-01-04 - 2015-01-11,precip 2015-01-11 - 2015-01-18,precip 2015-01-18 - 2015-01-25,precip 2015-01-25 - 2015-02-01,precip 2015-02-01 - 2015-02-08,precip 2015-02-08 - 2015-02-15,precip 2015-02-15 - 2015-02-22,precip 2015-02-22 - 2015-03-01,precip 2015-03-01 - 2015-03-08,precip 2015-03-08 - 2015-03-15,precip 2019-01-20 - 2019-01-27,precip 2019-01-27 - 2019-02-03,precip 2019-02-03 - 2019-02-10,precip 2019-02-10 - 2019-02-17,precip 2019-02-17 - 2019-02-24,precip 2019-02-24 - 2019-03-03,precip 2019-03-03 - 2019-03-10,precip 2019-03-10 - 2019-03-17,precip 2019-03-17 - 2019-03-24,precip 2019-03-24 - 2019-03-31,precip 2019-03-31 - 2019-04-07,precip 2019-04-07 - 2019-04-14,precip 2019-04-14 - 2019-04-21,precip 2019-04-21 - 2019-04-28,precip 2019-04-28 - 2019-05-05,precip 2019-05-05 - 2019-05-12,precip 2019-05-12 - 2019-05-19,LC_Type1_mode,Square_ID
0,34.26,-15.91,0.0,887.764222,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,9,4e3c3896-14ce-11ea-bce5-f49634744a41
1,34.26,-15.9,0.0,743.403912,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,9,4e3c3897-14ce-11ea-bce5-f49634744a41
2,34.26,-15.89,0.0,565.728343,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,9,4e3c3898-14ce-11ea-bce5-f49634744a41
3,34.26,-15.88,0.0,443.392774,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,10,4e3c3899-14ce-11ea-bce5-f49634744a41
4,34.26,-15.87,0.0,437.443428,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,10,4e3c389a-14ce-11ea-bce5-f49634744a41


In [4]:
# Previewwing the last ten rows of the dataframe
#
df.tail()

Unnamed: 0,X,Y,target_2015,elevation,precip 2014-11-16 - 2014-11-23,precip 2014-11-23 - 2014-11-30,precip 2014-11-30 - 2014-12-07,precip 2014-12-07 - 2014-12-14,precip 2014-12-14 - 2014-12-21,precip 2014-12-21 - 2014-12-28,precip 2014-12-28 - 2015-01-04,precip 2015-01-04 - 2015-01-11,precip 2015-01-11 - 2015-01-18,precip 2015-01-18 - 2015-01-25,precip 2015-01-25 - 2015-02-01,precip 2015-02-01 - 2015-02-08,precip 2015-02-08 - 2015-02-15,precip 2015-02-15 - 2015-02-22,precip 2015-02-22 - 2015-03-01,precip 2015-03-01 - 2015-03-08,precip 2015-03-08 - 2015-03-15,precip 2019-01-20 - 2019-01-27,precip 2019-01-27 - 2019-02-03,precip 2019-02-03 - 2019-02-10,precip 2019-02-10 - 2019-02-17,precip 2019-02-17 - 2019-02-24,precip 2019-02-24 - 2019-03-03,precip 2019-03-03 - 2019-03-10,precip 2019-03-10 - 2019-03-17,precip 2019-03-17 - 2019-03-24,precip 2019-03-24 - 2019-03-31,precip 2019-03-31 - 2019-04-07,precip 2019-04-07 - 2019-04-14,precip 2019-04-14 - 2019-04-21,precip 2019-04-21 - 2019-04-28,precip 2019-04-28 - 2019-05-05,precip 2019-05-05 - 2019-05-12,precip 2019-05-12 - 2019-05-19,LC_Type1_mode,Square_ID
16461,35.86,-15.44,0.0,635.675022,16.956563,31.155531,12.882013,8.810145,6.179829,9.863685,15.765685,21.457507,105.275891,3.645338,18.531483,13.816063,23.728058,8.794998,9.369763,21.428131,2.493683,8.760326,5.177616,12.450319,17.289942,19.612179,10.909635,64.494171,15.940852,24.828982,11.335339,30.984762,0.518269,5.770066,14.839779,4.928294,10.526186,18.746072,10,4e6f5dfd-14ce-11ea-bce5-f49634744a41
16462,35.86,-15.43,0.0,632.598892,16.956563,31.155531,12.882013,8.810145,6.179829,9.863685,15.765685,21.457507,105.275891,3.645338,18.531483,13.816063,23.728058,8.794998,9.369763,21.428131,2.493683,8.760326,5.177616,12.450319,17.289942,19.612179,10.909635,64.494171,15.940852,24.828982,11.335339,30.984762,0.518269,5.770066,14.839779,4.928294,10.526186,18.746072,10,4e6f5dfe-14ce-11ea-bce5-f49634744a41
16463,35.86,-15.42,0.0,632.450136,16.956563,31.155531,12.882013,8.810145,6.179829,9.863685,15.765685,21.457507,105.275891,3.645338,18.531483,13.816063,23.728058,8.794998,9.369763,21.428131,2.493683,8.760326,5.177616,12.450319,17.289942,19.612179,10.909635,64.494171,15.940852,24.828982,11.335339,30.984762,0.518269,5.770066,14.839779,4.928294,10.526186,18.746072,10,4e6f5dff-14ce-11ea-bce5-f49634744a41
16464,35.86,-15.41,0.0,629.272733,16.956563,31.155531,12.882013,8.810145,6.179829,9.863685,15.765685,21.457507,105.275891,3.645338,18.531483,13.816063,23.728058,8.794998,9.369763,21.428131,2.493683,8.760326,5.177616,12.450319,17.289942,19.612179,10.909635,64.494171,15.940852,24.828982,11.335339,30.984762,0.518269,5.770066,14.839779,4.928294,10.526186,18.746072,10,4e6f5e00-14ce-11ea-bce5-f49634744a41
16465,35.86,-15.4,0.0,626.164641,16.956563,31.155531,12.882013,8.810145,6.179829,9.863685,15.765685,21.457507,105.275891,3.645338,18.531483,13.816063,23.728058,8.794998,9.369763,21.428131,2.493683,8.760326,5.177616,12.450319,17.289942,19.612179,10.909635,64.494171,15.940852,24.828982,11.335339,30.984762,0.518269,5.770066,14.839779,4.928294,10.526186,18.746072,10,4e6f5e01-14ce-11ea-bce5-f49634744a41


In [5]:
# Previewing some statistical summaries of the dataframe
# Transposing for a better view
#
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X,16466.0,35.077656,0.392395,34.26,34.76,35.05,35.39,35.86
Y,16466.0,-15.813802,0.359789,-16.64,-16.07,-15.8,-15.52,-15.21
target_2015,16466.0,0.076609,0.228734,0.0,0.0,0.0,0.0,1.0
elevation,16466.0,592.848206,354.790357,45.541444,329.063852,623.0,751.434813,2803.303645
precip 2014-11-16 - 2014-11-23,16466.0,1.61076,4.225461,0.0,0.0,0.0,1.261848,19.354969
precip 2014-11-23 - 2014-11-30,16466.0,2.502058,8.631846,0.0,0.0,0.0,0.0,41.023858
precip 2014-11-30 - 2014-12-07,16466.0,1.162076,4.396676,0.0,0.0,0.0,0.0,22.020803
precip 2014-12-07 - 2014-12-14,16466.0,8.27061,4.263375,1.411452,5.54844,7.941822,10.887235,18.870675
precip 2014-12-14 - 2014-12-21,16466.0,8.892459,3.760052,3.580342,5.90544,8.61839,10.960668,23.04434
precip 2014-12-21 - 2014-12-28,16466.0,9.572821,4.523767,1.254098,6.179885,8.78678,12.670775,21.757828


#### Shape and Size

In [6]:
# Checking for the shape and size of the dataframe
#
df.shape, df.size

((16466, 40), 658640)

#### Missing Values

In [7]:
# Checking for missing values
#
df.isnull().sum().any()

False

#### Duplicated Values

In [8]:
# Checking for duplicates
#
df.duplicated().any()

False

#### Data Types

In [9]:
# Checking if the columns are represented with the appriopriate datatypes
#
df.dtypes

X                                 float64
Y                                 float64
target_2015                       float64
elevation                         float64
precip 2014-11-16 - 2014-11-23    float64
precip 2014-11-23 - 2014-11-30    float64
precip 2014-11-30 - 2014-12-07    float64
precip 2014-12-07 - 2014-12-14    float64
precip 2014-12-14 - 2014-12-21    float64
precip 2014-12-21 - 2014-12-28    float64
precip 2014-12-28 - 2015-01-04    float64
precip 2015-01-04 - 2015-01-11    float64
precip 2015-01-11 - 2015-01-18    float64
precip 2015-01-18 - 2015-01-25    float64
precip 2015-01-25 - 2015-02-01    float64
precip 2015-02-01 - 2015-02-08    float64
precip 2015-02-08 - 2015-02-15    float64
precip 2015-02-15 - 2015-02-22    float64
precip 2015-02-22 - 2015-03-01    float64
precip 2015-03-01 - 2015-03-08    float64
precip 2015-03-08 - 2015-03-15    float64
precip 2019-01-20 - 2019-01-27    float64
precip 2019-01-27 - 2019-02-03    float64
precip 2019-02-03 - 2019-02-10    

## Data Cleaning

#### Separating the train and test sets 

In [0]:
# Creating lists of columns to be used in separating the dataframe into training and testing datasets
# Using a for loop for efficiency
#
precip_features_2019 = []
precip_features_2015 = []
for col in df.columns:
  if '2019' in col:
    precip_features_2019.append(col)
  elif 'precip 2014' in col:
    precip_features_2015.append(col)
  elif 'precip 2015' in col:
    precip_features_2015.append(col)

In [11]:
# Separating the train dataset from the main dataframe
#
train = df[df.columns.difference(precip_features_2019)]

# Previewing the first two rows of the train dataset
#
train.head(2)

Unnamed: 0,LC_Type1_mode,Square_ID,X,Y,elevation,precip 2014-11-16 - 2014-11-23,precip 2014-11-23 - 2014-11-30,precip 2014-11-30 - 2014-12-07,precip 2014-12-07 - 2014-12-14,precip 2014-12-14 - 2014-12-21,precip 2014-12-21 - 2014-12-28,precip 2014-12-28 - 2015-01-04,precip 2015-01-04 - 2015-01-11,precip 2015-01-11 - 2015-01-18,precip 2015-01-18 - 2015-01-25,precip 2015-01-25 - 2015-02-01,precip 2015-02-01 - 2015-02-08,precip 2015-02-08 - 2015-02-15,precip 2015-02-15 - 2015-02-22,precip 2015-02-22 - 2015-03-01,precip 2015-03-01 - 2015-03-08,precip 2015-03-08 - 2015-03-15,target_2015
0,9,4e3c3896-14ce-11ea-bce5-f49634744a41,34.26,-15.91,887.764222,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,0.0
1,9,4e3c3897-14ce-11ea-bce5-f49634744a41,34.26,-15.9,743.403912,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,0.0


In [12]:
# Separating the test dataset from the main dataframe
#
precip_features_2019.extend(['X',	'Y',	'elevation', 'LC_Type1_mode',	'Square_ID'])
test = df[precip_features_2019]

# Previewing the first two rows of the test dataset
#
test.head(2)

Unnamed: 0,precip 2019-01-20 - 2019-01-27,precip 2019-01-27 - 2019-02-03,precip 2019-02-03 - 2019-02-10,precip 2019-02-10 - 2019-02-17,precip 2019-02-17 - 2019-02-24,precip 2019-02-24 - 2019-03-03,precip 2019-03-03 - 2019-03-10,precip 2019-03-10 - 2019-03-17,precip 2019-03-17 - 2019-03-24,precip 2019-03-24 - 2019-03-31,precip 2019-03-31 - 2019-04-07,precip 2019-04-07 - 2019-04-14,precip 2019-04-14 - 2019-04-21,precip 2019-04-21 - 2019-04-28,precip 2019-04-28 - 2019-05-05,precip 2019-05-05 - 2019-05-12,precip 2019-05-12 - 2019-05-19,X,Y,elevation,LC_Type1_mode,Square_ID
0,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,34.26,-15.91,887.764222,9,4e3c3896-14ce-11ea-bce5-f49634744a41
1,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0,34.26,-15.9,743.403912,9,4e3c3897-14ce-11ea-bce5-f49634744a41


#### Renaming columns

In [0]:
# Creating a dictionary of column names to be renamed for the training dataset
# The column names are renamed for conveniency
#
new_2015_cols = {}
for col, number in zip(precip_features_2015, range(1, len(precip_features_2015) + 1)):
  if 'precip' in col:
    new_2015_cols[col] = 'week_' + str(number) + '_precip'

    
# Creating a dictionary of column names to be renamed for the testing dataset
#
new_2019_cols = {}
for col, number in zip(precip_features_2019, range(1, len(precip_features_2019) + 1)):
  if 'precip' in col:
    new_2019_cols[col] = 'week_' + str(number) + '_precip'
    
# Renaming the columns
#
train.rename(columns = new_2015_cols, inplace = True)
test.rename(columns = new_2019_cols, inplace = True)

In [14]:
# Previewing the first three rows of the cleaned train set
#
train.head(3)

Unnamed: 0,LC_Type1_mode,Square_ID,X,Y,elevation,week_1_precip,week_2_precip,week_3_precip,week_4_precip,week_5_precip,week_6_precip,week_7_precip,week_8_precip,week_9_precip,week_10_precip,week_11_precip,week_12_precip,week_13_precip,week_14_precip,week_15_precip,week_16_precip,week_17_precip,target_2015
0,9,4e3c3896-14ce-11ea-bce5-f49634744a41,34.26,-15.91,887.764222,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,0.0
1,9,4e3c3897-14ce-11ea-bce5-f49634744a41,34.26,-15.9,743.403912,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,0.0
2,9,4e3c3898-14ce-11ea-bce5-f49634744a41,34.26,-15.89,565.728343,0.0,0.0,0.0,14.844025,14.552823,12.237766,57.451361,30.127047,30.449468,1.521829,29.389995,32.878318,8.179804,0.963981,16.659097,3.304466,0.0,0.0


#### Re-aligning the Train and Test Datasets

In [15]:
# Separating the target variable
#
target = train.target_2015


# Aligning the training and testing datasets
#
train, test = train.align(test, join = 'inner', axis = 1)


# Previewing the first three rows of the cleaned and realigned test set
#
test.head(3)

Unnamed: 0,LC_Type1_mode,Square_ID,X,Y,elevation,week_1_precip,week_2_precip,week_3_precip,week_4_precip,week_5_precip,week_6_precip,week_7_precip,week_8_precip,week_9_precip,week_10_precip,week_11_precip,week_12_precip,week_13_precip,week_14_precip,week_15_precip,week_16_precip,week_17_precip
0,9,4e3c3896-14ce-11ea-bce5-f49634744a41,34.26,-15.91,887.764222,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0
1,9,4e3c3897-14ce-11ea-bce5-f49634744a41,34.26,-15.9,743.403912,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0
2,9,4e3c3898-14ce-11ea-bce5-f49634744a41,34.26,-15.89,565.728343,12.99262,4.582856,35.037532,4.796012,28.083314,0.0,58.362456,18.264692,17.537486,0.896323,1.68,0.0,0.0,0.0,0.0,0.0,0.0


## Model Selection

In [16]:
# Installing catboost
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/3d/f6/733fe7cca5d0d882e1a708ad59da2510416cc2e4fa54e17c7a5082f67811/catboost-0.20.1-cp36-none-manylinux1_x86_64.whl (63.6MB)
[K     |████████████████████████████████| 63.6MB 60.8MB/s 
Installing collected packages: catboost
Successfully installed catboost-0.20.1


#### Comparing different models to find the most accurate

In [17]:
# Using different models to find the optimal model
#
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')


# Creating a list of regressor algorithms to compare with
#
models = [RandomForestRegressor(), GradientBoostingRegressor(), AdaBoostRegressor(), DecisionTreeRegressor(),  XGBRegressor(objective ='reg:squarederror'),\
          SVR(), KNeighborsRegressor(), LinearRegression(), CatBoostRegressor(logging_level='Silent')]


# Creating lists of the algorithms, to store the accuracy scores of each fold
#
RandomForest, GradientBoosting, AdaBoost, DecisionTree, XGB, SVR, KNeighbors, Linear, Cat = ([] for x in range(9))


# Creating a list containig the list of each algorithm. Created for easy iteration
#
model_list = [RandomForest, GradientBoosting, AdaBoost, DecisionTree, XGB, SVR, KNeighbors, Linear, Cat]


# Spliting the data into features and the target variable
#
X = train.drop('Square_ID', axis = 1)
y = target


# Creating a cross validation of 10 folds
#
kfold  = KFold(n_splits=10, random_state=101)


# Iterating through each model and appending the scores of each fold to the appriopriate list
#
for i, j in zip(models, model_list):
  j.extend(list(cross_val_score(i, X, y, scoring = 'neg_mean_squared_error', cv = kfold)))

  
# Creating a function to convert neg_mean_squared_error to a square root
#
def sq(lis):
  new_lis = []
  lis = np.array(lis)
  for i in lis:
    i = np.sqrt(i*-1)
    new_lis.append(i)
  return new_lis


# Creating a dataframe of all the rmses from the iterations for each model
#
rmses = pd.DataFrame({'Fold': np.arange(1, 11), 'RandomForest': sq(RandomForest), 'GradientBoosting': sq(GradientBoosting), 'Adaboost': sq(AdaBoost), 'DecisionTree': sq(DecisionTree),\
                       'XGB': sq(XGB), 'SVR': sq(SVR), 'Kneighbors': sq(KNeighbors), 'Linear': sq(Linear), 'Cat': sq(Cat)})

# Setting the index
#
rmses.set_index('Fold', inplace = True)


# Calculating the mean and standard deviation rmse of each algorithm
#
rmses.loc['mean'] = rmses.mean()
rmses.loc['std'] = rmses.std()


# Previewing the rmses dataframe
#
rmses

Unnamed: 0_level_0,RandomForest,GradientBoosting,Adaboost,DecisionTree,XGB,SVR,Kneighbors,Linear,Cat
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0.085937,0.084569,0.091366,0.085926,0.085266,0.130031,0.086023,0.135525,0.084636
2,0.073427,0.059387,0.0883,0.089046,0.058436,0.109598,0.058385,0.089754,0.062418
3,0.11261,0.089913,0.091947,0.141038,0.088311,0.127553,0.096783,0.121761,0.094261
4,0.159949,0.166018,0.213543,0.198635,0.170162,0.198225,0.19184,0.26274,0.160362
5,0.160206,0.179172,0.218934,0.224328,0.176187,0.206909,0.177272,0.336841,0.162315
6,0.109505,0.118684,0.149015,0.133493,0.118987,0.148906,0.109368,0.231556,0.109079
7,0.058981,0.056948,0.081455,0.064764,0.05873,0.112576,0.059242,0.153379,0.052203
8,0.157463,0.102168,0.124836,0.15754,0.10114,0.195707,0.184638,0.115595,0.145777
9,0.246438,0.264649,0.269114,0.272945,0.260912,0.276731,0.324014,0.341167,0.26018
10,0.224625,0.225823,0.366404,0.315316,0.227395,0.236619,0.230469,0.37066,0.216168


#### Selecting the top three models with the least RMSE

In [18]:
# Checking for the regressor with minimum root mean squared error
#
rmses.loc['mean'].idxmin(), rmses.loc['mean'].min()

('XGB', 0.13455272135801205)

In [19]:
# Arranging the models in ascending order
#
rmses.loc['mean'].sort_values()

XGB                 0.134553
GradientBoosting    0.134733
Cat                 0.134740
RandomForest        0.138914
Kneighbors          0.151803
DecisionTree        0.168303
Adaboost            0.169491
SVR                 0.174286
Linear              0.215898
Name: mean, dtype: float64

## Training the top three models and making predictions

In [0]:
# Using the top three models; XGBoost, Catboost and Gradientboost to train and make predictions
# Creating a list of models to use
models = [XGBRegressor(objective ='reg:squarederror'), CatBoostRegressor(logging_level='Silent'), GradientBoostingRegressor()]
model_names = ['xgboost', 'catboost', 'gradientboost']


# Selecting the training features and the target feature
#
X = train.drop('Square_ID', axis = 1)
y = target


# Submission dataset
#
sub = test.drop('Square_ID', axis = 1)


# Using a for loop to create a submission file for each model
#
for model, model_name in zip(models, model_names):
  regressor = model                      # instantiating the model
  regressor.fit(X, y)                    # Training the model
  predictions  = regressor.predict(sub)  # Making predictions
  submission_df = pd.DataFrame({'Square_ID': test.Square_ID, 'target_2019': predictions}) # Creating a submission file
  submission_df.to_csv(model_name + '_baseline.csv', index = False)

*The models yielded the following Root Mean Squared Errors:*
 - XGBRegressor: 0.250710809791906
 - **CatBoostRegressor: 0.118661182373564**
 - GradientBoostingRegressor: 0.608857842367698
 
The CatBoostRegressor was the most accurate with an RMSE of 0.118661182373564

# Next Steps:
To further improve the accuracy of the model, the following should be considered:
  - A thorough Exploratory Data Analysis
  - Feature Engineering
  - Feature Selection
  - Hyperparameter Tuning
  - Model Evaluation
  - Model interpretation
  - Source for more data
  
For any suggestions or clarifications feel free to reach out @ [Darius Moruri - Linkedin](https://www.linkedin.com/in/dariusmoruri/)
