<h2><center> Welcome to the Landslide Prediction Challenge</h2></center>

A landslide is the movement of a mass of rock, debris, or earth(soil) down a slope. As a common natural hazard, it can lead to significant losses of human lives and properties.


Hong Kong, one of the hilly and densely populated cities in the world, is frequently affected by extreme rainstorms, making it highly susceptible to rain-induced natural terrain landslides

<img src = "https://drive.google.com/uc?export=view&id=1-8sSI75AG3HM89nDJEwo6_KJbAEUXS-r">

The common practice of identifying landslides is visual interpretation which, however, is labor-intensive and time-consuming.

***Thus, this hack will focus on automating the landslide identification process using artificial intelligence techniques***

This will be achieved by using high-resolution terrain information to perform the terrain-based landslide identification. Other auxiliary data such as the lithology of the surface materials and rainfall intensification factor are also provided.


Table of contents:

1. [Import relevant libraries](#Libraries)
2. [Load files](#Load)
3. [Preview files](#Preview)
4. [Data dictionary](#Dictionary)
5. [Data exploration](#Exploration)
6. [Target distribution](#Target)
7. [Outliers](#Outliers)
8. [Correlations](#Correlations)
9. [Model training](#Model)
10. [Test set predictions](#Predictions)
11. [Creating a submission file](#Submission)
12. [Tips to improve model performance](#Tips)

<a name = "Libraries"></a>
## 1. Import relevant libraries

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import f1_score, classification_report,confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

<a name = "Load"></a>
## 2. Load files

In [None]:
# Read files to pandas dataframes
train = pd.read_csv('Dataset/Train.csv')
test = pd.read_csv('Dataset/Test.csv')
sample_submission = pd.read_csv('Dataset/Sample submission.csv')

<a name = "Preview"></a>
## 3. Preview files

In [None]:
# Check the first five rows of the train set
train.head()

In [None]:
# Check the first five rows of the test set
test.head()

In [None]:
# Check how the submission file should look like
sample_submission.head()

<a name = "Dictionary"></a>
## 4. Data Dictionary
<figure>
<img src = "https://drive.google.com/uc?export=view&id=1T_XBSH6ozmhGiDz_nL4bQvvonHUpbCfW" height = "200">
<img src = "https://drive.google.com/uc?export=view&id=13nSrrIowiFPjAgiR--Nd4cHLVwvXFaFj" height = "400">

In [None]:
# Check shape and size of train and test set
train.shape, test.shape, sample_submission.shape

<a name = "Exploration"></a>
## 5. Data exploration

In [None]:
# Check statistical summaries of the train set
train.describe()

 - There is a very high correlation between features extracted from the same location

In [None]:
# Elevation correlations
plt.figure(figsize = (20, 12))
sample_elevations = ['1_elevation',	'2_elevation',	'3_elevation',	'4_elevation',	'5_elevation']
sns.pairplot(train[sample_elevations], kind="scatter", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
plt.show()

In [None]:
# Check statistical summaries of the test set
test.describe()

In [None]:
# Check for any missing values
train.isnull().sum().any(), test.isnull().sum().any()

In [None]:
# Check for duplicates
train.duplicated().any(), test.duplicated().any()

<a name = "Target"></a>
## 6. Target variable distribution

In [None]:
# Check distribution of the target variabe
train.Label.value_counts(normalize = True)

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(7, 6))
sns.countplot(x= train.Label)
plt.title('Target Variable Distribution')
plt.show()

The dataset is highly imbalanced with the majority class having 75% and the minority class 25%

Some techiques in handling class imbalance include;
 1. Using SMOTE to create synthetic data to reduce imbalanceness
 2. Undersampling the majority class
 3. Oversampling the minority class
 4. Giving more weight to minority class during modelling

<a name = "Outliers"></a>
## 7. Outliers

In [None]:
# Exploring some features for cell 1
explore_cols =  ['1_elevation', '1_aspect', '1_slope', '1_placurv', '1_procurv', '1_lsfactor', '1_twi', '1_geology']
explore_cols

In [None]:
# Plotting boxplots for each of the numerical columns
fig, axes = plt.subplots(nrows = 2, ncols = 3, figsize = (20, 10))
fig.suptitle('Box plots showing outliers', y= 0.93, fontsize = 15)

for ax, data, name in zip(axes.flatten(), train, explore_cols):
  sns.boxplot(train[name], ax = ax)

 Elevation, IsFactor, Placurv, curve and slope have some outliers.
 The aspect feature has no outliers.
 
 Some of the techniques you can use to handle outliers include:
  1. Log transformations, scaling, box-cox transformations...
  2. Dropping the outliers
  3. Replacing the outliers with mean, median, mode or any other aggregates

<a name = "Correlations"></a>
## 8. Correlations

In [None]:
explore_cols

In [None]:
# Type of correlations 
plt.figure(figsize = (20, 12))
print(len(train[train['Label']==0][explore_cols]), len(train[train['Label']==1][explore_cols]), len(train))
sns.pairplot(train[explore_cols+['Label']], kind="scatter", plot_kws=dict(s=1, edgecolor="green"), hue = 'Label')
plt.show()

In [None]:
# Type of correlations 
plt.figure(figsize = (20, 12))
sns.pairplot(train[explore_cols], kind="scatter", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
plt.show()

- There is no correlation for most of the features, how can you capture this information for modelling...
- Which information can you derive from this correlations

In [None]:
# Quantify correlations
corr = train[explore_cols].corr()
plt.figure(figsize = (13, 8))
sns.heatmap(corr, cmap='RdYlGn', annot = True, center = 0)
plt.title('Correlogram', fontsize = 15, color = 'darkgreen')
plt.show()

 - There is a strong positive correlation of approximately 0.8 between slope and IsFactor
 - There is some negative correlation between IsFactor and placurv
 - The IsFactor variable is correlated most of the other features, why is this?

<a name = "Model"></a>
## 9. Model training

In [None]:
# Select main columns to be used in training
main_cols = train.columns.difference(['Sample_ID', 'Label'])
X = train[main_cols]
y = train.Label

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=2022)

# Train model
model = RandomForestClassifier(random_state = 2022)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Check the auc score of the model
print(f'RandomForest F1 score on the X_test is: {f1_score(y_test, y_pred)}\n')

# print classification report
print(classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
fig, ax = plt.subplots(figsize=(15,7))
disp.plot(ax=ax)
plt.show()

 - True positives - 442
 - True negatives - 2287
 - False positives - 128
 - False negatives - 403

 Precision  = TP / (TP + FP) = 442 / (442 + 128) = 0.775438596491228

 Recall = TP / (TP + FN) = 442 / (442 + 403) = 0.5230769230769231

 F1 score = harmonic mean between Precision and Recall

 F1 score = (2 * Precision * Recall) / (Precision + Recall)

 F1 score = (2 * 0.775438596491228 * 0.5230769230769231) / (0.775438596491228 + 0.5230769230769231) = 0.6247349823321554

In [None]:
# Feature importance
impo_df = pd.DataFrame({'feature': X.columns, 'importance': model.feature_importances_}).set_index('feature').sort_values(by = 'importance', ascending = False)
impo_df = impo_df[:10].sort_values(by = 'importance', ascending = True)
impo_df.plot(kind = 'barh', figsize = (10, 10))
plt.legend(loc = 'center right')
plt.title('Bar chart showing top ten features', fontsize = 14)
plt.xlabel('Features', fontsize = 12, color = 'indigo')
plt.show()

<a name = "Predictions"></a>
## 10. Test set predictions

In [None]:
# Make prediction on the test set
test_df = test[main_cols]
predictions = model.predict(test_df)

# Create a submission file
sub_file = pd.DataFrame({'Sample_ID': test.Sample_ID, 'Label': predictions})

# Check the distribution of your predictions
sns.countplot(x = sub_file.Label)
plt.title('Predicted Variable Distribution');

<a name = "Submission"></a>
## 11. Creating a submission file

In [None]:
# Create a csv file and upload to zindi 
sub_file.to_csv('Baseline.csv', index = False)
sub_file.head()

<a name = "Tips"></a>
## 12. Tips to improve model performance
 - Use cross-validation techniques
 - Feature engineering
 - Handle the class imbalance of the target variable
 - Try different modelling techniques - Stacking classifier, Voting classifiers, ensembling...
 - Data transformations
 - Feature Selection techniques such as RFE, Tree-based feature importance...
 - Domain Knowledge, do research on how the provided features affect landslides, soil topology...

#                       ::GOOD LUCK AND HAPPY HACKING 😊


