# QCTO - Workplace Module

### Investigating and Predicting Landslide Risk in Iran
#### Done By: Mieke Spaans

© ExploreAI 2024

---
 <a id="cont"></a>
## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

Add introduction here

---
<a id="one"></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** Listed below is all of the Python packages that will be used throughout the project, such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc. I have seperated the code blocks to libraries used for specific parts of the project for ease of reading and debugging.
---

In [None]:
# MLFlow is used to perform model hyperparameter tuning
!pip install mlflow

In [None]:
# loading .csv files via google drive:
from google.colab import drive

In [1]:
# Libraries used for data loading and pre-processing:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

In [None]:
# Libraries used for data visualisation during exploratory data analysis:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import re
from scipy import stats

In [None]:
# Model training libraries:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Model evaluation libraries:
from sklearn.metrics import mean_squared_error

---
<a id="two"></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

---

### Data Collection:
The dataset I am using for this project is publicly available on Kaggle. 

Link to the Kaggle page that contains the .csv file: https://www.kaggle.com/datasets/mohammadrahdanmofrad/landslide-risk-assessment-factors

The Kaggle page mentions that it was sourced from 'Topography Maps and Google Erath'[sic]. Nothing further is mentioned regarding sourcing data. The dataset author, Mohamad Rahdan, is a Geospatial Data Scientist.

The data is licensed to the CC BY-SA 4.0, or the Creative Commons Attribution-ShareAlike 4.0 International. This means I may:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Please note, however, according to this license I must comply to the following rules:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

The above two paragraphs were copied directly off the Creative Commons website, linked below.
https://creativecommons.org/licenses/by-sa/4.0/

### Data Description:
The dataset contains over 4000 landslide hazard points in Iran. Each point includes longitude and latitude information to help locate the point on a map. 

##### Definition: "Landslide point":

A "Landslide Point" refers to a specific geographical location that has been identified as having a heightened risk of experiencing landslides. This designation is typically based on various factors such as the area's slope, soil composition, rainfall patterns, and other environmental conditions that contribute to instability and susceptibility to landslides. Landslide points are often determined through analysis of historical landslide data, geological surveys, and climate assessments, and are used to assess and mitigate risks in vulnerable regions.

##### The dataset contains 15 columns: [descriptions taken directly off the Kaggle page]
- `ID`
- `LONG`: The longitude of the landslide point
- `LAT`: The latitude of the landslide point
- `SUB_Basin`: The name of the watershed of the landslide point
- `Elevation`: The elevation of the landslide point from sea level (m)
- `AAP(mm)`: The average annual precipitation at the landslide point
- `RiverDIST(m)`: The distance of the landslide point from the river
- `FaultDIST(m)`: The distance of the landslide point from the fault
- `Landuse_Type`: The landuse type at the landslide point
- `Slope(Percent)`: The slope at the landslide point (Values range from 0 to 100%)
- `Slope(Degrees)`: The slope at the landslide point (Values range from 0 to 90)
- `GEO_UNIT`: The geology unit of the landslide point
- `DES_GEOUNI`: The description of the geology unit of the landslide point
- `Climate_Type`: The climate type of the landslide point
- `DES_ClimateType`: The description of the climate type of the landslide point

All but six columns are numerical. The exceptions are:
- `SUB_Basin`
- `Landuse_Type`
- `GEO_UNIT`
- `DES_GEOUNI`
- `Climate_Type`
- `DES_ClimateType`

All are instances of text data.

##### First impressions:
This dataset is very thorough, as each landslide point contains a large variety of different factors, like distance to river and fault, slope, geology unit, etc.

One major issue is the fact that this dataset focuses on landslide points in Iran. For this project to provide real-world benefit, the dataset would be better to contain a variety of countries' landslide risk points. Despite this setback, a workaround would be to use the information gathered here and apply it to countries and regions of similar climate and topographical setting.

With regards to model building:
- We can look into classification, where we would input a new location and determine if it is a possible landslide point. This would be valuable to predict possible landslide points.
- We could also look into unsupervised learning, by applying KNN clustering (or something similar) to the landslide points. It would enable us to understand different categories of landslides. This could help researchers. 

---
<a id="three"></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [5]:
# Loading via pandas when using Jupyter Notebooks
landslide = pd.read_csv('Landslide_Factors_IRAN.csv')

In [7]:
# Check that the data was imported correctly
landslide.head(3)

Unnamed: 0,ID,LONG,LAT,SUB_Basin,Elevation,AAP(mm),RiverDIST(m),FaultDIST(m),Landuse_Type,Slop(Percent),Slop(Degrees),GEO_UNIT,DES_GEOUNI,Climate_Type,DES_ClimateType
0,1,52.326,27.763,Mehran,617.0,137,1448.705292,40639.5789,poorrange,42.240669,22.899523,EOas-ja,"Undivided Asmari and Jahrum Formation , regard...",A-M-VW,"Warm and humid, with a humid period longer tha..."
1,2,52.333,27.772,Mehran,944.0,137,344.299484,40135.02913,mix(woodland_x),68.219116,34.301464,KEpd-gu,Keewatin Epedotic quartz diorite,A-M-VW,"Warm and humid, with a humid period longer tha..."
2,3,52.326,27.763,Mehran,617.0,137,1448.705292,40639.5789,poorrange,42.240669,22.899523,EOas-ja,"Undivided Asmari and Jahrum Formation , regard...",A-M-VW,"Warm and humid, with a humid period longer tha..."


From the above code, we see that the dataframe imports correctly. 

Some issues noted:
- `GEO_UNIT` is in code form and difficult to understand.
- `Climate_Type` is also in code form and difficult to understand.
- `DES_ClimateType` repeats. Consider removing it and only relying on `Climate_Type`.
- `ID` was loaded as a column, but functions as index. Need to either remove or replace existing index.

In [8]:
# Check size of dataframe
landslide.shape

(4295, 15)

The dataframe contains 4295 entries, and has 15 columns.

In [9]:
# Get each column's data type, and count non-null values
landslide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4295 entries, 0 to 4294
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               4295 non-null   int64  
 1   LONG             4295 non-null   float64
 2   LAT              4295 non-null   float64
 3   SUB_Basin        4295 non-null   object 
 4   Elevation        4295 non-null   float64
 5   AAP(mm)          4295 non-null   int64  
 6   RiverDIST(m)     4295 non-null   float64
 7   FaultDIST(m)     4295 non-null   float64
 8   Landuse_Type     4295 non-null   object 
 9   Slop(Percent)    4295 non-null   float64
 10  Slop(Degrees)    4295 non-null   float64
 11  GEO_UNIT         4295 non-null   object 
 12  DES_GEOUNI       4295 non-null   object 
 13  Climate_Type     4295 non-null   object 
 14  DES_ClimateType  4295 non-null   object 
dtypes: float64(7), int64(2), object(6)
memory usage: 503.4+ KB


From this we find out what each column's data type is, as well as the total number of non-null values in each.

Analysis:
- The three data types present are:
    - Integer (2 columns)
    - Float (7 columns)
    - Object (typically string values) (6 columns)
- No column contains null values, which makes our job a little easier in the next phase.

---
<a id="four"></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="five"></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="six"></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="seven"></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="eight"></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="nine"></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a id="ten"></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

Licensing information: https://creativecommons.org/licenses/by-sa/4.0/

Dataset: https://www.kaggle.com/datasets/mohammadrahdanmofrad/landslide-risk-assessment-factors

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
