# **Step 1: Answering the Question**
## **Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?**
We're trying to predict the ordinal variable damage_grade, which represents a level of damage to the building that was hit by the earthquake. There are 3 grades of the damage:
1 represents low damage
2 represents a medium amount of damage
3 represents almost complete destruction

## **Did you define the metric for success before beginning?**
To measure the performance of our algorithms, we'll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score.
## **Did you understand the context for the question and the scientific or business application?**
Predict the level of damage to buildings caused by the 2015 Gorkha earthquake to help predict the damages of future earthquakes.
## **Did you record the experimental design?**
Throughout this process, the National Planning Commission, along with Kathmandu Living Labs and the Central Bureau of Statistics, has generated one of the largest post-disaster datasets ever collected, containing valuable information on earthquake impacts, household conditions, and socio-economic-demographic statistics.
## **Did you consider whether the question could be answered with the available data?**
The question should be able to be answered by the dataset.


# **Step 2 & 3: Checking and Tidying the Data**

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

In [4]:
dataset_vals = pd.read_csv("train_values.csv")
dataset_labs = pd.read_csv("train_labels.csv")
dataset = dataset_vals.merge(dataset_labs)
dataset = dataset.dropna(axis = 0)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46616 entries, 0 to 46615
Data columns (total 40 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   building_id                             46616 non-null  int64  
 1   geo_level_1_id                          46616 non-null  int64  
 2   geo_level_2_id                          46616 non-null  int64  
 3   geo_level_3_id                          46616 non-null  int64  
 4   count_floors_pre_eq                     46616 non-null  int64  
 5   age                                     46616 non-null  int64  
 6   area_percentage                         46616 non-null  int64  
 7   height_percentage                       46616 non-null  int64  
 8   land_surface_condition                  46616 non-null  object 
 9   foundation_type                         46616 non-null  object 
 10  roof_type                               46616 non-null  ob

In [5]:
#Encoding independent variable
objList = dataset.select_dtypes(include = 'object').columns
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in objList:
    dataset[col] = le.fit_transform(dataset[col].astype(str))
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values
#Encoding dependent variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y) 

In [6]:
#splitting dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)
#feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

## **Step 4: Exploratory analysis**

In [None]:
sns.set_palette('colorblind')
sns.pairplot(data=dataset, height=3)

In [None]:
#training the decision tree classification model
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

#making a confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
print(cm) #rows actuals, columns predictions
accuracy_score(y_test,y_pred)

[[ 3518  3398   569]
 [ 3514 31121 10248]
 [  483  9701 15629]]


0.6429695194484594

In [None]:
#training the logistic regression model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

#making a confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
print(cm) #rows actuals, columns predictions
accuracy_score(y_test,y_pred)

[[ 2075  5266   144]
 [ 1543 40452  2888]
 [  136 22345  3332]]


0.586574743224057