# Welcome to my Richter's Predictor Nepal Earthquake Damage Predictor Model!

Default code from Kaggle Notebook:

In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



## Displaying some important libraries:

In [7]:
# import tensorflow as tf
# print("Tensorflow:", tf.__version__)

# import kerastuner as kt
# print("kerastuner:", kt.__version__)

# import keras_tuner as kt2
# print("keras_tuner:", kt2.__version__)

# import platform
# print("Python:", platform.python_version())

# import numpy as np
# print("numpy:", np.__version__)

# import pandas as pd
# print("pandas:", pd.__version__)

# import sklearn
# print("sklearn version:", sklearn.__version__)

# import sklearn
# print("sklearn path:", sklearn.__path__)

# import matplotlib
# print("matplotlib:", matplotlib.__version__)

# import seaborn as sns
# print("seaborn:", sns.__version__)

# # WARNING:tensorflow:From c:\Users\Micha\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

# # Tensorflow: 2.15.0
# # C:\Users\Micha\AppData\Local\Temp\ipykernel_6936\1753711907.py:4: DeprecationWarning: `import kerastuner` is deprecated, please use `import keras_tuner`.
# #   import kerastuner as kt
# # kerastuner: 1.0.5
# # keras_tuner: 1.3.5
# # Python: 3.10.11
# # numpy: 1.24.3
# # pandas: 2.1.4
# # sklearn version: 1.2.2
# # sklearn path: ['c:\\Users\\Micha\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\sklearn']
# # matplotlib: 3.8.2
# # seaborn: 0.13.0


## Set Global random seed to make sure we can replicate any model that we create (no randomness)

In [8]:
import random
import tensorflow as tf
import numpy as np
import os



np.random.seed(42)
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

os.environ['TF_DETERMINISTIC_OPS'] = '1'

### Loading the training and testing data

train_values are the features (X), and train_labels is the target/label (Y)

In [9]:
train_values = pd.read_csv("train_values.csv")
train_labels = pd.read_csv("train_labels.csv")

test_values = pd.read_csv("test_values.csv")

# print("train labels:\n", train_labels.head())

# print("train_values:\n", train_values.head())
      
# print("test_values:\n", test_values.head())

## I want to find out which features to use since there are so many. Here are some common data science techniques:

1. **Correlation Matrix with Heatmap**: Correlation states how the features are related to each other or the target variable. You can use a heatmap to visualize the correlation matrix.

2. **Univariate Selection**: Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the `SelectKBest` class that can be used with a suite of different statistical tests to select a specific number of features.

3. **Recursive Feature Elimination (RFE)**: RFE is a popular feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

4. **Feature Importance**: You can get the feature importance of each feature of your dataset by using the feature importance property of the model. For example, Decision Trees models in the scikit-learn library offer an importance property that can be accessed directly.

For categorical features, you can convert them into numerical values using techniques like One-Hot Encoding or Label Encoding before applying these feature selection techniques. 

In [10]:
from sklearn.feature_selection import RFE
from lightgbm import LGBMRegressor #Lightgbm is a great gradient boosting model for large amount of data

# Assuming X is your feature set and y is the target value
X = train_values.drop('building_id', axis=1)
X = pd.get_dummies(X)

y = train_labels.drop('building_id', axis=1)
y = np.ravel(y) # converting dataframe to a one-dimensional array using the ravel function from numpy

estimator = LGBMRegressor(verbose = 0, random_state = 42)  # It's best to find the best model for you
selector = RFE(estimator, step=1)
selector = selector.fit(X, y)

# The mask of selected features.
print(selector.support_)

# The feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. 
# Selected features are assigned rank 1.
print(selector.ranking_)

# [ True  True  True  True  True  True  True  True  True  True  True  True
#   True  True  True  True  True  True  True  True False False False False
#  False False False False False False  True  True False  True False  True
#   True False  True  True  True  True False  True  True False False  True
#  False False False False  True False False False False False False False
#  False False False  True False False False False]
# [ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2 18 10 31
#  27 33 28 25 30  6  1  1  3  1  8  1  1  5  1  1  1  1 34  1  1 16  9  1
#  23  7  4 14  1 29 15 20 17 35 32 24 21 13 26  1 11 22 12 19]

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True False False False False
 False False False False False False  True  True False  True False  True
  True False  True  True  True  True False  True  True False False  True
 False False False False  True False False False False False False False
 False False False  True False False False False]
[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2 18 10 31
 27 33 28 25 30  6  1  1  3  1  8  1  1  5  1  1  1  1 34  1  1 16  9  1
 23  7  4 14  1 29 15 20 17 35 32 24 21 13 26  1 11 22 12 19]


## Data Analysis from ChatGPT-4:

The categorical variables in the features dataset have been successfully encoded. Now, let's look at the correlation of these features with the `damage_grade`:

### Correlation with `damage_grade`
The correlation values range between -1 and 1. A value closer to 1 indicates a strong positive correlation, meaning that as the feature increases, the `damage_grade` tends to increase. Conversely, a value closer to -1 indicates a strong negative correlation, where an increase in the feature leads to a decrease in `damage_grade`. Values around 0 imply weak or no linear correlation.

#### Top Positively Correlated Features:
- `has_superstructure_mud_mortar_stone`
- `count_floors_pre_eq`
- Other features like `legal_ownership_status`, `has_superstructure_stone_flag`, etc., also show positive correlation but to a lesser extent.

#### Top Negatively Correlated Features:
- `has_superstructure_cement_mortar_brick`
- `ground_floor_type`
- `has_superstructure_rc_engineered`
- Other features like `roof_type`, `has_superstructure_rc_non_engineered`, etc., also show negative correlation.


Here are the selected features based on the correlation threshold (of 0.05):

1. `has_superstructure_mud_mortar_stone`
2. `count_floors_pre_eq`
3. `legal_ownership_status`
4. `has_superstructure_stone_flag`
5. `count_families`
6. `has_superstructure_adobe_mud`
7. `position`
8. `has_superstructure_cement_mortar_stone`
9. `has_superstructure_bamboo`
10. `has_superstructure_timber`
11. `geo_level_1_id`
12. `has_secondary_use`
13. `has_secondary_use_rental`
14. `has_secondary_use_hotel`
15. `foundation_type`
16. `area_percentage`
17. `has_superstructure_rc_non_engineered`
18. `roof_type`
19. `has_superstructure_rc_engineered`
20. `ground_floor_type`
21. `has_superstructure_cement_mortar_brick`

These features were chosen because they have a correlation with the target variable `damage_grade` greater than the specified threshold of 0.05 (in absolute value). You can use these features for building your predictive model. 


### 1. TRY USING lightgbm! It's SO FAST. And maybe try it on the titanic