## Feature Engineering for Assignment 3 and 4

**Note:** This code only covers the feature engineering and data splitting part, and not model training and feature selection. I have selected a few sample features, which may not represent the ideal set of features. 

You may wish to use this code in Assignment 4 but I recommend you use your code from Assignment 3 first, especially if you came up with your own feature combinations and your feature engineering code ran successfully. 

In [50]:
# Fixing issues with printing out the dataframes using head() method
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# Loading the data into a dataFrame
mt_rainier_df = pd.read_csv('MtRainier_data.csv')

# Dropping duplicated data, if any
mt_rainier_df = mt_rainier_df.drop_duplicates()

# Dropping any rows with null/NaN values
mt_rainier_df = mt_rainier_df.dropna()


mt_rainier_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Route,Succeeded,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG
0,0,11/27/2015,Disappointment Cleaver,0,13.64375,26.321667,19.715,27.839583,68.004167,88.49625
1,1,11/21/2015,Disappointment Cleaver,0,13.749583,31.3,21.690708,2.245833,117.549667,93.660417
2,2,10/15/2015,Disappointment Cleaver,0,13.46125,46.447917,27.21125,17.163625,259.121375,138.387
3,3,10/13/2015,Little Tahoma,0,13.532083,40.979583,28.335708,19.591167,279.779167,176.382667
4,4,10/9/2015,Disappointment Cleaver,0,13.21625,38.260417,74.329167,65.138333,264.6875,27.791292


From this, we select the following colums as features:

- **'Route'** - The rationale behind selecting the column as a feature is that given the weather/climatic conditions on a given day, one route may be preferrable over others. A user may want to maximize the chances of sumitting given any route. 
- **'Temperature AVG'**: Temperature being too high or low may have adverse impact on the body and hence may reduce the chances of summitting. This is why we select this column as a feature. 
- **'Relative Humidity AVG'**: Selected for simlar reasons as Temperature AVG. 
- **'Wind Speed Daily AVG'**: Wind-speed being high or low has an impact on the ease of climbing, and hence chances of sumitting, hence selected as feature. 
- **'Wind Direction AVG'**: Wind direction change may have an adverse impact on climbing, and hence could be related to sumitting. We select this as a feature. 
- **'Solare Radiation AVG**: For the same reason as Wind Direction AVG. 

We select the following column as label:
- **Succeeded** which carries a binary value for chances of sumitting (1- for success and 0- for no sucess)

In [51]:
features_df = mt_rainier_df[['Route', 'Temperature AVG', 'Relative Humidity AVG', 'Wind Speed Daily AVG', 'Wind Direction AVG', 'Solare Radiation AVG']]
features_df.head()

Unnamed: 0,Route,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG
0,Disappointment Cleaver,26.321667,19.715,27.839583,68.004167,88.49625
1,Disappointment Cleaver,31.3,21.690708,2.245833,117.549667,93.660417
2,Disappointment Cleaver,46.447917,27.21125,17.163625,259.121375,138.387
3,Little Tahoma,40.979583,28.335708,19.591167,279.779167,176.382667
4,Disappointment Cleaver,38.260417,74.329167,65.138333,264.6875,27.791292


In [52]:
labels_df = mt_rainier_df[['Succeeded']]
labels_df.head()

Unnamed: 0,Succeeded
0,0
1,0
2,0
3,0
4,0


From our features, we observe that "Route" is a categorical nominal feature. It would need 1-hot transformation

In [53]:
route_encoder = OneHotEncoder()

route_list = features_df['Route'].to_list()
route_list_of_list = [[el] for el in route_list]

route_transformed = route_encoder.fit_transform(route_list_of_list)
route_transformed = route_transformed.toarray()

route_transformed_df = pd.DataFrame(route_transformed)
print(f'Route 1-hot shape is {route_transformed_df.shape}')

Route 1-hot shape is (1895, 22)


In [54]:
features_df.reset_index(drop=True, inplace=True)
route_transformed_df.reset_index(drop=True, inplace=True)

# Concatinating the Route dataframe + the features dataframe 
features_transformed_df = pd.concat([features_df, route_transformed_df], axis=1)

# Dropping the original Route column
features_transformed_df = features_transformed_df.drop(columns=["Route"], axis=1)

# Printing out the new shape after the 1-hot encoding has been finished and original column dropped
print(f"Shape of data after droping original categorical column = {features_transformed_df.shape}")

Shape of data after droping original categorical column = (1895, 27)


We now scale the numerical features except route using MinMax Scaler

In [55]:
# Creating the scaler 
scaler = MinMaxScaler()

# Scaleing the numerical features with standard scaler
numerical_feature_names = ['Temperature AVG', 
                   'Relative Humidity AVG', 
                   'Wind Speed Daily AVG', 
                   'Wind Direction AVG', 
                   'Solare Radiation AVG']
features_transformed_df[numerical_feature_names] = scaler.fit_transform(features_transformed_df[numerical_feature_names])

# Printing out the head of the feature-engineered dataframe
features_transformed_df.head()

Unnamed: 0,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG,0,1,2,3,4,...,12,13,14,15,16,17,18,19,20,21
0,0.395119,0.083886,0.427392,0.204255,0.240442,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.496061,0.106431,0.034478,0.389892,0.254473,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.803203,0.169424,0.263495,0.920335,0.375994,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.692326,0.182255,0.300762,0.997736,0.479228,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.637191,0.707076,1.0,0.941191,0.075508,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We form the train-test split. The train split can be used for cross validation

In [57]:
# Putting the features and labels into a numpy array for sklearn's data splitter
features = features_transformed_df.to_numpy()
labels = labels_df.to_numpy()

# Splitting the data into test and training data
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=512)

print (f"Shape of features for train+validation data= {_x.shape}")
print (f"Shape of labels for train+validation data= {_y.shape}")
print (f"Shape of features for test data = {x_test.shape}")
print (f"Shape of features for test data = {y_test.shape}")

Shape of features for train+validation data= (1705, 27)
Shape of labels for train+validation data= (1705, 1)
Shape of features for test data = (190, 27)
Shape of features for test data = (190, 1)
