# 3. Feature Engineering

**Kristian Newell**

**Course: BrainStation Data Science**

**Previous Notebook: 2. EDA (Exploratory Data Analysis)**

**Next Notebook: 4. Modeling and Findings**

In this notebook I will be loading in the cleaned dataset that I exported in notebook **1. Data Loading and Cleaning**. My goal in this notebook will be to create a few new features that should help with all the types of modeling I plan on incorporating to my next notebook **4. Modeling and Findings**. I will begin first by loading in the standard toolkit and my DataFrame.

In [14]:
# Loading in the usual toolkit required
import pandas as pd
import numpy as np
import matplotlib as plt
import plotly.express as px

In [15]:
# Chaning working directory
os.chdir('C:/Users/Owner/Brainstation/Capstone')

# Reading in cleaned DataFrame
earthquake_df=pd.read_csv('cleaned_df.csv')
earthquake_df

Unnamed: 0,Latitude,Longitude,Magnitude,Depth,DateAndTime
0,19.2460,145.6160,6.0,131.60,1965-01-02 13:44:18
1,1.8630,127.3520,5.8,80.00,1965-01-04 11:29:49
2,-20.5790,-173.9720,6.2,20.00,1965-01-05 18:05:58
3,-59.0760,-23.5570,5.8,15.00,1965-01-08 18:49:43
4,11.9380,126.4270,5.8,15.00,1965-01-09 13:32:50
...,...,...,...,...,...
23224,38.3917,-118.8941,5.6,12.30,2016-12-28 08:22:12
23225,38.3777,-118.8957,5.5,8.80,2016-12-28 09:13:47
23226,36.9179,140.4262,5.9,10.00,2016-12-28 12:38:51
23227,-9.0283,118.6639,6.3,79.00,2016-12-29 22:30:19


In [7]:
earthquake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23229 entries, 0 to 23228
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Latitude     23229 non-null  float64
 1   Longitude    23229 non-null  float64
 2   Magnitude    23229 non-null  float64
 3   Depth        23229 non-null  float64
 4   DateAndTime  23229 non-null  object 
dtypes: float64(4), object(1)
memory usage: 907.5+ KB


It looks like `DateAndTime` was read in as an object instead of a datetime, so I will be casting it to type datetime.

In [8]:
earthquake_df['DateAndTime']=pd.to_datetime(earthquake_df['DateAndTime'])

The first feature that I will be creating will be an ordinal time feature that is based on my current feature `DateAndTime`. This will represent datetime as an integer value by calculating date and time as a sum of seconds. The purpose of creating an ordinal time in my dataset is so that I can still incorperate time into any model that can not include a datetime data type.

In [9]:
# Making Datetime as a list of seconds so I can use it in the regression
earthquake_df['OrdinalTime']=earthquake_df['DateAndTime'].apply(lambda x:x.toordinal())

In [17]:
# Checking to see that the ordinal time has been created correctly
earthquake_df.head()

Unnamed: 0,Latitude,Longitude,Magnitude,Depth,DateAndTime
0,19.246,145.616,6.0,131.6,1965-01-02 13:44:18
1,1.863,127.352,5.8,80.0,1965-01-04 11:29:49
2,-20.579,-173.972,6.2,20.0,1965-01-05 18:05:58
3,-59.076,-23.557,5.8,15.0,1965-01-08 18:49:43
4,11.938,126.427,5.8,15.0,1965-01-09 13:32:50


The `OrdinalTime` column was created successfully, now I am going to create a subset of my DataFrame on a specific region of the world. 

The region that I will be creating a subset for is South America. There are several reasons behind my decision of South America vs the rest of the regions of the world. South America is directly adjacent to the Nazca tectonic plate, which is undergoing subduction into the crust below Peru and Chile and is responsible for a significant amount of earthquakes each year, which if occurring on the coast also have the potential to create a tsunami. In addition to having a significant amount of earthquakes and tsunamis, South America has poorer infrastructure than several of the more advanced regions of the world, resulting in tremendous loss of life and costly damages. My model has the highest potential for good in an area like this, that has a relatively high probability of earthquakes as well as relatively low protection against them.

Additional information available at:
* https://www.statista.com/statistics/1204065/earthquake-risk-index-latin-america-country/
* https://www.usgs.gov/news/usgs-authors-new-report-seismic-hazard-risk-and-design-south-america

Since there is no locational data other than `Latitude` and `Longitude`, I am going to create a boundary around South America and only keep the values that fall within this boundary.

In [16]:
# Creating a new DataFrame from only the values within a specific latitude and longitude
SouthAmerica_df=earthquake_df[(((earthquake_df['Latitude'] <10) & (earthquake_df['Longitude']>-87)) & (earthquake_df['Longitude']< -55)) &((earthquake_df['Longitude']>-77) | (earthquake_df['Latitude']>-30))]

In [13]:
# Using a plotly scatter_geo to inspect the data in my new SouthAmerica DataFrame
fig = px.scatter_geo(SouthAmerica_df,lat='Latitude',lon='Longitude',color='Magnitude')
fig.update_layout(title = 'World map', title_x=0.5)
fig.show()

The new DataFrame looks good, I will now be exporting my original `Earthquake` DataFrame with my new feature as well as my new `SouthAmerica` DataFrame for use in my next notebooks.

In [None]:
# Exporting the earthquake DataFrame to CSV for use in further notebooks
earthquake_df.to_csv('C:/Users/Owner/Brainstation/Capstone/cleaned_df.csv',index=False)

In [18]:
# Exporting the SouthAmerica DataFrame to CSV for use in further notebooks
SouthAmerica_df.to_csv('C:/Users/Owner/Brainstation/Capstone/SouthAmerica_df.csv',index=False)

My new `OridinalTime` feature as well as my new `SouthAmerican_df` have been created successfully. This marks the end of my feature engineering, I will now move onto my next and most extensive notebook **4. Modeling**. In this next notebook I hope to use my current data and new features to create a model that can accurately predict earthquakes.