1. Step 1: Data Collection 
    - Identify the data sources 
        - CLimate - NASA, NOAA, world Bank climate data
        - GIS - OpenStreetMap, Sentinel-2, Landsat, Google Earth Engine
    - Understand the formats of data
        - Raster data - GeoTIFF, 
        - Vector - Shapefile, GeoJSON
        - Tabular data - Excel, CSV, 

2. Load the data
3. Clean the data
- Missing Values 
- Convert data types
- Handle Duplicates
4. Spatial and Temporal analysis 
- Time Series Analysis
5. Data Visualization
6. Feature Engineering 
7. Prepare the data for modelling

# Feature Engineering: 
1. Introduction to Feature Engineering
2. Setting up env for feature engineering
3. Feature Engineering for Numerical and Categorical data
4. Feature Engineering for text data
5. Feature Engineering for Image data
6. Handling imbalanced data - SMOTE
7. GIS Feature Engineering

# Introduction 
- It is the process by which we are transforming the raw data tto meaningful features to improve the performance of the ML model that we are going create
- It includes:
    - Handling categorical and Numerical columns
    - working wiwth text data
    - Balancing the imbalanced data
    - extract features from our data

## Set up the env.
- Links:
- https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
- 

In [1]:
# Importing all the libraries that I will need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import shapely
import rasterio
from sklearn.preprocessing import OneHotEncoder, StandardScaler


# Feature Engineering for Numerical and Categorical data

In [2]:
# Use titanic data
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Handling missing adata
- Numerical data - fill missing data with median
- Categorical data - fill missing data with mode

In [3]:
# Check missing data
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
# Look at the information of my data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [5]:
# Fill missing values
# Numeric
df['age'].fillna(df['age'].median(), inplace = True)
# Categorical
df['embark_town'].fillna(df['embark_town'].mode()[0], inplace = True)
df['embarked'].fillna(df['embarked'].mode()[0], inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['embark_town'].fillna(df['embark_town'].mode()[0], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object o

In [6]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64

In [7]:
df.drop('deck', axis = 1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,Southampton,yes,True
888,0,3,female,28.0,1,2,23.4500,S,Third,woman,False,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,Cherbourg,yes,True


In [8]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,22.0,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,35.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


## Feature Transformations

In [9]:
df['embark_town'].unique()

array(['Southampton', 'Cherbourg', 'Queenstown'], dtype=object)

In [10]:
# Log transformation - THis helps to handle skewness in the numerical data
df['fare_log'] = np.log1p(df['fare'])
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,fare_log
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,2.110213
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,4.280593
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,2.188856
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,3.990834
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,2.202765


In [11]:
#

In [12]:
# One hot encoding - Converting categorical variable to umeric form
df = pd.get_dummies(df, columns = ['sex', 'embark_town'], drop_first = True)

In [13]:
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone,fare_log,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,S,Third,man,True,,no,False,2.110213,True,False,True
1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,yes,False,4.280593,False,False,False
2,1,3,26.0,0,0,7.925,S,Third,woman,False,,yes,True,2.188856,False,False,True
3,1,1,35.0,1,0,53.1,S,First,woman,False,C,yes,False,3.990834,False,False,True
4,0,3,35.0,0,0,8.05,S,Third,man,True,,no,True,2.202765,True,False,True


## Feature Engineering for TExt Data

In [14]:
df = sns.load_dataset('car_crashes')
df.head()

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA


In [17]:
df

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA
5,13.6,5.032,3.808,10.744,12.92,835.5,139.91,CO
6,10.8,4.968,3.888,9.396,8.856,1068.73,167.02,CT
7,16.2,6.156,4.86,14.094,16.038,1137.87,151.48,DE
8,5.9,2.006,1.593,5.9,5.9,1273.89,136.05,DC
9,17.9,3.759,5.191,16.468,16.826,1160.13,144.18,FL


In [24]:
df['accident_desc'] = [
    "High Speed Crash on the highway",
    "Low speed collision at an intersection",
    "Drunk Driving accident",
    "Rear-end crash in city traffic",
] * (len(df) // 4)

TypeError: can't multiply sequence by non-int of type 'float'