## Feature Engineering for Machine Learning

**Course structure:**
* Chapter 1: Feature creation and extraction
* Chapter 2: Engineering messy data
* Chapter 3: Feature normalization
* Chapter 4: Working with text features

### CHAPTER 1. Creating Features

#### 1.1 Why generate features?

**Feature engineering**:
* Act of taking raw data and extracting features from it
* Information is stored in columns of data

**Different types of data**:
* Continuous: either integers (or whole numbers) or floats (decimals)
* Categorical: one of a limited set of values, e.g: gender, country of birth
* Ordinal: ranked values, often with no detail of distance between them
* Boolean: True/False values
* Datetime: dates and times

**Dataset**:
* Most common object in machine learning task
* Work with 'pandas' library substantially
* Know data with column names and column types
* Basic information with *'.head()'* and *'.dtypes'*
* You can also select specific data types in a dataset with *'.select_dtypes'* method
*

In [1]:
# getting to know your data
import pandas as pd

so_survey_df = pd.read_csv('10_datasets/Combined_DS_v10.csv')
print(so_survey_df.head())
print(so_survey_df.dtypes)

      SurveyDate                                    FormalEducation  \
0  2/28/18 20:20           Bachelor's degree (BA. BS. B.Eng.. etc.)   
1  6/28/18 13:26           Bachelor's degree (BA. BS. B.Eng.. etc.)   
2    6/6/18 3:37           Bachelor's degree (BA. BS. B.Eng.. etc.)   
3    5/9/18 1:06  Some college/university study without earning ...   
4  4/12/18 22:41           Bachelor's degree (BA. BS. B.Eng.. etc.)   

   ConvertedSalary Hobby       Country  StackOverflowJobsRecommend  \
0              NaN   Yes  South Africa                         NaN   
1          70841.0   Yes       Sweeden                         7.0   
2              NaN    No       Sweeden                         8.0   
3          21426.0   Yes       Sweeden                         NaN   
4          41671.0   Yes            UK                         8.0   

      VersionControl  Age  Years Experience Gender   RawSalary  
0                Git   21                13   Male         NaN  
1     Git;Subversion  

In [2]:
# selecting specific data types

# create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])
print(so_numeric_df.columns)

Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age',
       'Years Experience'],
      dtype='object')


#### 1.2 Dealing with categorical variables

**Categorical variables**:
* They represent groups that are qualitative in nature (e.g: colors, country of birth)
* They need to be encoded to be used in your machine learning models

**Encoding categorical features**:
* Values can be encoded by creating additional binary features (value 0 and 1)
* Simply assigning numbers to values might be misleading because they might represent some un-existed orders
* Two main approaches: (both canbe done pd.get_dummies method)
    1. One-hot encoding ('pd.get_dummies(df, columns, prefix)'): n categories -> n features
    2. Dummy encoding ('pd.get_dummies(df, columns, **drop_first=True**, prefix)'): n categories -> n-1 features
* For columns with too many categories, you might want to limit column numbers by creating columns for most common values

**One-hot vs. dummies:**
* One-hot encoding: explainable features
* Dummy encoding: Necessary information without duplication


In [3]:
# one-hot encoding and dummy variables
import pandas as pd

# get original data
so_survey_df = pd.read_csv('10_datasets/Combined_DS_v10.csv')

# convert  the Country column to one-hot dataframe
one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')
print(one_hot_encoded.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
      dtype='object')


In [4]:
# convert Country column to dummy variables
dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')
print(dummy.columns)

# notice column for France was missing for dummy variables

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],
      dtype='object')


In [7]:
# dealing with uncommon categories

# get value_counts of column 'Country'
country_counts = so_survey_df['Country'].value_counts()
print(country_counts)

# create a mask for less common values
mask = so_survey_df['Country'].isin(country_counts[country_counts < 10].index)
so_survey_df['Country'][mask] = 'Other'
print(so_survey_df['Country'].value_counts())

# note country 'Ukraine' and 'Ireland' become 'Other'

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Ukraine           9
Ireland           5
Name: Country, dtype: int64
South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Other            14
Name: Country, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  so_survey_df['Country'][mask] = 'Other'


#### 1.3 Numeric variables

**Types of numeric features**:
* Age
* Price
* Counts
* Geospatial data (e.g: coordinates)

**Key considerations:**
1. Does size matter? Is the *magnitude* more important or its *directions*?
    * If the magnitude does NOT matter, just turn it into a binary numeric variable with value 0 and 1
2. An extension of this would be binning numeric variables into groups
    * Bins can be created with *'.cut()'* function in pandas

In [8]:
# binarizing columns
import pandas as pd

# get original data
so_survey_df = pd.read_csv('10_datasets/Combined_DS_v10.csv')

# create new 'Paid_Job' column with zeros
so_survey_df['Paid_Job'] = 0

# replace values where ConvertedSalary > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

   Paid_Job  ConvertedSalary
0         0              NaN
1         1          70841.0
2         0              NaN
3         1          21426.0
4         1          41671.0


In [10]:
# binning values
import numpy as np

# bin the column 'ConvertedSalary' into 5 groups
so_survey_df['ConvertedSalary'] = so_survey_df['ConvertedSalary'].replace(np.nan, 0)
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=5)
print(so_survey_df[['equal_binned', 'ConvertedSalary']].head())

          equal_binned  ConvertedSalary
0  (-2000.0, 400000.0]              0.0
1  (-2000.0, 400000.0]          70841.0
2  (-2000.0, 400000.0]              0.0
3  (-2000.0, 400000.0]          21426.0
4  (-2000.0, 400000.0]          41671.0


In [11]:
# binning values with boundaries
import numpy as np

# specify the boundaries and labels of bin
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]
labels = ['Very low', 'Low', 'Medium', 'High', 'Very High']

# bin the column
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=bins, labels=labels)
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())

  boundary_binned  ConvertedSalary
0        Very low              0.0
1          Medium          70841.0
2        Very low              0.0
3             Low          21426.0
4             Low          41671.0


### CHAPTER 2. Dealing with messy data

#### 2.1 Why do missing values exist?

*

### CHAPTER 3. Conforming to Statistical Assumptions

#### 3.1

### CHAPTER 4. Dealing with Text Data

#### 4.1