## Imputation

In [None]:
threshold = 0.7

# Drop columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean < threshold]]

# Drop rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis = 1) < threshold]

# Fill all missing values with 0
data = data.fillna(0)

# Fill missing values with medians of the columns
data = data.fillna(data.median())

# Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts().idxmax(), inplace = True)

data['column_name'].fillna('Other', inplace = True)

## Handling Outliers

In [None]:
# Drop the outlier rows with standard deviation

factor = 3

upper_lim = data['column'].mean() + data['column'].std() * factor
lower_lim = data['column'].mean() - data['column'].std() * factor

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]


# Drop the outlier rows with percentiles

upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)

data = data[(data['column'] < upper_lim) & data['column'] > lower_lim]


# Cap the outlier rows with percentiles

data.loc[(df[column] > upper_lim), column] = upper_lim
data.loc[(df[column] < lower_lim), column] = lower_lim

## Log Transform

1) It helps to handle skewed data and after transformation, the distribution becomes more normal.  
2) It also decreases the effect of outliers, due to the normalization of magnitude difference and the model becomes more robust.

In [None]:
data['log'] = (data['value'] - data['value'].min() + 1).transform(np.log)

# Feature Engineering

## Indicator variables

* Indicator variable from thresholds
* Indicator variable from multiple features
* Indicator variable for special events
* Indicator variable for groups of classes

## Interaction features

* Sum of two features
* Difference between two features
* Product of two features
* Quotient of two features

We do not recommend using an automated loop to create interactions for all your features. This leads to feature explosion.

## Feature representation

* Date and time features

In [None]:
from datatime import date

data = pd.DataFrame({'date':
                    ['01-01-2017',
                    '04-12-2008',
                    '23-06-1988',
                    '25-08-1999',
                    '20-02-1993']})

# Transform string to date

data['date'] = pd.to_datetime(data.date, format = "%d-%m-%Y")

# Extract year

data['year'] = data['date'].dt.year

# Extract month

data['month'] = data['date'].dt.month

# Extract passed years since the date

data['passed_years'] = date.today().year - date['date'].dt.year

# Extract passed months since the date

data['passed_months'] = (date.today().year - data['date'].dt.year)*12 + date.today().month - data['date'].dt.month

# Extract the weekday name of the date

data['day_name'] = data['date'].dt.day_name()

* Binning

The main motivation is to make the model more robust and prevent overfitting, however, it has a cost to the performance. Every time you bin something, you sacrifice information and make your data more regularized.

In [None]:
# Numerical binning

data['bin'] = pd.cut(data['value'], bins = [0, 30, 70, 100], labels = ['low', 'mid', 'high'])

# Categorical binning

conditions = [
    data['country'].str.contains('Spain'),
    data['country'].str.contains('Italy'),
    data['country'].str.contains('Chile'),
    data['country'].str.contains('Brazil')
]

choices = ['Europe', 'Europe', 'South America', 'South America']

data['continent'] = np.select(conditions, choices, default = 'Other')  # Group sparse classes

* One-hot encoding

In [None]:
encoded_columns = pd.get_dummies(data['column'])

data = data.join(encoded_columns).drop('column', axis = 1)

* Labeled encoding
* Frequency encoding
* Target mean encoding

## Textual Data

* Bag-of-Words: Extract tokens from text and use their occurances as features
* NLP techniques:
  * Remove stop words
  * Convert all words to lower case
  * Stemming for English words
  * Pingyin
* Deep Learning for textual data
  * Turn each token into a vector of predefined size
  * Help compute "semantic distance" between tokens/words

## External Data

* External APIs
* Geocoding
* Other sources of the same data

## Error Analysis

* Start with larger errors
* Segment by classes
* Unsupervised clustering
* Ask colleagues or domain experts