# Feature engineering

Parte importante do fluxo de trabalho de pré-processamento.
Feature engineering é a criação de novos recursos com base nas features existentes.

"Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data."

Machine learning algorithms learn a solution to a problem from sample data.

In this context, feature engineering asks: what is the best representation of the sample data to learn a solution to your problem?

A feature is an attribute that is useful or meaningful to your problem. It is an important part of an observation for learning about the structure of the problem that is being modeled.

Feature importance scores can also provide you with information that you can use to extract or construct new features, similar but different to those that have been estimated to be useful.

A feature may be important if it is highly correlated with the dependent variable (the thing being predicted). Correlation coefficients and other univariate (each attribute is considered independently) methods are common methods.

## List of Techniques

* imputation

* handling outliers

* binning

* log transformation

* one-hot encoding

* grouping operations

* feature split

* scaling

* extracting date

To see the features used in the analysis

In [1]:
import pandas as pd

In [None]:
df = pd.read_csv(path_to_csv_file)
print(df.head())

In [None]:
df.columns

### data types

In [None]:
print(df.dtypes)

### Selecting specific data types

In [None]:
df.select_dtypes(include = ['int', 'float'])

### Dealing with Categorical Variables

To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables.

#### Enconding categorical features

* One-hot enconding
* Dummy enconding

by default in Pandas - one-hot, using the function get_dummies().

A codificação one-hot converte n categorias em n recursos.
A função pega um dataframe e uma lista de colunas categóricas e retorna um dataframe atualizado com essas colunas incluídas. A especificação de um prefixo com o argumento prefix pode melhorar a legibilidade.

A codificação **Dummy enconding** cria recursos n - 1, para n categorias, omitindo a primeira categoria.

#### One-hot vs. dummies

* One-hot enconding: explainable features - can create a colinear features as result of the same information appear multiple times.

* Dummy-enconding: Necessary information without duplication



In [None]:
# Para saber o número de categorias de uma coluna específica usar:
counts = df['categorical feature'].value_counts()
print(counts)

# Para limitar o número de códigos podemos criar uma máscara
# Uma máscara é uma lista de booleanos que descrevem quais 
# valores em uma coluna devem ser afetados

mask = df['categorical feature'].isin(counts[counts < 5].index)
df['categorical feature'][mask] = 'Other'
print(pd.value_counts(xxx))


In [None]:
pd.get_dummies(df, columns = ['Categoric feature'], prefix = 'C')

pd.get_dummies(df, columns = ['list of catecorical features'],
                              drop_first = True, prefix = 'M')

In [20]:
df = pd.DataFrame({'key' : ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [22]:
dummies = pd.get_dummies(df['key'], prefix = 'key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [23]:
df[['data1']].join(dummies)

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


### Enconding binary variables - Pandas

In [None]:
users['sub_enc'] = users['subscribed'].apply(lambda val: 1 if val 'y' else 0)

## Enconding binary variables - scikit - learn
**LabelEncoder**

In [None]:
from sklearn.preprocessing import LabelEncoder

# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

# Compare the two columns
print(hiking[["Accessible_enc", "Accessible"]].head())
#.fit_transform() is a good way to both fit an encoding and transform the data in a single step.

## Numerical variables

While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful. For example on some occasions, you might not care about the magnitude of a value but only care about its direction, or if it exists at all. In these situations, you will want to binarize a column. 


### Binning values

For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into. This can be useful when plotting values, or simplifying your machine learning models. It is mostly used on continuous variables where accuracy is not the biggest concern e.g. age, height, wages.

Bins are created using pd.cut(df['column_name'], bins) where bins can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.

Dados contínuos com frequência são discretizados ou , de modo alternativo, separados em 'compartimentos' (bins) para análise.

In [2]:
ages = [20, 22 , 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

bins = [18, 25, 35, 60, 100]

categorias = pd.cut(ages, bins)
categorias

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [3]:
categorias.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [4]:
categorias.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [5]:
pd.value_counts(categorias)  # contadores de compartimentos para o resultado de pd.cut

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [6]:
group_names = ['juventude', 'jovem adulto', 'idade média', 'senior']
pd.cut(ages, bins, labels = group_names)

[juventude, juventude, juventude, jovem adulto, juventude, ..., jovem adulto, senior, idade média, idade média, jovem adulto]
Length: 12
Categories (4, object): [juventude < jovem adulto < idade média < senior]

In [7]:
import numpy as np

In [9]:
data  = np.random.rand(20)
pd.cut(data, 4, precision = 2)

[(0.026, 0.26], (0.026, 0.26], (0.72, 0.95], (0.49, 0.72], (0.72, 0.95], ..., (0.49, 0.72], (0.49, 0.72], (0.49, 0.72], (0.72, 0.95], (0.49, 0.72]]
Length: 20
Categories (4, interval[float64]): [(0.026, 0.26] < (0.26, 0.49] < (0.49, 0.72] < (0.72, 0.95]]

In [16]:
data = list(range(1, 101))


In [17]:
cats = pd.qcut(data, 4) # separa em quantis
cats

[(0.999, 25.5], (0.999, 25.5], (0.999, 25.5], (0.999, 25.5], (0.999, 25.5], ..., (74.5, 99.0], (74.5, 99.0], (74.5, 99.0], (74.5, 99.0], (74.5, 99.0]]
Length: 99
Categories (4, interval[float64]): [(0.999, 25.5] < (25.5, 50.0] < (50.0, 74.5] < (74.5, 99.0]]

In [18]:
pd.value_counts(cats)

(74.5, 99.0]     25
(25.5, 50.0]     25
(0.999, 25.5]    25
(50.0, 74.5]     24
dtype: int64

#### Binning  

Binning can be applied on both categorical and numerical data:

#Numerical Binning Example
Value      Bin       
0-30   ->  Low       
31-70  ->  Mid       
71-100 ->  High
#Categorical Binning Example
Value      Bin       
Spain  ->  Europe      
Italy  ->  Europe       
Chile  ->  South America
Brazil ->  South America

The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance. Every time you bin something, you sacrifice information and make your data more regularized. (Please see regularization in machine learning)

The trade-off between performance and overfitting is the key point of the binning process. In my opinion, for numerical columns, except for some obvious overfitting cases, binning might be redundant for some kind of algorithms, due to its effect on model performance.

However, for categorical columns, the labels with low frequencies probably affect the robustness of statistical models negatively. Thus, assigning a general category to these less frequent values helps to keep the robustness of the model. For example, if your data size is 100,000 rows, it might be a good option to unite the labels with a count less than 100 to a new category like “Other”.

In [None]:
#Numerical Binning Example
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"])
   value   bin
0      2   Low
1     45   Mid
2      7   Low
3     85  High
4     28   Low
#Categorical Binning Example
     Country
0      Spain
1      Chile
2  Australia
3      Italy
4     Brazil
conditions = [
    data['Country'].str.contains('Spain'),
    data['Country'].str.contains('Italy'),
    data['Country'].str.contains('Chile'),
    data['Country'].str.contains('Brazil')]

choices = ['Europe', 'Europe', 'South America', 'South America']

data['Continent'] = np.select(conditions, choices, default='Other')
     Country      Continent
0      Spain         Europe
1      Chile  South America
2  Australia          Other
3      Italy         Europe
4     Brazil  South America

Um método comum de feature engineering é usar um conjunto agregado(média) para usar no lugar dessas features. Isso pode ser útil para reduzir a dimensionalidade do espaço de features

Usando uma estatística agregada como a média

In [None]:
columns = ['day1', 'day2', 'day3']
df['media'] = df.apply(lambda row: row[columns].mean(), axis = 1)
print(df)

## Dealing with missing values 1
### Listwise deletion in Python

Serve para variáveis categóricas aonde os dados ausentes ocorrem aleatoriamente.

In [None]:
# Drop all rows with at least one missing values
df.dropna(how = 'any')

In [None]:
# Drop rows with missing values in a specific column
df.dropna(subset = ['VersionControl'])

No caso de colunas categóricas, é comum substituir valores ausentes por sequências de caracteres como 'outro', 'não fornecido',ect....

In [None]:
# Replace missing values in a specific column with a given string
df['VersionControl'].fillna(value = 'None Given', inplace = True)

In [None]:
# Record where the values are not misssing
df['SalaryGiven'] = df ['ConvertedSalary'].notnull()

# Drop a specific column
df.drop(columns = ['ConvertedSalary'])

In [None]:
Usando o booleano pra deletar linhas específicas aonde está faltando dado
print(df[df['B'] == 7])
# Contando o número de elementos faltantes numa especifica coluna
print(df['col_name'].isnull().sum())

# mostrando os elementos não nulos em uma coluna específica
print(df['col_name'].notnull())

dropna('col_name', axis = 1, thresh = 3)

# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

# Imputation
threshold = 0.7
#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold]

#Filling all missing values with 0
data = data.fillna(0)
#Filling missing values with medians of the columns
data = data.fillna(data.median())

# Dealing with missing values 2
## Fill continuos missing values
### Dealing with missing values(2)

* categorical columns: replace missing values with the most common occurring value or with a string that flags missing values such as 'None'

* numerical columns: replace missing values with a suitable values

In [None]:
df['column X'] = df['Column X'].fillna(df['Column X'].mean())

#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)

### Dealing with bad characteres

In [None]:
df['RawSalary'] = df['RawSalary'].str.replace(',', '')

In the last exercise, you could tell quickly based off of the df.head() call which characters were causing an issue. In many cases this will not be so apparent. There will often be values deep within a column that are preventing you from casting a column as a numeric type so that it can be used in a model or further feature engineering.

One approach to finding these values is to force the column to the data type desired using pd.to_numeric(), coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.

In [None]:
# Attempt to convert the column to numeric values
numeric_vals = pd.to_numeric(so_survey_df['RawSalary'], errors='coerce')

# Find the indexes of missing values
idx = numeric_vals.isna()

# Print the relevant rows
print(so_survey_df['RawSalary'][idx])

In [None]:
# conversão de tipo 
df['RawSalary'] = df['RawSalary'].astype('float') 

In [None]:
Chaining methods

df['column_name'] = df['column_name'].method1().method2().method3()

In [None]:
# Use method chaining
so_survey_df['RawSalary'] = so_survey_df['RawSalary']\
                              .str.replace(',', '')\
                              .str.replace('$', '')\
                              .str.replace('£', '')\
                              .astype('float')
 
# Print the RawSalary column
print(so_survey_df['RawSalary'])

Uma importante consideração antes de criar um modelo de machine learning é entender como a distribuição dos dados subjacentes. Muitos algoritmos fazem suposições sobre como seus dados são distribuídos ou como as diferentes features interagem entre si. Por exemplo, quase todos os modelos, além dos modelos baseados em árvore, exigem que as features estejam na mesma escala. 
A feature engineering pode ser usada para manipular os dados para que possam caber nas suposições de distribuição, ou pelo menos, ajustam-na o mais próximo possível. Quase todos os modelos, além dos modelos baseados em árvore, assumem que seus dados sejam normalmente distribuídos.

Para entender a forma de seus próprios dados, vc pode criar histogramas de cada um das features contínuas.

In [None]:
# Para gerar um histograma só das colunas numéricas
# separar essas colunas em um novo dataframe e então usar

# import matplotlib as plt

df.hist()
plt.show()

# Para gerar um boxplot no pandas
# Create a boxplot of two columns
so_numeric_df[['Age', 'Years Experience']].boxplot()
plt.show()

#### Min-Max scaling

A escala Min-Max é quando seus dados são dimensionados linearmente entre um valor mínimo e máximo. Como é uma escala linear enquanto os valores mudam, a distribuição não.

When could you use normalization(MinMaxScaler) when working with a dataset?
Normalization scales all points linearly between the upper and lower bound.

In [None]:
Min-Max scaling in Python

from sklearn.preprocessing import MinMaxScaler
# pre processing modulo - sckit learn é a biblioteca de ML mais usada 

scaler = MinMaxScaler()

scaler.fit(df[['column']])

df['normalized_column'] = scaler.transform(df[['column']])

#### Standardization

Outro escalonador comumente usado é chamado de padronização. A padronização encontra a média dos seus dados e centraliza a sua distribuição ao redor, calculando o número de desvios-padrão da média em cada ponto.

In [None]:
Standardization in Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(df[['col']])

df['standardized_col'] = scaler\
                         .transform(df[['col']])

#### Log Transformation

Uma transformação de log, por outro lado, pode ser usada para tornar as distribuições high skewed em less skewed.

Logarithm transformation (or log transform) is one of the most commonly used mathematical transformations in feature engineering. What are the benefits of log transform:

* It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal.

* In most of the cases the magnitude order of the data changes within the range of the data. For instance, the difference between ages 15 and 20 is not equal to the ages 65 and 70. In terms of years, yes, they are identical, but for all other aspects, 5 years of difference in young ages mean a higher magnitude difference. This type of data comes from a multiplicative process and log transform normalizes the magnitude differences like that.

* It also decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust.

**A critical note**: The data you apply log transform must have only positive values, otherwise you receive an error. Also, you can add 1 to your data before transform it. Thus, you ensure the output of the transformation to be positive.

In [31]:
#Log Transform Example
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['log+1'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log)
data

Unnamed: 0,value,log+1,log
0,2,1.098612,3.258097
1,45,3.828641,4.234107
2,-23,,0.0
3,85,4.454347,4.691348
4,28,3.367296,3.951244
5,2,1.098612,3.258097
6,35,3.583519,4.077537
7,-12,,2.484907


In [None]:
Log transformation in Python

from sklearn.preprocessing import PowerTransformer

log = PowerTransformer()

log.fit(df['col'])

df['log_col'] = log.transform(df[['col']])


#### Removing outliers

In [None]:
Quantiles in Python

q_cutoff = df['col_name'].quantile(0.95)

mask = df['col_name'] < q_cutoff

trimmed_df = df[mask]


##### Standard deviation based detection
baseado na distância a partir da média.


In [None]:
Standard deviation detection in Python

mean = df['col_name'].mean()
std = df['col_name'].std()

cut_off = std * 3

lower, upper = mean - cut_off, mean + cut_off

new_df = df[(df['col_name'] < upper) & (df['col_name'] > lower)]

#Dropping the outlier rows with standard deviation
factor = 3
upper_lim = data['column'].mean () + data['column'].std () * factor
lower_lim = data['column'].mean () - data['column'].std () * factor

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

#### An Outlier Dilemma: Drop or Cap

Another option for handling outliers is to cap them instead of dropping. So you can keep your data size and at the end of the day, it might be better for the final model performance.
On the other hand, capping can affect the distribution of the data, thus it better not to exaggerate it.

In [None]:
#Capping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data.loc[(df[column] > upper_lim),column] = upper_lim
data.loc[(df[column] < lower_lim),column] = lower_lim

#### Extracting Date

Though date columns usually provide valuable information about the model target, they are neglected as an input or used nonsensically for the machine learning algorithms. It might be the reason for this, that dates can be present in numerous formats, which make it hard to understand by algorithms, even they are simplified to a format like "01–01–2017".

Building an ordinal relationship between the values is very challenging for a machine learning algorithm if you leave the date columns without manipulation. Here, I suggest three types of preprocessing for dates:

* Extracting the parts of the date into different columns: Year, month, day, etc.
* Extracting the time period between the current date and columns in terms of years, months, days, etc.
* Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc.

If you transform the date column into the extracted columns like above, the information of them become disclosed and machine learning algorithms can easily understand them.

In [34]:
from datetime import date

data = pd.DataFrame({'date':
['01-01-2017',
'04-12-2008',
'23-06-1988',
'25-08-1999',
'20-02-1993',
]})

#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

#Extracting Year
data['year'] = data['date'].dt.year

#Extracting Month
data['month'] = data['date'].dt.month

#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()
data

Unnamed: 0,date,year,month,passed_years,passed_months,day_name
0,2017-01-01,2017,1,3,40,Sunday
1,2008-12-04,2008,12,12,137,Thursday
2,1988-06-23,1988,6,32,383,Thursday
3,1999-08-25,1999,8,21,249,Wednesday
4,1993-02-20,1993,2,27,327,Saturday


In [None]:
Dates

df['data_converted'] = pd.to_datetime(df['date'])

df['month'] = df["date_converted"].apply(lambda row: row.month)

#### Stratified Sampling

In [None]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop('category_desc', axis= 1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify = volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train["category_desc"].value_counts())

### Grouping Operations

In most machine learning algorithms, every instance is represented by a row in the training dataset, where every column show a different feature of the instance. This kind of data called “Tidy”.

" Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table."

Datasets such as transactions rarely fit the definition of tidy data above, because of the multiple rows of an instance. In such a case, we group the data by the instances and then every instance is represented by only one row.

The key point of group by operations is to decide the aggregation functions of the features. For numerical features, average and sum functions are usually convenient options, whereas for categorical features it more complicated.

##### Categorical Column Grouping

* The first option is to select the label with the highest frequency. In other words, this is the max operation for categorical columns, but ordinary max functions generally do not return this value, you need to use a lambda function for this purpose.


In [None]:
data.groupby('id').agg(lambda x: x.value_counts().index[0])

Second option is to make a pivot table. This approach resembles the encoding method in the preceding step with a difference. Instead of binary notation, it can be defined as aggregated functions for the values between grouped and encoded columns. This would be a good option if you aim to go beyond binary flag columns and merge multiple features into aggregated features, which are more informative.

In [None]:
#Pivot table Pandas Example
data.pivot_table(index='column_to_group', columns='column_to_encode', values='aggregation_column', 
                 aggfunc=np.sum, fill_value = 0)

* Last categorical grouping option is to apply a group by function after applying one-hot encoding. This method preserves all the data -in the first option you lose some-, and in addition, you transform the encoded column from categorical to numerical in the meantime. You can check the next section for the explanation of numerical column grouping.

#### Numerical Column Grouping

Numerical columns are grouped using sum and mean functions in most of the cases. Both can be preferable according to the meaning of the feature. For example, if you want to obtain ratio columns, you can use the average of binary columns. In the same example, sum function can be used to obtain the total count either.

In [None]:
#sum_cols: List of columns to sum
#mean_cols: List of columns to average
grouped = data.groupby('column_to_group')

sums = grouped[sum_cols].sum().add_suffix('_sum')
avgs = grouped[mean_cols].mean().add_suffix('_avg')

new_df = pd.concat([sums, avgs], axis=1)

#### Feature Split

Splitting features is a good way to make them useful in terms of machine learning. Most of the time the dataset contains string columns that violates tidy data principles. By extracting the utilizable parts of a column into new features:

* We enable machine learning algorithms to comprehend them.
* Make possible to bin and group them.
* Improve model performance by uncovering potential information.

Split function is a good option, however, there is no one way of splitting features. It depends on the characteristics of the column, how to split it. Let’s introduce it with two examples. First, a simple split function for an ordinary name column:


In [None]:
data.name
0  Luther N. Gonzalez
1    Charles M. Young
2        Terry Lawson
3       Kristen White
4      Thomas Logsdon
#Extracting first names
data.name.str.split(" ").map(lambda x: x[0])
0     Luther
1    Charles
2      Terry
3    Kristen
4     Thomas
#Extracting last names
data.name.str.split(" ").map(lambda x: x[-1])
0    Gonzalez
1       Young
2      Lawson
3       White
4     Logsdon

The example above handles the names longer than two words by taking only the first and last elements and it makes the function robust for corner cases, which should be regarded when manipulating strings like that.

Another case for split function is to extract a string part between two chars. The following example shows an implementation of this case by using two split functions in a row.

In [None]:
#String extraction example
data.title.head()
0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
data.title.str.split("(", n=1, expand=True)[1].str.split(")", n=1, expand=True)[0]
0    1995
1    1995
2    1995
3    1995
4    1995