# Data Preprocessing

## Dealing with Missing Data

- Identifying missing values
- Technique to address missing value issues

### identifying missing values

In [None]:
# access the values in dataframe/column
df.values
#  count the number of missing values per column
df.isnull().sum()

### Removing missing values
Note: potential loss of information

In [None]:
# drop both columns and rows with missing values
df.dropna(how = 'all')

# drop rows/instances with missing values
df.dropna(axis = 0)

#drop columns/features with missing values
df.dropna(axis = 1)

# drop instances with more than k NAs
df.dropna(thresh = k)

# drop rows where NaN appear in specific columns
df.dropna(subset = ['column'])


### Imputing missing values
- replace the missing value with the mean/median/most freqent (for categorical variable) value of the entire feature column

In [None]:
from sklearn.preprocessing import Imputer

imr = Imputer(missing_vales = 'NaN', strategy = 'mean', axis = 0)
# strategy: mean or median for continuous values, most_frequent for categorical variable
# axis = 0 for column-wise imputation, axis = 1 for row-wise imputation
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)

## Handling categorical variable

- Ordinal features
- Nominal features
- Encoding class labels

### Ordinal features: categorical variablesthat can be sorted or ordered

     mapping ordinal features

In [None]:
map_dic = {'category': 'value'}
df['ordinal'] = df['ordinal'].map(map_dic)

### Nominal featuers: doesn't imply any order
    one-hot encoding on nominal features: to create a new dummy feature for each unique value in the nominal feature column


In [None]:
# Method A
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories = [i], sparse = True/False)
ohe.fit_transform(X).toarray()

# Method B: 
pd.get_dummies(df[['nominal_features']])
# potential problem of multicollinearity, remove one feature column
pd.get_dummies(df[['nominal_features']], drop_first = True)

### Encoding class labels:  class labels as integer

In [None]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)

## Feature Scaling

- Feature scaling is crucial for distance/similarity based algorithm, such as SVM, KNN, K-means, etc. Decision trees and random forests are two of the very few machine learning algorithms that are scale invariant.

- Two common approaches for feature scaling:
    1. Normalization: rescale the features to a range of [0,1], a special case of min-max scaling
        
        $x_{norm}^{(i)} = \frac{x^{(i)} - x_{min}}{x_{max} - x_{min}}$
        
    2. Standardization: center the feature columns at mean 0 with standard deviation 1, which makes it easier to leran the weights for gradient descent algorithms. This also maintains useful iinformation about the outliers and makes the algorithm less sensitive to them in contrast to min-max scaling.
    
        $x_{std}^{(i)} = \frac{x^{(i)}-\mu_x}{\sigma_x}$

In [None]:
# MinMax Scaler
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

# Standardization
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

# Note: fit scaler only once on the training data and use those parameters to transform the test set

## Common solutions to reduce the generalization errors

### Regularization

- L1 and L2 regularization
- adding bias and preferring a simpler model to reduce the variance in the absence of sufficient training data to fit the model

### Dimensionality Reduction
- Feature selection:
    1. Sequential feature selection algorithms: a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k<d. The motivation is to automatically select a subset of features that are most relevant to the problem, to improve the computational efficiency or reduce the generalization error of the model by removing irrelevant featurs or noise.
    2. Tree-based models to select features by importance
    3. Univariate statistical tests
    
- Feature extraction