# Preprocessing data methods

## Missing Data

##### Types of missing data:
- MCAR - Missing Completely at Random (not many missing values)
- MAR - Missing at Random (some missing values)
- MNAR - Missing Not at Random (many missing values)

##### Visualizing Correlation of Missing Data
import missingno as msno
msno.heatmap(df) #visualizes correlation
msno.dendrogram(df) #shows tree diagram of correlation

In [None]:
# compares missingness between missing and non-missing data
def fill_dummy_values(df, scaling_factor=1):
    df_dummy = df.copy(deep=True)
    for col in df_dummy:
        col=df_dummy[col]
        col_null=col.isnull()
        num_nulls=col_null.sum()
        col_range=col.max()-col.min()
        dummy_values=(rand(num_nulls)-2)*scaling_factor*col_range+col.min()
        col[col_null]=dummy_values
return df_dummy
#can visualize results with a scatterplot
# Fill dummy values in diabetes_dummy
df_dummy = fill_dummy_values()

# Sum the nullity of one column and another column
nullity = df[col_name].isna() + diabetes[col_name].isna()

# Create a scatter plot of Skin Fold and BMI 
diabetes_dummy.plot(x='col_name', y='col_name', kind='scatter', alpha=0.5, c= nullity, cmap='rainbow')

### Numerical

In [None]:
#visualizing missing data
import missingno as msno
msno.bar(df) #visualizes missing data as a bar chart (remember to plt.show)

### Time Series

In [None]:
#visualizing missing data
import missingno as msno
msno.matrix(df) #shows missing data and can parse through date frames

### Deleting Missing Values

1. pairwise - skips missing value (automatically happens in pandas)
2. listwise - using df.dropna() to remove data by row or column
    **only use when missing data is MCAR**

### Imputing Missing Values
replacing missing values with another value

#### fillna()

##### Types:
- ffill - forward fill - replace NaN with last observed value
- bfill - backfill - replace NaN with next observed value
example:
\
df.fillna(method='ffill', inplace=True)

#### interpolate()
**preferred method for time-series data**
##### Types:
- linear - extrapolates straight line between last and next observations and imputes equidistantly
- quadratic - takes parabolic trajectory in negative direction and shoots back positive value
- nearest - combination of ffill and bfill
example:
\
df.interpolate(method='linear', inplace=True)

#### SimpleImputer

In [None]:
#Simple Imputer
from sklearn.impute import SimpleImputer
df_copy = df.copy(deep=True) #makes copy for comparison to original
si = SimpleImputer(strategy='', fill_value=#constant) #mean, median, constant, most-frequent(mode)
df_copy.iloc[:,:] = si.fit_transform(df_copy)

#### FancyImpute

In [None]:
from fancyimpute import KNN, IterateImputer
#KNN uses K nearest neighbor to replace values
#IterateImputer uses multiple regressions to replace values (most robust)
example:
\
ki = KNN()
df_copy = df.copy(deep=True) #make copy for comparison to original
df_copy.iloc[:,:] = ki.fit_transform(df_copy)

### Categorical Imputation
convert, then impute if data are strings then fill Nan using most frequent value (KNN)

In [None]:
#function that will loop through each column and encode strings to integers, then impute missing
#values with KNN and return those columns back to the original dataframe
# Create an empty dictionary ordinal_enc_dict
from sklearn.preprocessing import OrdinalEncoder

ordinal_enc_dict = {}
def cat_data_imputer(df):
    for col_name in df:
    # Create Ordinal encoder for col
        ordinal_enc_dict[col_name] = OrdinalEncoder()
        col = df[col_name]
    
    # Select non-null values of col in users
        col_not_null = col[col.notnull()]
        reshaped_vals = col_not_null.values.reshape(-1, 1)
        encoded_vals = ordinal_enc_dict[col_name].fit_transform(reshaped_vals)
    
    # Store the values to column in users
        df.loc[col.notnull(), col_name] = np.squeeze(encoded_vals)
    
    # Create KNN imputer
    KNN_imputer = KNN()

# Impute and round the users DataFrame
    df.iloc[:, :] = np.round(KNN_imputer.fit_transform(df))

# Loop over the column names in users
    for col_name in df:
    
    # Reshape the data
        reshaped = df[col_name].values.reshape(-1, 1)
    
    # Perform inverse transform of the ordinally encoded columns
        df[col_name] = ordinal_enc_dict[col_name].inverse_transform(reshaped)

### Evaluating Imputations
1. Use linear regression for each imputed datset and compare results with original dataset
2. Use KDE plots for each imputed dataset and compare shape with original dataset

## Dealing with Categorical Variables

#### Identifying categorical variables
As categorical variables need to be treated in a particular manner, as you'll see later on, you need to make sure to identify which variables are categorical. In some cases, identifying will be easy (e.g. if they are stored as strings), in other cases they are numeric and the fact that they are categorical is not always immediately apparent.  Note that this may not be trivial. A first thing you can do is use the `.describe()` function and `.info()`-function and get a better sense. `.describe()` will give you info on the data types (like strings, integers, etc), but even then continuous variables might have been imported as strings, so it's very important to really have a look at your data.

#### Transforming categorical variables
When you want to use categorical variables in regression models, they need to be transformed. There are two approaches to this:
- 1) Perform label encoding
- 2) Create dummy variables / one-hot-encoding

##### Label Encoding

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder

    from sklearn.preprocessing import LabelEncoder
    lb_make = LabelEncoder()

    origin_encoded = lb_make.fit_transform(cat_origin)
    
##### One hot encoding

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

    #pandas
    pd.get_dummies(cat_origin)
    
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
   
    #sklearn
    from sklearn.preprocessing import LabelBinarizer
    lb = LabelBinarizer()
    origin_dummies = lb.fit_transform(cat_origin)
    # you need to convert this back to a dataframe
    origin_dum_df = pd.DataFrame(origin_dummies,columns=lb.classes_)

## Multicollinearity

Because the idea behind regression is that you can change one variable and keep the others constant, correlation is a problem, because it indicates that changes in one predictor are associated with changes in another one as well. Because of this, the estimates of the coefficients can have big fluctuations as a result of small changes in the model. As a result, you may not be able to trust the p-values associated with correlated predictors.

#### Checking for multicollinearity

##### scatter matrix
    pd.plotting.scatter_matrix(data,figsize  = [11, 11]);
    
##### correlation matrix
    data.corr()
    
##### heatmap
    import seaborn as sns
    sns.heatmap(data_pred.corr(), center=0);

## Feature Scaling and Normalization

The idea behind this is that, around every point of the regression line, we would assume the data is spread around the eventual regression line in a "homogenous" way, with more points closer to the regression line and less points further away.

Often, your dataset will contain features that largely vary in magnitudes. If we leave these magnitudes unchanged, coefficient sizes will largely fluctuate in magnitude as well. This can give the false impression that some variables are less important than others.

Even though this is not always a formal issue when estimating linear regression models, this can be an issue in more advanced machine learning models. This is because most machine learning algorithms use Euclidean distance between two data points in their computations. Because of that, making sure that features have similar scales is formally required there. Some algorithms even require features to be zero centric.

A good rule of thumb is, however, to check your features for normality, and while you're at it, scale your features so they have similar magnitudes, even for a "simple" model like linear regression.

#### Popular transformations

##### Log transformation

Log transformation is a very useful tool when you have data that clearly does not follow a normal distribution. log transformation can help reducing skewness when you have skewed data, and can help reducing variability of data. 

    import numpy as np
    data_log= pd.DataFrame([])
    data_log["column"] = np.log(data["column"])

##### Min-max scaling

When performing min-max scaling, you can transform x to get the transformed $x'$ by using the formula:
$$x' = \dfrac{x - \min(x)}{\max(x)-\min(x)}$$
This way of scaling brings values between 0 and 1

    features_final["CRIM"] = (logcrim-min(logcrim))/(max(logcrim)-min(logcrim))

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

    scaler = MinMaxScaler()
    scaler.fit(data['column'])
    

##### Standardization

When 
$$x' = \dfrac{x - \bar x}{\sigma}$$
x' will have mean $\mu = 0$ and $\sigma = 1$
Note that standardization does not make data $more$ normal, it will just change the mean and the standard error!

    features_final["DIS"]   = (logdis-np.mean(logdis))/np.sqrt(np.var(logdis))

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

    scaler = StandardScaler()
    scaler.fit(data['column'])

##### Mean normalization
When performing mean normalization, you use the following formula:
$$x' = \dfrac{x - \text{mean}(x)}{\max(x)-\min(x)}$$
The distribution will have values between -1 and 1, and a mean of 0.

    features_final["LSTAT"] = (loglstat-np.mean(loglstat))/(max(loglstat)-min(loglstat))

##### Unit vector transformation
 When performing unit vector transformations, you can create a new variable x' with a range [0,1]:
$$x'= \dfrac{x}{{||x||}}$$
Recall that the norm of x $||x||= \sqrt{(x_1^2+x_2^2+...+x_n^2)}$

## Feature Selection

#### Stepwise Selection
In stepwise selection, you start with and empty model (which only includes the intercept), and each time, the variable that has an associated parameter estimate with the lowest p-value is added to the model (forward step). After adding each new variable in the model, the algorithm will look at the p-values of all the other parameter estimates which were added to the model previously, and remove them if the p-value exceeds a certain value (backward step). The algorithm stops when no variables can be added or removed given the threshold values.

    import statsmodels.api as sm

    def stepwise_selection(X, y, 
                           initial_list=[], 
                           threshold_in=0.01, 
                           threshold_out = 0.05, 
                           verbose=True):
        """ Perform a forward-backward feature selection 
        based on p-value from statsmodels.api.OLS
        Arguments:
            X - pandas.DataFrame with candidate features
            y - list-like with the target
            initial_list - list of features to start with (column names of X)
            threshold_in - include a feature if its p-value < threshold_in
            threshold_out - exclude a feature if its p-value > threshold_out
            verbose - whether to print the sequence of inclusions and exclusions
        Returns: list of selected features 
        Always set threshold_in < threshold_out to avoid infinite looping.
        See https://en.wikipedia.org/wiki/Stepwise_regression for the details
        """
        included = list(initial_list)
        while True:
            changed=False
            # forward step
            excluded = list(set(X.columns)-set(included))
            new_pval = pd.Series(index=excluded)
            for new_column in excluded:
                model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
                new_pval[new_column] = model.pvalues[new_column]
            best_pval = new_pval.min()
            if best_pval < threshold_in:
                best_feature = new_pval.idxmin()
                included.append(best_feature)
                changed=True
                if verbose:
                    print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

            # backward step
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
            # use all coefs except intercept
            pvalues = model.pvalues.iloc[1:]
            worst_pval = pvalues.max() # null if pvalues is empty
            if worst_pval > threshold_out:
                changed=True
                worst_feature = pvalues.argmax()
                included.remove(worst_feature)
                if verbose:
                    print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
            if not changed:
                break
        return included
        
#### Recursive Feature Elimination

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE

    from sklearn.feature_selection import RFE
    from sklearn.linear_model import LinearRegression

    linreg = LinearRegression()
    selector = RFE(linreg, n_features_to_select = 2)
    selector = selector.fit(predictors, data_fin["mpg"])
    
    
#### Forward Selection using Adjusted R-squared    
    
    import statsmodels.formula.api as smf

    def forward_selected(data, response):
        """Linear model designed by forward selection.

        Parameters:
        -----------
        data : pandas DataFrame with all possible predictors and response

        response: string, name of response column in data

        Returns:
        --------
        model: an "optimal" fitted statsmodels linear model
               with an intercept
               selected by forward selection
               evaluated by adjusted R-squared
        """
        remaining = set(data.columns)
        remaining.remove(response)
        selected = []
        current_score, best_new_score = 0.0, 0.0
        while remaining and current_score == best_new_score:
            scores_with_candidates = []
            for candidate in remaining:
                formula = "{} ~ {} + 1".format(response,
                                               ' + '.join(selected + [candidate]))
                score = smf.ols(formula, data).fit().rsquared_adj
                scores_with_candidates.append((score, candidate))
            scores_with_candidates.sort()
            best_new_score, best_candidate = scores_with_candidates.pop()
            if current_score < best_new_score:
                remaining.remove(best_candidate)
                selected.append(best_candidate)
                current_score = best_new_score
        formula = "{} ~ {} + 1".format(response,
                                       ' + '.join(selected))
        model = smf.ols(formula, data).fit()
        return model
        
#### Permutation Importance for Classification

    #oob classifier accuracy for classification scoring
    def oob_classifier_accuracy(rf, X_train, y_train):
        """
        Compute out-of-bag (OOB) accuracy for a scikit-learn random forest
        classifier. We learned the guts of scikit's RF from the BSD licensed
        code:
        https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L425
        """
        X = X_train
        y = y_train

        n_samples = len(X)
        n_classes = len(np.unique(y))
        predictions = np.zeros((n_samples, n_classes))
        for tree in rf.estimators_:
            unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)
            tree_preds = tree.predict_proba(X[unsampled_indices, :])
            predictions[unsampled_indices] += tree_preds

        predicted_class_indexes = np.argmax(predictions, axis=1)
        predicted_classes = [rf.classes_[i] for i in predicted_class_indexes]

        oob_score = np.mean(y == predicted_classes)
        return oob_score

    #package for PI
    import eli5
    from eli5.sklearn import PermutationImportance
    from sklearn.ensemble.forest import _generate_unsampled_indices
    X_train, X_test, y_train, y_test = train_test_split(f_scale,target_resample, random_state=0)
    perm = PermutationImportance(rf, cv=5, scoring = oob_classifier_accuracy) #can change scoring for other forms of models
    perm.fit(X_train, y_train)

## Interactions in Regression Models

In statistics, an interaction is a particular property of three or more variables, where two or more variables interact in a non-additive manner when affecting a third variable. In other words, the two variables interact to have an effect that is more (or less) than the sum of their parts. Not accounting for them might lead to results that are wrong. You'll also notice that including them when they're needed will increase your  R2R2  value!

#### Iterate through combinations of features to get top three interactions

    from itertools import combinations
    combinations = list(combinations(data.feature_names, 2))
    interactions = []
    data = df.copy()
    for comb in combinations:
        data["interaction"] = data[comb[0]] * data[comb[1]]
        score = np.mean(cross_val_score(regression, data, y, scoring="r2", cv=crossvalidation))
        if score > baseline: interactions.append((comb[0], comb[1], round(score,3)))

    print("Top 3 interactions: %s" %sorted(interactions, key=lambda inter: inter[2], reverse=True)[:5])
    
#### Feature engineer interactions into dataframe

    df_inter = df.copy()#make a copy of dataframe so original is not affected
    df_inter["RM_LSTAT"] = df["RM"] * df["LSTAT"] #combines the two features
    df_inter["RM_TAX"] = df["RM"] * df["TAX"]
    df_inter["RM_RAD"] = df["RM"] * df["RAD"]


## Polynomials in Regression (curved relationship)

When relationships between predictors and outcome are not linear and show some sort of a curvature, polynomials can be used to generate better approximations. The idea is that you can transform your input variable by e.g, squaring it.

$\hat y = \hat \beta_0 + \hat \beta_1x + \hat \beta_2 x^2$ 

The use of polynomials is not restricted to quadratic relationships, you can explore cubic relationships,... as well! Imagine you want to go until the power of 10, it would be quite annoying to transform your variable 9 times. Of course, Scikit-Learn has a built-in Polynomial option in the preprocessing library!

#### sci-kit learn polynomial selection with visual feedback and MSE scores

    for index, degree in enumerate([2,3,4]):
        poly = PolynomialFeatures(degree)
        X = poly.fit_transform(X)
        X_plot = poly.fit_transform(X_plot)
        reg_poly = LinearRegression().fit(X, y)
        y_plot = reg_poly.predict(X_plot)
        plt.plot(x_plot, y_plot, color=colors[index], linewidth = 2 ,
                 label="degree %d" % degree)
        print("degree %d" % degree, r2_score(y, reg_poly.predict(X)))

    plt.legend(loc='lower left')
    plt.show();



    from sklearn.decomposition import PCA

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

#### Sparseness in n-Dimensional Space

Points in n-dimensional space become increasingly sparse as the number of dimensions increases.

#### Convergence Time
Another issue with increasing feature space is the training time required to fit a machine learning model. 

## PCA example on Iris Dataset¶

#### loading dataset into Pandas DataFrame
#### before PCA is performed, ensure that dataset is explored and standardized.
#### Initialize an instance of PCA from scikit-learn with n components

    pca=PCA(n_components=n)
    transformed = pca.fit_transform(X)

##### To visualize the components, it will be useful to also look at the target associated with the particular observation. As such, append the target (flower name) to the principal components in a pandas dataframe.

    # Create a new dataset from principal components 

    df = pd.DataFrame(data = transformed, columns = ['PC1', 'PC2'])
    result_df = pd.concat([df, iris[['target']]], axis = 1)
    result_df.head()

#### Visualize Principal Components Using the target data

    # PCA scatter plot

    plt.style.use('seaborn-dark')
    fig = plt.figure(figsize = (10,8))
    ax = fig.add_subplot(1,1,1) 
    ax.set_xlabel('First Principal Component ', fontsize = 15)
    ax.set_ylabel('Second Principal Component ', fontsize = 15)
    ax.set_title('Principal Component Analysis (2PCs) for Iris Dataset', fontsize = 20)

    targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    colors = ['r', 'g', 'b']
    for target, color in zip(targets,colors):
        indicesToKeep = iris['target'] == target
        ax.scatter(result_df.loc[indicesToKeep, 'PC1']
                   , result_df.loc[indicesToKeep, 'PC2']
                   , c = color
                   , s = 50)
    ax.legend(targets)
    ax.grid()

#### Calculate the variance explained by priciple components

    print('Variance of each component:', pca.explained_variance_ratio_)
    print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100, 2))

#### Run a KNeighborsClassifier to classify the dataset after PCA

    X = result_df[['PC1', 'PC2']]
    y = iris.target
    y = preprocessing.LabelEncoder().fit_transform(y)
    start = timeit.timeit()

    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=9)
    model = KNeighborsClassifier()
    model.fit(X_train, Y_train)

    Yhat = model.predict(X_test)
    acc = metrics.accuracy_score(Yhat, Y_test)
    end = timeit.timeit()
    print("Accuracy:",acc)
    print ("Time Taken:", end - start)

##### some accuracy is lost after performing PCA, but computing time is reduced and accuracy can be improved in some complex cases

#### Plot decision boundary using principal components 

    def decision_boundary(pred_func):
    
###### Set the boundary
    
    x_min, x_max = X.iloc[:, 0].min() - 0.5, X.iloc[:, 0].max() + 0.5
    y_min, y_max = X.iloc[:, 1].min() - 0.5, X.iloc[:, 1].max() + 0.5
    h = 0.01
    
###### build meshgrid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

###### plot the contour
    plt.figure(figsize=(15,10))
    plt.contourf(xx, yy, Z, cmap=plt.cm.afmhot)
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=plt.cm.Spectral, marker='x')

    decision_boundary(lambda x: model.predict(x))

    plt.title("decision boundary")

## Image Recognition with PCA

#### Obtain Data
#### Scrub and Explore
#### Baseline Model w/ SVC
    from sklearn import svm
    from sklearn.model_selection import train_test_split
    X = data.data
    y = data.target
    X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=22)
    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


#### Compressing with PCA

    from sklearn.decomposition import PCA
    import seaborn as sns
    sns.set_style('darkgrid')
    pca = PCA()
    X_pca = pca.fit_transform(X_train)

#### Plot the Explained Variance versus Number of Features

    plt.plot(range(1,65), 
    pca.explained_variance_ratio_.cumsum())

#### Determine the Number of Features to Capture 95% of the Datasets Variance

    total_explained_variance = pca.explained_variance_ratio_.cumsum()
    n_over_95 = len(total_explained_variance[total_explained_variance >= .95])
    n_to_reach_95 = X.shape[1] - n_over_95 + 1
    print("Number features: {}\tTotal Variance Explained: {}".format(n_to_reach_95, total_explained_variance[n_to_reach_95-1]))


#### Subset the Dataset

    pca = PCA(n_components=n_to_reach_95)
    X_pca_train = pca.fit_transform(X_train)
    pca.explained_variance_ratio_.cumsum()[-1]

#### Refit a Model on the Compressed Dataset

    X_pca_test = pca.transform(X_test)
    clf = svm.SVC()
    %timeit clf.fit(X_pca_train, y_train)
    train_pca_acc = clf.score(X_pca_train, y_train)
    test_pca_acc = clf.score(X_pca_test, y_test)
    print('Training Accuracy: {}\tTesting Accuracy: {}'.format(train_pca_acc, test_pca_acc))

#### Evaluate model and optimize

## Raw PCA using Numpy

    # Normalize the Data
    data = data - data.mean()
    data.head()
    # Calculate the Covariance Matrix
    cov_mat = data.cov()
    cov_mat
    # Calculate the Eigenvectors
    import numpy as np
    eig_values, eig_vectors = np.linalg.eig(cov_mat)
    # Sorting the Eigenvectors to Determine Primary Components
    e_indices = np.argsort(eig_values)[::-1] 
    # Get the index values of the sorted eigenvalues
    eigenvectors_sorted = eig_vectors[:,e_indices]
    eigenvectors_sorted
    # Reprojecting the Data to n dimensions
    eigenvectors_sorted[:n]