# Machine Learning in Python

# Supervised ML Methods

## Regression
- Linear Regression (Feature Engineering eg. Polynomial Values)
- Subset Selection (Best subset, forward/backwards stepwise)
- Shrinkage Methods
 - Ridge Regression
 - Lasso
 - SCAD

## Classification
- Logistic Regression
- LDA / QDA

## Both
- K-Nearest Neighbours
- Decision Trees
 - Regression Trees
 - Classification Trees
 - Bagging & Bootstrap
 - Random Forest
 - Boosting
- Neural Networks

# Unsupervised ML Methods
- K-Means Clustering
- Hierarchical Clustering
- PCA

# Dimensionality Reduction 
- PCR
- PLS
- SIR
- SIS

In [1]:
#loading general libraries
import numpy as np
import pandas as pd

In [16]:
X, y = np.arange(10).reshape((10, )), range(10)

In [17]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=None)

# Supervised Learning
# 1. Regression

## 1.1 Linear Regression

In [4]:
from sklearn.linear_model import LinearRegression
#sklearn LinearRegression expects a 2d array as input

In [22]:
# Using X_train[:, None] to convert a 1d array to a 2d array
lin_reg = LinearRegression().fit(X_train[:,None], y_train)
y_test_pred = lin_reg.predict(X_test[:,None])

## 1.1.1 Feature Engineering
### Polynomial Features: $x^n$

In [None]:
def create_polynomial_features(x_array, p):
    covariates = []
    for k in np.arange(1,p+1):
        covariates.append( x_array**k )
    print(covariates)
    X_poly = np.column_stack(covariates)
    return X_poly

## 1.2 Subset Selection (Use R)

## 1.3 Shrinkage Methods
**Reminder**: Standardize predictors to have mean 0 and standard deviation 1 before using any shrinkage method

### 1.3.1 Ridge Regression
Uses the $l_2$ loss, minimizes:
$$
\frac{RSS}{2} + \alpha\sum^{p}_{j=1}\beta^2_j
$$
- Decreases variance but increases bias
- Performs better when the response is a function of **many predictors**, all with coefficients of roughly equal size

In [None]:
from sklearn.linear_model import Ridge

clf = Ridge(alpha=1.0)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

### 1.3.2 Lasso
Uses the $l_1$ loss, minimizes:
$$
\frac{RSS}{2} + \alpha\sum^{p}_{j=1}|\beta|
$$
- Performs **variable selection**
- Produces simpler and more interpretable models that involve only a subset of the predictors
- Performs better when a **small number** of predictors have substantial coefficients, and the rest have coefficients that are very small or equal zero

In [None]:
from sklearn.linear_model import Lasso

clf = Lasso(alpha=0.1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

### 1.3.3 SCAD
- Small coefficient are set to zero, moderate coefficients shrunk towards zero while retaining large coefficients as they are
- Produces a sparse solution and approx unbiased estimates for large coefficients

## 1.4 K-Nearest Neighbor (Regression)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X, y)

# 2. Classification

## 2.1 Logistic Regression
Can be used for either binomial classification or multinomial classification

**(In R)** Select the best logistic model (with a reduced number of variables) using stepwise selection

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression.fit(X_train, y_train)
clf.predict(X_test)

## 2.2 LDA/QDA
- LDA assumes normality of predictors and common covariance, hence is more stable than logistic regression
- QDA is the compromise between KNN, LDA, and logreg. Can accurately model a wider range of problems than linear methods, but is not as flexible as KNN

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis().fit(X_train, y_train)

from sklearn.qda import QDA
clf = QDA().fit(X_train, y_train)

## 2.3 K-Nearest Neighbour (Classifier)
- Makes no assumptions about the shape of decision boundary
- Cannot be used in high-dimensions or to identify important predictors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

# 3. Decision Trees
See also LightGBM: https://lightgbm.readthedocs.io/en/latest/index.html

## 3.1 Classifier

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=4)
clf = clf.fit(X, Y)

# Plot the tree
tree.plot_tree(clf)

## 3.2 Regression

In [None]:
from sklearn import tree

clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)

## 3.3 Hyperparameter Tuning

In [None]:
# To obtain graph of accuracies across different depths
max_depth = []
acc_gini = []
acc_entropy = []
for i in range(2,50):
    dtree = DecisionTreeClassifier(random_state=8,criterion='gini', max_leaf_nodes=i)
    dtree.fit(X_train, y_train)
    pred = dtree.predict(X_test)
    acc_gini.append(accuracy_score(y_test, pred))
    ###
    dtree = DecisionTreeClassifier(criterion='entropy', max_depth=i)
    dtree.fit(X_train, y_train)
    pred = dtree.predict(X_test)
    acc_entropy.append(accuracy_score(y_test, pred))
    ####
    max_depth.append(i)
d = pd.DataFrame({'acc_gini':pd.Series(acc_gini), 
    'acc_entropy':pd.Series(acc_entropy),
    'max_depth':pd.Series(max_depth)})
# visualizing changes in parameters
plt.plot('max_depth','acc_gini', data=d, label='gini')
plt.plot('max_depth','acc_entropy', data=d, label='entropy')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.legend()
print("Best max_depth for gini:", d.sort_values('acc_gini',ascending=False).iloc[0,2])
print("Best max_depth for entropy:", d.sort_values('acc_entropy',ascending=False).iloc[0,2])

## 3.4 Bagging & Bootstrap
- Bagging builds many prediction models and takes the average resulting predictions
- **Reduces variance**, as decision trees have high variance
- Bootstrap creates many training sets from a single dataset


- Resulting model can be difficult to interpret, hence use summary of importance of predictor using RSS (regression) or Gini index (classification) $\Rightarrow$ large value indicates an important predictor

### 3.4.1 Out-of-Bag Error Estimation
- OOB error is a valid estimation of test error of bagged model

## 3.5 Random Forests
- **Decorrelates** the trees by only considering a random sample of *m* predictors at each split instead of all the predictors
- *m* set at $\frac{p}{3}$ (regression) or $\sqrt{p}$ (classification)


- Logic of Random Forests: if there is a very strong predictor, for bagging, most of the trees will use that predictor for the top split and all the bagged trees will look similar, and hence highly correlated

## 3.6 Boosting
- Trees are grown **sequentially** using information from previously grown trees
- Very time and resource heavy

# 4. Neural Networks
- Can model non-linearity very well
- Difficult to interpret the resulting model

**Providers**
- Tensorflow
 - Keras
- Pytorch

## 4.1 Regression and Classification NN

## 4.2 Convolutional NN
- Works well for images

## 4.3 LSTM and GRU
- Works well for financial and stock data

# Unsupervised Learning

## 1. K-Means Clustering

## 2. Hierarchical Clustering

## 3. Principal Component Analysis (PCA)

# Dimensionality Reduction Methods

## 1. Principal Component Regression (PCR)

## 2. Partial Least Squares (PLS)

## 3. Sure Independence Screening (SIS)

## 4. Sliced Inverse Regression (SIR)