# Notes on Topics 25 - 28

## Topic 25: Intro to Log. Regression
(Binary Classification)

###  Intro to Supervised Learning
- ML
    - Supervised L. : Classification (Cls) and Regression (Reg)
        - labeled training data
            - labeling data is time-consuming and expensive, so AWS Mechanical Turk (MTurk) is typically used.
            - we need enough negative examples (quantity and variety) to allow the model to learn
        - objective/loss functions (performance)
        - Cls (categorical): SVM, Discriminant Analysis, Naive Bayes, KNN, Logistic Regression, Trees
            - Binary Cls
                - Logistic Regression
            - Multiclass Cls
        - Reg (continuous): SVM, SVR, SPR, Ensemble Methods, nn's, Linear/Lasso/Ridge/Polynomial
            - How much/many?
            - Label is a real-valued number
        - Decision Trees (continuous)
        - Random Forests (continuous)
    - Unsupervised L. : Clustering (Clu) and Association Analysis (AsAn) and Hidden Markov Model (HMM)
        - unlabeled training data
        - Clu (continuous): K-Means, K-Medroids, Fuzzy C-Means, Heirarchical, Gaussian Mixture, nn's, HMM, SVD, PCA
        - AsAn (categorical): Apriori, FP_Growth
        - HMM (categorical)

### Linear to Logistic regression

- Logistic Regression Assumptions:
    - Binary Log Reg requires the y to be binary with 1 representing the desired outcome.
    - Only meaningful variables should be included (Regularization)
    - little/no multicollinearity (_independent_ variables)
    - independent variables are linearly related to the log odds
    - requires quite large sample sizes

- .fit() parameters:
    - C : float, default=1.0
        - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
    - solver : {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'
        -  Algorithm to use in the optimization problem
        - 'liblinear' is limited to one-versus-rest schemes
        - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones.
        - For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes.
        - 'newton-cg', 'lbfgs', 'sag' and 'saga' handle L2 or no penalty
        - 'liblinear' and 'saga' also handle L1 penalty
        - 'saga' also supports 'elasticnet' penalty
        - 'liblinear' does not support setting ``penalty='none'``
    - penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2'
        - Used to specify the norm used in the penalization. The 'newton-cg', 'sag' and 'lbfgs' solvers support only l2 penalties. 'elasticnet' is only supported by the 'saga' solver. If 'none' (not supported by the liblinear solver), no regularization is applied.
        
- outcome variable should be interpreted as the probability of the class label to be equal to 1

In [None]:
from sklearn.linear_model import LogisticRegression
regr = LogisticRegression(C=1e5, solver='liblinear')
regr.fit(X_train, y_train)




regr.coef_
y_hat_test = logreg.predict(X_test)
y_hat_train = logreg.predict(X_train)
residuals = np.abs(y_train - y_hat_train) # or y train
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))

In [None]:

import statsmodels.api as sms
X = pd.get_dummies(salaries[['Race', 'Sex', 'Age']], drop_first=True, dtype=float)
y = pd.get_dummies(salaries['Target'], drop_first=True, dtype=float)['>50K']
# Create intercept term required for sm.Logit
X = sms.add_constant(X)
logit_model = sm.Logit(y, X)
result = logit_model.fit()
result.summary()
np.exp(result.params) # e ^ coefficients 


### Confusion Matrices + Lab
- used to evaluate classification models
- predicted label on bottom (columns/x-axis)
- actual label on left (rows/y-axis)

In [None]:
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1]
y_pred  = [0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1]
confusion_matrix(y_true, y_pred, normalize)
# Visualize your confusion matrix
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(logreg, X_test, y_test,
                     cmap=plt.cm.Blues)

### Evaluation Metrics + Evaluating Logistic Regression Models Lab
- quantify performance of classifiers
- precision and recall have an inverse relationship
   
Precision: $ = \frac{\text{TP}}{\text{TP + FP}} $
- measures how precise the predictions are
- how many of the selected things (Total predicted P) are relevant (TP)
- "Out of all the times the model said someone had a disease, how many times did the patient in question actually have the disease?"
- more conservative models can have a high precision score, but this doesn't necessarily mean that they are the best performing model

Recall/Sensitivity: $ = \frac{\text{TP}}{\text{TP + FN}} $
- indicates what percentage of the classes we're interested in were actually captured by the model 
- how many of the relevant things (Actual Positives) are selected (TP)
- Ex. “Out the patients that actually had the disease, what percentage did our model correctly identify as positive?"
- How good podel is at identifying positives

Accuracy: $ = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}} $
- measures percentage of predictions a model gets right
- "Out of all the predictions our model made, what percentage were correct?"

F1 score: $ = 2\ \frac{Precision\ x\ Recall}{Precision + Recall} $
- the Harmonic Mean of Precision and Recall, which means that the F1 score cannot be high without both precision and recall also being high. "all around measure"
- penalizes models heavily if it skews too hard towards either precision or recall
- generally the most used metric for describing the performance of a mode

Specificity: $ = \frac{\text{TN}}{\text{TN + FP}} $
- what percent of total negatives were correctly identified
- How good model is at predicting negatives


In [None]:
from sklearn.metrics import precision_score, recall_score,
                                accuracy_score, f1_score
print('Training Precision: ', precision_score(y_train, y_hat_train))
print('Testing Precision: ', precision_score(y_test, y_hat_test))
print('Training Recall: ', recall_score(y_train, y_hat_train))
print('Testing Recall: ', recall_score(y_test, y_hat_test))
print('Training Accuracy: ', accuracy_score(y_train, y_hat_train))
print('Testing Accuracy: ', accuracy_score(y_test, y_hat_test))
print('Training F1-Score: ', f1_score(y_train, y_hat_train))
print('Testing F1-Score: ', f1_score(y_test, y_hat_test))

In [None]:
print('Training Precision: ', precision_score(y_train, y_hat_train))
print('Testing Precision: ', precision_score(y_test, y_hat_test))
print('Training Recall: ', recall_score(y_train, y_hat_train))
print('Testing Recall: ', recall_score(y_test, y_hat_test))
print('Training Accuracy: ', accuracy_score(y_train, y_hat_train))
print('Testing Accuracy: ', accuracy_score(y_test, y_hat_test))
print('Training F1-Score: ', f1_score(y_train, y_hat_train))
print('Testing F1-Score: ', f1_score(y_test, y_hat_test))

### ROC Curves and AUC + Lab

In [None]:
from sklearn.metrics import roc_curve, auc
# First calculate the probability scores of each of the datapoints:
y_score = logreg.fit(X_train, y_train).decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)
print('AUC: {}'.format(auc(fpr, tpr)))
# can plot ROC curve with code at the end of the lesson

### Class Imbalance Problems

## Topic 26: MLE and Log. Regression

### MLE Review


### MLE and Logistic Regression


### Gradient Descent Review


## Topic 27: K Nearest Neighbors
- [ex. KNN code from Medium](https://link.medium.com/Mvaj4jTpodb)
- often used as a benchmark for more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM)
- 1 < k < inf, but often k < 30
- A very low k will fail to generalize (overfitting). A very high k is costly
- lazy learner because it doesn’t learn a discriminative function from the training data but memorizes the training dataset instead
    - An eager learner has a model fitting or training step. A lazy learner does not have a training phase
- pros:
    - Quick to implement : Which is why it is popular as a benchmarking algorithm.
    - Less training time: Faster turn around time
    - Comparable accuracies: Its prediction accuracy as indicated in a lot of research papers is fairly high for a lot of applications.

### Distance Metrics
- Euclidean (L2) Distance: $ d_{L2}(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $, Most common distance metric
- Chebyshev (L∞) Distance: $ d_{L∞}(x,y) = max(|x_1-x_2|,|y_1-y_2|)$
- Manhattan ((L1) Distance: $ d_{L1}(x,y) = \sum_{i=1}^{n}|x_i - y_i | $, Sum of the (absolute) differences of their coordinates.
- Minkowski (Lc) Distance: $ d_{Lc}(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$
    - np.power( np.sum( np.abs( np.array(a) + np.array(b)...) ) ** c, 1/c )
---
- "distance quantifies similarity"

### K-Nearest Neighbors + Lab

- normalize X data before fitting 

In [None]:
# transform/normalize data (code below)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(scaled_X_train, y_train)
test_preds = clf.predict(scaled_data_test)
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))
print_metrics(y_test, test_preds)
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))

## Topic 28: Bayes Classification

## Extra Notes

- Ex. of scaling with StandardScaler
    - only training data is fit_transformed!

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data_train = scaler.fit_transform(X_train) # np array
scaled_data_test = scaler.transform(X_test) # np array
scaled_df_train = pd.DataFrame(scaled_data_train, columns=one_hot_df.columns)
scaled_df_train.head()


### Binary Encoding

In [None]:
# Convert Sex to binary encoding
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df.head()