# Notes on Topics 25 - 28

## Topic 25: Intro to Log. Regression
(Binary Classification)

The sigmoid function
  
$S(x) = \dfrac{1}{1+e^{-x}}$   

###  Intro to Supervised Learning
- ML
    - Supervised L. : Classification (Cls) and Regression (Reg)
        - labeled training data
            - labeling data is time-consuming and expensive, so AWS Mechanical Turk (MTurk) is typically used.
            - we need enough negative examples (quantity and variety) to allow the model to learn
        - objective/loss functions (performance)
        - Cls (categorical): SVM, Discriminant Analysis, Naive Bayes, KNN, Logistic Regression, Trees
            - Binary Cls
                - Logistic Regression
            - Multiclass Cls
        - Reg (continuous): SVM, SVR, SPR, Ensemble Methods, nn's, Linear/Lasso/Ridge/Polynomial
            - How much/many?
            - Label is a real-valued number
        - Decision Trees (continuous)
        - Random Forests (continuous)
    - Unsupervised L. : Clustering (Clu) and Association Analysis (AsAn) and Hidden Markov Model (HMM)
        - unlabeled training data
        - Clu (continuous): K-Means, K-Medroids, Fuzzy C-Means, Heirarchical, Gaussian Mixture, nn's, HMM, SVD, PCA
        - AsAn (categorical): Apriori, FP_Growth
        - HMM (categorical)

### Linear to Logistic regression

- Logistic Regression Assumptions:
    - Binary Log Reg requires the y to be binary with 1 representing the desired outcome.
    - Only meaningful variables should be included (Regularization)
    - little/no multicollinearity (_independent_ variables)
    - independent variables are linearly related to the log odds
    - requires quite large sample sizes

- .fit() parameters:
    - C : float, default=1.0
        - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
    - solver : {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'
        -  Algorithm to use in the optimization problem
        - 'liblinear' is limited to one-versus-rest schemes
        - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones.
        - For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes.
        - 'newton-cg', 'lbfgs', 'sag' and 'saga' handle L2 or no penalty
        - 'liblinear' and 'saga' also handle L1 penalty
        - 'saga' also supports 'elasticnet' penalty
        - 'liblinear' does not support setting ``penalty='none'``
    - penalty : {'l1', 'l2', 'elasticnet', 'none'}, default='l2'
        - Used to specify the norm used in the penalization. The 'newton-cg', 'sag' and 'lbfgs' solvers support only l2 penalties. 'elasticnet' is only supported by the 'saga' solver. If 'none' (not supported by the liblinear solver), no regularization is applied.
        
- outcome variable should be interpreted as the probability of the class label to be equal to 1

In [None]:
from sklearn.linear_model import LogisticRegression
regr = LogisticRegression(C=1e5, solver='liblinear')
regr.fit(X_train, y_train)

regr.coef_
y_hat_test = logreg.predict(X_test)
y_hat_train = logreg.predict(X_train)
residuals = np.abs(y_train - y_hat_train) # or y train
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))

In [None]:
import statsmodels.api as sms
X = pd.get_dummies(salaries[['Race', 'Sex', 'Age']], drop_first=True, dtype=float)
y = pd.get_dummies(salaries['Target'], drop_first=True, dtype=float)['>50K']
# Create intercept term required for sm.Logit
X = sms.add_constant(X)
logit_model = sm.Logit(y, X)
result = logit_model.fit()
result.summary()
np.exp(result.params) # e ^ coefficients 


### Confusion Matrices + Lab
- used to evaluate classification models
- predicted label on bottom (columns/x-axis)
- actual label on left (rows/y-axis)

In [None]:
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1]
y_pred  = [0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1]
confusion_matrix(y_true, y_pred, normalize)
# Visualize your confusion matrix
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(logreg, X_test, y_test,
                     cmap=plt.cm.Blues)

### Evaluation Metrics + Evaluating Logistic Regression Models Lab
- quantify performance of classifiers
- precision and recall have an inverse relationship
   
Precision: $ = \frac{\text{TP}}{\text{TP + FP}} $
- measures how precise the predictions are
- how many of the selected things (Total predicted P) are relevant (TP)
- "Out of all the times the model said someone had a disease, how many times did the patient in question actually have the disease?"
- more conservative models can have a high precision score, but this doesn't necessarily mean that they are the best performing model

Recall/Sensitivity: $ = \frac{\text{TP}}{\text{TP + FN}} $
- indicates what percentage of the classes we're interested in were actually captured by the model 
- how many of the relevant things (Actual Positives) are selected (TP)
- Ex. “Out the patients that actually had the disease, what percentage did our model correctly identify as positive?"
- How good podel is at identifying positives

Accuracy: $ = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}} $
- measures percentage of predictions a model gets right
- "Out of all the predictions our model made, what percentage were correct?"

F1 score: $ = 2\ \frac{Precision\ x\ Recall}{Precision + Recall} $
- the Harmonic Mean of Precision and Recall, which means that the F1 score cannot be high without both precision and recall also being high. "all around measure"
- penalizes models heavily if it skews too hard towards either precision or recall
- generally the most used metric for describing the performance of a mode

Specificity: $ = \frac{\text{TN}}{\text{TN + FP}} $
- what percent of total negatives were correctly identified
- How good model is at predicting negatives


In [None]:
from sklearn.metrics import precision_score, recall_score,
                                accuracy_score, f1_score
print('Training Precision: ', precision_score(y_train, y_hat_train))
print('Testing Precision: ', precision_score(y_test, y_hat_test))
print('Training Recall: ', recall_score(y_train, y_hat_train))
print('Testing Recall: ', recall_score(y_test, y_hat_test))
print('Training Accuracy: ', accuracy_score(y_train, y_hat_train))
print('Testing Accuracy: ', accuracy_score(y_test, y_hat_test))
print('Training F1-Score: ', f1_score(y_train, y_hat_train))
print('Testing F1-Score: ', f1_score(y_test, y_hat_test))

In [None]:
print('Training Precision: ', precision_score(y_train, y_hat_train))
print('Testing Precision: ', precision_score(y_test, y_hat_test))
print('Training Recall: ', recall_score(y_train, y_hat_train))
print('Testing Recall: ', recall_score(y_test, y_hat_test))
print('Training Accuracy: ', accuracy_score(y_train, y_hat_train))
print('Testing Accuracy: ', accuracy_score(y_test, y_hat_test))
print('Training F1-Score: ', f1_score(y_train, y_hat_train))
print('Testing F1-Score: ', f1_score(y_test, y_hat_test))

### ROC Curves and AUC + Lab

[Precision-Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve)

[Visualizing/Plotting curves](https://scikit-learn.org/stable/visualizations.html#visualizations)

In [None]:
from sklearn.metrics import roc_curve, auc
# First calculate the probability scores of each of the datapoints:
y_score = logreg.fit(X_train, y_train).decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)
print('ROC AUC: {}'.format(auc(fpr, tpr)))



### Class Imbalance Problems

- class_weight parameter in LogisticRegression(): None, 'balanced' or dict ({'class_label': weight})
-  typically the heavier we weight the positive case, the better our classifier appears to be performing
- Another technique, SMOTE (Synthetic Minority Oversampling): oversampling the minority class or undersampling the majority class
    - Undersampling can only be used when you have a truly massive dataset and can afford to lose data points. However, even with very large datasets, you are losing potentially useful data. Oversampling can run into the issue of overfitting to certain characteristics of certain data points because there will be exact replicas of data points
    - still maintain a test set from the original dataset in order to accurately judge the accuracy of the algorithm overall
    - curse of dimensionality

Let's look at the level of class imbalance in the dataset: 

In [None]:
print('Raw counts: \n')
print(df['target var'].value_counts())
print('-----------------------------------')
print('Normalized counts: \n')
print(df['target var'].value_counts(normalize=True))

In [None]:
# Now let's compare a few different regularization performances on the dataset:
weights = [None, 'balanced', {1:2, 0:1}, {1:10, 0:1}, {1:100, 0:1}, {1:1000, 0:1}]
names = ['None', 'Balanced', '2 to 1', '10 to 1', '100 to 1', '1000 to 1']
colors = sns.color_palette('Set2')

plt.figure(figsize=(10,8))

for n, weight in enumerate(weights):
    # Fit a model with class_weights
    logreg = LogisticRegression(fit_intercept=False, C=1e20, class_weight=weight, solver='lbfgs')
    model_log = logreg.fit(X_train, y_train)
    print(model_log
          
#     # Fit a model with SMOTE
#     smote = SMOTE(sampling_strategy=ratio)
#     X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train) 
#     logreg = LogisticRegression(fit_intercept=False, C=1e20, solver ='lbfgs')
#     model_log = logreg.fit(X_train_resampled, y_train_resampled)
#     print(model_log)
          
#     # Fit a model with the C
#     logreg = LogisticRegression(fit_intercept=False, C=c, solver='liblinear')
#     model_log = logreg.fit(X_train, y_train)
#     print(model_log) # Preview model params

    # Predict
    y_hat_test = logreg.predict(X_test)
    y_score = logreg.fit(X_train, y_train).decision_function(X_test)
    fpr, tpr, thresholds = roc_curve(y_test, y_score) 
    print('AUC for {}: {}'.format(names[n], auc(fpr, tpr)))
    print('-------------------------------------------------------------------------------------')
    lw = 2
    # see how to plot ROC curve with sklearn
    plt.plot(fpr, tpr, color=colors[n],
             lw=lw, label='ROC curve {}'.format(names[n]))

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

In [None]:
from imblearn.over_sampling import SMOTE
# Previous original class distribution
print('Original class distribution: \n')
print(y.value_counts())
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train) 
# Preview synthetic sample class distribution
print('-----------------------------------------')
print('Synthetic sample class distribution: \n')
print(pd.Series(y_train_resampled).value_counts())

## Topic 26: MLE and Log. Regression

### MLE Review
- maximum likelihood estimation finds the underlying parameters of an assumed distribution to maximize the likelihood of the observations
- probability vs likelihood
    - probability = p(data | distribution parameters), area under curve
    - likelihood = L(distribution parameters | data), y-value on the curve
        - maximum likelihood finds the maximum y-value
- When calculating maximum likelihood, it is common to use the log-likelihood, as taking the logarithm can simplify calculations.
- MLE for binomial distribution:
    - $L(p) = L(y_1, y_2, ..., y_n | p) = p^y (1-p)^{n-y}$ where $ y = \sum_{i=1}^{n}y_i$
    - $ln[L(p)] = ln[p^y (1-p)^{n-y}] = y ln(p)+(n-y)ln(1-p)$
    - simplifies to : $p = \frac{y}{n}$
- [MLE for normal distribution](https://www.youtube.com/watch?v=Dn6b9fCIUpM)
- [MLE for exponential distribution](https://www.youtube.com/watch?v=p3T-_LMrvBc)

### MLE and Logistic Regression

- Mathematically, you can write each of these probabilities for each factor $X$ as:
    - $\pi_i = P(Y_i = 1|X_i = x_i)=\dfrac{\text{exp}(\beta_0 + \beta_1 x_i)}{1 + \text{exp}(\beta_0 + \beta_1 x_i)}$

- and maximize the likelihood function:
    - $ L(\beta_0,\beta_1)=\prod\limits_{i=1}^N \pi_i^{y_i}(1-\pi_i)^{n_i-y_i}=\prod\limits_{i=1}^N \dfrac{\text{exp}{y_i(\beta_0+\beta_1 x_i)}}{1+\text{exp}(\beta_0+\beta_1 x_i)}$


### Gradient Descent Review
- Recall that the general outline for gradient descent is:

    - Define initial parameters:
        - Pick a starting point
        - Pick a step size $\alpha$ (alpha)
        - Choose a maximum number of iterations; the algorithm will terminate after this many iterations if a minimum has yet to be found
        - (optionally) define a precision parameter; similar to the maximum number of iterations, this will terminate the algorithm early. For example, one might define a precision parameter of 0.00001, in which case if the change in the loss function were less then 0.00001, the algorithm would terminate. The idea is that we are very close to the bottom and further iterations would make a negligible difference
    - Calculate the gradient at the current point (initially, the starting point)
    - Take a step (of size alpha) in the direction of the gradient
    - Repeat steps 2 and 3 until the maximum number of iterations is met, or the difference between two points is less then your precision parameter

## Topic 27: K Nearest Neighbors
- [ex. KNN code from Medium](https://link.medium.com/Mvaj4jTpodb)
- often used as a benchmark for more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM)
- 1 < k < inf, but often k < 30
- A very low k will fail to generalize (overfitting). A very high k is costly
- lazy learner because it doesn’t learn a discriminative function from the training data but memorizes the training dataset instead
    - An eager learner has a model fitting or training step. A lazy learner does not have a training phase
- pros:
    - Quick to implement : Which is why it is popular as a benchmarking algorithm.
    - Less training time: Faster turn around time
    - Comparable accuracies: Its prediction accuracy as indicated in a lot of research papers is fairly high for a lot of applications.

### Distance Metrics
- Euclidean (L2) Distance: $ d_{L2}(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $, Most common distance metric
- Chebyshev (L∞) Distance: $ d_{L∞}(x,y) = max(|x_1-x_2|,|y_1-y_2|)$
- Manhattan ((L1) Distance: $ d_{L1}(x,y) = \sum_{i=1}^{n}|x_i - y_i | $, Sum of the (absolute) differences of their coordinates.
- Minkowski (Lc) Distance: $ d_{Lc}(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$
    - np.power( np.sum( np.abs( np.array(a) + np.array(b)...) ) ** c, 1/c )
---
- "distance quantifies similarity"

### K-Nearest Neighbors + Lab

- normalize X data before fitting 

In [None]:
# transform/normalize data (code below)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
test_preds = clf.predict(X_test)
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))
print_metrics(y_test, test_preds)
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)

## Topic 28: Bayes Classification
[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVe)

### Classifiers with Bayes

[Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)

- Bayes Theorem: $P(Y|X_1, X_2, ..., X_n) $
- Expanding to multiple features, the multinomial Bayes' formula is:
$ P(y|x_1, x_2, ..., x_n) = \dfrac{P(y) \cdot P(x_1|y) \cdot P(x_2|y) \cdot ... \cdot P(x_n|y)}{P(x_1, x_2, ..., x_n)} = \dfrac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$
- assume that the features are independent of one another (naive assumption), estimate an overall probability by multiplying the conditional probabilities for each of the independent features, and make a classification prediction by seeing which probability is higher
- since the denominator, $P(X_1, X_2, ..., X_n)$, is equal for both $P(Y_0)$ and $P(Y_1)$, you can compare the numerators, as these will be proportional to the overall probability.
- Decision Rule for Bernoulli naive Bayes: $ P(x_i|y) = P(i | y)x_i + (1 - P(i | y))(1-x_i)$
- Probability of category *t* in feature  given class  is estimated as: $ P(x_i|y) = \dfrac{N_tic + \alpha}{N_c + \alpha n_i}$

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import CategoricalNB

clf = BernoulliNB()
y_pred = clf.fit(X_train, y_train).predict(X_test)
clf.score(X_test, y_test) # Mean accuracy of self.predict(X) wrt. y

### Gaussian Naive Bayes + Lab

- $ P(x_i|y) = \dfrac{1}{\sqrt{2\pi \sigma_i^2}}e^{\frac{-(x-\mu_i)^2}{2\sigma_i^2}}$

- $P(y|x_1,x_2,...x_n) = \dfrac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1,x_2,...x_n)}$

In [None]:
df['Target'].value_counts() # examine the target variable
aggs = df.groupby('Target').agg(['mean', 'std'])
aggs
# Calculate conditional probability point estimates
from scipy import stats
def p_x_given_class(obs_row, feature, class_):
    mu = aggs[feature]['mean'][class_] # mean of feature
    std = aggs[feature]['std'][class_] # std of feature
    
    # A single observation
    obs = df.iloc[obs_row][feature] # row to observe
    
    p_x_given_y = stats.norm.pdf(obs, loc=mu, scale=std)
    return p_x_given_y
# Multinomial Bayes
# Calculating class probabilities for observations
def predict_class(row):
    c_probs = []
    for c in range(3):
        # Initialize probability to relative probability of class
        p = len(df[df['Target'] == c])/len(df) # p(y)
        for feature in X.columns:   # product of all p(xi|y)
            p *= p_x_given_class(row, feature, c)
        # Update the probability using the point estimate for each feature
        c_probs.append(p)
    return np.argmax(c_probs) # class with max of p(y|xi)
# Calculating accuracy
y_hat_train = [predict_class(X_train.iloc[idx]) for idx in range(len(X_train))]
y_hat_test = [predict_class(X_test.iloc[idx]) for idx in range(len(X_test))]
residuals_train = y_hat_train == y_train
acc_train = residuals_train.sum()/len(residuals_train)
residuals_test = y_hat_test == y_test
acc_test = residuals_test.sum()/len(residuals_test)
print('Training Accuracy: {}\tTesting Accuracy: {}'.format(acc_train, acc_test))

Go to lab for classication under a range of a feature

### Document Classification with Naive Bayes + Lab

## Extra Notes

- Ex. of scaling with StandardScaler
    - only training data is fit_transformed!

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data_train = scaler.fit_transform(X_train) # np array
scaled_data_test = scaler.transform(X_test) # np array
scaled_df_train = pd.DataFrame(scaled_data_train, columns=one_hot_df.columns)
scaled_df_train.head()


### Binary Encoding

In [None]:
# Convert Sex to binary encoding
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df.head()