#### From Lecture - Naive Bayesian Classification and Support Vector Machines (SVM)

Bayes Theorem
* conditional probability review
* P(A|B) = P(A) * P(B|A) / P(B)

Naive Bayes
* nest spam filtering methods
* think of learning a problem of 'statistical inference' 
* based on bayes theorem
* assumes that predictors are independent
    * presence of one feature unrelated to any other feature
    * this is only an assumption - some features might actually be related
* useful for large data sets
* outperforms even some highly sophisticated classification methods (e.g. text data)
* first as classifier but can also be used for regression

Naive Bayes Classifier
* CountVectorizer - convert text data into feature vectors, counts the occurrence of each word
* gives a "bag of words"
* side note (pipeline) - used for organizing the steps of the data - chains models 
* called 'MultinomialNB()' and fit on your data
* can find accuracy with .score
* can then use to predict new messages
* can be done probabilistic classifiers 
* in NLP - given a message, and words - calculate probability of being spam given a message, and same for non-spam
* higher probability (spam vs non-spam) gets labelled with that 
* by hand:
    * denominator often ignored - (P(B)) - somewhat irrelevant and difficult to calculate 
    * essentially calculating marginal probability 
    * multiplies all conditional probability for each feature given target * marginal prob of target 
    * some cases if one of the probabilities = 0 - it will all equal 0 - need to convert to probability
* when done through sklearn:
    * can calculate soft predictions
    * laplace smoothing - adds a constant, so zeroes are not coded as zero but 1, and 1 becomes 2
        * alpha=1 (hyperparameters)
        * high alpha - can lead to underfitting dilutes the information adjusts scores 
        * low alpha - overfitting - what's seen in training will be your test (?? does it literally just add 1)

Gaussian Naive Bayes 
* calculates probability based on continuous variables, just different formula 
* usually very accurate, fast for learning corresponding parameters
* scales greate - matter of counting how many times each attribute co-occurs with each class
* can be used for multiclass classifications
* Draw backs
    * oversimplifies - though could be useful

#### Support Vector Machines
* can be applied instead of logistic regression, doesn't need boundary between classes to be linear
* similarity based algo, more like weighted K-NNs (supervised cousin of K-means)
* boundaries are non-linear - more curves
* each training example either is or is not a support vector, decided during 'fit'
* decision boundary only depends on the support vectors
* model learns weights and biases - similar to logistic reg
* separates based on largest margin on a hyperplane 
* selects important support vectors - once that are closest together but in separate class
* maximize distance from support vector to the plane
* hyperparameters - what we choose (tunes/configures models) - can be optimized with gridsearch
* parameters - what the model learns from the features (weights, intercepts)
* hyperparams of SVC affects fundamental trade-off:
    * gamma
        * controls complexity
        * larger = more complex, leading to overfitting
        * smaller = less complexity = can lead to underfitting 
    * C
        * larger C = more complex (overfitting)
        * smaller = less complex (underfitting)
        * changes which SVs are chosen to develop the hyperplane
* kernel - can be chosen to not be linear 
* can do kernel trick - transforms the dimensions, does a linear boundary, and retransforms back to old dimensions 
* better in terms of prediction but more complex - logistic may be better if done with 

#### From Compass

Bayes Theorem of Probability
* basis for naive bayes
* P(A|B) = P(A and B) / P(B)
* think about numbers in A and B and create a venn diagram

Naive Bayes
* probabilistic machine learning - that is used for classification task
* based on bayes theorem
* assumes all variables are naive and not correlated to each other
* each feature is independent and equal (same weight)
* these assumptions are usually never correct, but still works in practice
* computes conditional probability of *each* variable- multiplied with one another to calculate P(0|A) and P(1|A) note that
* compare probabilities P(0|A) vs P(1|A) - which ever is greater is the prediction

Gaussian Naive Bayes Classifier
* when dealing with normally distributed continuous data
* make assumptions regarding the distribution of values for each feature (normal)
* select different 'kernels'/'distributions' based on your data
* *multinomial* - for count data
* *bernoulli/binomial* naive bayes - multivariate bernoulli event model, features are indepednent binary variables, good for determining whether a word is in a document or not (see class example) 
* very popular for spam detection

#### Implementation

In [1]:
# load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
 
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
 
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
 
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
 
# making predictions on the testing set
y_pred = gnb.predict(X_test)
 
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Gaussian Naive Bayes model accuracy(in %): 95.0


#### SVM - Support Vector Machine 
* used for both classification and regression purposes
* more commonly used in classification problems
* each data is plotted on n-dimensional space (number of features)
* perform classification based on optimal hyper-plane that differentiates the two classes very well 

From Statquest
* finds the hyperplane that separates the two closest values from different classes where distance ('margin') of each value is farthest from the hyperplane
* when threshold is halfway - margin is largest it can be (threshold is not really a point if more than one dimension)
* in 2 dimensional data - threshold or support vector classifer is a line
* in 3 dimensional - it's a plane  
* in 4+ dimensions - the classifer is a hyperplane (all are technically hyperplanes)
* 'maximal margin classifer' point where they are separated 
* maximum/margin classifer falls short when there are outlier observations that are close to one class but belong to another 
* to address - *allow misclassiciations*
* allows for bias/variance tradeoff
* distance between points to threshold w/ misclassification = 'soft margin'
* crossvalidation to get the best soft-margin - how many missclass/observations allowed inside to get the best classification
* i.e we are using 'soft margin classifer' AKA support vector machine 
* support vectors: the observations on the edge* and within the soft margin classifier (picture number line)
* values inside the soft margin can be misclassified - use cross validation - to make sure that allowing misclassificaiton is better for the long run 

Sandwiched data i.e. lots of overlap
* can't be handled by maximal margin classifer or support vector classifiers
* BUT can be handled by support vector MACHINES 
* start with data in low dimension (e.g. 1 dimension) - then moved the data to higher dimension (e.g. squaring the value, giving it a y-axis) - then find the support vector classifer to separate the higher dimensional data
* how to decide to transform the data - finds specific "kernel functions" to systematically find support vector classifers in higher dimensions (i.e. should data be squared, cubed, etc)
* using kernel functions:
* polynomial degree 1 point, 2 line - finds relationship between each pairs to find the support vector classifer
* systematically increases degree to find support vector classifer - and can find a good value for D with cross validation
* common kernel: radial kernel in infinite dimensions (weighted nearest neighbours)
* 'kernel trick'- transformation doesn't actually transform the data - but only for computation of correct SVC
