#### Important Topics and Concepts
------------------------------
**General Concepts**
* ANOVA/t-test
    * hypothesis testing - comparing means/variance between groups  in a data set
    * error between groups is greater than error within groups = significant difference
* p-value
    * used in hypothesis testing
    * probability of getting observed value of the test statistic or greater if/assuming the null is true. 
    * null is true means theres no difference between groups
* Conditional Probability
    * probability of A given B

* Central Limit Theorem
    * distribution of sample means approaches normality, the greater the amount of sample means
    * most statistical analysis relies on assumption of normality
    * allows us to model data on samples, despite not always being a normal distribution
* SQL
    * WHERE vs HAVING 
        * order of operations: SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY
    * JOINS (Left (left circle including overlap), Inner (overlap), Full Outer (everything), can specify nulls to exclude overlap)
    * e.g. list of employees .. 
* Normality - Normal Distribution
    * data is symmetrical around a mean 
    * has a bellcurve shape, with values on the x and frequency or counts on the y
    * data around the mean are more common than data further from the mean
    * at perfectly normal - median and mean are the same
* Law of Large Numbers
    * greater number of trials/observations - the closer the approximation to the population statistic or the expected value


#### Feature Engineering 
Outliers
* remove outliers - based on domain expert or intuition, skews data
* visualizing single/multiple variables - boxplots/scatterplots
* create z-scores and eliminate based on 3+ SD 
    
Missing
* ensure actually missing or encoding is correct 
* drop if not significantly affected

Data Prep
* Log transforms (for skewed data, removes extreme values)
* Binning (reduce features)
* Scaling (0-1) 
* Dummy variables
* Caclulating new variables from existing ones

#### Linear Algebra
* multiplying matrices
    * the rows of A must be equal to the columns B
    * gives dot product (mult each element across rows and columns and sum)
* determinants only for square matrices
* rank is number of linearly independent columns/rows 
* eigenvalues and vectors
    * comes from decomposing a square matrix
    * multiply a matrix by eigenvalue/vectors only changes in scale
    * eigenvectors - stays on the line that it spans out, doesn't change direction
    * eigenvalue - degree to which that vector spans out, 1 meaning it remains the same
* SVD - Singular Value Decomposition
    * for all matrices - reduces matrix into its parts
    * aka 'matrix factorization'
    * can get a square which you can compute eigenvalues and vectors

### Machine Learning (General)
* AI > ML > Deep Learning
* a computer program that accepts a set of features (inputs) and predicts or classifies a target variable 
* supervised (labels are given)
    * regression-linear/logistic
* unsupervised (no labels)
    * finds structure in the data
    * clustering, generative networks, dimension reduction
* semi supervised
* reinforcement learning

**bias vs variance - approximation error vs estimation error**
* bias - difference between prediction of the model vs correct value
    * high bias = underfitted, oversimplified model
    * high error between prediction and correct
* variance - variability of model prediction for a given data point
    * high variance = overfit, pays too much attention to the  training data and doesn't generalize
    * performs well on training but high error on test 
    * model captures too much noise
* low variance/high bias - model is very precise but not accurate, underfitted
* high variance/low bias - model is accurate - but not precise - overfitted 
* models should have low variance, low bias 
* want to fit enough of variance (but not too much that it overfits) - with low bias
* Approximation error 
    * how well the model is fitting with the training data
    * is the model learning the underlying relationship
    * related to variance 
* Estimation error
    * how off are the predictions from the actual values
    * related to bias 

Validation methods 
* split data into a train/test
* protects against overfitting - model memorizes the data rather than extrapolating the underlying relationships
* cross validation - take each slice of the cake 

#### Dimensionality Reduction
* useful for visualization, lower complexity, improve model performance and reduce over-fitting 
* Dimensionality Reduction
    * PCA
    * LDA
* Variable Selection
    * Filter
        * low/high variance
        * feature similarity - correlation b/w features
        * correlations with target variable
    * Wrapper methods
        * train model with different combos of features
        * find the best subset - trying different combos
        * forward selection - starting with no features
        * backward elimination - start with all and remove one
        * stepwise selection - combines both, until optimal set of features

#### Principal Component Analysis 
* using linear algebra to find 'arbirtrary' components from a set of features to represent the data
* used to reduce the dimensions or number of features in your data set 
* done after some transformation and exclusion of features that are correlate with each other
* Scree Plot to parse out how much each component explains variance in your dataset - and figure out optimal components to use
* can be used to cluster data
* PC1 goes through the most variation, which is the eigenvector with the highest eigenvalue (most variance)
* PC2 goes through the second most variation - pass through the mean (origin) and is perpindicular to PC1

#### LDA (less covered)
* supervised used when target variable is categorical
* minimizes intra-class variance, and maximize inter-class variance 
------------------------------------------------------------------------------------
##### Clustering - Unsupervised
* K-Means (Nearest Neighbours)
    * specify number of clusters
    * finds centroid from k number of points, can start at random spot
    * find the mean of clusters - taken as new centroid
    * compare values between centroids
    * reassigns points to new clusters depending on how far those points are from the new centroid 
    * tries to minimize within SS difference from centroid (inertia), and maximize SSdiff between centroids (silhouette score)
    * distortion plot - plots SSerror on y axis, SSerror from centroid is lower, lower intertia, the more clusters - 
    * SSerror - how big is the circle, are the datapoints far from the middle of the cluster 
    * based on 'euclidean' distance (hypoteneuse) 
* Hierarchical Clustering
    * each point begins as a cluster > two closest are clustered > repeat until have one big cluster
    * dendrogram created to visualize the number of clusters possible
    * agglomerative (bottom up) 
    * divisive (top-down)
* DBScan
    * Density-based spatial clustering applications with noise
    * good for non-spherical/arbitrary shape clusters
    * no initial k-input - but based on radius or epsilon, and minimum number of neighbours to get a corepoint
    * stops when reaches borderpoints that doesn't meet the criteria for minimum number of neighbours
    * excludes outliers
------------------------------------------------------------------------------------
### **Supervised**
#### Linear Regression
* predicting a continuous (number) value
* split data into training/test
* bias error - high bias = underfitting - misses relevant relationships
* variance - error from sensitivity to small fluctuations 
    * high variance causes algorithm to model random noise = overfitting
* assumes linear relationship between features and target 
* assigns a weight/coefficient to each feature - how much they contribute to predicting the target
* uses Least Squares Error (LSE) meaning it tries to minimize the sum of squares of each data point from the model/line 
* results in an R2 or Adjust R2 - proportion of the variance in the dependent variable explained by independent variables in the linear regression model
* Polynomial Regression
* when data is non-linear (e.g. quadratic)
* but essential uses polynomial features - to still create a linear model

#### Logistic Regression
* similar to linear regression but for classification problems
* linear model (weights*X + intercept)
* learns weights associated with each feature and constant/bias 
* weights tunes the hyperplane, tilt/orientation
* creates a 'decision boundary' splitting the data in half
* hard predictions of -1 to 1 or soft-predictions like probabilities
* works via a sigmoid function - squishes the output between 0 and 1 
* really large negative values become 0, really large positive values get mapped to 1
* after squashing - can predict probabilitiy along the sigmoid
* weights are accessible 
* can be done through multiclass - by doing one versus rest strategies - compares one versus rest, and combines the prediction at the end
* falls short in larger features, more complex with multi-class 

#### Decision Trees
* essentially a flow chart based on features
* decides a way to split the data, based on a series of if functions
* can overfit more than logistic regression
* max_depth - hyperparameter that's tuned - too high and will split the data to each single data point
* advantage: easy to interpret and visualize, little data prep, can work with missing
* loss function = Gini impurity 
    * 0 to 0.5 for two classes
    * quantifies quality of split 
    * calculate impurity of each split, no mislabeled = 0
    * when splitting - results will either belong to one class or another - if the split leads to leaf with only one class - pure, if mixed of more than one class - impure 
    * 1 - probability of one class^2 - probability of another class^2
    * if all in one class = 1^2 = 1 leading to Gini impurity of 0
    * maximize the Gini Gain - subtract weighted impurites from original impurity 
    * in numerical feats - ordered - and calculate gini impurity based on average of pairs, lowest impurity chosen as the split
* In regression Trees - loss function is MSE, MAE
    * tries to be minimized across split
    * good for stepwise data - can't be fit easily with linear regression
* top of the tree/root is one with the lowest impurity 

#### Random Forest
* address the overfitting in decision trees
* creates a bunch of decision trees - and inject some randomness 
* takes the most common prediction from all trees - and uses that as a final prediction
* injects randomness through bootstrap samples - resampling the data with replacement - so that each point can be repeated multiple times and you get a different but similar dataset
* at each split - consider only a random subset of features - not all
* averaging over many trees - reduces the variance
* slower to train multiple trees - but can be done in parallel
* number of bootstrap samples can be optimized 

#### Ensembles
* Bagging
    * bootrstrapping and aggregating
    * similar to random forest
    * addresses overfitting 
* Boosting
    * adds one model at a time, aggregation is done during training and not after
    * addresses underfitting 
    * each tree gets a more 'weighted' vote
    * learns sequentially - next models learns from the previous
    * done with XGboosts/Adaboosts
* Stacking 
    * uses a variety of weak models as inputs
    * similar to bagging but different models 
* more of a black box
* not sure what each model is doing 
* when combining models - use one that overfits - and one that underfits

#### Boosting/XGBoost
* focused on reducing bias - better prediction
* Adaboost
    * weighted sum of weak learners
    * adds learners one by one 
    * better weak learners adds more to the final model
* Gradient Boosting
    * calculates pseudoresiduals - error between model and prediction
    * tries to reduce pseudoreisudals
* XGBoost
    * minimize training time
    * finds ways to split data and iterates to improve fit
    * calculates a similarity score
        * mean squared error + lambda hyperparameter 
    * gets the gain from each split - comparing similarity scores between leafs (split) and the root (one before) 
    * if gain is larger at the leaf than at the root or prior leaf then that split is kept and used - and repeated after each split
    * default levels = 6
    * if gain is negative - it is pruned, gamma=0 for pruning 
    * lambda - regularization parameter
        * greater - lowers gain, makes it easier to prune trees
        * prevents overfitting
        * especially if only one observation at a leaf

#### Naive Bayes
* based on bayes theorem, and conditional probability
* classifier or regression
* based on statistical inference, good for spam detection
* assumes variables are independent - and has equal weight

#### Support Vector Machines (SVM)
* boundaries are non-linear
* decision boundary is based on support vectors
* separates based on largest margin/hyperplane
* first select import support vecotrs - one that are closest together - but belonging in separate classes - splits those points such that distance from that line or boundary called the margin - is largest between the two support vectors
* parameters - are weights and intercepts - learned from training
* hyperparameters
    * gamma - controls complexity - larger = more complex (overfitting)
    * C - larger = more complex (overfitting) 
        * changes which SV are chosen to develop hyperplane - allowing for misclassification 
* does a kernel trick for non-linears - where dimensions are transformed to create a linear boundary then re-transforms the data 

-------------------------------------------------------------------

#### Gradient Descent
* algorithm that optimizes the parameters in a models
* works by minimizing loss or error, in simple terms the difference between prediction and actual value
* works through trial and error then selecting the best parameters that minimize the loss 
* adjusts parameters via a learning rate that reduces loss 
* finds the optimal weights for your linear regression
* to avoid overfitting
* intializes a value (for the parameter) -> evaluate fit based on some loss function (SSresiduals) -> if plotted this will create a curve, at different values, loss function will go down and up -> using calculus you can then find the derivative of this curve to find the minimum, so the value that gives the smallest loss function (SSresiduals e.g.)
* the derivative - calculates the slope at each point of the curve, and adjusts the parameters based on step size (slope * learning rate) 
* adjustment starts at large steps when far, and smaller when closer to the optimal value (the adjustment is based on a 'learning rate') that is preset
* stops when step size is very close to 0, i.e. when slope is close to 0
* as long as derivative can be taken - it can be done

**Stochastic Gradient Descent**
* when there's large amount of data
* selects a one point data at every step instead of using the full data set to get the gradient
* reduces calculation time of derivatives (less derivatives to calculate)
* randomly picks one sample and uses that to calculate the derivative
* can better escape local minimums - because it's done randomly and selects a different point to calculate derivative from 

**Mini-batch Gradient Descent**
* instead of one point, calculates from a batch of data points

#### Regularization
* done to minimize overfitting 
* used in many algo
* punishes for more features 
* Ridge Regression (L2)
    * add a weight 'lambda' and multiply by sum of squared errors
    * more features, higher penalty
* Lasso Regression (L1)
    * add a weight 'lambda' and multiply by sum of absolute value of errors



#### Metrics
* MSE, MAE, R2
* Accuracy
    * True Positive + True Negative / (All)
* Precision **
    * true positives from all *labeled/predicted* positive
    * TP / (TP+FP)
    * high precision - low false positives rate
* Recall/Sensitivity **
    * true positives from all *actual/true* positives
    * TP / (TP+FN)
    * high recall - low false negatives
* Precision/Recall - trade off **
    * churn example
    * increasing precision - get better at detecting false positives and reducing false positive 
        * shoplifter at a store - increase precision, ensure that is actually a shoplifter
    * increasing recall - better at detecting false negatives 
    * case of cancer - might be better to optimize for recall - in order to be better at detecting false negatives - because it would be more costly to the person - to have cancer and not have it get detected
* F1 score - Recall/Precsion
* Lift
* ROC/AUC
    * plots true positive rate and false positive rate of model
    * compute area under the curve - want AUC to be high - meaning that TPR increases faster than the FPR
    * another way of representing precision of the model 
    * straight line = 0.5, no predictive value 

#### Deep Learning - Artificial Neural Networks

#### Time Series 
* FBprophet, time series analysis
* turning time series into a machine learning problem - using previous data as y, by shifting the data by a period (days, weeks or months)