### Questions that Data Science Methods Can Answer
- **Is this new observation A or B (or C, D, or E) (Classification)**
    - outcomes can be binary or have multiple catigories
        - example of having multiple categories is handwriting recognition 
            - because you have a category for each letter
- How Many or How Much of something (Regression)
    - counting
- What groupings exist in the data already (Clustering)
    - looking for similarities between things across multiple dimensions
    - group based off similar features/measurments
- What should we expect to happen next? (Time Series Analysis)
- Is this weird? (Anomaly Detection)

### What are Classification algorithms?
- Classification tools are all **supervised learning task**. That means we train on data w/ answers/labels
    - supervised learning is not all of machine learning but is quite a bit
- BIG GOAL: 
    - We train with ansers/labels to produce a `decision rule` we'll use to classify future data.
        - if this then that
        - if else if else if else
            - elif elif elif elif

![image.png](attachment:image.png)

### Main Ideas
- With classification, we use labeled data to train algorithms to classify future data points.
- The training data allows us to train an algorithm to produce a decision rule
- Using a boundary between points or a distance between points, we classify new datapoints into A or B (or C or D or E)

![image.png](attachment:image.png)

Classification is a technique for labeling the class of an observation. This is done through the modeling of the patterns in the related data which drive the outcome.

The primary goal of developing a classification model is to generalize patterns. This is done so that the category/class of new data can be identified with a high degree of certaintly. Classification can be performed on structured or unstructured data. It can be used to predict binary classes (2 classes) or multi-classes (>2 classes).

### Vocab
- **Classifier:** 
    - An algorithm that maps the input data to a specific category.

- **Classification Model:**
    - A series of steps that takes the patterns of input variables, generalizes those patterns (makes decision rule/boundarie), and applies them to new data in order to predict the class.

- **Feature:** 
    - A feature, aka input/independent variable, is an individual measurable property of a phenomenon being observed.
    - a coulumn in a data frame
    - an independent variable
    - a measurable property

- **Binary Classification:**
    - Classification with two possible outcomes, e.g. pass/fail.

- **Multiclass Classification:**
    - Classification with more than two classes, where each sample is assigned to one and only one target label, e.g. Grade levels of students in school (1st-12th).

### Common Classification Algorithms/Tools

- Logistic Regression (sklearn.linear_model.LogisticRegression)
- Decision Tree (sklearn.tree.DecisionTreeClassifier)
- Naive Bayes (sklearn.naive_bayes.BernoulliNB)
- K-Nearest Neighbors (sklearn.neighbors.KNeighborsClassifier)
- Random Forest (sklearn.ensemble.RandomForestClassifier)
- Support Vector Machine (sklearn.svm.SVC)
- Stochastic Gradient Descent (sklearn.linear_model.SGDClassifier)
- AdaBoost (sklearn.ensemble.AdaBoostClassifier)
- Bagging (sklearn.ensemble.BaggingClassifier)
- Gradient Boosting (sklearn.ensemble.GradientBoostingClassifier)

look at:
https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
for more information

### Data Preparation Needs
- Features need to be turned into numbers
- Categorical features or discrete features will be numbers that represent those categories
- Continuous features may need to be scaled so we're not comparing different units like dollars to cm to age.

![image.png](attachment:image.png)

### Logistic Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

- Technically a regression algorithm (Goal is to find the values for the coefficients that weight each input variable).
- Used to predict binary outcomes.
- The output is a value between 0 and 1 that represents the probability of one class over the other.

- Pros:
    - Interpretable: Good for understanding the influence of several independent variables on a single outcome variable.
    - Flexible: We can choose to ‘snap’ predictions to 0 and 1 via a rule (such as if < .5, 0 else 1) OR we can choose to use the output as is, which is a probability of being class 1.
    - Easy to implement, meaning it is good to use for creating a benchmark.
    - Very efficient and does not require many computational resources.

- Cons:
    - Need to remove attributes which are either unrelated to the output variable or correlated to other attributes.
    - Not one of the top performing classification algorithms.
        - but good for getting things out the door

![Screen%20Shot%202021-02-22%20at%2010.48.19%20AM.png](attachment:Screen%20Shot%202021-02-22%20at%2010.48.19%20AM.png)

### Decision Tree (CART: Classification And Regression Trees)
- A sequence of rules used to classify 2 or more classes.
- Each node represents a single input variable (x) and a split point or class of that variable.
- The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.
- Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.

- Pros:
    - Simple to understand, visualize & explain.
    - Requires little data preparation.
    - Can handle both numerical and categorical data.
    - Performs well for a broad range of problems.	

- Cons:
    - Risk of Overfitting: Can create complex trees that do not generalise well.
        - algorithm perfomrs better on training data than it does sample data
    - Can be unstable because small variations in the data might lead to overfitting.

Example below: If an observation has a length of 45, blue eyes, and 2 legs, it's going to be classified as red.

![image.png](attachment:image.png)

### Random Forest

Random forest is an implementation of bootstrap aggregation, aka bagging, which is an ensemble algorithm.

- ensemble algorithm
    - multiple alogirthms going on not just one

Bootstrapping is a statistical method for estimating a quantity from a data sample, e.g. mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value. In bootstrap aggregation, or bagging, the same approach is used for estimating entire statistical models, such as decision trees. Multiple samples of your training data are taken and models are constructed for each sample set.

When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimatation of the true output value.

Random forest is a tweak on this approach of bootstrapping, where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness. The models created for each sample of the data are therefore more different than they otherwise would be, in normal bootstrapping, but still accurate in their unique and different ways. This all combines their prediction results in a better estimate of the true underlying output value.

If you get good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm, e.g. using a random forest.

- Pros:
    - Less risk of overfitting than with a decision tree.
    - More accurate than decision trees in most cases.	

- Cons:
    - High demand on computational resources.
        - shouldnt be a too big of a problem for our labtops
        - but if you are running a 2tb it would def cause a problem
    - Difficult to implement.
    - Somewhat of a blackbox model, difficult to explain.

### K-Nearest Neighbor

K-Nearest Neighbor (KNN) makes predictions based on how close a new data point is to known data points.

It is considered "lazy" as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the K nearest neighbours of each point.

Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable. For classification problems, this might be the mode (or most common) class value.

It is important to define a metric to measure the similarity between data instances. Euclidean distance can be used if attributes are all on the same scale (or you convert them to the same scale).

- Pros:
    - Simple to implement.
    - Robust to noise.
    - Performs calculations "just in time"
        - i.e. when a prediction is needed (as opposed to ahead of time)
    - Training instances can be updated and curated over time to keep predictions accurate.

- Cons:
    - Need to determine the value of K.
        - you have to come up with it
    - High Computational Cost: It has to compute the distance of each instance to all the training samples...you have to hang on to your entire training dataset.
    - "Curse of dimensionality": 
        - Distance can break down in very high dimensions, negatively affecting the performance.
            - if we have 2 columns that is low dimensionality
            - 1,000 coulmns of features is high dimensionality

![image.png](attachment:image.png)

### Support Vector Machine

A technique that uses higher dimensions to best seperate data points into two classes.

Support Vector Machines select hyperplane (a line that splits the input variable space) to best separate the points in the input variable space by their class, either class 0 or class 1. In two-dimensions, you can visualize this as a line.

An optimization algorithm is used to find the values for the coefficients that maximizes the margin. The distance between the hyperplane and the closest data points is referred to as the **margin**. The best or optimal hyperplane that can separate the two classes is the line that has the largest margin. Only these points, called the **support vectors**, are relevant in defining the hyperplane and in the construction of the classifier.

- Pros:
    - Effective in high dimensional spaces
    - Memory efficient: Uses a subset of training points in the decision function
    - Highly successful classifier

- Cons:
    - Does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

### Naïve Bayes

Naive Bayes is based on Bayes’ theorem that assumes independence between every pair of features.

It is comprised of two types of probabilities that can be calculated directly from your training data:

- The probability of each class
- The conditional probability for each class given each x value

Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem. When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities. (so normalize your data!)

It assumes that each input variable is independent (which is often not the case), thus it is called "naive". This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems, including document classification and spam filtering.

- Pros:
    - Works with a smaller sample size of training data than other classifiers
    - Extremely fast compared to more sophisticated methods.
    - Simple & Powerful

- Cons:
    - Can be a bad estimator if used in less than ideal problems.

Use cases:

- Based on their purchase and browsing history, what promos should I offer to my customers?
- Learn from IB to develop methods for prospecting new customers

### Predicting Categorical Outcomes
In this module, we will explore **supervised** machine learning related to **classification** using **structured data**. We will work through the data science pipeline, expanding our knowledge of both techniques, through each stage of the data science pipeline, as well as improving our knowledge of the python programming language.

![image.png](attachment:image.png)