# Classification

[What is Machine Learning](https://www.youtube.com/watch?v=iLu9XyZ55oI)

## Questions that data science methods can answer

- **Is this new observation A or B (or C, D, or E)  (Classification)**
- How many or how much of something (Regression)
- What groupings exist in the data already (Clustering)
- What should we expect to happen next? (Time Series Analysis)
- Is this weird? (Anomaly Detection)


## Predicting categorical outcomes

Classification is a technique for labeling the class of an observation. This is done through the modeling of the patterns in the related data which drive the outcome.

The primary goal of developing a classification model is to generalize patterns. This is done so that the category/class of new data can be identified with a high degree of certainty. Classification can be performed on structured or unstructured data. It can be used to predict binary classes (2 classes) or multi-classes (>2 classes).

In this module, we will explore **supervised** (the data is labeled) machine learning related to **classification** (the target variable is categorical) using **structured** (the data can be naturally stored in rows and columns) data. We will work through the data science pipeline, as well as improving our knowledge of the python programming language.

## Vocabulary

- Classifier:  An algorithm that maps the input data to a specific category.

- Classification Model: A series of steps that takes the patterns of input variables, generalizes those patterns, and applies them to new data in order to predict the class. 

- Feature:  A feature, aka input/independent variable, is an individual measurable property of a phenomenon being observed.

- Binary Classification: Classification with two possible outcomes, e.g. pass/fail.

- Multiclass Classification:  Classification with more than two classes, where each sample is assigned to one and only one target label, e.g. Grade levels of students in school (1st-12th).
    

## Main ideas


- Train on data with answers and labels (supervised).
- This training will produce a `decision rule` that will be used to classify future data.
- Using a boundary between points or a distance between points, we classify new data points into A or B (or C or D or E)




![](classify_apples_oranges.png)

## Process

### Planning

- Clearly define your goals, your timeline, and how you plan to get there. 

- Understand what your MVP will be and set milestones. 

- Know who your stakeholders are and understand the final delivery product. 

- Set some initial questions that you plan to investigate. 


### Acquire

- Acquire structured data from a clipboard, excel, Google sheets, or SQL and read it into pandas. 

- Understand and summarize the data through aggregates, descriptive stats and distribution plots.


### Prepare

- Clean the data by converting datatypes and handling missing values.

- Split our observations into 3 samples - train, validate, test.


### Explore

- Univariate analysis: Look at the distribution of your variables individually. 

- Bivariate/multivariate analysis: Look at the relationship of two or more variables. 
    - We will discuss the meaning of "drivers", variables vs. features, and the target variable. We will discuss the importance of documenting questions and hypotheses.
    - Visualize the interaction of variables, especially independent variables with the dependent variable using charts such as scatterplots, jointplots, pairgrids, and heatmaps to identify drivers.
    - Run statistical tests to verify relationships. Test hypotheses that involve a categorical variable are t-tests and chi-squared tests.
    - Conclude with documenting answers for those questions, and documenting takeaways and findings at each step of exploration.

### Model

- Preprocessing: We will prepare our data specifically for modeling. In this module, we will focus on encoding our values. Machine learning models can not accept string values, therefore, we will turn them all into numbers. Scaling our data is also important for distance-based algorithms. Scaling will be discussed in the regression module. 

- Establish Baseline: We will learn about the importance of establishing a "baseline model" or baseline score and ways to complete this task. The baseline for classification is typically the mode of the dependent variable.

- Build Models: We will build classification models. What does that mean? We will use well established algorithms to extract the patterns in the data. This will create a model that will allow us to compute predictions for each observation. 

- Model Evaluation: We will compare classification models by computing evaluation metrics. These metrics will measure how well a model did at predicting the target variable, based on different priorities. 

- Model Selection and Testing: We will evaluate the model on the unseen data (the out-of-sample validate and test datasets). We will use validate to tune hyperparameters, and then test on our best model. 


**Common Classification Algorithms**

- Logistic Regression (sklearn.linear_model.LogisticRegression)
- Decision Tree (sklearn.tree.DecisionTreeClassifier)
- Naive Bayes (sklearn.naive_bayes.BernoulliNB)
- K-Nearest Neighbors (sklearn.neighbors.KNeighborsClassifier)
- Random Forest (sklearn.ensemble.RandomForestClassifier)
- Support Vector Machine (sklearn.svm.SVC)
- Stochastic Gradient Descent (sklearn.linear_model.SGDClassifier)
- AdaBoost (sklearn.ensemble.AdaBoostClassifier)
- Bagging (sklearn.ensemble.BaggingClassifier)
- Gradient Boosting (sklearn.ensemble.GradientBoostingClassifier)

See [the sklearn docs on supervised methods](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for more.