# Classification

In this chapter, you will be introduced to classification problems and learn how to solve them using supervised learning techniques. And you’ll apply what you learn to a political dataset, where you classify the party affiliation of United States congressmen based on their voting records. 

# (1) Supervised learning 

## What is machine learning?
- The art and science of:
    - Giving computers the ability to learn to make decisions from data
    - without being explicitly prgrammed
- Examples:
    - Learning to predict whether an email is spam or not
    - Clustering wikipedia entries into different categories
- Supervised learning: Uses labeled data
- Unsupervised learning: Uses unlabeled data

## Unsupervised learning
- Uncovering hidden patterns from unlabeled data
- Example:
    - Grouping customers into distinct categories (Clustering)

## Reinforcement learning
- Software agents interact with an environment
    - Learn how to optimize their behavior
    - Given a system of rewards and punishments
    - Draws inspiration from behavioral psychology
- Applications
    - Economics
    - Genetics
    - Game playing
- AlphaGo: First computer to defeat the world champion in Go

## Supervised Learning
- Predictot variables/features and a target variable
- Aim: Predict the target variable, given the predictor variables
    - Classification: Target variable consists
    - Regression: Target variable is continuous

<img src="image/Screenshot 2021-02-01 130537.png">

## Naming conventions
- Features = predictor variables = independent variables
- Target variable = dependent variable = response variable

## Supervised learning
- Automate time-consuming or expensive manual tasks
    - Example: Doctor's diagnosis
- Make predictions about the future
    - Example: Will a customer click on an ad or not
- Need labeled data
    - Historical data with labeled data
    - Experiments to get labeled data
    - Crowd-sourcing labeled data

## Supervised learning in Python
- We will use scikit-learn/sklearn
    - Integrates well with the SciPy
- Other libraries
    - TensorFlow
    - keras

# Exercise I: Which of these is a classification problem?

Once you decide to leverage supervised machine learning to solve a new problem, you need to identify whether your problem is better suited to classification or regression. This exercise will help you develop your intuition for distinguishing between the two.

Provided below are 4 example applications of machine learning. Which of them is a supervised classification problem?

### Instructions

- Using labeled financial data to predict whether the value of a stock will go up or go down next week. (T)

- Using labeled housing price data to predict the price of a new house based on various features.

- Using unlabeled data to cluster the students of an online education company into different categories based on their learning styles.

- Using labeled financial data to predict what the value of a stock will be next week.

# (2) Exploratory data analysis

## The Iris dataset
<img src="image/Screenshot 2021-02-01 135820.png">

Features:

- Petal length
- Petal width
- Sepal length
- Sepal width

Target variable: Species

- Versicolor
- Virginica
- Setosa

## The Iris dataset in scikit-learn

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()
type(iris)

In [None]:
print(iris.key())

In [None]:
type(iris.data), type(iris.target)

In [None]:
iris.data.shape

In [None]:
iris.target_names

## Exploratory data analysis (EDA)

In [None]:
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
print(df.head())

## Visual EDA

_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8, 8], s=150, marker='D')

<img src="image/Screenshot 2021-02-01 141551.png">

In [None]:
# Exercise II: Numerical EDA

In this chapter, you'll be working with a dataset obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of [Statistical Thinking in Python (Part 1)](https://learn.datacamp.com/courses/statistical-thinking-in-python-part-1).

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df. Use pandas' `.head()`, `.info()`, and `.describe()` methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.

### Instructions
    
- The DataFrame has a total of `435` rows and `17` columns.
- Except for `'party'`, all of the columns are of type `int64`.
- The first two rows of the DataFrame consist of votes made by Republicans and the next three rows consist of votes made by Democrats.
- There are 17 predictor variables, or features, in this DataFrame. (T)
- The target variable in this DataFrame is `'party'`.

# Exercise III: Visual EDA

The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the `scatter_matrix()` function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as [Seaborn's](http://seaborn.pydata.org/generated/seaborn.countplot.html) `countplot`.

Given on the right is a `countplot` of the `'education'` bill, generated from the following code:

```
plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
```

In `sns.countplot()`, we specify the x-axis data to be `'education'`, and hue to be `'party'`. Recall that `'party'` is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the `'education'` bill, with each party colored differently. We manually specified the color to be `'RdBu'`, as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the `'satellite'` and `'missile'` bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with `plt.figure()` so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.

<img src="image/2021-02-01-143554.svg" width=50%>

### Instructions

- `'satellite'`.
- `'missile'`.
- Both `'satellite'` and `'missile'`. (T)
- Neither `'satellite'` nor `'missile'`.