## Draw Classification

This interactive notebook demonstrates classification methods with user-generated data. The `drawdata` module allows you to create a data set for classification by sketching the location and density of points. See [Draw Classification](https://apmonitor.com/pds/index.php/Main/DrawClassification) for additional instructions and course content. Google Colab does not allow access to the clipboard so the Draw Data package is replaced with sample data. 

<img src='https://apmonitor.com/pds/uploads/Main/drawdata_colab.png' width=50%>

### Read Data

[Read the generated data](https://apmonitor.com/pds/index.php/Main/GatherData) with `pandas` and display a random sample of 5 rows. The 5 rows have `x` and `y` location information with the `z` label that is `a`, `b`, `c`, or `d`. 

In [None]:
import pandas as pd
url = 'http://apmonitor.com/pds/uploads/Main/drawdata.txt'
data = pd.read_csv(url)
data.sample(5)

### Describe Data

A [statistical overview](https://apmonitor.com/pds/index.php/Main/StatisticsMath) of the data reveals the number and distribution of points.

In [None]:
data.describe()

### Visualize Data

Create plots to view data distribution. A classifier creates boundaries to define regions for 2 or more labels. [Data visualization](https://apmonitor.com/pds/index.php/Main/VisualizeData) can give insights on how to build an effective classifier and [identify any data quality issues](https://apmonitor.com/pds/index.php/Main/CleanseData) such as outliers.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(data,hue='z')
plt.show()

### Encode Data

[Ordinal number encoding](https://apmonitor.com/pds/index.php/Main/FeatureEngineering) translates text labels (`a`, `b`, `c`, `d`) into numeric labels (`0`, `1`, `2`, `3`). One-hot encoding and feature hashing are two alternative encoding methods.

In [None]:
data['z'] = pd.factorize(data['z'])[0]
data.sample(10)

### Scale Data

Many classification methods work best with [scaled data](https://apmonitor.com/pds/index.php/Main/ScaleData). Only the features `x` and `y` are scaled while the label `z` is not scaled to preserve the integer values.

In [None]:
from sklearn.preprocessing import StandardScaler
s = StandardScaler() # mean=0, standard deviation=1
dxy = s.fit_transform(data[['x','y']])
# add scaled values to dataframe
data['xs'] = dxy[:,0]; data['ys'] = dxy[:,1]
data.sample(5)

### Split Data

[Split data](https://apmonitor.com/pds/index.php/Main/SplitData) into train (80%) and test (20%) sets. The test set is to validate the fit created from the training data.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, shuffle=True)

### Import Supervised Learning Classifier Packages

Classification: Use supervised learning classification methods:

- Logistic Regression
- Naïve Bayes
- Stochastic Gradient Descent
- K-Nearest Neighbors
- Decision Tree
- Random Forest
- Support Vector Classifier
- Deep Learning Neural Network

The [Scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) has additional information on classifiers.

In [None]:
# Import Supervised Learning Classifiers
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.naive_bayes import GaussianNB # Naïve Bayes
from sklearn.linear_model import SGDClassifier # Stochastic Gradient Descent
from sklearn.neighbors import KNeighborsClassifier # K-Nearest Neighbors
from sklearn.tree import DecisionTreeClassifier # Decision Tree
from sklearn.ensemble import RandomForestClassifier # Random Forest
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.neural_network import MLPClassifier # Neural Network

In [None]:
# Initialize Classifiers
nb=GaussianNB()
lr=LogisticRegression(max_iter=1000)
sgd=SGDClassifier()
knn=KNeighborsClassifier()
dt=DecisionTreeClassifier()
rfm=RandomForestClassifier()
svm=SVC()
nn=MLPClassifier(max_iter=5000)

clsfrs = [[nb,'Naive Bayes'],
          [dt,'Decision Tree'],
          [knn,'K Nearest Neighbors'],
          [svm,'Support Vector Machine'],
          [lr,'Logistic Regression'],
          [sgd,'Stochastic Gradient Descent'],
          [rfm,'Random Forest Classifier'],
          [nn,'Neural Network']
         ]

### Train Classifiers

In [None]:
for clf, name in clsfrs:
    clf.fit(train[['x','y']],train['z'])

### Show Confusion Matrix Result

A confusion matrix shows true positive, false positive, true negative, and false negative groups from the test set. Generate a confusion matrix for each classifier.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
plt.figure(figsize=(14,7))
i = 0
for clf, name in clsfrs:
    i+=1
    ax = plt.subplot(2,4,i)
    ConfusionMatrixDisplay.from_estimator(clf,test[['x','y']],test['z'],ax=ax)
    acc = accuracy_score(test['z'],clf.predict(test[['x','y']]))
    print('{0}: {1:.1f}%'.format(name,acc*100))
    plt.title(name)
plt.savefig('confusion_matrix.png')

### Activity

![activity](https://apmonitor.com/che263/uploads/Begin_Python/expert.png)

[AdaBoost (Adaptive Boosting)](https://apmonitor.com/pds/index.php/Main/AdaBoost) is a machine learning algorithm for classification. It is used as a supervisory layer to other classification algorithms such as neural networks, decisions trees, and support vector machines. It takes weak classifiers as a weighted sum and adaptively refines the output to focus on the harder to classify cases.

```python
from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier()
ab.fit(train[['x','y']],train['z'])
yP = ab.predict(test[['x','y']],test['z'])
```

Train an AdaBoost classifier with the drawn data. Show a confusion matrix and report the accuracy (%).