# HW1: Classification on simple datasets
## Goal: 
1. get familiar with the machine learning processes.
2. understand the applications of KNN, SVM, and Logistic Regression.
3. try different parameters of the networks, and determine the best one.
4. finish the coding

## Datasets:
Two famous ML datasets:
- Iris dataset
 - The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant(Setosa(0), Versicolor(1),  Virginica(2)). 
 - each data instance contains 
    1. sepal length in cm
    2. sepal width in cm
    3. petal length in cm
    4. petal width in cm
- The Street View House Numbers (SVHN) Dataset
 - a colorful handwritten digits database collected from house numbers in Google Street View images, containing 73257 digits for training, 26032 digits for testing, and 531131 additional.
 - each data is a 32x32 color image corresponding to a digit from 0-9.

## Method:
KNN, SVM, and Logistic Regression

## Submission:
 1. The final version of this file (rename it to HW1_yourName.ipynb)
 2. A simple report (.doc/.docx) that contains the information below
  - Results part
     - All the results with different parameters (in table format)
     - Screenshots of the learning curves
  - Discussion part
     - Can you find the best parameter?
     - Why this parameter is better than the others?


Upload these two files to Canvas separately, without compressing them into a zip file

## Grading:
- Total: 100 points
- For each dataset, each method is worth 10 points. (60 points total for both datasets)
- For each dataset, the discussion is worth 20 points. (40 points total for both datasets)

# 1. Iris dataset

In [None]:
# step one: import the needed packages
from sklearn import datasets
import pandas as pd
import numpy as np

In [None]:
iris_raw = datasets.load_iris() # load iris dataset form sklearn library
iris_raw.keys() 

## Checking the contents of the data 

wonder what does the raw data look like? Try to print out the variable 'iris_raw' in an empty code block

In [None]:
iris = pd.DataFrame(data = np.c_[iris_raw['data'], iris_raw['target']],
                    columns= iris_raw['feature_names']+['target']) # Convert raw data into an easy-to-read format
iris.head(10) # check the first 10 rows

In [None]:
# Add the name of the species corresponding to the target
species = []
for i in range(len(iris['target'])):
  if iris['target'][i] == 0:
    species.append('setona')
  elif iris['target'][i] == 1:
    species.append('versicolor')
  else:
    species.append('virginica')
iris['species'] = species
iris.head(10)

### train/test data split

In [None]:
from sklearn.model_selection import train_test_split # import the package for train/test spliting
X = iris.drop(['target', 'species'], axis=1) # or X = iris_raw['data']
y = iris['target'] 
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.5, random_state=42)

# Model training and testing

## 1) KNN

- Try different parameters of "n_neighbors"

### training

In [None]:
from sklearn.neighbors import KNeighborsClassifier # import the package for KNN
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train) # training

### testing

In [None]:
# testing
knn_predictions = knn.predict(X_test)
knn_predictions

### printing the results

In [None]:
# printing the results
from sklearn import metrics
print('Precision, Recall, Confusion matrix, intraining\n')

print(metrics.classification_report(y_test, knn_predictions, digits=3))


### plotting the learning curve

In [None]:
# Use this block as a black box for plotting the learining curve
# Feel free to dive into the code and figure out how it works
# There's no need to copy and paste this block again.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(
    estimator,
    title,
    X,
    y,
    axes=None,
    ylim=None,
    cv=None,
    n_jobs=None,
    scoring=None,
    train_sizes=np.linspace(0.1, 1.0, 5),
):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    scoring : str or callable, default=None
        A str (see model evaluation documentation) or
        a scorer callable object / function with signature
        ``scorer(estimator, X, y)``.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """

    axes.set_title(title)
    if ylim is not None:
        axes.set_ylim(*ylim)
    axes.set_xlabel("Training examples")
    axes.set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        scoring=scoring,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes.grid()
    axes.fill_between(
        train_sizes,
        train_scores_mean - train_scores_std,
        train_scores_mean + train_scores_std,
        alpha=0.1,
        color="r",
    )
    axes.fill_between(
        train_sizes,
        test_scores_mean - test_scores_std,
        test_scores_mean + test_scores_std,
        alpha=0.1,
        color="g",
    )
    axes.plot(
        train_sizes, train_scores_mean, "o-", color="r", label="Training score"
    )
    axes.plot(
        train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
    )
    axes.legend(loc="best")
    return plt
  


In [None]:
fig, axes = plt.subplots()
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
plot_learning_curve( knn, 'training curve for KNN', X, y, axes=axes, ylim=(0.7, 1.01), cv = cv, n_jobs=4)

## 2) SVM

- Try different kernels, and record the results
- For more information, refer to https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- finish the code
- try different parameters of "kernel"

### training

In [None]:
from sklearn.svm import SVC
clf = SVC(kernel='rbf')
clf.fit(X_train, y_train)

## 3) Logistic regression
- Try different regularization methods
- Please refer to https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 
- finish the code
- try different parameters of "penalty"

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(penalty='l1', solver='liblinear')
log_reg.fit(X_train, y_train)

# 2. SVHN Dataset
- download data from http://ufldl.stanford.edu/housenumbers/ 
- finish the code, and record the results

In [None]:
import scipy.io as sio
# loading data
train_data = sio.loadmat('train_32x32.mat')
test_data = sio.loadmat('test_32x32.mat')

# training set
length = 5000 # We use the first 5000 instances here for simplicity. But feel free to use more instances if you don't mind long training sessions.
X_train = np.zeros([length,1024])
y_train = np.zeros([length,1])
for i in range(length):
    data = np.mean(train_data['X'][:,:,:,i], axis=2) # transform data to be compatible for training 
    X_train[i] = data.flatten()
    y_train[i] = train_data['y'][i]
  
# testing set


Checking the contents of the data

In [None]:
# show sample
# try to run this block multiple times
import random
image_ind = random.randint(0,5000)
plt.imshow(train_data['X'][:,:,:,image_ind])
plt.show()


y_train[image_ind]

## 1) KNN

## 2) SVM

## 3) Logistic Regression