# Scikit-Learn Basics

https://scikit-learn.org/stable/


### Tutorial and nice resources

https://scikit-learn.org/stable/tutorial/index.html

https://stackoverflow.com/


### Nice sources for Datasets

https://archive.ics.uci.edu/ml/index.php (datasets, datasets, all kinds of datasets)

https://www.kaggle.com/ (earn some money while challenging your skill?)

https://www.google.com/

## Quick Review

### Types of machine learning problems?
* 
* 

### Types of data?
* 
* 
* 
* 


In [None]:
# step -1: apply your domain knowledge, understand the dataset and problem
# learn about different models, choose the ones that make sense and try

# step 0: load useful python libraries/packages
import numpy as np
import pandas as pd
import pickle # used to save/load model
from sklearn import some_model # load the model definition from scikit-learn

# Prepare DATA (for both training and test)
X_train = 
y_train = 
X_test = 
y_test = 

# The shape of X's should be (num_samples, num_feature)
# The shape of y's should be (num_samples, 1) or (num_samples,)

###############################################
################ The ML part ##################
###############################################
# Initalize model
model = some_model()

# fit/train the model
model.fit(X_train,y_train)

# test/predict with the trained model
y_pred = model.predict(X_test)

# evaluate model
some_error_metric(y_test, y_pred) # you need to define this error metric
################################################



# save model
with open('my_model.pkl', 'wb') as f:
    pickle.dump(model, f)    

# load it again
with open('my_model.pkl', 'rb') as f:
    model_loaded = pickle.load(f)

### Error metric: Mean Sqaured Error
# $\frac{1}{n} \sum (y_{actual} - y_{pred})^2$

### Error metric: Mean Absolute Percentage Error
# $\frac{1}{n} \sum \frac{|y_{actual} - y_{pred}|}{y_{actual}} \times 100 \%$

### Visualization: Parity Plot

## Let's try it with a simple example
fit $y = 3x^2 + 2x + 3$ in the interval x = [0, 100]

with 
1. $x$
2. $x$ and $x^2$

use 101 evenly spaced data points as both training and test set

### Helper functions

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pickle

# Two Types of Error Metrics

def mape(y,y_pred):
    result = np.mean(np.abs(y-y_pred)/y)
    print("The mape is {}%".format(result*100))
    return result

def mse(y,y_pred):
    result = np.mean((y-y_pred)**2)
    print("The mse is {}".format(result))
    return result


# Parity Plot

def parity_plot(y,y_pred):
    plt.figure(figsize=(10,10))
    plt.scatter(y,y_pred )
    plt.plot([min(y), max(y)],[min(y), max(y)], color='red')
    plt.xlabel("y")
    plt.ylabel("predicted y")
    plt.show()  
    return

def plot(x,y,y_pred):
    plt.figure(figsize=(10,10))
    plt.scatter(x,y, color = "red", label = "actual y" )
    plt.scatter(x,y_pred, color = "blue", label = "predicted y" )
    
    plt.xlabel("y")
    plt.ylabel("predicted y")
    plt.legend()
    plt.show()  
    return


In [None]:
# Preparing data

X_train = np.linspace(0, 100, 101).reshape(101,1)
y_train = 3 * X_train ** 2 + 2 * X_train + 3
X_test = X_train
y_test = y_train
print(X_train.shape)
print(y_train.shape)
X_train

In [None]:
# linear_model for 1)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

mse(y_test, y_predict)
plot(X_test, y_test, y_predict)

In [None]:
X_train_2 = np.transpose([np.linspace(0, 100, 101), np.linspace(0, 100, 101) **2])
X_test_2 = X_train_2

print(X_train_2.shape)
X_train_2

In [None]:
# linear_model for 2)
model = LinearRegression()
model.fit(X_train_2, y_train)
y_predict = model.predict(X_test_2)

mse(y_test, y_predict)
plot(X_test, y_test, y_predict)

### save and reload the model

In [None]:
# save the model
   

# load it again


# test loaded model
y_predict = model_loaded.predict(X_test_2)
mse(y_test, y_predict)
plot(X_test, y_test, y_predict)

## Regression Challenge

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

* id: a notation for a house
* date: Date house was sold
* price: Price is prediction TARGET!!
* bedrooms: Number of Bedrooms/House
* bathrooms: Number of bathrooms/House
* sqft_living: square footage of the home
* sqft_lot: square footage of the lot
* floors: Total floors (levels) in house
* waterfront: House which has a view to a waterfront
* view: Has been viewed
* condition: How good the condition is ( Overall )
* grade: overall grade given to the housing unit, based on King County grading system
* sqft_above: square footage of house apart from basement
* sqft_basement: square footage of the basement
* yr_built: Built Year
* yr_renovated: Year when house was renovated
* zip: codezip
* lat: Latitude coordinate
* long: Longitude coordinate
* sqft_living15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
* sqft_lot15: lotSize area in 2015(implies-- some renovations)

### Error metric: Mean Absolute Percentage Error
# $\frac{1}{n} \sum \frac{|y_{actual} - y_{pred}|}{y_{actual}} \times 100 \%$

### my best: mape = 25.9%  (using first 10000 data points as training set, the rest as test set)

### Loading original dataset

In [None]:
data = pd.read_csv('kc_house_data.csv')
data

### dataset preprocess and train_test_split

In [None]:
# get the target
price = np.array(data["price"])

# choose the columns to be used in prediction
column_selection = []

selected_feature = np.array(data[column_selection])



In [None]:
# test/train split

selected_feature_train = selected_feature[:10000]
price_train = price[:10000]
selected_feature_test = selected_feature[10000:]
price_test = price[10000:]

### Let's do some regression
Choose any models from here: https://scikit-learn.org/stable/supervised_learning.html

In [None]:
# Linear Regression


In [None]:
# Random Forest Regression



In [None]:
# Random Forest Regression with different parameter

In [None]:
# performance on the training set?

price_predict = model.predict(selected_feature_train)
mape(price_train, price_predict)
parity_plot(price_train, price_predict)

In [None]:
# Multi-layer perceptron (a.k.a. neural network!)



## Classification 

Linear Discriminant Analysis

https://scikit-learn.org/stable/auto_examples/classification/plot_lda_qda.html#sphx-glr-auto-examples-classification-plot-lda-qda-py

### Helper Functions

In [None]:
# #############################################################################
# Generate datasets
from matplotlib import colors
from matplotlib.colors import ListedColormap

# Colormap
cmap = colors.LinearSegmentedColormap(
    'red_blue_classes',
    {'red': [(0, 1, 1), (1, 0.7, 0.7)],
     'green': [(0, 0.7, 0.7), (1, 0.7, 0.7)],
     'blue': [(0, 0.7, 0.7), (1, 1, 1)]})
plt.cm.register_cmap(cmap=cmap)


def dataset_fixed_cov():
    '''Generate 2 Gaussians samples with the same covariance matrix'''
    n, dim = 300, 2
    np.random.seed(0)
    C = np.array([[0., -0.23], [0.83, .23]])
    X = np.r_[np.dot(np.random.randn(n, dim), C),
              np.dot(np.random.randn(n, dim), C) + np.array([1, 1])]
    y = np.hstack((np.zeros(n), np.ones(n)))
    return X, y


def dataset_cov():
    '''Generate 2 Gaussians samples with different covariance matrices'''
    n, dim = 300, 2
    np.random.seed(0)
    C = np.array([[0., -1.], [2.5, .7]]) * 2.
    X = np.r_[np.dot(np.random.randn(n, dim), C),
              np.dot(np.random.randn(n, dim), C.T) + np.array([1, 4])]
    y = np.hstack((np.zeros(n), np.ones(n)))
    return X, y

In [None]:
def classification_accuracy(y,y_pred):
    result = (y == y_pred).sum() / len(y)
    print("The classification accuracy is {}".format(result))
    return result


def plot_data(X,y):
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
    plt.figure(figsize=(10,10))
    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.show()
    
    return

def plot_result(model,X,y):
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(10,10))
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.show()
    return

In [None]:
X,y = dataset_fixed_cov()
plot_data(X,y)

In [None]:
# LDA classification

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
model.fit(X, y)
y_predict = model.predict(X)

classification_accuracy(y,y_predict)
plot_result(model, X, y)

In [None]:
# different dataset
X,y = dataset_cov()
plot_data(X,y)

## Classification with the Iris Dataset
This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 

Predicted attribute: class of iris plant. 

This is an exceedingly simple domain. 

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.


Attribute Information:

1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 
5. class: 
-- Iris Setosa 
-- Iris Versicolour 
-- Iris Virginica

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target
plot_data(X, y)