## MACHINE LEARNING
* Supervised Learning
* Semi-supervised Learning
* Un-supervised Learning
* Reinforcement Learning

## Supervised Learning
* Classification
* Regression

## Regression

### Linear Regression

In [1]:
from sklearn.datasets import load_diabetes

In [3]:
diabetes = load_diabetes()

In [6]:
X = diabetes['data']
y = diabetes['target']

In [7]:
print(diabetes['DESCR'])

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

In [8]:
import pandas as pd

In [9]:
pd.DataFrame(X, columns=diabetes['feature_names'])

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041180,-0.096346
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062913,-0.038357
7,0.063504,0.050680,-0.001895,0.066630,0.090620,0.108914,0.022869,0.017703,-0.035817,0.003064
8,0.041708,0.050680,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014956,0.011349
9,-0.070900,-0.044642,0.039062,-0.033214,-0.012577,-0.034508,-0.024993,-0.002592,0.067736,-0.013504


In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = \
                train_test_split(X, y, test_size=0.2)

#### Linear Regression Model

In [12]:
from sklearn.linear_model import LinearRegression

In [13]:
lin_reg = LinearRegression()

In [14]:
lin_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
y_pred = lin_reg.predict(X_test)

#### Compare Pred with Actual Target

In [18]:
from sklearn.metrics import mean_squared_error
import numpy as np

In [19]:
rmse = np.sqrt(mean_squared_error(y_pred, y_test))

In [20]:
rmse

52.24750574813371

### Polynomial Regression

In [21]:
from sklearn.preprocessing import PolynomialFeatures

In [43]:
poly = PolynomialFeatures(degree=10)

In [44]:
X_train_poly = poly.fit_transform(X_train)

In [45]:
poly_reg = LinearRegression()

In [46]:
poly_reg.fit(X_train_poly, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [47]:
X_test_poly = poly.fit_transform(X_test)

In [48]:
y_pred_poly = poly_reg.predict(X_test_poly)

In [49]:
rmse_poly = np.sqrt(mean_squared_error(y_pred_poly, y_test))

In [50]:
rmse_poly

410.2026760481282

#### Stochastic Gradient Descent Regressor

In [51]:
from sklearn.linear_model import SGDRegressor

In [55]:
sgd_reg = SGDRegressor(tol=np.infty)

In [56]:
sgd_reg.fit(X_train, y_train)

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=inf, validation_fraction=0.1, verbose=0,
             warm_start=False)

In [57]:
y_pred = sgd_reg.predict(X_test)

In [59]:
np.sqrt(mean_squared_error(y_pred, y_test))

78.3039697199476

In [60]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import LinearSVR
from sklearn.svm import SVR

In [70]:
dt_reg = DecisionTreeRegressor()
rf_reg = RandomForestRegressor(n_estimators=100)
lin_svr_reg = LinearSVR()
svr_reg = SVR(gamma='auto')

In [71]:
rmse_scores = []
for clf_ in (dt_reg, rf_reg, lin_svr_reg, svr_reg):
    clf_.fit(X_train, y_train)
    pred = clf_.predict(X_test)
    rmse_scores.append(np.sqrt(
        mean_squared_error(pred, y_test)
    ))

In [72]:
rmse_scores

[75.11414534866701, 56.77228282030954, 84.2271753780354, 78.11781686312735]

### Problem Statement 1
* Split the data
* Use the following regressors:
    * LinearRegressor, SGDRegressor, RandomForestRegressor, DecisionTreeRegressor, LinearSVR, SVR
* Find which Regressor is the best for the data

In [73]:
from sklearn.datasets import make_regression

In [74]:
data = make_regression()

In [75]:
X = data[0]
y = data[1]