# Chapter 6: Machine Learning -------- Part B2

    6 | Overview
        6.1 What is Machine Learning?
        6.2 Scikit-Learn
        6.3 Supervised Learning: Classification
        6.4 Supervised Learning: Regression
        6.5 Unsupervised Learning: Dimension Reduction
        6.6 Unsupervised Learning: Clustering

### In this part we will only focus on 6.4

## 6.3: Supervised Learning - Regression

### Quick notes: 

    Predicting numeric data with Linear Regression:
    
    ▪ linear regression models are a good starting point for regression tasks 
        → popular because they can be fit very quickly, and are very interpretable
    ▪ simple linear regression: straight-line fit to data
    ▪ regularized regression (Ridge, LASSO): prevents overfitting by restricting influence of
      coefficients (shrinkage)
        → parameter lambda (‘alpha’ in functions): higher value → more restriction on the coefficients 
        → LASSO: can be used for feature selection (coefficients set so zero)

### Example : Predicting diabetes

In [65]:
# dataset: diabetes(target: disease progression one year after baseline)

from sklearn import datasets
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# train/test split
from sklearn.model_selection import train_test_split

diab_X_train, diab_X_test, diab_y_train, diab_y_test = train_test_split(diabetes_X, diabetes_y, test_size=0.3,
                                                                       random_state=42)



### Simple Linear Regression

In [15]:
# create linear reggression model
from sklearn.linear_model import LinearRegression
regr = LinearRegression()

# train the model using the training sets
regr.fit(diab_X_train, diab_y_train)

# make predictions using the testing set
## LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
diab_y_pred = regr.predict(diab_X_test)

from sklearn.metrics import mean_squared_error, r2_score
mean_squared_error(diab_y_test, diab_y_pred)

# r2_score shows us how much variance can we explain. 
# as we can see, we have pretty high mean squared error

2821.7385595843766

In [16]:
r2_score(diab_y_test, diab_y_pred) 
# we only can explian less that 50% of the variance in our outcome variable using such a linear regression

0.47729201741573324

### Ridge Regression

In [23]:
from sklearn.linear_model import Ridge, RidgeCV
import numpy as np

# find optimal alpha using 10-fold cross validation (cv)
np.random.seed(42)
ridge_cv = RidgeCV(cv=10, alphas=np.logspace(-6, 6, 13))
ridge_cv.fit(diab_X_train, diab_y_train);
### RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, ### 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]),
### cv=10, fit_intercept=True, gcv_mode=None, normalize=False, scoring=None,
### store_cv_values=False)

ridge_cv.alpha_

0.1

In [28]:
# use best alpha (i.e., lambda) to train final model

ridge = Ridge(alpha=ridge_cv.alpha_)
ridge.fit(diab_X_train, diab_y_train)

#Ridge(alpha=0.1)

diab_y_pred_ridge = ridge.predict(diab_X_test)

# evaluate the performance
mean_squared_error(diab_y_test, diab_y_pred_ridge)

2805.393845841173

In [30]:
r2_score(diab_y_test, diab_y_pred_ridge) # it is 1% better than the linear regression model

0.480319765084846

### LASSO Regression

In [32]:
from sklearn.linear_model import Lasso, LassoCV

# find optinal alpha using 10-fold cross-validation

lasso_cv = LassoCV(cv=10, random_state=42)
lasso_cv.fit(diab_X_train, diab_y_train);
lasso_cv.alpha_   # it is very close to zero

0.005255949654898021

In [36]:
# use best alpha (i.e., lambda) to train final model

lasso = Lasso(alpha=lasso_cv.alpha_)
lasso.fit(diab_X_train, diab_y_train)

diab_y_pred_lasso = lasso.predict(diab_X_test)

# evaluate the performance

mean_squared_error(diab_y_test, diab_y_pred_lasso)

2816.600161923496

In [37]:
r2_score(diab_y_test, diab_y_pred_lasso) # it is same as Linear Regression

0.4782438708274931

### Using random forests for regression tasks
    ▪ random forests can also be made to work in the case of regression 
    ▪ estimator to use is RandomForestRegressor

In [42]:
# import estimator 

from sklearn.ensemble import RandomForestRegressor
# apply the usual steps

forest = RandomForestRegressor(n_estimators=200, random_state=42) # make 200 forests
forest.fit(diab_X_train, diab_y_train)

diab_y_pred_forest = forest.predict(diab_X_test)


# evaluate the performance

mean_squared_error(diab_y_test, diab_y_pred_forest)

2804.493662030075

In [44]:
r2_score(diab_y_test, diab_y_pred_forest) # this is the best among the others. But still not good enough!

0.4804865180472194

## Exersice: 

    → Use the dataset used_cars.csv. The dataset includes over 100,000 web-scraped data points from an online advertisement platform.
        → Variables include the asking price of used cars, car brand and type, mileage, age, emissions, maintenance certificate, seller type, guarantee, etc.
    → Take a random subset of 10,000 cases (to speed up the computation). Pandas’ .sample() method might be helpful here.
    → Train a simple regression model that predicts the asking prices of used cars. Evaluate the performance of your model.
    → Train a second model using random forests. Can you improve your model performance?

    Note: If you want to transform some of the input variables, you might have a look at the
    pd.get_dummies() function!
 
  

In [85]:
# dataset: used_cars (target: to find asking price)
# import libs and data

import pandas as pd
import numpy as np

cars = pd.read_csv('/Users/abdulhabirkarahanli/Desktop/Data/used_cars.csv')

cars = cars.sample(10000, random_state=42)
cars = pd.get_dummies(cars, columns=['car'], drop_first=True)
X = cars.drop(['X', 'first_price'], axis=1)
y = cars['first_price']
X

Unnamed: 0,diesel,mileage,age_car_years,other_car_owner,pm_green,private_seller,guarantee,inspection,maintenance_cert,co2_em,euro_norm,new_inspection,car_mercedes_c,car_opel_astra,car_vw_golf,car_vw_passat
42263,1,35.000,1.2,2,1,0,0,0,1,103,6,0,0,0,1,0
43083,1,176.000,6.6,2,1,0,0,0,1,150,5,0,0,0,0,0
72064,1,193.228,4.2,1,1,0,0,2,1,135,5,1,0,0,0,1
51462,0,55.975,4.5,1,1,0,0,0,1,165,5,0,0,0,0,1
34715,1,78.658,5.4,1,1,0,0,2,0,120,5,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21814,0,95.700,3.7,1,1,0,0,1,1,137,5,0,0,0,1,0
1175,1,161.831,3.6,1,1,0,0,0,1,123,5,0,0,0,0,1
101181,1,53.174,3.5,1,1,0,0,1,1,124,5,0,1,0,0,0
161,1,61.940,2.7,1,0,0,0,2,0,127,5,1,1,0,0,0


In [89]:
# train/test slpit 
from sklearn.model_selection import train_test_split
car_X_train, car_X_test, car_y_train, car_y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create linear reggression model
from sklearn.linear_model import LinearRegression
regr = LinearRegression()

# train the model using the training sets
regr.fit(car_X_train, car_y_train)


# make predictions using testing set

car_y_pred = regr.predict(car_X_test)


# to evaluate our regression model import mean_squared_error and r2_score

from sklearn.metrics import mean_squared_error, r2_score

mean_squared_error(car_y_test, car_y_pred)

12.494152013199416

In [90]:
r2_score(car_y_test, car_y_pred) # our model is not baddddd !

0.7748418338472217

#### Same exercise but with Random Forests

In [97]:
# import the estimator
from sklearn.ensemble import RandomForestRegressor

# apply the usual steps

forest = RandomForestRegressor(n_estimators=200, random_state=42)
forest.fit(car_X_train, car_y_train)

car_y_pred_forest = forest.predict(car_X_test)

# evaluate the model

mean_squared_error(car_y_test, car_y_pred)

12.494152013199416

In [99]:
r2_score(car_y_test, car_y_pred) # it is totally same!!!

0.7748418338472217

### This part ends here!!!