<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
## Open Machine Learning Course
<center>Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist at Mail.ru Group <br>
    All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
You may use this material for any purpose (you can edit, correct and use it as example) exept commercial use with mandatory citation of author.

# <center> Assignment #6 (demo).
## <center>  Exploring OLS, Lasso and Random Forest in a regression task
    
<img src=https://habrastorage.org/webt/-h/ns/aa/-hnsaaifymavmmudwip9imcmk58.jpeg width=30%>

**Fill in the missing code and choose answers in [this](https://docs.google.com/forms/d/1aHyK58W6oQmNaqEfvpLTpo6Cb0-ntnvJ18rZcvclkvw/edit) web form.**

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.metrics.regression import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, LassoCV, Lasso
from sklearn.ensemble import RandomForestRegressor

**We are working with UCI Wine quality dataset (no need to download it – it's already there, in course repo and in Kaggle Dataset).**

In [2]:
data = pd.read_csv('../input/winequality-white.csv')

In [3]:
data.head()

In [4]:
data.info()

**Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with `StandardScaler`.**

In [5]:
y = data['quality'] # you code here
X = data.drop(['quality'], axis = 1)
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, train_size=0.7, random_state=17)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_holdout_scaled = scaler.transform(X_holdout)

## Linear regression

**Train a simple linear regression model (Ordinary Least Squares).**

In [6]:
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train)

**<font color='red'>Question 1:</font> What are mean squared errors of model predictions on train and holdout sets?**

In [48]:
linreg_train_pred = linreg.predict(X_train_scaled)
linreg_holdout_pred = linreg.predict(X_holdout_scaled)

mse_train = mean_squared_error(y_train, linreg_train_pred).round(3)
mse_holdout = mean_squared_error(y_holdout, linreg_holdout_pred).round(3)

In [51]:
print(f"Mean squared error (train) for linreg: {mse_train}")
print(f"Mean squared error (test) for linreg: {mse_holdout}")

**Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It's handy to use `pandas.DataFrame` here.**

**<font color='red'>Question 2:</font> Which feature this linear regression model treats as the most influential on wine quality?**

In [9]:
d = dict(zip(X.columns, linreg.coef_))
linreg_coef = pd.DataFrame.from_records(sorted(d.items(), key=lambda item: np.abs(item[1]), reverse=True), columns=['features', 'coefficients']).set_index('features')
linreg_coef

## Lasso regression

**Train a LASSO model with $\alpha = 0.01$ (weak regularization) and scaled data. Again, set random_state=17.**

In [10]:
lasso1 = Lasso(alpha=0.01, random_state=17)
lasso1.fit(X_train_scaled, y_train)

**Which feature is the least informative in predicting wine quality, according to this LASSO model?**

In [11]:
d = dict(zip(X.columns, lasso1.coef_))
lasso1_coef = pd.DataFrame.from_records(sorted(d.items(), key=lambda item: np.abs(item[1])), columns=['features', 'coefficients']).set_index('features')
lasso1_coef

**Train LassoCV with random_state=17 to choose the best value of $\alpha$ in 5-fold cross-validation.**

In [12]:
alphas = np.logspace(-6, 2, 200)
lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=17)
lasso_cv.fit(X_train_scaled, y_train)

In [13]:
lasso_cv.alpha_

**<font color='red'>Question 3:</font> Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?**

In [14]:
d = dict(zip(X.columns, lasso_cv.coef_))
lasso_cv_coef = pd.DataFrame.from_records(sorted(d.items(), key=lambda item: np.abs(item[1])), columns=['features', 'coefficients']).set_index('features')
lasso_cv_coef

**<font color='red'>Question 4:</font> What are mean squared errors of tuned LASSO predictions on train and holdout sets?**

In [43]:
lasso_cv_train_pred = lasso_cv.predict(X_train_scaled)
lasso_cv_holdout_pred = lasso_cv.predict(X_holdout_scaled)

mse_train = mean_squared_error(y_train, lasso_cv_train_pred).round(3)
mse_holdout = mean_squared_error(y_holdout, lasso_cv_holdout_pred).round(3)

In [44]:
print(f"Mean squared error (train) for lasso_cv: {mse_train}")
print(f"Mean squared error (test) for lasso_cv: {mse_holdout}")

## Random Forest

**Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.**

In [17]:
forest = RandomForestRegressor(random_state=17)
forest.fit(X_train_scaled, y_train)

**<font color='red'>Question 5:</font> What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**

In [45]:
cross_val_score_result = abs(cross_val_score(forest, X_train_scaled, y_train, scoring='neg_mean_squared_error').mean())
forest_train_pred = forest.predict(X_train_scaled)
forest_holdout_pred = forest.predict(X_holdout_scaled)

mse_train = mean_squared_error(y_train, forest_train_pred).round(3)
mse_holdout = mean_squared_error(y_holdout, forest_holdout_pred).round(3)

In [46]:
print("Mean squared error (train) for forest: %.3f" % mse_train)
print("Mean squared error (cv for forest): %.3f" % cross_val_score_result)
print("Mean squared error (test) for forest: %.3f" % mse_holdout)

**Tune the `max_features` and `max_depth` hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.**

In [52]:
forest_params = {'max_depth': list(range(10, 25)), 
                 'min_samples_leaf': list(range(1, 8)),
                 'max_features': list(range(6,12))}

locally_best_forest = GridSearchCV(forest, param_grid=forest_params, n_jobs=-1, scoring='neg_mean_squared_error', verbose=True)
locally_best_forest.fit(X_train_scaled, y_train)

In [53]:
locally_best_forest.best_params_, locally_best_forest.best_score_

**<font color='red'>Question 6:</font> What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**

In [54]:
cross_val_score_result = abs(cross_val_score(locally_best_forest.best_estimator_, X_train_scaled, y_train, scoring='neg_mean_squared_error').mean()).round(3)
locally_best_forest_train_pred = locally_best_forest.predict(X_train_scaled)
locally_best_forest_holdout_pred = locally_best_forest.predict(X_holdout_scaled)

mse_train = mean_squared_error(y_train, locally_best_forest_train_pred).round(3)
mse_holdout = mean_squared_error(y_holdout, locally_best_forest_holdout_pred).round(3)

In [55]:
print(f"Mean squared error (cv) for locally_best_forest: {cross_val_score_result}")
print(f"Mean squared error (test) for locally_best_forest: {mse_holdout}")

**Output RF's feature importance. Again, it's nice to present it as a DataFrame.**<br>
**<font color='red'>Question 7:</font> What is the most important feature, according to the Random Forest model?**

In [37]:
d = dict(zip(X.columns, locally_best_forest.best_estimator_.feature_importances_))
locally_best_forest_feature_importances = pd.DataFrame.from_records(sorted(d.items(), key=lambda item: np.abs(item[1]), reverse=True), columns=['features', 'coefficients']).set_index('features')
locally_best_forest_feature_importances

**Make conclusions about the performance of the explored 3 models in this particular prediction task.**

Mean squared error (train) for linreg: 0.558 <br />
Mean squared error (test) for linreg: 0.584

Mean squared error (train) for lasso_cv: 0.558 <br />
Mean squared error (test) for lasso_cv: 0.583 <br />

Mean squared error (train) for forest: 0.075 <br />
Mean squared error (cv for forest): 0.460 <br />
Mean squared error (test) for forest: 0.422

Mean squared error (cv) for locally_best_forest: 0.453 <br />
Mean squared error (test) for locally_best_forest: 0.417 

Performance of Lasso (L1 regularized regression) is just slightly better than the performance of simple linear regression. Whereas random forest regression improves the result of linear regression greatly and by tuning the parameters results on a train and a test set can be improved even further. 

**It means that probably the dependency between wine quality and the features is non-linear.**


What can be done next:
* Checking Ridge regression
* Analysing feature importances and linear regression coefficients

The main conclusion for solving all the supervised learning tasks - for determining baseline for the task first check all easily available variants because it is hard to guess which algorithm will prove yourself best for the task. 