# Touchsoft ML Course Preassignment

Hi,

This task will help you in getting familiar with Python ecosystem for machine learning

You'll write some code to create and train **your own** model

**Let's start and import some core libraries**

In [None]:
import numpy as np # number array processing support
import pandas as pd # data table processing support
import matplotlib.pyplot as plt # plotting support

**And do some magic to display graphs inplace**

In [None]:
%matplotlib inline

### Exploratory data analysis

In this assignment you will face **Wine  Quality** data

Let's load and explore it

![White wines](http://www.larevista.ro/wp-content/uploads/2016/04/vin.jpg)

In [None]:
data = pd.read_csv(filepath_or_buffer='../data/winequality-white.csv',sep=';')

In [None]:
data.describe()

In [None]:
data.head()

OK, we see that our dataset contains 4898 rows and 12 numeric features. Description of features is provided below

**fixed acidity**: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

**volatile acidity**: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

**citric acid**: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

**residual sugar**: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

**chlorides**: the amount of salt in the wine

**free sulfur dioxide**: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

**total sulfur dioxide**: amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

**density**: the density of water is close to that of water depending on the percent alcohol and sugar content

**pH**: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**sulphates**: a wine additive which can contribute to sulfur dioxide gas (SO2) levels, wich acts as an antimicrobial and antioxidant

**alcohol**: the percent alcohol content of the wine

Our taget variable

**quality**: expert's score of current wine type



Let's face our data more closely and plot histograms of some features

In [None]:
data['alcohol'].hist(bins=20)

In [None]:
data['quality'].hist(bins=20)

In [None]:
data['residual sugar'].hist(bins=100)

In [None]:
data['density'].hist(bins=100)

In [None]:
data['total sulfur dioxide'].hist()

Let's see how quality depends on **fixed acidity**

In [None]:
plt.scatter(data['fixed acidity'], data['quality'])

**Try to build your own graphs below**

Use [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) and [Matplotlib](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html) documentation as reference

In [None]:
#YOUR CODE HERE

### Feature engineering

**Note** : we will train linear regression model here (Ilya told about it).

Please try to obtain symmetric distribution of each feature here, actually most of them are already symmetric, applying logarithm should help for other.

**Hint**: As you probably seen, mean values of some features are more preferable for quality than extreme values. Please try to explain it to linear model

In [None]:
# Extend data with new features
# YOUR CODE HERE

Let's see what we got

In [None]:
data.head()

### Prepare train/test split and train model

As soon as features are prepared, let's consume data to model.

**Note**: linear model requires data standartization. We will use [sklearn preprocessor](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for it.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
data_features = data.drop('quality', axis=1)
data_score = data['quality']
scaler = StandardScaler()
scaler.fit(data_features)

Splitting data to train/test parts. For small dataset 70/30 split considered as good practice, so let's follow it.

In [None]:
train_sample = np.random.choice(len(data), size=3400, replace=False)
test_sample = np.invert(data.index.isin(train_sample))

In [None]:
train_y = data_features.iloc[train_sample]
train_X = scaler.transform(data_features.iloc[train_sample])

In [None]:
test_y = data_score[test_sample]
test_X = scaler.transform(data_features[test_sample])

**Important**: We will minimize squared error. 

$E = (y_{true} - y_{predicted})^2 = (y_{true} - w X^T - b)^2$

There is a Linear Regression class template below. As you can see, it's `fit` method is lost somewhere, so you need to implement it yourself. Please use Gradient Decent method as follows:
$$
    w_{n+1} = w_n - \eta \frac{\partial E}{\partial w} \\
    b_{n+1} = b_n - \eta \frac{\partial E}{\partial b}
$$

You have to compute partial derivatives yourself. Use norm of update step as stopping criteria. $\eta$ is learning rate here.


In [None]:
class LinRegression():
    coef_ = None
    bias_ = None
    learning_rate_= None
    stop_threshold_ = None

    def __init__(self, learning_rate=1e-3, stop_threshold=0.01) -> None:
        self.learning_rate_ = learning_rate
        self.stop_threshold_ = stop_threshold

    def fit(self, X : np.ndarray, y : np.ndarray):
        self.coef_ = np.random.randn(1, X.shape[1])
        self.bias_ = np.random.randn()
        # YOUR CODE HERE
        # write train loop for linear regression model
        pass

    def predict(self, X : np.ndarray):
        return self.coef_ @ X.T + self.bias_

In [None]:
lr = LinRegression()
lr.fit(train_X, train_y)

In [None]:
y_pred = lr.predict(test_X)

In [None]:
from sklearn.metrics import mean_squared_error

If you completed model correctly, \~ 0.6 score is expected below.

In [None]:
mean_squared_error(test_y, y_pred.T)

As a completion, let's compare [Sklearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) with our model

In [None]:
from sklearn.linear_model import LinearRegression
# YOUR CODE HERE
# Train sklean linear model and print mean squared error score for our train/test split