This Lab exercise is to get your hands dirty with some code. <br> It's important you understand the concepts as they'll be important in Deep Reinforcement Learning

## Linear regression
In linear regression, the relationships are modeled using linear functions whose unknown model parameters are estimated from the data. <br> We'll play around with a popular dataset of wine. <br> In dataset based on multiple features you'll determine the quality of wine.

In [116]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

Load dataset

In [117]:
dataset = pd.read_csv('winequality-red.csv',sep=';')

Check how it looks actually, other popular way to do is calling the head() function. Please refer to pandas for more details

In [119]:
dataset

Sometimes dataset can be too huge to display, the shape comes in handy to check how your dataset looks, <br> in current example there are 1599 rows and 12 columns. <br> When working with image dataset the shape maybe in channels like RGB and number of images or 1 channel if it's black and white image

In [120]:
dataset.shape

When working with stastical problems describe can give you quick insights. (here it's just to show you some functionalities of pandas)

In [121]:
dataset.describe()

real world data is not as clean as this one, **ALWAYS KNOW YOUR DATA** . We check here just in case if there are null values that we may have to remove or fill

In [122]:
dataset.isnull().any()

Define features

In [123]:
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates','alcohol']

Defining your data and labels

In [124]:
X = dataset[features].values
y = dataset['quality'].values

We would want to see the distribution of data so we know how imbalanced it is. you can read a lot about imbalance in detail. **It's good and bad ;)** Here we we'll contine playing with this imbalanced data

In [125]:
plt.figure(figsize=(15,10))
plt.tight_layout()
sns.distplot(dataset['quality'])

Here we just plot how these 3 features varry for different classes. If you are a wine lover you'll probably know this 

In [126]:
relationship = sns.pairplot(dataset, vars=['fixed acidity','volatile acidity','citric acid'], hue='quality')
plt.show(relationship)

Stop and think why do we have train and test set. If you don't know contact any of the teaching assistant to explain you. 

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**FINALLY** <br> in this example we just used default parameters. you can have fun with them 

In [127]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

Apart from just estimating a function linear regression has lot many uses. It's one of the favorate tools used in reporting. We are just trying to explain the relation of a wine with different features here

In [128]:
coeff_df = pd.DataFrame(regressor.coef_, features, columns=['Coefficient'])  
coeff_df

In [36]:
y_pred = regressor.predict(X_test)

In [37]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(25)

Let's check how you performed shall we?

In [129]:
df1

Will you be able to explain the graph?

In [130]:
df1.plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

Here we report common matrices i.e how close we can predict to data.

In [136]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

## Decision Tree
You can choose to read more about them here (https://en.wikipedia.org/wiki/Decision_tree) but in short they are like flowchart where you get the final class based on features. Decisions and their possible consequences are because of your features.

In [131]:
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [132]:
x = dataset.drop('quality',axis=1)
y = dataset['quality']

In [134]:
clfTre = tree.DecisionTreeClassifier(max_depth=5)
clfTre.fit(X_train, y_train)

In [135]:
utfall = np.count_nonzero(clfTre.predict(X_test) == y_test)
print("The decision tree predicts the test data in", utfall/(len(X_test))*100 , "% of the cases.")

Are we performing better or worst compared to Linear regression? If worst, what can we do? Think about it...

## Random Forest
Random forests or random decision forests are an ensemble method. They are a form of decision tree, except they construct multitude of decision trees at training time and output the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set

In [137]:
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)
rf.fit(X_train, y_train)

In [138]:
utfall = np.count_nonzero(rf.predict(X_test) == y_test)
print("The decision tree predicts the test data in", utfall/(len(X_test))*100 , "% of the cases.")

## MLP
Multilayer perceptron are type of neuralnets. A single layer without activation function behaves quiet closely like linear regression.
I'm sure you guys had fun uptill this point. <br>
For upcoming DRL classes it's very important that you develop a good intuition for MLP, CNNs. <br>
We want you to run this code and compare with classical algorithms we used previously. 
Do notice what, how it's learning and since you know your dataset what more/less can be done? <br>
Don't forget **Always know your data** 

In [100]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

import torch
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable


In [108]:
class Net(nn.Module):
    # define nn
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(11, 100)
        self.fc2 = nn.Linear(100, 100)
        self.fc3 = nn.Linear(100, 6)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, X):
        X = F.relu(self.fc1(X))
        X = self.fc2(X)
        X = self.fc3(X)
        X = self.softmax(X)

        return X

In [109]:
dataset = pd.read_csv('winequality-red.csv', sep=';')

In [110]:
dataset['quality'] = str(dataset['quality'])

In [111]:
unique_labels = np.unique(dataset['quality'].values)
le = preprocessing.LabelEncoder()
le.fit(unique_labels)
labels = le.transform(dataset['quality'])

In [112]:
train_X, test_X, train_y, test_y = train_test_split(dataset[features].values,
                                                    labels, test_size=0.8)

In [113]:
train_X = Variable(torch.Tensor(train_X).float())
test_X = Variable(torch.Tensor(test_X).float())
train_y = Variable(torch.Tensor(train_y).long())
test_y = Variable(torch.Tensor(test_y).long())

In [114]:
net = Net()

criterion = nn.CrossEntropyLoss()# cross entropy loss

optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

In [140]:
for epoch in range(1000):
    optimizer.zero_grad()
    out = net(train_X)
    loss = criterion(out, train_y)
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print('number of epoch', epoch, 'loss', loss.data)

predict_out = net(test_X)
_, predict_y = torch.max(predict_out, 1)

print ('prediction accuracy', accuracy_score(test_y.data, predict_y.data))

print ('macro precision', precision_score(test_y.data, predict_y.data, average='macro'))
print ('micro precision', precision_score(test_y.data, predict_y.data, average='micro'))
print ('macro recall', recall_score(test_y.data, predict_y.data, average='macro'))
print ('micro recall', recall_score(test_y.data, predict_y.data, average='micro'))