<a href="https://colab.research.google.com/github/Saltire78/ITT_Labs/blob/main/Practical_1_Victor_Szultka.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Programming assignment 1: introductory tour and decision trees
In this assignment, you will take a quick guided tour of the scikit-learn library, one of the most widely used machine learning libraries in Python. We will particularly focus on decision tree learning for classification and regression.


There are three tasks of the assignment, where the main focus is on using scikit-learn for training and evaluating machine learning models. 

Save this notebook in your own google drive/github account, complete your code, plots, and comments and save as a .ipynb file. Name the file as "Assignment1_your_name.ipynb" and submit to the assignmet for this practical on blackboard.

Deadline: 06-11-2020

Didactic purpose of this assignment:

*   getting a feel for the workflow of machine learning in Python;
*   understanding machine learning algorithms for classification and regression;




# Task 1: A classification example: fetal heart condition diagnosis
The UCI Machine Learning Repository contains several datasets that can be used to investigate different machine learning algorithms. In this exercise, we'll use a dataset of fetal heart diagnosis. The dataset contains measurements from about 2,600 fetuses. This is a classification task, where our task is to predict a diagnosis type following the FIGO Intrapartum Fetal Monitoring Guidelines: normal, suspicious, or pathological.

# Step 1. Reading the data

This file contains the data that we will use. This file contains the same data as in the public distribution, except that we converted from Excel to CSV. Download the file and save it in a working directory.

Open your favorite editor or a Jupyter notebook. To read the CSV file, it is probably easiest to use the Pandas library. Here is a code snippet that carries out the relevant steps:


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

  
# Read the CSV file.
data = pd.read_csv("https://raw.githubusercontent.com/niallomahony93/MLDIoT_Practical1/master/Assignment1/CTG.csv", skiprows=1)

# Select the relevant numerical columns.
selected_cols = ['LB', 'AC', 'FM', 'UC', 'DL', 'DS', 'DP', 'ASTV', 'MSTV', 'ALTV',
                 'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros', 'Mode', 'Mean',
                 'Median', 'Variance', 'Tendency', 'NSP']
data = data[selected_cols].dropna()

# Shuffle the dataset.
data_shuffled = data.sample(frac=1.0, random_state=0)

# Split into input part X and output part Y.
X = data_shuffled.drop('NSP', axis=1)

# Map the diagnosis code to a human-readable label.
def to_label(y):
    return [None, 'normal', 'suspect', 'pathologic'][(int(y))]

Y = data_shuffled['NSP'].apply(to_label)


# Partition the data into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)

Y.head()
X.head()
data.head()


Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,MLTV,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency,NSP
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,2.4,64.0,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,4.0,0.0,4.0,2.0,0.0,0.0,17.0,2.1,0.0,10.4,130.0,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,2.0,0.0,5.0,2.0,0.0,0.0,16.0,2.1,0.0,13.4,130.0,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,2.0,0.0,6.0,2.0,0.0,0.0,16.0,2.4,0.0,23.0,117.0,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,4.0,0.0,5.0,0.0,0.0,0.0,16.0,2.4,0.0,19.9,117.0,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


# Step 2. Training the baseline classifier

We can now start to investigate different classifiers.
The DummyClassifier is a simple classifier that does not make use of the features: it just returns the most common label in the training set, in this case Spondylolisthesis. The purpose of using such a stupid classifier is as a baseline: a simple classifier that we can try before we move on to more complex classifiers.

In [2]:

from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent')


To get an idea of how well our simple classifier works, we carry out a cross-validation over the training set and compute the classification accuracy on each fold.

In [3]:
from sklearn.model_selection import cross_val_score

cross_val_score(clf, Xtrain, Ytrain)

array([0.78235294, 0.78235294, 0.77941176, 0.77941176, 0.77941176])

The result is a NumPy array that contains the accuracies on the different folds in the cross-validation. Get the mean accuracy with the .mean() method.

In [4]:
cross_val_score(clf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.7805882352941176

## Step 3. Trying out some different classifiers
Replace the DummyClassifier with some more meaningful classifier and run the cross-validation again. Try out a few classifiers and see how much you can improve the cross-validation accuracy. Remember, the accuracy is defined as the proportion of correctly classified instances, and we want this value to be high.


Here are some possible options:

Tree-based classifiers:

sklearn.tree.DecisionTreeClassifier

sklearn.ensemble.RandomForestClassifier

sklearn.ensemble.GradientBoostingClassifier


Linear classifiers:

sklearn.linear_model.Perceptron

sklearn.linear_model.LogisticRegression

sklearn.svm.LinearSVC


Neural network classifier (will take longer time to train):

sklearn.neural_network.MLPClassifier

You may also try to tune the hyperparameters of the various classifiers to improve the performance. For instance, the decision tree classifier has a parameter that sets the maximum depth, and in the neural network classifier you can control the number of layers and the number of neurons in each layer.

# The Tree-based classifier using DecisionTreeClassifier 

In [5]:
from sklearn import tree
tclf = tree.DecisionTreeClassifier()

In [6]:
tclf = tclf.fit(Xtrain,Ytrain)

In [7]:
cross_val_score(tclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.9176470588235294

# The Tree-based classifier using RandomForestClassifier 

In [8]:
from sklearn import ensemble
rfclf = ensemble.RandomForestClassifier()

In [9]:
rfclf = rfclf.fit(Xtrain,Ytrain)

In [10]:
cross_val_score(rfclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.9394117647058824

# The Tree-based classifier using GradientBoostingClassifier

In [11]:
from sklearn.ensemble import GradientBoostingClassifier
gbclf = ensemble.GradientBoostingClassifier()

In [12]:
gbclf = gbclf.fit(Xtrain,Ytrain)

In [13]:
cross_val_score(gbclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.9494117647058824

# Linear classifier using Perceptron

In [14]:
from sklearn import linear_model
lmclf = linear_model.Perceptron()

lmclf = lmclf.fit(Xtrain,Ytrain)

cross_val_score(lmclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.825294117647059

# Linear classifier using LogisticRegression

In [15]:
from sklearn.linear_model import LogisticRegression
lrclf = LogisticRegression(solver='newton-cg',max_iter=5000)

lrclf = lrclf.fit(Xtrain,Ytrain)

cross_val_score(lrclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.8905882352941177

# Linear classifier using LinearSVC

In [16]:
from sklearn.svm import LinearSVC
lsclf = LinearSVC(max_iter=10000)

lsclf = lsclf.fit(Xtrain,Ytrain)

cross_val_score(lsclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()



0.8464705882352941

# Linear classifier using MLPClassifier

In [17]:
from sklearn import neural_network
mlpclf = neural_network.MLPClassifier()

mlpclf = mlpclf.fit(Xtrain,Ytrain)

cross_val_score(mlpclf, Xtrain, Ytrain, cv=5, scoring='accuracy').mean()

0.876470588235294

# Step 4. Final evaluation
When you have found a classifier that gives a high accuracy in the cross-validation evaluation, train it on the whole training set and evaluate it on the held-out test set. Please include a description of the classifier you selected and report its accuracy below.

According to step 3, the best results were made using **Gradient Boosting CLassifier** and the accuracy is about **95%**.

In [19]:
from sklearn.metrics import accuracy_score
  
gbclf.fit(Xtrain, Ytrain)
Yguess = gbclf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

0.9295774647887324


# Task 2: Decision trees for classification
Import the code from Lecture1.ipynb and use the defined class TreeClassifier as your classifier in an experiment similar to those in Task 1. Tune the hyperparameter max_depth to get the best cross-validation performance, and then evaluate the classifier on the test set.

Please report below what value of max_depth you selected and what accuracy you got.

For illustration, let's also draw a tree. Set max_depth to a reasonable small value, and then call draw_tree to visualize the learned decision tree. Include this tree in your report.

The accuracy of the classifier in the test set is ____. The accuracy is largely increased at the max_depth _______. The learned decision tree is visualized:

# Task 3: A regression example: predicting apartment prices
Here is another dataset. This dataset was created by Sberbank and contains some statistics from the Russian real estate market. Here is the Kaggle page where you can find the original data.

Since we will just be able to handle numerical features and not symbolic ones, we'll need with a simplified version of the dataset. So we'll just select 9 of the columns in the dataset. The goal is to predict the price of an apartment, given numerical information such as the number of rooms, the size of the apartment in square meters, the floor, etc. Our approach will be similar to what we did in the classification example: load the data, find a suitable model using cross-validation over the training set, and finally evaluate on the held-out test data.

The following code snippet will carry out the basic reading and preprocessing of the data.

In [None]:
from sklearn import preprocessing
from sklearn import utils
# Read the CSV file using Pandas.
alldata = pd.read_csv('sberbank.csv')

# Convert the timestamp string to an integer representing the year.
def get_year(timestamp):
    return int(timestamp[:4])
alldata['year'] = alldata.timestamp.apply(get_year)

# Select the 9 input columns and the output column.
selected_columns = ['price_doc', 'year', 'full_sq', 'life_sq', 'floor', 'num_room', 'kitch_sq', 'full_all']
alldata = alldata[selected_columns]
alldata = alldata.dropna()

# Shuffle.
alldata_shuffled = alldata.sample(frac=1.0, random_state=0)

# Separate the input and output columns.
X = alldata_shuffled.drop('price_doc', axis=1)
# For the output, we'll use the log of the sales price.
Y = alldata_shuffled['price_doc'].apply(np.log)
# Split into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)
Y.head(100)

We train a baseline dummy regressor (which always predicts the same value) and evaluate it in a cross-validation setting.

This example looks quite similar to the classification example above. The main differences are (a) that we are predicting numerical values, not symbolic values; (b) that we are evaluating using the mean squared error metric, not the accuracy metric that we used to evaluate the classifiers.

In [None]:
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_validate
m1 = DummyRegressor()
cross_validate(m1, Xtrain, Ytrain.astype('int'), scoring='neg_mean_squared_error')

Replace the dummy regressor with something more meaningful and iterate until you cannot improve the performance. Please note that the cross_validate function returns the negative mean squared error.

Some possible regression models that you can try:

sklearn.linear_model.LinearRegression

sklearn.linear_model.Ridge

sklearn.linear_model.Lasso

sklearn.tree.DecisionTreeRegressor

sklearn.ensemble.RandomForestRegressor

sklearn.ensemble.GradientBoostingRegressor

sklearn.neural_network.MLPRegressor

According to the negative mean squared error, the best results were made using ______ and the error is about ____. 

Finally, train on the full training set and evaluate on the held-out test set:

In [None]:
from sklearn.metrics import mean_squared_error
  
regr.fit(Xtrain, Ytrain)
mean_squared_error(Ytest, regr.predict(Xtest))

the mean squared error is _____.