<h3>Task 1: A classification example: fetal heart condition diagnosis</h3>

Step 1. Reading the data

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
  
# Read the CSV file.
data = pd.read_csv('ctg.csv', skiprows=1)

# Select the relevant numerical columns.
selected_cols = ['LB', 'AC', 'FM', 'UC', 'DL', 'DS', 'DP', 'ASTV', 'MSTV', 'ALTV',
                 'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros', 'Mode', 'Mean',
                 'Median', 'Variance', 'Tendency', 'NSP']
data = data[selected_cols].dropna()

# Shuffle the dataset.
data_shuffled = data.sample(frac=1.0, random_state=0)

# Split into input part X and output part Y.
X = data_shuffled.drop('NSP', axis=1)

# Map the diagnosis code to a human-readable label.
def to_label(y):
    return [None, 'normal', 'suspect', 'pathologic'][(int(y))]

Y = data_shuffled['NSP'].apply(to_label)

# Partition the data into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)

Step 2. Training the baseline classifier

In [3]:
from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy='most_frequent')

In [10]:
from sklearn.model_selection import cross_val_score

np.average(cross_val_score(clf, Xtrain, Ytrain))

0.9241176470588235

Step 3. Trying out some different classifiers

In [46]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=0, max_depth=6)

In [39]:
np.average(cross_val_score(dtc, Xtrain, Ytrain, cv=10))

0.9382352941176471

In [40]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=6, random_state=0)
np.average(cross_val_score(rfc, Xtrain, Ytrain, cv=10))

0.9323529411764706

In [59]:
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression(random_state=0,max_iter=1000, solver='newton-cg')
np.average(cross_val_score(lgr, Xtrain, Ytrain, cv=10))

0.8864705882352941

Step 4. Final evaluation

In [45]:
# Accuracy Decision Tree
# hyperparameter: max_depth -> 6 seems to be the best parameter
from sklearn.metrics import accuracy_score
  
dtc.fit(Xtrain, Ytrain)
Yguess = dtc.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

0.892018779342723


In [43]:
# Accuracy Random Forest
# hyperparameter: max_depth -> again 6 seems to be the best paramater
rfc.fit(Xtrain, Ytrain)
Yguess = rfc.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

0.9131455399061033


In [60]:
# Accuracy Logistic Regression
# hyperparameter: solver:  'lbfgs' -> does not work, 'liblinear'->0.87 accuracy, 'newton-cg'->0.89 best accuracy, 'sag'-> does not work, and 'saga'-> does not work.
lgr.fit(Xtrain, Ytrain)
Yguess = lgr.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

0.892018779342723


In the cross validation Decision Tree and Random Forest had high average scores whereas Logistic Regression did not perform well. <br>
However when looking at the accuracy of the training sets the random forest performed the best. Interestingly Decision Tree performs exactly the same as Logistic Regression

<h3>Task 2: Decision trees for classification</h3>

<h3>Task 3: A regression example: predicting apartment prices</h3>

In [62]:
# Read the CSV file using Pandas.
alldata = pd.read_csv('sberbank.csv')

# Convert the timestamp string to an integer representing the year.
def get_year(timestamp):
    return int(timestamp[:4])
alldata['year'] = alldata.timestamp.apply(get_year)

# Select the 7 input columns and the output column.
selected_columns = ['price_doc', 'year', 'full_sq', 'life_sq', 'floor', 'num_room', 'kitch_sq', 'full_all']
alldata = alldata[selected_columns]
alldata = alldata.dropna()

# Shuffle.
alldata_shuffled = alldata.sample(frac=1.0, random_state=0)

# Separate the input and output columns.
X = alldata_shuffled.drop('price_doc', axis=1)
# For the output, we'll use the log of the sales price.
Y = alldata_shuffled['price_doc'].apply(np.log)

# Split into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)

In [63]:
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_validate
m1 = DummyRegressor()
cross_validate(m1, Xtrain, Ytrain, scoring='neg_mean_squared_error')

{'fit_time': array([0.        , 0.        , 0.00771284, 0.        , 0.        ]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([-0.39897319, -0.37113485, -0.38083108, -0.39057156, -0.40475168])}

In [64]:
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
lso = linear_model.Lasso(alpha=0.1)

In [66]:
lso.fit(Xtrain, Ytrain)
mean_squared_error(Ytest, lso.predict(Xtest))

0.32604901387710894

In [67]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(max_depth=2, random_state=0)
rfr.fit(Xtrain, Ytrain)

mean_squared_error(Ytest, rfr.predict(Xtest))

0.30696994385672405

For this task we tried the Lasso and the Random Forest Regressor. We evaluated both models with the mean squarred error. The less the better. In this case the Random Forest Regressor performed better than Lasso.