<a href="https://colab.research.google.com/github/OSGeoLabBp/tutorials/blob/master/english/data_processing/lessons/ml_tutor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to machine learning

Working examples are presented to introduce Machnine Learning tasks.


##Predicting wine quality from parameters

We use public wine quality dataset. Let's download it!

In [2]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

--2023-12-17 20:28:46--  http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘winequality-white.csv’

winequality-white.c     [ <=>                ] 258.23K  1.27MB/s    in 0.2s    

2023-12-17 20:28:46 (1.27 MB/s) - ‘winequality-white.csv’ saved [264426]



There are eleven parameters and a quality column in the downloaded csv file. The first few lies are the following (the column headers are in the first line):

In [3]:
!head winequality-white.csv

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
6.2;0.32;0.16;7;0.045;30;136;0.9949;3.18;0.47;9.6;6
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6


We will use different methods to predict quality from the parameters.

###Multiple regression

Supporting there are linear connections between the parameters and the quality.

In [49]:
import pandas
df = pandas.read_csv("winequality-white.csv", sep=';')
print(len(df.index))
df.head()

4898


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


Let's sperate training and test data sets. 25% of records will be used for **testing** the model.

In [5]:
features = list(df.columns)[:-1]
X = df[features]
y = df[df.columns[-1]]

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 22)

In [8]:
# creating the model
from sklearn import linear_model
regr = linear_model.LinearRegression()
_ = regr.fit(X_train, y_train)

In [16]:
# testing the model
predicted = (regr.predict(X_test)+0.5).astype(int)
print(type(predicted))
diff = y_test - predicted
print(f"mean difference: {diff.mean():.1f} mean error: {diff.std():.1f} min diff {diff.min():.1f} max diff {diff.max():.1f}")

<class 'numpy.ndarray'>
mean difference: 0.0 mean error: 0.8 min diff -3.0 max diff 5.0


In [18]:
from sklearn.metrics import accuracy_score
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = (regr.predict(X_train)+0.5).astype(int)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = predicted))

Train data accuracy: 0.5181050912060986
Test data accuracy: 0.5232653061224489


##Logistic regression

In [31]:
from sklearn.linear_model import LogisticRegression

In [32]:
logit = LogisticRegression(max_iter = 100000, C=1.0)
logit.fit(X_train, y_train)
print(logit.score(X_train, y_train))

0.5366185679281241


In [33]:
log_pred = logit.predict(X_test)
diff = y_test - log_pred
print(f"mean difference: {diff.mean():.1f} mean error: {diff.std():.1f} min diff {diff.min():.1f} max diff {diff.max():.1f}")

mean difference: 0.1 mean error: 0.8 min diff -3.0 max diff 3.0


In [34]:
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = logit.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = log_pred))

Train data accuracy: 0.5366185679281241
Test data accuracy: 0.5355102040816326


##Decision trees

In [19]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

In [20]:
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X_train, y_train)

In [21]:
tree_pred = dtree.predict(X_test)
diff = y_test - tree_pred
print(f"mean difference: {diff.mean():.1f} mean error: {diff.std():.1f} min diff {diff.min():.1f} max diff {diff.max():.1f}")

mean difference: -0.0 mean error: 0.8 min diff -3.0 max diff 4.0


In [24]:
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = dtree.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = tree_pred))

Train data accuracy: 1.0
Test data accuracy: 0.6114285714285714


##Neural network

Number of neurons:

$N_h = \frac {N_s} {\alpha \cdot (N_i + N_o)}$

$N_s$ - number of input samples in train data set

$\alpha$ - scaling factor between 2 and 10

$N_i$ - number of input neurons

$N_o$ - number of output neurons

In [43]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, max_iter=10000,
                    hidden_layer_sizes=(50, 30), random_state=1)
_ = clf.fit(X_train, y_train)

In [44]:
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = clf.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = clf.predict(X_test)))

Train data accuracy: 0.5641165260005445
Test data accuracy: 0.5461224489795918


Scaling the data

In [52]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(),
                     MLPClassifier(solver='lbfgs', alpha=1e-5, max_iter=10000, hidden_layer_sizes=(10, 10), random_state=1))
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

KeyboardInterrupt: ignored