# Machine learning:  Recap

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

# Training set and testing set

Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.

# The problem setting


Which approach should I take for my problem?

<figure>
<img src="figs/Scikit.png" width='1000'>
<figcaption></figcaption>
</figure>



 We think about our estimator, depending on how the distribution of the data is, what is our desired precision, how big is our data set, how much time and computation effort do we have, ... .
 

 

 
 
 ### Many classification methods exist:
 which method would you choose?
 
 
 #### Note: Classification is different from Regression:

- Classify for categorical output
- Regression: predicting continuous-valued attribute(s)


 
 For example here there are different methods illustration for 3 different  data distributions. 
 
 
 
 
 
<figure>
<img src="figs/InputData2.png" width='1100'>
<figcaption></figcaption>
</figure>

Let us start with 2 main classification approaches. 

 # Supervied learning: 
 Labeled information is available and can be used for learning.

## Example: K-Nearest Neighbours classification

KNN is a supervised machine learning algorithm  and one the most simple and helpful techniques for classification.  Within this algorithm the prediction for each new data is made based on the k-most similar data in our stored dateset. The similarity can be any feature like Eucledean distance. Then the k nearest neighbours to our new date are the k nearest points. 
The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let $$x_i$$ be an input sample with p features $$(x_{i1},x_{i2},...,x_{ip})$$, and   n be the total number of input samples (i=1,2,...,n). The Euclidean distance between sample $$x_i$$ and $$x_1$$

 is defined as:


$$d(x_i,x_l)=  \sqrt{(x_{i 1} − x_{l 1})^2+(x_{i 2} − x_{l2})^2 + ... + (x_{ip} − x_{lp})^2}$$

### Using an example dataset
let us start with a famouse datset of flower species, Iris dataset. It is a multiclass dataset and there are 4 attributes to classify the dataset as follows:

- Sepal length 
- Sepal width 
- Petal length 
- Petal width 

and the class names are :

- - Iris-setosa
- - Iris-versicolor
- - Iris-virginica



### Algorithm steps

STEP 1: Choose the number K of neighbors

STEP 2: Take the K nearest neighbors of the new data point, according to your distance metric

STEP 3: Among these K neighbors, count the number of data points to each category

STEP 4: Assign the new data point to the category where you counted the most neighbors


### Import libraries and load the dataset.


Iris dataset includes three iris species with 50 samples each as well as some properties about each flower.



<figure>
<img src="figs/Iris.png" width='600'>
<figcaption></figcaption>
</figure>

### Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.


In [None]:
import numpy as np
import pandas as pd


dataset = pd.read_csv ("Data/iris.csv" )
#print (dataset)
print (dataset.shape)
dataset.head(6)  # The head() function is used to get the first n rows.
#dataset


number of instances (rows) of each class:

In [None]:
dataset.groupby('Species').size()

###  Dividing data into features and labels
the dataset contain six columns: Id, SepalLength[cm], SepalWidth [cm], PetalLength [cm], PetalWidth [cm] and Species. The actual features are described by columns 1-4. Last column contains labels of samples. Firstly we need to split data into two arrays: X (features) and y (labels).



In [None]:
X = dataset.iloc[:, 1:5].values
y = dataset.iloc[:, 5].values


### Label encoding
The labels are categorical. KNeighborsClassifier does not accept string labels. We need to use LabelEncoder to transform them into numbers. Iris-setosa correspond to 0, Iris-versicolor correspond to 1 and Iris-virginica correspond to 2.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)  # Fit label encoder and return encoded labels.


### Spliting dataset into training set and test set

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)  
# random_state, Controls the shuffling applied to the data before applying the split.



###  Data Visualization
 - Parallel Coordinates
 
 Parallel coordinates is a plotting technique for plotting multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together.
 
 #### import data visulization libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from pandas.plotting import parallel_coordinates
plt.figure(figsize=(15,10))
parallel_coordinates(dataset.drop("Id", axis=1), "Species")
plt.title('Parallel Coordinates Plot', fontsize=20, fontweight='bold')
plt.xlabel('Features', fontsize=15)
plt.ylabel('Features values', fontsize=15)
plt.legend(loc=1, prop={'size': 15}, frameon=True,shadow=True, facecolor="white", edgecolor="black")
plt.show()

## Using KNN for classification
### Making predictions

In [None]:
# Fitting classifier to the Training set
# Loading libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score

# Instantiate learning model (k = 3)
classifier = KNeighborsClassifier(n_neighbors=2)

# Fitting the model
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

### Evaluating predictions
#### Model accuracy


In [None]:
accuracy = accuracy_score(y_test, y_pred)*100
print('Accuracy of our model is : ' + str(round(accuracy, 2)) + ' %.')


#### Confusion matrix generation.

Confusion matrix is always built to evaluate the performance of the model on test data. Maximization of the diagonal elements proves the best performance of model. 


In [None]:
cm = confusion_matrix(y_test, y_pred)

print (cm)




 
 # Unsupervised learning:
 No (initial) labels and learning needs to structure data on its own.

## Clustering 

There are several clustering methods:


<figure>
<img src="figs/clustering.png" width='900'>
<figcaption></figcaption>
</figure>



<figure>
<img src="figs/Cluster_table.png" width='1100'>
<figcaption></figcaption>
</figure>




## Example :DBSCAN 
Density-Based Spatial Clusering of Applications with Noice.

- 1) Epsilon: The maximum distance (euclidean distance) between a pair of points. The two points are considered as neighbors if and only if they are separated by a distance less than or equal to epsilon.
- 2) MinPoints: The minimum number of points required to form a dense cluster.

<figure>
<img src="figs/DBSCAN.png" width='650'>
<figcaption></figcaption>
</figure>

## DBSCAN example


In [None]:
from sklearn.cluster import DBSCAN
from sklearn import datasets
from sklearn.datasets import make_blobs

In [None]:
# Configuration options
num_samples_total = 1000
cluster_centers = [(3,3), (7,7)]
num_classes = len(cluster_centers)
epsilon = 1.0
min_samples = 13

# Generate data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5)


The generated random points 

In [None]:
np.shape(X)

In [None]:
for k in range(num_samples_total):

    plt.plot(X[:,0], X[:,1],'o', 
             markeredgecolor='k', markersize=5)#markerfacecolor=col,
plt.title('The distribution of generated random data')

plt.show()

In [None]:
# Compute DBSCAN
db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X)
labels = db.labels_

In [None]:
no_clusters = len(np.unique(labels) )
no_noise = np.sum(np.array(labels) == -1, axis=0)

print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)

In [None]:
# Generate scatter plot for training data
colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels))
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.title('Two clusters with data')
plt.xlabel('Axis X[0]')
plt.ylabel('Axis X[1]')
plt.show()

An example of The application of DBSCAN clustering algorithm on the damage sites detection in Dual Phase Steel can be found here.
As you can see the damage sites are darker than the rest of microstructure , however there are also some dark thin shadows sites which maybe misinterpreted as damage.  

<figure>
<img src="figs/20.png" width='540'>
<figcaption></figcaption>
</figure>



### <span style="color: red"> Excersice :</span>  Load the Mag-Al-Ca alloy from figs folder and apply DBSCAN after thresholding. 
