# Introduction


## Data Challenge for the course Machine Learning and Data Mining

### Authors: 
#### Pavlo Mozharovskyi (pavlo.mozharovskyi@telecom-paris.fr), Awais Hussain Sani, Stephan Clémençon


# Supervised anomaly detection

**Anomaly detection** (or **outlier detection**) comprises the **machine learning** methods aimed at identification of observations that exhibit suspicious behaviour and are very likely to cause a problem. 

This data set is provided by Valeo, the French one of the largest Original Equipment Manufacturer. It regards the task of the supervised anomaly detection on a production line. For each of the produced items, a set of properties are measured, and finally a special testing procedure verifies that the item is intact. Thus, all the observations are labeled as normal or defect (anomalies), with anomalies being rare.

Statistically, this is formalized as the supervised anomaly detection, because the correct labels are given during training. This can also be seen as a task of supervised classification with two very dis-equilibrated classes.

You are suggested to construct an anomaly detection rule which, for each new observation, provides an anomaly score, that is the score is higher for more abnormal observation. This would allow to detect anomaly just based on measured parameters of the item without running a mechanical testing procedure.

# The properties of the dataset:


The data set is provided by Valeo and consistst of the measures of 27 properties of produced items and their labels identifying whether an item is intact or defect (= anomaly).

### Training data: 

The training set consists of files, **valeo_xtrain.csv** and **valeo_ytrain.csv**.

File **valeo_xtrain.csv** contains one observation per row, each observation having 27 entries.

File **valeo_ytrain.csv** contains one observation per row, each observation having 1 entry identifying whether it is an anomay (**1**) or not (**0**).

There are in total **27586** training observations.

### Test data:

The testing set consists of one file, **valeo_xtest.csv**, which has the same structure as file **valeo_xtrain.csv**.

There are in total **27587** test observations.

### Remark:

The task of the **supervised anomaly detection** can be difficult, in the sense that the classes are heavily disequilibrated.

## The performance criterion:

You should submit a file that contains in each row anomaly score (a real vaule) for the observation in the corresponding row of the file **valeo_xtest.csv**. For a sample submission please see the code below. Please note, that your score should provide ordering which allows to identify anomalies, i.e., the higher the value of the score, the **more abnormal** the observation should be considered.

The performance criterion is the **Area Under the Receiver Operating Characteristic** (AUC), see also:
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve

# Training Data

Training data, input (file **valeo_xtrain.csv**): https://partage.imt.fr/index.php/s/W3WDoTmB6jJrPZp

Training data, input (file **valeo_ytrain.csv**): https://partage.imt.fr/index.php/s/YAXDEXx6XJtf3X8

# Test Data 

Training data, output (file **valeo_xtest.csv**): https://partage.imt.fr/index.php/s/TCoKd6DMegpmmqL

# Example submission

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

### Load and investigate the data

In [2]:
xtrain = pd.read_csv("valeo_xtrain.csv")
print(xtrain.shape)
xtrain.head()

(27586, 27)


Unnamed: 0,START2_OP020_V_1angle,START2_OP020_V_1torque,START2_OP020_V_2angle,START2_OP020_V_2torque,START2_OP040_Vision_cosseprog,START2_OP050_Vision_paliermodel,START2_OP050_Vision_palierpresencedouille,START2_OP060_Vision_tirantcouleur,START2_OP070_V_1angle,START2_OP070_V_1prog,...,START2_OP090_SnapRingFinalStroke,START2_OP090_SnapRingMidPointForce,START2_OP090_SnapRingPeakForce,START2_OP090_StartLinePeakForce,START2_OP100_Capuchon_insertionmesure,START2_OP110_Vissage_M8angle,START2_OP110_Vissage_M8prog,START2_OP110_Vissage_M8torque,START2_OP120_RodageI_mesure,START2_OP120_RodageU_mesure
0,35.7,3.76,49.1,3.78,300.0,1.0,1.0,2.0,111.7,8.0,...,11.6,71.52,122.23,20.57,0.55,34.7,2.0,9.54,126.96,11.97
1,47.2,3.77,50.3,3.76,30.0,1.0,1.0,2.0,106.0,8.0,...,11.82,67.38,163.78,18.73,0.55,38.7,2.0,9.54,133.88,11.97
2,52.7,3.78,40.4,3.78,300.0,1.0,1.0,2.0,103.4,8.0,...,11.86,89.09,207.73,26.39,0.55,30.2,2.0,9.66,135.28,11.97
3,34.9,3.77,34.9,3.78,1000.0,2.0,1.0,1.0,146.0,7.0,...,11.47,93.45,177.31,25.73,0.59,17.6,1.0,12.06,116.51,11.97
4,50.0,3.77,41.9,3.75,400.0,1.0,1.0,2.0,115.8,8.0,...,11.88,85.17,174.73,21.5,0.42,52.0,1.0,12.12,140.92,11.98


In [3]:
ytrain = pd.read_csv("valeo_ytrain.csv")
print(ytrain.shape)
ytrain.head()

(27586, 1)


Unnamed: 0,Anomaly
0,0
1,0
2,0
3,0
4,0


In [4]:
ytrain["Anomaly"].value_counts() / len(ytrain)

0    0.974879
1    0.025121
Name: Anomaly, dtype: float64

Read in the test data

In [5]:
xtest = pd.read_csv("valeo_xtest.csv")
print(xtest.shape)
xtest.head()

(27587, 27)


Unnamed: 0,START2_OP020_V_1angle,START2_OP020_V_1torque,START2_OP020_V_2angle,START2_OP020_V_2torque,START2_OP040_Vision_cosseprog,START2_OP050_Vision_paliermodel,START2_OP050_Vision_palierpresencedouille,START2_OP060_Vision_tirantcouleur,START2_OP070_V_1angle,START2_OP070_V_1prog,...,START2_OP090_SnapRingFinalStroke,START2_OP090_SnapRingMidPointForce,START2_OP090_SnapRingPeakForce,START2_OP090_StartLinePeakForce,START2_OP100_Capuchon_insertionmesure,START2_OP110_Vissage_M8angle,START2_OP110_Vissage_M8prog,START2_OP110_Vissage_M8torque,START2_OP120_RodageI_mesure,START2_OP120_RodageU_mesure
0,35.1,3.78,37.4,3.77,700.0,2.0,1.0,1.0,144.1,7.0,...,11.58,78.73,137.7,27.91,0.41,15.0,1.0,12.14,114.68,11.97
1,35.2,3.75,33.2,3.75,1300.0,2.0,1.0,1.0,153.5,7.0,...,12.12,71.78,135.23,16.29,0.23,23.8,1.0,12.25,120.63,11.97
2,46.1,3.79,36.1,3.77,1100.0,2.0,1.0,1.0,131.1,7.0,...,11.92,67.03,161.63,20.99,0.12,16.8,1.0,12.17,142.24,11.97
3,35.0,3.76,43.7,3.79,1300.0,2.0,1.0,1.0,147.1,7.0,...,11.93,76.06,123.96,23.47,0.23,18.5,1.0,12.19,116.94,11.97
4,44.9,3.76,31.9,3.79,700.0,2.0,1.0,1.0,163.1,7.0,...,11.59,62.3,126.82,35.31,0.41,13.5,1.0,12.18,120.41,11.97


### Fit the prediction model

In [6]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(xtrain, ytrain.values.ravel())

SGDClassifier(loss='log', random_state=42)

In [7]:
from sklearn.linear_model import LogisticRegression
# Train the anomaly detector
clf = LogisticRegression(solver='liblinear', max_iter=1000)
clf.fit(xtrain, ytrain.values.ravel(), )


LogisticRegression(max_iter=1000, solver='liblinear')

In [8]:
sscore = sgd_clf.decision_function(xtest)

In [9]:
sscore[:25]

array([ -912.4139287 ,  -331.34268546,  -475.3622375 ,  -609.56123036,
       -1406.98332814,  -390.84703435,  -238.53503941,  -502.40931293,
         191.15829818,   419.89855687,  -421.19820258,  -309.76577504,
         145.72661621,   -84.35325758,  -199.92239749,  -436.35773698,
        -517.06599948,  -504.04019375,  -948.85166833,   512.34586851,
        -785.56448134,  -303.59371121,  -297.4278979 ,   475.20559141,
        -600.16062477])

## Prepare a file for submission

In [10]:
# Save the anomaly scores to file
print(sscore.shape)
np.savetxt('ytest_challenge_student.csv', sscore, fmt = '%1.6f', delimiter=',')

(27587,)


#### Now it's your turn. Good luck !  :) 