# **Final Project ML for Time Series**

### **Subject**: *A Symbolic Representation of Time Series, with Implications for Streaming Algorithms*, Jessica Lin, Eamonn Keogh, Stefano Lonardi, Bill Chiu

#### **Authors**: Tom Rossa and Naïl Khelifa

## **Table des Matières**
1. [Introduction](#introduction)
2. [Importation des Bibliothèques et des Données](#importation-des-bibliothèques-et-des-données)
3. [Exploration des Données](#exploration-des-données)
   - [Aperçu des Données](#aperçu-des-données)
   - [Statistiques Descriptives](#statistiques-descriptives)
   - [Visualisation des Données](#visualisation-des-données)
4. [Prétraitement des Données](#prétraitement-des-données)
   - [Gestion des Valeurs Manquantes](#gestion-des-valeurs-manquantes)
   - [Normalisation et Transformation](#normalisation-et-transformation)
   - [Encodage des Variables Catégoriques](#encodage-des-variables-catégoriques)
5. [Sensibilité des données aux hyper-paramètres](#optimisation-et-tuning-des-hyperparamètres)
6. [Clustering](#clustering)
   - [Hierarchical Clustering](#hierarchical)
   - [Partitional Clustering](#partitional)
   - [Sensibilité aux paramètres](#other)
   - [Conclusion classification](#conclusion-clustering)
7. [Classification](#classification)
   - [Nearest Neighbor Classification](#neighbor)
   - [Sensibilité aux paramètres](#other)
   - [Conclusion classification](#conclusion-classification)
8. [Query by content (indexing)](#indexing)
9. [Other](#other-data-mining)
   - [Anomaly Detection](#anomaly)
   - [Motif discovery](#motif)
   - [Other](#other)
   - [Conclusion clustering](#conclusion-other-data-mining)
10. [Résultats et Interprétation](#résultats-et-interprétation)
11. [Conclusion et Perspectives](#conclusion-et-perspectives)


## **Introduction**

L'objet de ce travail est de reproduire et d'étendre les expériences réalisées dans le papier *A Symbolic Representation of Time Series, with Implications for Streaming Algorithms* (Lin et al.). 

## **Importation des Bibliothèques et des Données**

### Bibliothèques

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
from collections import Counter
import scipy.stats as stats # for the breakpoints in SAX
from scipy.stats import norm
from dtw import dtw
import os

##Sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, make_scorer, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc
import seaborn as sns

## TSLEARN
from tslearn.neighbors import KNeighborsTimeSeriesClassifier
from tslearn.utils import to_time_series_dataset

## Custom : code implémenté par nous-même
from Symbol import SYMBOLS
from SFA import *
from ASTRIDE import *
from SAX_transf import *
from distances import MINDIST, TRENDIST
import utils
import warnings 

warnings.filterwarnings('ignore')


Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



In [2]:
## loading Control Charts dataset
CC_path = "/Users/badis/MVA_Times_Series_ML_Homeworks-1/Final_project/datasets/control_charts.data" ## (chemin local ?)
cc_df = utils.load_control_chart_dataset(CC_path)

In [4]:
cc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,Label
0,28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,...,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717,Normal
1,24.8923,25.741,27.5532,32.8217,27.8789,31.5926,31.4861,35.5469,27.9516,31.6595,...,31.0205,26.6418,28.4331,33.6564,26.4244,28.4661,34.2484,32.1005,26.691,Normal
2,31.3987,30.6316,26.3983,24.2905,27.8613,28.5491,24.9717,32.4358,25.2239,27.3068,...,26.5966,25.5387,32.5434,25.5772,29.9897,31.351,33.9002,29.5446,29.343,Normal
3,25.774,30.5262,35.4209,25.6033,27.97,25.2702,28.132,29.4268,31.4549,27.32,...,28.7261,28.2979,31.5787,34.6156,32.5492,30.9827,24.8938,27.3659,25.3069,Normal
4,27.1798,29.2498,33.6928,25.6264,24.6555,28.9446,35.798,34.9446,24.5596,34.2366,...,27.9601,35.7198,27.576,35.3375,29.9993,34.2149,33.1276,31.1057,31.0179,Normal


In [5]:
cbf_train_path = "/Users/badis/MVA_Times_Series_ML_Homeworks-1/Final_project/datasets/CBF/CBF_TRAIN.txt"
cbf_test_path = "/Users/badis/MVA_Times_Series_ML_Homeworks-1/Final_project/datasets/CBF/CBF_TEST.txt"

cbf_df_train, cbf_df_test = utils.load_CBF_dataset(cbf_test_path, cbf_train_path)

In [6]:
cbf_df_train.head()

Unnamed: 0,labels,1,2,3,4,5,6,7,8,9,...,119,120,121,122,123,124,125,126,127,128
0,1,-0.464276,-0.555048,-0.842843,-0.865895,-0.936396,-0.81727,-0.263612,-1.258048,-1.250393,...,-1.331086,-1.0837,-1.017961,-0.916029,-1.134289,-0.920224,-0.789362,-0.638706,-0.963664,-1.245169
1,1,-0.896972,-0.685686,-1.351382,-1.458667,-1.165346,-1.403929,-1.8218,-0.831601,-1.016312,...,-1.062553,-1.568676,-1.36404,-1.155177,-0.995861,-1.163753,-0.916523,-0.582098,-1.259009,-1.392524
2,1,-0.464696,-0.567739,-0.032023,-0.635046,-0.602826,-0.266856,-0.267061,-0.931042,-0.449382,...,-0.286721,-0.637158,-0.15526,-0.688129,-0.885609,-0.766239,-0.865315,-0.284486,-0.687889,-0.887608
3,3,-0.18719,-0.620808,-0.815661,-0.521398,-0.790423,-0.967517,-1.487006,-0.277887,-0.835075,...,-0.908615,-0.726286,0.183778,-0.737444,-1.113997,-0.393987,-0.587889,-0.608232,-0.636465,-0.349029
4,2,-1.136017,-1.319195,-1.844624,-0.788692,-0.251715,-1.487603,-0.668764,-0.34036,-1.046382,...,-1.182911,-1.073514,-1.611362,-1.06434,-0.970736,-0.827281,-0.953538,-1.270185,-1.742758,-0.925944
