# Auxilliary Tutorial 5: Principal Component Analysis (PCA)
*This tutorial was prepared by Souleymane N'Doye and was generated from an Jupyter notebook. 

<centre><strong>Construire un nouveau système de représentation.</strong></centre> </br>
<centre>(composantes principales, axes factoriels, facteurs : combinaisons linéaires des variables originelles)
qui permet synthétiser l’information.</centre>

In [126]:
import pandas as pd
import numpy as np

mark = ['ALFASUD-TI-1350','AUDI-100-L','SIMCA-1307-GLS','CITROEN-CG-CLUB','FIAT-132-1600GLS','LANCIA-BETA-1300',
        'PEUGEOT-504','RENAULT-16-TL','RENAULT-30-TS','TOYOTA-COROLLA','ALFETTA-1.66','PRINCESS-1800-HL',
        'DATSUN-200L','TAUNUS-2000-GL','RANCHO','MAZDA-9295','OPEL-REKORD-L','LADA-1300'	]

cyl =    [1350,1588,1294,1222,1585,1297,1796,1565,2664,1166,1570,1798,1998,1993,1442,1769,1979,1294]

puiss =  [79,85,68,59,98,82,79,55,128,55,109,82,115,98,80,83,100,68]

lon =    [393,468,424,412,439,429,449,424,452,399,428,445,469,438,431,440,459,404]

lar =     [161,177,168,161,164,169,169,163,173,157,162,172,169,170,166,165,173,161]

poi =    [870,1110,1050,930,1105,1080,1160,1010,1320,815,1060,1160,1370,1080,1129,1095,1120,955]

vit =    [165,160,152,151,165,160,154,140,180,140,175,158,160,167,144,165,173,140]

dic_ = {'marque': mark, 
      'cylindre' : cyl, 
      'puissance' :puiss, 
      'longueur' : lon , 
      'largeur' :lar , 
      'poids' : poi, 
      'vitesse' : vit
      }

auto = pd.DataFrame(dic_)

In [127]:
auto = auto.set_index('marque')
auto

Unnamed: 0_level_0,cylindre,puissance,longueur,largeur,poids,vitesse
marque,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALFASUD-TI-1350,1350,79,393,161,870,165
AUDI-100-L,1588,85,468,177,1110,160
SIMCA-1307-GLS,1294,68,424,168,1050,152
CITROEN-CG-CLUB,1222,59,412,161,930,151
FIAT-132-1600GLS,1585,98,439,164,1105,165
LANCIA-BETA-1300,1297,82,429,169,1080,160
PEUGEOT-504,1796,79,449,169,1160,154
RENAULT-16-TL,1565,55,424,163,1010,140
RENAULT-30-TS,2664,128,452,173,1320,180
TOYOTA-COROLLA,1166,55,399,157,815,140


In [128]:
auto.shape

(18, 6)

<p>$n$ = 18 individus actifs /  active individuals <br>
$p$ = 7  variables actives (utlisisées pour construire les facteurs) / active variables (used to build the factors)</p>

Questions :<br>
(1) Quelles sont les véhicules qui se ressemblent ? (proximité entre les individus) <br>
(2) Sur quelles variables sont fondées les ressemblances / dissemblances <br>
(3) Quelles sont les relations entre les variables <br>
(1) Quelles sont les véhicules qui se ressemblent ? (proximité entre les individus) <br>
(2) Sur quelles variables sont fondées les ressemblances / dissemblances<br>
(3) Quelles sont les relations entre les variables<br>

<center> <strong>Position du problème</strong> </center>
<center> Analyse des proximités entre les individus</center>

<p> Que voit-on dans ce graphique ? <br>
1. Les variables CYL et PUISS sont liées.<br>
2. 'Opel Reckord' et 'Taunus 2000 (Ford)' ont le
même profil (caractéristiques)<br>
3. 'Renault 30' et 'Toyota Corolla' ont des
profils opposés…
</p>

In [129]:
from bokeh.plotting import figure, show, output_file
from bokeh.models import ColumnDataSource, Range1d, LabelSet, Label

from bokeh.io import output_notebook
output_notebook()

output_file("label.html", title="label.py example")

source = ColumnDataSource(data=dict(dic_))

p = figure(title='CYLINDRE x PUISSANCE',
           x_range=Range1d(1000, 3000))

p.scatter(x='cylindre', y='puissance', size=8, source=source)

p.xaxis[0].axis_label = 'Cylindre'
p.yaxis[0].axis_label = 'Puissance'

labels = LabelSet(x='cylindre', y='puissance', text='marque', level='glyph',
              x_offset=5, y_offset=5, source=source, render_mode='canvas')

p.add_layout(labels)

# Coordonnées barycentre dans ce plan
yg = np.mean(dic_['puissance'])
xg = np.mean(dic_['cylindre'])

p.circle(x=xg, y=yg, size=10, alpha=0.9,color = 'red')

show(p)

#### Notion d'inertie

Impossible de créer un nuage à « p » dimensions.<br>
On pourrait croiser les variables 2 à 2, mais :
1. Très difficile de surveiller plusieurs cadrans en
même temps.
2. Etiqueter les points rendrait le tout illisible.
Ce type de représentation n’est utile que pour
effectuer un diagnostic rapide et repérer les points
atypiques.
Ex. Renault 30 : le plus gros moteur, la plus
puissante, une des plus lourdes, la plus rapide.

<strong>Principe</strong> : Construire un système de représentation de
dimension réduite (<strong>q</strong> << p) qui préserve les distances entre
les individus. On peut la voir comme une compression avec
perte (contrôlée) de l’information.

- Distance euclidienne entre 2 individus ($i$, $i’$):

 $$d^2(i,i') =   \sum_{j=1}^p(x_{ij}-x_{i'j})^2$$

- Un critère global : distance entre l’ensemble des individus pris 2 à 2,
inertie du nuage de points dans l’espace originel. Elle traduit la quantité
d’information disponible.

$$ I_{p} =  \frac{1}{2n^2} \sum_{i=1}^n \sum_{i'=1}^n d^2(i,i')$$

- Autre écriture de l’inertie : écart par rapport au barycentre G (vecteur
constitué des moyennes des p variables).
$$ I_{p} =  \frac{1}{n} \sum_{i=1}^n d^2(i,G) $$

<strong>L’inertie indique la dispersion autour du barycentre,
c’est une variance multidimensionnelle (calculée sur
p dimensions)</strong>

#### Régression orthogonale

Habituellement on (a) centre et (b) réduit les variables.<br> On parle
d’ACP normée.<br>
(a) Pour que G soit situé à l’origine [obligatoire]<br>
(b) Pour rendre comparables les variables exprimées sur des
échelles (unités) différentes [non obligatoire]
$$ z_{ij} = \frac{x_{ij}-\overline{x}_j }{s_j} $$

#### Cas particulier de 2 variables centrée réduite. (Ip = p =2)

In [130]:
auto

Unnamed: 0_level_0,cylindre,puissance,longueur,largeur,poids,vitesse
marque,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALFASUD-TI-1350,1350,79,393,161,870,165
AUDI-100-L,1588,85,468,177,1110,160
SIMCA-1307-GLS,1294,68,424,168,1050,152
CITROEN-CG-CLUB,1222,59,412,161,930,151
FIAT-132-1600GLS,1585,98,439,164,1105,165
LANCIA-BETA-1300,1297,82,429,169,1080,160
PEUGEOT-504,1796,79,449,169,1160,154
RENAULT-16-TL,1565,55,424,163,1010,140
RENAULT-30-TS,2664,128,452,173,1320,180
TOYOTA-COROLLA,1166,55,399,157,815,140


In [131]:
x = auto.values #returns a numpy array

In [134]:
import pandas as pd
from sklearn import preprocessing

x = auto.values #returns a numpy array
x_scaled = preprocessing.scale(x)
auto_scaled = pd.DataFrame(x_scaled)
auto_scaled.columns = auto.columns

In [138]:
auto_scaled = auto_scaled.set_index(auto.index)

In [139]:
auto_scaled

Unnamed: 0_level_0,cylindre,puissance,longueur,largeur,poids,vitesse
marque,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALFASUD-TI-1350,-0.775099,-0.283358,-1.885081,-1.097345,-1.569007,0.56976
AUDI-100-L,-0.120163,0.019639,1.60581,2.001041,0.234161,0.145972
SIMCA-1307-GLS,-0.929201,-0.838852,-0.442179,0.258199,-0.216631,-0.53209
CITROEN-CG-CLUB,-1.127333,-1.293348,-1.000722,-1.097345,-1.118215,-0.616848
FIAT-132-1600GLS,-0.128419,0.676132,0.255999,-0.516398,0.196595,0.56976
LANCIA-BETA-1300,-0.920946,-0.13186,-0.209453,0.451848,0.008765,0.145972
PEUGEOT-504,0.452217,-0.283358,0.721451,0.451848,0.609821,-0.362575
RENAULT-16-TL,-0.183455,-1.495346,-0.442179,-0.710047,-0.517159,-1.549183
RENAULT-30-TS,2.840806,2.191116,0.861086,1.226445,1.811934,1.841127
TOYOTA-COROLLA,-1.281436,-1.495346,-1.60581,-1.871942,-1.982233,-1.549183


In [140]:
auto_scaled.columns = auto.columns
dic_cr = auto_scaled.to_dict('list')
dic_cr 

marque = list(auto.index)

# Adding a new key value pair
dic_cr.update( {'marque' : marque} )

In [141]:
output_notebook()

output_file("label.html", title="label.py example")

source = ColumnDataSource(data=dict(dic_cr))

p = figure(title='CYLINDRE x PUISSANCE',
           x_range=Range1d(-1, 1))

p.scatter(x='cylindre', y='puissance', size=8, source=source)

p.xaxis[0].axis_label = 'Cylindre'
p.yaxis[0].axis_label = 'Puissance'

labels = LabelSet(x='cylindre', y='puissance', text='marque', level='glyph',
              x_offset=5, y_offset=5, source=source, render_mode='canvas')

p.add_layout(labels)

# Coordonnées barycentre dans ce plan
yg = np.mean(dic_cr['puissance'])
xg = np.mean(dic_cr['cylindre'])

p.circle(x=xg, y=yg, size=10, alpha=0.9,color = 'red')

show(p)

#### Matrice des corrélations / Matrix of correlations


Le coefficient de corrélation mesure la liaison (linéaire) entre deux variables $x$ et $y$. <br>
The correlation coefficient measures the (linear) relationship between two variables $x$ and $y$.


${\displaystyle r_{xy}={\dfrac {\displaystyle \sum _{i=1}^{N}(x_{i}-{\bar {x}})\cdot (y_{i}-{\bar {y}})}{{\sqrt {\displaystyle \sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}}}\cdot {\sqrt {\displaystyle \sum _{i=1}^{N}(y_{i}-{\bar {y}})^{2}}}}}}$

In [19]:
cormat = auto.corr()
cormat 

Unnamed: 0,cylindre,puissance,longueur,largeur,poids,vitesse
cylindre,1.0,0.796628,0.701462,0.629757,0.788952,0.664934
puissance,0.796628,1.0,0.641362,0.520832,0.765293,0.844379
longueur,0.701462,0.641362,1.0,0.849266,0.86809,0.475928
largeur,0.629757,0.520832,0.849266,1.0,0.716874,0.472945
poids,0.788952,0.765293,0.86809,0.716874,1.0,0.477596
vitesse,0.664934,0.844379,0.475928,0.472945,0.477596,1.0


#### Matrice des corrélations

In [20]:
from numpy import linalg as LA

w, v = LA.eig(cormat)
print(w)
print(v)


[4.42085806 0.85606229 0.37306608 0.21392209 0.09280121 0.04329027]
[[-0.42493602 -0.12419108 -0.35361252  0.80778648  0.15158003 -0.05889517]
 [-0.42179441 -0.41577389 -0.18492049 -0.35779199 -0.29373465 -0.63303302]
 [-0.42145993  0.41181773  0.06763394 -0.27975231  0.73056903 -0.19029153]
 [-0.38692224  0.446087    0.60486812  0.21156941 -0.47819008 -0.10956624]
 [-0.43051198  0.24267581 -0.48439601 -0.30171136 -0.30455842  0.5808122 ]
 [-0.35894427 -0.6198626   0.48547226 -0.0735743   0.18865511  0.45852167]]


In [110]:
# La somme des valeurs propres donne le nombre de composantes
eigenvalues = LA.eig(cormat)[0]
eigenvectors = LA.eig(cormat)[1]

print('Eigenvalues: \n', eigenvalues)
print('\n')
print('Eigenvectors: \n',eigenvectors)

Eigenvalues: 
 [4.42085806 0.85606229 0.37306608 0.21392209 0.09280121 0.04329027]


Eigenvectors: 
 [[-0.42493602 -0.12419108 -0.35361252  0.80778648  0.15158003 -0.05889517]
 [-0.42179441 -0.41577389 -0.18492049 -0.35779199 -0.29373465 -0.63303302]
 [-0.42145993  0.41181773  0.06763394 -0.27975231  0.73056903 -0.19029153]
 [-0.38692224  0.446087    0.60486812  0.21156941 -0.47819008 -0.10956624]
 [-0.43051198  0.24267581 -0.48439601 -0.30171136 -0.30455842  0.5808122 ]
 [-0.35894427 -0.6198626   0.48547226 -0.0735743   0.18865511  0.45852167]]


##### Construction des composantes / Building components

<p>Construire la première composante $F_1$ qui permet de maximiser le carré de sa corrélation avec les variables de la base de données.</p>

<p>
Construct the first component $ F_1 $ which allows to maximize the square of its correlation with the variables of the database </p>

$$ \lambda_{1} =  \sum_{j=1}^p r^2_j(F_1) = r^2_1(F_1) + r^2_2(F_1)+ \cdots + r^2_p(F_1) $$



<p>Habituellement, Inertie totale = Somme des variances des variables lorsque les données sont réduites (ACP normée), Inertie totale = Trace(R) = p.</p>

<p> Usually Total Inertia = Sum of Variance Variables when the data are reduced (normalized PCR), total inertia = trace (R) = p. </p>
$$ Inertie = Trace(R) = p $$

In [111]:
cormat.values.trace()

6.0

(1) Trouver la première composante F1 qui maximise l’écartement global des points par rapport à l’origine : <br>
(1) Find the first F1 component that maximizes the overall distance of points from the origin:


La partie d'inertie expliquée par $F_1 est \frac{\lambda_1}{p}$ </p>
The part of inertia (variance) explained by $F_1 is \frac{\lambda_1}{p}$

<p> De nouveau, on observe la décomposition de l’information en composantes non
corrélées (orthogonales): </p>

$$p = \sum_{k=1}^p\lambda_{k}$$

(2) Trouver la première composante F2 qui maximise l’écartement global des points par rapport à l’origine :

In [112]:
from numpy import linalg as LA
axe = np.arange(6)+1
eigen_values = LA.eig(cormat)[0]

In [113]:
import numpy as np
import pandas as pd

# Creating a 2 dimensional numpy array
data = np.array([axe, eigen_values])
print(data)



[[1.         2.         3.         4.         5.         6.        ]
 [4.42085806 0.85606229 0.37306608 0.21392209 0.09280121 0.04329027]]


##### Calcule des parts de variance (pdv)

In [114]:
dic__ = {'axe' : np.arange(6)+1,
        'eigen_values' : LA.eig(cormat)[0]
        }
pdv = pd.DataFrame(dic__)
pdv['proportion'] = pdv['eigen_values']/pdv['eigen_values'].sum()*100
pdv['pct_cumule'] = pdv['proportion'].cumsum()
pdv

Unnamed: 0,axe,eigen_values,proportion,pct_cumule
0,1,4.420858,73.680968,73.680968
1,2,0.856062,14.267705,87.948672
2,3,0.373066,6.217768,94.16644
3,4,0.213922,3.565368,97.731809
4,5,0.092801,1.546687,99.278495
5,6,0.04329,0.721505,100.0


In [229]:
from bokeh.io import output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure

output_file("Eigenvalues.html")

# create a column data source for the plots to share
source = ColumnDataSource(data=pdv)

TOOLS = "box_select,lasso_select,help"

# create a new plot and add a renderer
left = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
left.line('axe', 'eigen_values', source=source)

# create another new plot and add a renderer
middle = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
middle.line('axe', 'proportion', source=source)

# create another new plot and add a renderer
right = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
right.line('axe', 'pct_cumule', source=source)

p = gridplot([[left, middle,right]])

show(p)

In [115]:
# 87.95 % de variance expliquée par l'axe 1 + l'axe 2 (Pourcentage cumulé)
print(round(pdv.iloc[1][3],2),'%')

87.95 %


#### Objectif des calculs des l'ACP

Construire un ensemble de composantes $(F_1, F_2, …, F_2,…)$, combinaisons linéaires des variables originelles (centrées et réduites), dont on peut apprécier la qualité de restitution de l’information à
travers l’inertie reproduite ($\lambda_k$)


$$  \begin{cases}  F_1 = a_{11}z_1 + a_{21}z_2 + \ldots + a_{p1}z_p, (\lambda_1) 
                     \\ \vdots  
                     \\ F_k = a_{1k}z_1 + a_{2k}z_2 + \ldots + a_{pk}z_k, (\lambda_k) 
                     \\ \vdots 
                     \\ F_p = a_{1p}z_1 + a_{2p}z_2 + \ldots + a_{pp}z_p, (\lambda_p) 
    \end{cases} 
$$

Comment obtenir les coefficients « ${a}_{jk}$ » à partir des données ? Pour :
<p> - Permettre de calculer les coordonnées des individus dans le repère factoriel, et de juger de leur proximité dans les différents plans factoriels.</p>
<p> - Interpréter en les calculant leur corrélations (et autres indicateurs dérivés : CTR (contribution) et COS²) avec les
variables originelles $(X_1, X_2, …, X_p)$.<br><br>
Plus la corrélation est élevée en valeur absolue, plus forte est
l’influence de la variable sur le facteur: $R_{x_j} = (F_k)$ .

</p>

###### Corrélation des facteurs avec les variables:

$r_{x_j}(F_k) = \sqrt{\lambda_k} \times a_{jk} $

In [155]:
corrVarFac_ = np.sqrt(eigenvalues)*eigenvectors*-1
corrVarFac_

array([[ 0.89346354,  0.1149061 ,  0.21598347, -0.37361508, -0.04617627,
         0.01225391],
       [ 0.88685803,  0.38468911,  0.11294784,  0.16548492,  0.08948124,
         0.13171084],
       [ 0.88615477, -0.38102873, -0.04131023,  0.12939024, -0.22255537,
         0.03959265],
       [ 0.81353638, -0.4127359 , -0.36944822, -0.09785447,  0.14567244,
         0.0227967 ],
       [ 0.90518746, -0.22453248,  0.29586489,  0.13954667,  0.09277852,
        -0.12084561],
       [ 0.75471037,  0.57351941, -0.29652226,  0.03402937, -0.05747056,
        -0.09540146]])

In [158]:
corrVarFac = pd.DataFrame(corrVarFac_)
corrVarFac.columns = ['F1','F2','F3','F4','F5','F6']
corrVarFac = corrVarFac.set_index( auto.columns)

In [159]:
corrVarFac

Unnamed: 0,F1,F2,F3,F4,F5,F6
cylindre,0.893464,0.114906,0.215983,-0.373615,-0.046176,0.012254
puissance,0.886858,0.384689,0.112948,0.165485,0.089481,0.131711
longueur,0.886155,-0.381029,-0.04131,0.12939,-0.222555,0.039593
largeur,0.813536,-0.412736,-0.369448,-0.097854,0.145672,0.022797
poids,0.905187,-0.224532,0.295865,0.139547,0.092779,-0.120846
vitesse,0.75471,0.573519,-0.296522,0.034029,-0.057471,-0.095401


###### Regardons le premier plan factoriel

In [160]:
corrVarFac.loc['puissance', ['F2']]

F2    0.384689
Name: puissance, dtype: float64

In [161]:
list(corrVarFac.index)

['cylindre', 'puissance', 'longueur', 'largeur', 'poids', 'vitesse']

In [181]:
from bokeh.plotting import figure, show, output_file
from bokeh.models import ColumnDataSource, Range1d, LabelSet, Label, Arrow, OpenHead

from bokeh.io import output_notebook
output_notebook()

output_file("label.html", title="label.py example")

source = ColumnDataSource(data=corrVarFac[['F1', 'F2']])

p = figure(title='F1 x F2',
           x_range=Range1d(-1.1, 1.1),
           y_range=Range1d(-1.1, 1.1))

p.scatter(x='F1', y='F2', size=8, source=source)

p.xaxis[0].axis_label = 'F1'
p.yaxis[0].axis_label = 'F2'

labels = LabelSet(x='F1', y='F2', text='index', level='glyph',
              x_offset=5, y_offset=5, source=source, render_mode='canvas')

p.add_layout(labels)

# Cercle de corrélation
p.circle(0, 0, size=20,radius = 1, color="navy", alpha=0.1)

# Plot segment for Fact1 & Fact2
p.segment(x0=[0,-1], y0=[-1,0], 
          x1=[0,1] , y1=[1,0], 
          color="grey", line_width=1.5)

# Plot arrows
p.add_layout(Arrow(end=OpenHead(line_color="blue", line_width=2),
                                        x_start=0, y_start=0, 
                   x_end = corrVarFac.loc['puissance', ['F1']].item(), 
                   y_end = corrVarFac.loc['puissance', ['F2']].item()))

p.add_layout(Arrow(end=OpenHead(line_color="blue", line_width=2),
                                        x_start=0, y_start=0, 
                   x_end = corrVarFac.loc['largeur', ['F1']].item(), 
                   y_end = corrVarFac.loc['largeur', ['F2']].item()))

p.add_layout(Arrow(end=OpenHead(line_color="blue", line_width=2),
                                        x_start=0, y_start=0, 
                   x_end = corrVarFac.loc['longueur', ['F1']].item(), 
                   y_end = corrVarFac.loc['longueur', ['F2']].item()))

p.add_layout(Arrow(end=OpenHead(line_color="blue", line_width=2),
                                        x_start=0, y_start=0, 
                   x_end = corrVarFac.loc['poids', ['F1']].item(), 
                   y_end = corrVarFac.loc['poids', ['F2']].item()))

p.add_layout(Arrow(end=OpenHead(line_color="blue", line_width=2),
                                        x_start=0, y_start=0, 
                   x_end = corrVarFac.loc['cylindre', ['F1']].item(), 
                   y_end = corrVarFac.loc['cylindre', ['F2']].item()))

p.add_layout(Arrow(end=OpenHead(line_color="blue", line_width=2),
                                        x_start=0, y_start=0, 
                   x_end = corrVarFac.loc['vitesse', ['F1']].item(), 
                   y_end = corrVarFac.loc['vitesse', ['F2']].item()))

show(p)

### Projection des individus sur le premier plan factoriel (F1,F2)

#### Calcul via la décomposition en valeurs singulières de la matrice des données centrées et réduites.


<p> Principe de la SVD :</p> 

$$Z = U  \triangle V ^{t}, \begin{cases} Z \vec{v_k} =  \delta_k \vec{u_k} \\ 
                           Z^t \vec{u_k}  =  \delta_k \vec{v_k}
                               \end{cases} $$ 

<p> On obtient les coordonnées factorielles des individus avec: </p>
 $$F_{ik} =  \delta_k  \times u_{ik}$$

In [196]:
u, s, vh = np.linalg.svd(auto_scaled.values,full_matrices=False)

In [199]:
print(u)

[[-0.23977601 -0.45489896 -0.22067967 -0.10290372  0.23316757 -0.06108363]
 [ 0.17504146  0.38901071 -0.50755864  0.10770636 -0.11491073  0.3707088 ]
 [-0.12548449  0.17182895 -0.17619576  0.08542341  0.29042869 -0.30790958]
 [-0.28851967 -0.02875703 -0.05733257  0.00883824 -0.17552212 -0.29847642]
 [ 0.0479631  -0.1771945   0.07458827  0.31990848 -0.20388078  0.04208801]
 [-0.03410542  0.04996854 -0.26079009  0.28331197  0.34437435 -0.22666202]
 [ 0.07666924  0.2376945   0.09910714 -0.10351879 -0.16143911 -0.17433474]
 [-0.21842846  0.24976734  0.23909179 -0.32121546 -0.22682934 -0.12307377]
 [ 0.49433677 -0.2709586   0.22904055 -0.43176395  0.29006171 -0.04984108]
 [-0.4468112  -0.06018181  0.11697811 -0.13510809 -0.2154272   0.37258201]
 [ 0.04906203 -0.48719266 -0.00962512  0.38674655 -0.13006532  0.06144609]
 [ 0.11413873  0.21442461 -0.08358913 -0.15462842  0.14304311 -0.20945483]
 [ 0.32969878  0.1424487   0.48005071  0.39350428 -0.04210005  0.06486943]
 [ 0.14739974 -0.1239407 

In [201]:
print(s)

[8.920507   3.92544535 2.59136825 1.96229396 1.29244799 0.88273717]


In [212]:
coordInd = pd.DataFrame(coordInd_)
coordInd.columns = ['F1','F2','F3','F4','F5','F6']
coordInd = coordInd.set_index(auto_scaled.index)

In [213]:
coordInd

Unnamed: 0_level_0,F1,F2,F3,F4,F5,F6
marque,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALFASUD-TI-1350,-2.138924,-1.785681,-0.571862,-0.201927,0.301357,-0.053921
AUDI-100-L,1.561459,1.52704,-1.315271,0.211352,-0.148516,0.327238
SIMCA-1307-GLS,-1.119385,0.674505,-0.456588,0.167626,0.375364,-0.271803
CITROEN-CG-CLUB,-2.573742,-0.112884,-0.14857,0.017343,-0.226853,-0.263476
FIAT-132-1600GLS,0.427855,-0.695567,0.193286,0.627754,-0.263505,0.037153
LANCIA-BETA-1300,-0.304238,0.196149,-0.675803,0.555941,0.445086,-0.200083
PEUGEOT-504,0.683928,0.933057,0.256823,-0.203134,-0.208652,-0.153892
RENAULT-16-TL,-1.948493,0.980448,0.619575,-0.630319,-0.293165,-0.108642
RENAULT-30-TS,4.409735,-1.063633,0.593528,-0.847248,0.37489,-0.043997
TOYOTA-COROLLA,-3.985782,-0.23624,0.303133,-0.265122,-0.278428,0.328892


In [215]:
coordInd['F1'].min()

-3.985782417966164

In [222]:
output_notebook()

output_file("label.html", title="label.py example")

source = ColumnDataSource(data=coordInd)

p = figure(title='F1 x F2',
           x_range=Range1d(coordInd['F1'].min()-1, coordInd['F1'].max()+1),
           y_range=Range1d(coordInd['F2'].min()-1, coordInd['F2'].max()+1)
           )

p.scatter(x='F1', y='F2', size=8, source=source)

p.xaxis[0].axis_label = 'F1'
p.yaxis[0].axis_label = 'F2'

labels = LabelSet(x='F1', y='F2', text='marque', level='glyph',
              x_offset=5, y_offset=5, source=source, render_mode='canvas')

p.add_layout(labels)

# Plot segment for Fact1 & Fact2
p.segment(x0=[0,-10], y0=[-10,0], 
          x1=[0,10] , y1=[10,0], 
          color="grey", line_width=1.5)

show(p)

In [223]:
auto

Unnamed: 0_level_0,cylindre,puissance,longueur,largeur,poids,vitesse
marque,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALFASUD-TI-1350,1350,79,393,161,870,165
AUDI-100-L,1588,85,468,177,1110,160
SIMCA-1307-GLS,1294,68,424,168,1050,152
CITROEN-CG-CLUB,1222,59,412,161,930,151
FIAT-132-1600GLS,1585,98,439,164,1105,165
LANCIA-BETA-1300,1297,82,429,169,1080,160
PEUGEOT-504,1796,79,449,169,1160,154
RENAULT-16-TL,1565,55,424,163,1010,140
RENAULT-30-TS,2664,128,452,173,1320,180
TOYOTA-COROLLA,1166,55,399,157,815,140


In [225]:
auto.describe()

Unnamed: 0,cylindre,puissance,longueur,largeur,poids,vitesse
count,18.0,18.0,18.0,18.0,18.0,18.0
mean,1631.666667,84.611111,433.5,166.666667,1078.833333,158.277778
std,373.929846,20.376281,22.107358,5.313689,136.957808,12.140383
min,1166.0,55.0,393.0,157.0,815.0,140.0
25%,1310.25,70.75,424.0,162.25,1020.0,151.25
50%,1577.5,82.0,434.5,167.0,1087.5,160.0
75%,1797.5,98.0,448.0,169.75,1126.75,165.0
max,2664.0,128.0,469.0,177.0,1370.0,180.0
