**DATA8 - Statistics for Machine Learning - part 1**
<br>
Ce module a pour but d'introduire les statistiques : stats de description, tendances, normailisation/standardisation des données, corrélation, réduction de données (PCA).<br>
Les données dont celle de RTE : https://rte-france.com/fr/eco2mix/eco2mix-telechargement <br>

In [1]:
import pandas as pd
%matplotlib notebook
from matplotlib import pyplot as plt
import numpy as np

**Overview of the stats/probas class**
<font color='purple'>

1st project : descriptive statistics + time series <br/> 
Topic = Electricity production and consumption in France <br/> 
    
    1. Univariate analysis: measures of central tendancy, dispersion, shape
    2. Multivariate analysis: correlations, dimension reduction (PCA, ...)
    3. Time series: patterns, resampling, cross-correlation, spectral analysis, interpolation (no forecasting, that will be seen in ML)

2nd project : inferential stats (based on probability theory)  <br/> 
Topic : Détection de sons  <br/> 
    
    1. Continuous random variable analysis : probability density function, cumulative distribution function.
    2. Discrete random variable analysis : probabilty mass function
    3. Hypothesis testing : Null hypothesis, alternative hypothesis, significance level, p-value

3rd project : linear regression  <br/> 
Topics: Hospital emergency department overcrowding, wine characteristics, housing price prediction, Moneyball! <br/> 
    
    1. Classic linear regression: interpreting output, model checking, variable selection
    2. Linear regression in prediction context: loss functions , training/test data, over/under-fitting, cross-validation, regularisation
    
</font>

# Understanding the dataset => Descriptive Stats

**Why descriptive analytics**
<font color='green'>

Objective = transforming raw observations into information <br/> 
=> Statistics is a collection of tools that you can use to summarize and organize your dataset so it can be easily understood (helping to take decisions)
</font>

**Descriptive statistics vs inferential statistics**
<font color='green'>

- Descriptive statistics = data description and interpretation (without attempting to make inferences from the sample to the whole population) 
- Inferential statistics = probability-based decisions and predictions
</font>

**References**
- https://towardsdatascience.com/intro-to-descriptive-statistics-252e9c464ac9
- https://medium.com/@himanshuxd/the-guide-to-rigorous-descriptive-statistics-for-machine-learning-and-data-science-9209f88e4363

# Intro

## Vocabulaire

###### Definitions of common machine learning terms:
https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html

- Observation = data point or row in a dataset
- Attribute = quality describing an observation (e.g. color, size, weight) (columns headers)
- Feature = an attribute + its value (color is an attribute, “blue color” is a feature) 
- Dimension = number of features in the dataset


## Data types

Types de données les plus courantes:
- variables quantitatives: discrètes ou continues
- variables qualitatives
- variables ordonnées ou non

Pour en savoir plus: <br/> 
https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

# Unvariable statistique

**TODO**
<font color=#cc0066>
        
1. Charger le dataset "DATA_part1" dans un dataframe
2. Afficher les premières lignes, tester la méthode 'describe'
3. Tracer un histogramme des 8 premières colonnes
    
</font>

In [3]:
df.loc[df.lait==0,['lait']]='non'
df.loc[df.lait==1,['lait']]='oui'

In [4]:
df.describe()

Unnamed: 0,A,B,C,D,E,F,G,H
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.036334,-0.001222,0.027083,0.002676,4.687889,1.02313,1.027027,0.065347
std,2.8669,1.007486,3.008689,2.241933,2.966784,1.029788,1.036689,3.32862
min,-4.999033,-4.266466,-11.079926,-5.315736,0.000386,2.6e-05,-0.297518,-11.974507
25%,-2.412769,-0.690013,-1.990707,-1.984108,2.884097,0.28505,0.298594,-2.104974
50%,0.019228,0.003851,-0.001765,0.029963,4.168992,0.712555,0.72278,0.004217
75%,2.51638,0.68657,2.063967,2.018438,5.703988,1.411114,1.42356,2.193579
max,4.999584,3.834648,10.85612,5.915217,20.776221,11.23916,11.309762,29.876586


In [5]:
df.hist(column=["A","B","C","D","E","F","G","H"])

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff820e07050>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff820d8b310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff820603310>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff8205bd4d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff82056fcd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff820531510>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff82049a690>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff8204a3550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff820457a10>]],
      dtype=object)

In [6]:
df.hist(column='A',bins=100)

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff820ecdcd0>]],
      dtype=object)

In [7]:
print("A, skwewness:", df["A"].skew(),end="")
print("-std:",df["A"].std(),end="")
print("-skewness:",df["A"].skew(),end="")
print("-kurtosis:",df["A"].kurtosis(),end="")
print("-median:",df["A"].median(),end="")

A, skwewness: -0.0011151302687359174-std: 2.8669000482404985-skewness: -0.0011151302687359174-kurtosis: -1.1736561154473668-median: 0.01922830701956535

In [8]:
df.hist(column='B',bins=100)

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff8202d55d0>]],
      dtype=object)

In [9]:
print("B, skwewness:", df["B"].skew(),end="")
print(" std:",df["B"].std(),end="")
print(" skewness:",df["B"].skew(),end="")
print(" kurtosis:",df["B"].kurtosis(),end="")

B, skwewness: -0.03136228208462036 std: 1.0074862907246727 skewness: -0.03136228208462036 kurtosis: 0.023006110352262965

In [10]:
df.hist(column='C')

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff8201bde50>]],
      dtype=object)

In [11]:
print("C, skwewness:", df["C"].skew(),end="")
print(" std:",df["C"].std(),end="")
print(" skewness:",df["C"].skew(),end="")
print(" kurtosis:",df["C"].kurtosis(),end="")

C, skwewness: 0.013886297798656891 std: 3.0086889981445513 skewness: 0.013886297798656891 kurtosis: -0.02960731275515327

In [12]:
df.hist(column='D')

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff82001a490>]],
      dtype=object)

In [13]:
print("D, skwewness:", df["D"].skew(),end="")
print(" std:",df["D"].std(),end="")
print(" skewness:",df["D"].skew(),end="")
print(" kurtosis:",df["D"].kurtosis(),end="")

D, skwewness: -0.0009767501185446142 std: 2.241933228140919 skewness: -0.0009767501185446142 kurtosis: -1.2665927274054651

In [14]:
df.hist(column='E')

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff81ff4cf90>]],
      dtype=object)

In [15]:
print("E, skwewness:", df["E"].skew(),end="")
print(" std:",df["E"].std(),end="")
print(" skewness:",df["E"].skew(),end="")
print(" kurtosis:",df["E"].kurtosis(),end="")

E, skwewness: 1.3059701773817354 std: 2.966783788546626 skewness: 1.3059701773817354 kurtosis: 2.2936323897302757

In [16]:
_ = df.hist(column='F')

<IPython.core.display.Javascript object>

In [17]:
print("F, skwewness:", df["F"].skew(),end="")
print(" std:",df["F"].std(),end="")
print(" skewness:",df["F"].skew(),end="")
print(" kurtosis:",df["F"].kurtosis(),end="")

F, skwewness: 2.016465275513045 std: 1.0297880005328133 skewness: 2.016465275513045 kurtosis: 6.039592614174925

In [18]:
_ = df.hist(column='G')

<IPython.core.display.Javascript object>

In [19]:
print("G, skwewness:", df["G"].skew(),end="")
print(" std:",df["G"].std(),end="")
print(" skewness:",df["G"].skew(),end="")
print(" kurtosis:",df["G"].kurtosis(),end="")

G, skwewness: 1.9847758327262306 std: 1.0366893666382093 skewness: 1.9847758327262306 kurtosis: 5.914500390790007

In [20]:
_ = df.hist(column='H')

<IPython.core.display.Javascript object>

In [21]:
print("H, skwewness:", df["H"].skew(),end="")
print(" std:",df["H"].std(),end="")
print(" skewness:",df["H"].skew(),end="")
print(" kurtosis:",df["H"].kurtosis(),end="")

H, skwewness: 0.6637577524514291 std: 3.3286203844143176 skewness: 0.6637577524514291 kurtosis: 4.721507318325134

## Measures of central tendancy

**TODO**
<font color=#cc0066>
        
1. Afficher la moyenne et la médiane de chaque colonne
2. Pourquoi utiliser l'une plutôt que l'autre ?
3. Quelle est la valeure la plus représentative d'un dataset contenant des données qualitatives ?
    
</font>

In [22]:
s="ABCDEFGH"

for colonne in s:
    print("Pour",colonne,"la moyenne est :", round(df[colonne].mean(),3)," et la médiane ;",round(df[colonne].median(),3))

Pour A la moyenne est : 0.036  et la médiane ; 0.019
Pour B la moyenne est : -0.001  et la médiane ; 0.004
Pour C la moyenne est : 0.027  et la médiane ; -0.002
Pour D la moyenne est : 0.003  et la médiane ; 0.03
Pour E la moyenne est : 4.688  et la médiane ; 4.169
Pour F la moyenne est : 1.023  et la médiane ; 0.713
Pour G la moyenne est : 1.027  et la médiane ; 0.723
Pour H la moyenne est : 0.065  et la médiane ; 0.004


In [23]:
_ = df.hist(column=["A","B","C","D","E","F","G","H"])

<IPython.core.display.Javascript object>

As a rule of thumb, if you’re looking at statistics, it’s best to go with the median. The median isn’t affected much by outliers, like the mayor and his wife. If you’re looking at something more specific, like the average number of kids per household in the city of Pittsboro, use the mean.

In [24]:
df[["réveil","matin","lait","petit dej"]].mode()

Unnamed: 0,réveil,matin,lait,petit dej
0,difficile,café,non,pain au choc


Mode is a good statistic for a qualitative dataset.

## Measures of Dispersion

The most popular variability measures are the range, interquartile range (IQR), variance, and standard deviation. These are used to measure the amount of spread or variability within your data.

1. Range and interquartile range

    - the range measures where the beginning and end of your datapoint are
    - the interquartile range is a measure of where the majority of the values lie
    

2. Variance & Standard Deviation

    - The variance is computed by finding the difference between every data point and the mean, squaring them, summing them up and then taking the average of those numbers.
    - The squares are used to weight outliers more heavily + prevents that differences above the mean neutralize those below the mean.
    - Because of the squaring, it is not in the same unit of measurement as the original data
    - Standard Deviation = square root of the variance

Exemple de deux échantillons ayant la même moyenne mais des écarts types différents illustrant l'écart type comme mesure de la dispersion autour de la moyenne.

source https://fr.wikipedia.org/wiki/%C3%89cart_type

Courbe de distribution normale montrant l'écart type. Chaque bande verticale a une largeur d'un écart type et les pourcentages indiquent leur valeur approximative rapportée à la population totale. Note : du fait des arrondis, le total est de 99,8 % au lieu de 100%.

source: https://fr.wikipedia.org/wiki/%C3%89cart_type

In [25]:
# url = 'https://miro.medium.com/max/1000/1*gV5r1dUfmaPxoSMsL7h5rA.png'
# Image(url = url, width=400)

**TODO**
<font color=#cc0066>
        
Pour les 8 premières colonnes:
1. Afficher l'étendue (max-min) et l'écart interquartile
2. Afficher l'écart type de chaque colonne
3. Tracer une boîte à moustache pour chaque colonne (boxplot). Quelles distributions semblent similaires ?
4. Comparer ces boxplots avec les histogrammes obtenus plus haut
</font>

In [26]:
s="ABCDEFGH"

for colonne in s:
    print("Pour",colonne,"la range est :", \
          round(df[colonne].max()-df[colonne].min(),3),", la std ;",round(df[colonne].std(),3),\
          " et l'IQR :",round(df[colonne].quantile(.75)-df[colonne].quantile(.25),3))

Pour A la range est : 9.999 , la std ; 2.867  et l'IQR : 4.929
Pour B la range est : 8.101 , la std ; 1.007  et l'IQR : 1.377
Pour C la range est : 21.936 , la std ; 3.009  et l'IQR : 4.055
Pour D la range est : 11.231 , la std ; 2.242  et l'IQR : 4.003
Pour E la range est : 20.776 , la std ; 2.967  et l'IQR : 2.82
Pour F la range est : 11.239 , la std ; 1.03  et l'IQR : 1.126
Pour G la range est : 11.607 , la std ; 1.037  et l'IQR : 1.125
Pour H la range est : 41.851 , la std ; 3.329  et l'IQR : 4.299


In [27]:
fig = plt.figure()
ax = plt.axes()
df.boxplot(column=["A","B","C","D","E","F","G","H"])
plt.show()

<IPython.core.display.Javascript object>

In [28]:
_ = df.hist(column=["A","B","C","D","E","F","G","H"])

<IPython.core.display.Javascript object>

## Normalisation / standardisation

- Normalization = rescales the values into a range of [0,1]
- Standardization = rescales data to have a mean of 0 and a standard deviation of 1 (unit variance)
- Standard score (z-score) = represent the number of standard deviations above or below the mean that a specific observation falls

Why is it important ?  <br/> 
- Help compare features that have different units or scales
- Facilitate interpretation of regression coefficients
- Help convergence in ML algorithm (scaling issues)


Normalisation vs standardisation: <br/> 
https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc

Effect of various scalers:  <br/> 
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py


**TODO**
<font color=#cc0066>
   
        
Pour les 8 premières colonnes:
1. Standardiser le dataframe par colonne
2. Identifier les outliers (typiquement > 3 std)
    
</font>

In [29]:
names = df[["A","B","C","D","E","F","G","H"]].columns
names

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], dtype='object')

In [30]:
from sklearn import preprocessing
# Get column names first
names = df[["A","B","C","D","E","F","G","H"]].columns# Create the Scaler object
scaler = preprocessing.StandardScaler()# Fit your data on the scaler object
scaled_df = scaler.fit_transform(df[names])
scaled_df = pd.DataFrame(scaled_df, columns=names)

In [31]:
_=scaled_df.hist(column=["A","B","C","D","E","F","G","H"])

<IPython.core.display.Javascript object>

In [32]:
fig = plt.figure()
ax = plt.axes()
scaled_df.boxplot(column=["A","B","C","D","E","F","G","H"])
plt.show()

<IPython.core.display.Javascript object>

## Measures of shape

Skewness and Kurtosis, see here: 
https://towardsdatascience.com/intro-to-descriptive-statistics-252e9c464ac9

# Multivariable statistics

## Correlations

**TODO**
<font color=#cc0066>
        
1. Calculer les corrélations entre les colonnes 2 à 2
2. Afficher le résultat sur une 'heatmap'
3. A votre avis: comment la correlation va-t-elle être impactée par la standardisation ? Vérifiez sur le dataset.
4. Si deux variables sont fortement correlées, toute nouvelle feature correlée à l'une sera-t-elle forcément correlée à l'autre ?
    
</font>

**Aide:**    
<font color=#0033cc>
    
- la librairie 'seaborn' permet de créer très facilement de belles 'heatmap'
</font>

- Covariance = quantification de la variation d'une variable par rapport à une autre
- Correlation = covariance normalisée

The Pearson correlation coefficient measures the linear relationship between two datasets. 
Linear regression will be seen in details later in the class.

**Attention:**
- Correlation does not imply causation
- (Pearson's) correlation only measure linear tendency

**Références:**
https://en.wikipedia.org/wiki/Correlation_and_dependence

In [33]:
# url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1000px-Correlation_examples2.svg.png'
# Image(url = url, width=500)

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

https://en.wikipedia.org/wiki/Correlation_and_dependence

In [34]:
# url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/1000px-Anscombe%27s_quartet_3.svg.png'
# Image(url = url, width=500)

Four sets of data with the same correlation of 0.816 

https://en.wikipedia.org/wiki/Correlation_and_dependence

In [35]:
df.corr()

Unnamed: 0,A,B,C,D,E,F,G,H
A,1.0,0.000808,-0.000264,0.01407,-0.000892,0.011039,0.012232,0.006009
B,0.000808,1.0,0.003385,0.001377,0.006572,-0.019312,-0.019895,0.003725
C,-0.000264,0.003385,1.0,-0.003513,0.002217,0.006599,0.004055,0.899166
D,0.01407,0.001377,-0.003513,1.0,-0.12106,0.008913,0.010932,-0.000281
E,-0.000892,0.006572,0.002217,-0.12106,1.0,-0.00644,-0.006148,0.002333
F,0.011039,-0.019312,0.006599,0.008913,-0.00644,1.0,0.993152,0.00257
G,0.012232,-0.019895,0.004055,0.010932,-0.006148,0.993152,1.0,0.014469
H,0.006009,0.003725,0.899166,-0.000281,0.002333,0.00257,0.014469,1.0


In [36]:
import seaborn as sns
fig = plt.figure()
ax = plt.axes()
sns.heatmap(df[['A','B','C','D','E','F','G','H']].corr())
plt.show()

<IPython.core.display.Javascript object>

In [37]:
scaled_df.corr()

Unnamed: 0,A,B,C,D,E,F,G,H
A,1.0,0.000808,-0.000264,0.01407,-0.000892,0.011039,0.012232,0.006009
B,0.000808,1.0,0.003385,0.001377,0.006572,-0.019312,-0.019895,0.003725
C,-0.000264,0.003385,1.0,-0.003513,0.002217,0.006599,0.004055,0.899166
D,0.01407,0.001377,-0.003513,1.0,-0.12106,0.008913,0.010932,-0.000281
E,-0.000892,0.006572,0.002217,-0.12106,1.0,-0.00644,-0.006148,0.002333
F,0.011039,-0.019312,0.006599,0.008913,-0.00644,1.0,0.993152,0.00257
G,0.012232,-0.019895,0.004055,0.010932,-0.006148,0.993152,1.0,0.014469
H,0.006009,0.003725,0.899166,-0.000281,0.002333,0.00257,0.014469,1.0


In [38]:
import seaborn as sns
fig = plt.figure()
ax = plt.axes()
sns.heatmap(scaled_df.corr())
plt.show()

<IPython.core.display.Javascript object>

http://mathforum.org/library/drmath/view/62860.html

## Avec des données qualitatives ...

**TODO**   
<font color=#cc0066>

1. Tester le module describe sur les données qualitative
2. Chercher les dépendances entre les colonnes qualitatives et les autres colonnes
    
</font>

**Aide:**
<font color=#0033cc>
    
- pour fixer le type d'une variable, vous pouvez utiliser la méthode .astype()
- pour grouper selon la variable qualitative étudiée, vous pouvez utiliser 'groupby'
</font>

In [39]:
df[["réveil","matin","petit dej","lait"]].describe(include='all')

Unnamed: 0,réveil,matin,petit dej,lait
count,10000,10000,10000,10000
unique,3,2,3,2
top,difficile,café,pain au choc,non
freq,3956,8594,4678,5011


In [40]:
df.groupby('réveil').mean()

Unnamed: 0_level_0,A,B,C,D,E,F,G,H
réveil,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
difficile,3.032055,-0.004479,0.047757,0.052305,4.683998,1.033025,1.037904,0.11214
facile,-0.475232,-0.003645,0.001013,-0.005181,4.685972,1.024868,1.029581,0.031701
toujours pas réveillé,-3.483187,0.00581,0.027038,-0.056294,4.695236,1.007817,1.009504,0.037968


In [41]:
df.groupby('matin').mean()

Unnamed: 0_level_0,A,B,C,D,E,F,G,H
matin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
café,0.026228,0.004541,0.012369,-0.006448,4.692194,0.691666,0.695734,0.056584
thé,0.098106,-0.036448,0.117022,0.058445,4.661576,3.049158,3.052016,0.118913


## Réduction de dimension

- Principe = si les features ne sont pas indépendantes, c'est parce qu'elles sont liées par des facteurs qui leur sont communs (il y a redondance d'information)
- Objectif = créer un nouveau set de features, plus petit, mais expliquant (presque) aussi bien les données.

**Pourquoi vouloir réduire la dimension ?**
- Réduire le temps d'apprentissage
- Faciliter l'interpretation
- Visualiser mes données (tout est plus clair en 2 ou 3 dimensions)
- Fléau de la dimension: https://fr.wikipedia.org/wiki/Fl%C3%A9au_de_la_dimension
- filtration du bruit: les dernières composantes, n'apportant que très peu d'information, peuvent correspondre à du bruit

**Approches:**
- Feature selection: https://en.wikipedia.org/wiki/Feature_selection
- Feature extraction: création de nouvelles features (étudié ici)

**PCA**

Méthode consistant à transformer des variables liées entre elles en nouvelles variables décorrélées les unes des autres. Ces nouvelles variables sont nommées « composantes principales », ou axes principaux. Elle permet au praticien de réduire le nombre de variables et de rendre l'information moins redondante. 

https://fr.wikipedia.org/wiki/Analyse_en_composantes_principales  <br/> 
https://www.youtube.com/watch?v=_UVHneBUBW0

**Autres méthodes de réduction de dimension:**  <br/> 
https://en.wikipedia.org/wiki/Dimensionality_reduction

**TODO**
<font color=#cc0066>
        
1. Effectuer une analyse en composante principale sur les 8 dimensions (A,B,C,D,E,F,G,H) du dataframe
2. Combien faut-il de composantes pour expliquer plus de 90% de la variance ? Cela est-il cohérent avec les corrélations qu'on observe ?
2. Visualisation de la réduction de dimension:
    - Effectuer une PCA sur les colonnes 'C' et 'H': réduire la dimension à 1
    - Ces deux colonnes sont très corrélées: vérifier que la variance expliquée est bien > à 90%
    - Calculer la projection de première composante dans l'espace original (celui de 'C' et 'H')
    - Tracer un scatter plot dans l'espace original représentant à la fois les coordonnées initiales et la nouvelle projection
    
</font>

**Aide:**    
<font color=#0033cc>
- Charger la classe PCA (appartenant au module decomposition de la librairie sklearn)
- Cette classe permet de créer un objet pca (on peut lui donner des arguments, comme n_components)
- Cet objet pca possède, entre autre:
    - des attributs (explained_variance_ratio_, ...)
    - des fonctions (fit_transform, inverse_transform, ...)
</font>

In [73]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

x = df.loc[:, 'A':'H'].values
x = StandardScaler().fit_transform(x)
pca = PCA(n_components=8)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)

[0.24955659 0.23716552 0.14021025 0.12517676 0.12469097 0.10974557
 0.01266905 0.00078529]
[1.99665242 1.89751388 1.12179415 1.00151426 0.99762754 0.87805234
 0.10136251 0.00628298]


In [85]:
x = df.loc[:, 'A':'H'].values
#x = StandardScaler().fit_transform(x)
pca = PCA(n_components=8)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)

[4.21983420e-01 1.97826957e-01 1.81360358e-01 1.07209755e-01
 4.69408006e-02 2.23759357e-02 2.21540905e-02 1.48683065e-04]
[1.91282242e+01 8.96736273e+00 8.22094286e+00 4.85974599e+00
 2.12779487e+00 1.01428609e+00 1.00423000e+00 6.73970322e-03]


In [79]:
#Reducing C and H to one dimension
x = df.loc[:, ['C','H']].values
x = StandardScaler().fit_transform(x)
pca = PCA(n_components=1)
principalComponentsCH = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponentsCH)
print(pca.explained_variance_ratio_,"is superior than 90%, therefore PCA makes sense.")

[0.94958279] is superior than 90%, therefore PCA makes sense.


In [84]:
fig = plt.figure()
ax = plt.axes()
plt.scatter(scaled_df.C,scaled_df.H,alpha=.5)
plt.scatter(pca.inverse_transform(principalComponentsCH)[:,0],pca.inverse_transform(principalComponentsCH)[:,1] ,alpha=.2)
plt.show()

<IPython.core.display.Javascript object>