Ce notebook doit etre completé pour réaliser le TP1 du cours MLAI
Dans ce TP on utilisera la librairie [Altair](https://altair-viz.github.io)

Load the cars dataset from a csv file using panda

In [5]:
import pandas as pd
#cars = pd.read_csv('cars.csv')

In [6]:
# Alternative:
from vega_datasets import data
cars = data.cars()

In [7]:
# Dataframe (rows * columns)
cars.shape

(406, 9)

In [8]:
# Dataset dimensions
cars.dtypes

Name                        object
Miles_per_Gallon           float64
Cylinders                    int64
Displacement               float64
Horsepower                 float64
Weight_in_lbs                int64
Acceleration               float64
Year                datetime64[ns]
Origin                      object
dtype: object

In [9]:
# Dataset first rows
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


Now, import Altair and enable it on the notebook

In [11]:
import altair as alt
#alt.renderers.enable("notebook")

The fundamental object in [Altair](https://altair-viz.github.io/user_guide/marks.html) is the Chart which takes a dataframe as a single argument. 

The Chart is used in conjunction *marks*, and *encodings*:
- **Marks** enables us to represent each row in the data with a point, circle, square, bar, tick, ...
- **Encodings** specifies how a given data column should be mapped onto the visual properties of the visualization:
  - _x_: x-axis value
  - _y_: y-axis value
  - _color_: color of the mark
  - _opacity_: transparency/opacity of the mark
  - _shape_: shape of the mark
  - _size_: size of the mark
  - _row_: row within a grid of facet plots
  - _column_: column within a grid of facet plots

Let's try our first visualization with Altair

# Première Partie - TP1

In [12]:
alt.Chart(cars).mark_point()

In [13]:
# Add a x-axis encoding one of the dimensions
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon'
)

In [14]:
# Use tick to better visualize each item
alt.Chart(cars).mark_tick().encode(
    x='Miles_per_Gallon'
)

In [15]:
# Add a y-axis encoding another quantitative dimension
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Displacement'
)

In [16]:
# Add color to encode a Nominal dimension
alt.Chart(cars).mark_point().encode(
    x='Horsepower', 
    y='Miles_per_Gallon',
    color='Origin:N'
).interactive()

In [22]:
# Add matrix of diagrams
alt.Chart(cars).mark_point().encode(
    x='Horsepower', 
    y='Miles_per_Gallon',
    color='Origin:N',
    column='Cylinders:N'
).properties(width=135, height=135).interactive()

In [23]:
# Add interaction to diagrams
alt.Chart(cars).mark_point().encode(
    x='Horsepower', 
    y='Miles_per_Gallon',
    color='Origin:N',
    column='Cylinders:N',
    tooltip='Origin:N',
).properties(width=135, height=135).interactive()

In [34]:
# Change points diagrams to histograms
alt.Chart(cars).mark_bar().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='Origin:N',
    tooltip='Origin:N'
).properties(
    width=135, 
    height=135
).repeat(
    row=['Horsepower', 'Acceleration', 'Miles_per_Gallon', 'Displacement'],
    column=['Horsepower', 'Acceleration', 'Miles_per_Gallon', 'Displacement']
).interactive()

In [36]:
regression_chart = alt.Chart(cars).transform_regression(
    'Horsepower', 
    'Displacement'
).mark_line().encode(
    x='Horsepower:Q',
    y='Displacement:Q'
)

regression_chart

# Deuxième Partie - TP1

In [62]:
import numpy as np

np.random.seed(0)
mean1, std1 = 0, 1
mean2, std2 = 5, 1
n_samples = 1000

data1 = np.random.normal(mean1, std1, n_samples)
data2 = np.random.normal(mean2, std2, n_samples)

df = pd.DataFrame({'Data': np.concatenate([data1, data2]),
                   'Distribution': ['Gaussian 1'] * n_samples + ['Gaussian 2'] * n_samples})

chart = alt.Chart(df)

density_chart = chart.transform_density(
    'Data',
    as_=['Data', 'density'],
    groupby=['Distribution']
).mark_point().encode(
    x=alt.X('Data:Q', title='Value'),
    y=alt.Y('density:Q', title='Density'),
    color=alt.Color('Distribution:N', title='Distribution')
).interactive()

density_chart

In [63]:
weight1 = 0.3
weight2 = 0.3 

n_samples_gaussian3 = int(n_samples * weight1)
data3 = np.random.normal(mean1, std1, n_samples_gaussian3)

n_samples_gaussian4 = int(n_samples * weight2)
data4 = np.random.normal(mean2, std2, n_samples_gaussian4)

data12 = np.concatenate([data1, data2])
data34 = np.concatenate([data3, data4])
df = pd.DataFrame({'Data': np.concatenate([data12, data34]),
                   'Distribution': ['Gaussian 1'] * n_samples + ['Gaussian 2'] * n_samples + ['Gaussian 3'] * n_samples_gaussian3 + ['Gaussian 4'] * n_samples_gaussian4})

chart = alt.Chart(df)

density_chart = chart.transform_density(
    'Data',
    as_=['Data', 'density'],
    groupby=['Distribution']
).mark_point().encode(
    x=alt.X('Data:Q', title='Value'),
    y=alt.Y('density:Q', title='Density'),
    color=alt.Color('Distribution:N', title='Distribution')
).interactive()

density_chart

In [64]:
weight1 = 0.7
weight2 = 0.7 

n_samples_gaussian5 = int(n_samples * weight1)
data5 = np.random.normal(mean1, std1, n_samples_gaussian5)

n_samples_gaussian6 = int(n_samples * weight2)
data6 = np.random.normal(mean2, std2, n_samples_gaussian6)

data1234 = np.concatenate([data12, data34])
data56 = np.concatenate([data5, data6])
df = pd.DataFrame({'Data': np.concatenate([data1234, data56]),
                   'Distribution': ['Gaussian 1'] * n_samples 
                   + ['Gaussian 2'] * n_samples 
                   + ['Gaussian 3'] * n_samples_gaussian3 
                   + ['Gaussian 4'] * n_samples_gaussian4
                   + ['Gaussian 5'] * n_samples_gaussian5
                   + ['Gaussian 6'] * n_samples_gaussian6})

chart = alt.Chart(df)

density_chart = chart.transform_density(
    'Data',
    as_=['Data', 'density'],
    groupby=['Distribution']
).mark_point().encode(
    x=alt.X('Data:Q', title='Value'),
    y=alt.Y('density:Q', title='Density'),
    color=alt.Color('Distribution:N', title='Distribution')
).interactive()

density_chart

By increasing the number of data, there seem to be some displaying error, particularly with regard to points distribution, being a bit random.

In [65]:
from sklearn.neighbors import KernelDensity

data123456 = np.concatenate([data1234, data56]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(data123456)
data_pred = kde.score_samples(data123456)

df = pd.DataFrame({'Data': data_pred})

chart = alt.Chart(df)

density_chart = chart.transform_density(
    'Data',
    as_=['Data', 'density'],
    groupby=['Distribution']
).mark_point().encode(
    x=alt.X('Data:Q', title='Value'),
    y=alt.Y('density:Q', title='Density'),
).interactive()

density_chart

This is a solution using probability density estimation.

See more examples of visualizations at: https://altair-viz.github.io/gallery/

You can learn more trying the Altair tutorial: https://github.com/altair-viz/altair-tutorial/blob/master/notebooks/Index.ipynb