<a href="https://colab.research.google.com/github/Sillians/Atlair-Visualization/blob/master/Atlair_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install altair



In [3]:
import altair as alt
from vega_datasets import data

cars = data.cars()
cars.head()

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
0,12.0,8,307.0,130.0,18.0,chevrolet chevelle malibu,USA,3504,1970-01-01
1,11.5,8,350.0,165.0,15.0,buick skylark 320,USA,3693,1970-01-01
2,11.0,8,318.0,150.0,18.0,plymouth satellite,USA,3436,1970-01-01
3,12.0,8,304.0,150.0,16.0,amc rebel sst,USA,3433,1970-01-01
4,10.5,8,302.0,140.0,17.0,ford torino,USA,3449,1970-01-01


In [4]:
cars.shape

(406, 9)

In [5]:
cars.describe()

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Weight_in_lbs
count,406.0,406.0,406.0,400.0,398.0,406.0
mean,15.519704,5.475369,194.779557,105.0825,23.514573,2979.413793
std,2.803359,1.71216,104.922458,38.768779,7.815984,847.004328
min,8.0,3.0,68.0,46.0,9.0,1613.0
25%,13.7,4.0,105.0,75.75,17.5,2226.5
50%,15.5,4.0,151.0,95.0,23.0,2822.5
75%,17.175,8.0,302.0,130.0,29.0,3618.25
max,24.8,8.0,455.0,230.0,46.6,5140.0


**Data in Altair is built around the Pandas dataframe**

We will be using the ‘cars’ dataset that comes pre-loaded with Altair. This is a fairly known dataset in the ML community and deals with comprises fuel consumption and 9 aspects of automobile design and performance for various automobile models.

In [0]:
import altair as alt
charts = alt.Chart(cars)

# alt.Chart(data).mark_point().encode(
#     encoding_1='column_1',
#     encoding_2='column_2',
#     # etc.
# )

**CHARTS**

The fundamental object in Altair is the,Chart which takes a dataframe as a single argument. 
By itself, a chart has no meaning, and it is usually used in conjunction with data, marks, and encodings, which are inherently the core pieces an Altair chart.

In [0]:
# alt.Chart.mark_point().

**Marks**

Marks enables us to represent each row in the data. There are a number of available marks that can be used like point, circle, square, etc.

**Encodings**

An encoding channel specifies how a given data column should be mapped onto the visual properties of the visualization. Some of the more frequently used visual encodings are:



*   x: x-axis value

*   y: y-axis value

*   color: color of the mark
*   opacity: transparency/opacity of the mark


*   shape: shape of the mark


*   size: size of the mark



*      
    row: row within a grid of facet plots
   
   
    
   

    column: column within a grid of facet plots



In [8]:
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon'
)

Let’s encode miles per gallon on the x-axis using the encode() method:

In [9]:
alt.Chart(cars).mark_tick().encode(
    x='Miles_per_Gallon'
)

However, the point mark is probably not the best choice here. Let's replace it with a tickmark.

In [10]:
alt.Chart(cars).mark_tick().encode(
    y='Origin'
)

We can also map the y axis of the chart to the Origin column:

In [11]:
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Displacement'
)

A 1D chart doesn’t convey much information. Let’s convert it into a 2D chart by encoding Displacement on the y-axis.

In [12]:
alt.Chart(cars).mark_line().encode(
    x='Miles_per_Gallon',
    y='Horsepower'
)

Or with mark_line

In [13]:
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin'
)

**Color**

While a 2D plot allows us to encode two dimensions of the data, color enables us to encode a third.

In [14]:
# Acceleration is a continuous quantity
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Acceleration'
)

An important thing to notice is that when we use a categorical value for color, it chooses an appropriate color map for categorical data. However, when we choose a continuous color value, we get a color scale.

In [15]:
alt.Chart(cars).mark_line().encode(
    x='Year',
    y='mean(Horsepower)',
    color='Origin'
)

Altair also enables us to change the x and y variables. For example, we can replace the y-axis with the mean of horsepower.

**See how easy it is to plot the above chart without any extra effort or labelling required. This is the power of Altair.**

## **Binning and aggregation**

Groupby is a pretty useful tool in pandas. It splits the data according to some condition, applies some aggregation within those groups, and then combines the data back together. Similar operations can also be achieved in Altair with the help of binning and aggregations.

**Histograms**

We can create histograms in Altair without having to explicitly call the hist() function(as in other plotting libraries).In Altair, such binning and aggregation is part of the declarative API. To move beyond a simple field name, we use alt.X() for the x encoding, and we use 'count()' for the y encoding:

In [16]:
alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),
    y='count()'
)

Here, alt.Bin is sued to alt.Bin bin parameters.

If we apply another encoding (such as color), the data will be automatically grouped within each bin:

In [17]:
alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),
    y='count()',
    color='Origin'
)

We can also create a separate plot for each category by using the column encoding.

In [18]:
alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),
    y='count()',
    color='Origin',
    column='Origin'
)

**Interactions**

**A very important feature of Altair is the ease of interactions it provides simply by deploying the interactive() module.**

**There are three basic types of selections available:**


1.   Interval Selection: alt.selection_interval()
2.   Single Selection: alt.selection_single()
3.   Multi Selection: alt.selection_multi()






**Basic Interactions: Panning, Zooming, Tooltips**

These are the simplest type of interactions and can be accomplished in a few lines of code in Altair. Hovering over a point will bring up a tooltip with the name of the car model, and clicking/dragging/scrolling will pan and zoom on the plot.

In [19]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Origin',
    tooltip='Name'
).interactive()

**Advanced Interactions: Selections**

Altair provides a general selection API for creating interactive plots; for example, here we create an interval selection:

In [20]:
interval = alt.selection_interval()
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin'
).properties(
    selection=interval
)

At this point, this selection doesn’t offer much. However, let’s condition the color on this selection. This helps to highlight the points in the selection.

In [21]:
interval = alt.selection_interval()
alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).properties(
    selection=interval
)

Also, this selection automatically applies across any compound chart.

In [22]:
interval = alt.selection_interval()
base = alt.Chart(cars).mark_point().encode(
    y='Horsepower',
    color=alt.condition(interval, 'Origin', alt.value('lightgray')),
    tooltip='Name'
).properties(
    selection=interval
)
base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')

You can also do some cool stuff with the selections interaction like creating a histogram and stacking it to the above compound scatterplots.

In [23]:
interval = alt.selection_interval()
base = alt.Chart(cars).mark_point().encode(
    y='Horsepower',
    color=alt.condition(interval, 'Origin', alt.value('lightgray')),
    tooltip='Name'
).properties(
    selection=interval
)
hist = alt.Chart(cars).mark_bar().encode(
    x='count()',
    y='Origin',
    color='Origin'
).properties(
    width=800,
    height=100
).transform_filter(
    interval
)
scatter = base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')
scatter & hist

**Time-Series & Layering**

Let us discuss the Year column in our dataset now. Visualising Time Series data is an important aspect of EDA. It is always interesting to see the trends with time graphically. Let us see the trends for Miles_per_Gallon

Each year has a number of cars and a lot of overlap in the data. We can clean this up a bit by plotting the mean at each x value:

In [35]:
alt.Chart(cars).mark_line().encode(
    x='Year',
    y='mean(Miles_per_Gallon)',
)

Alternatively, we can change the mark to area and use the ci0 and ci1 mark to plot the confidence interval of the estimate of the mean:

In [37]:
alt.Chart(cars).mark_area().encode(
    x='Year',
    y='ci0(Miles_per_Gallon)',
    y2='ci1(Miles_per_Gallon)'
)

Let’s make the chart more appealing. We will add color to the country of origin; add some opacity and make the width a little wider.

In [44]:
alt.Chart(cars).mark_area(opacity=0.6).encode(
    x=alt.X('Year', timeUnit='year'),
    y=alt.Y('ci0(Miles_per_Gallon)', axis=alt.Axis(title='Miles per Gallon')),
    y2='ci1(Miles_per_Gallon)',
    color='Origin'
).properties(
    width=780
)

Finally, we can make use of the Altair’s layering API to add a line chart representing the mean on top of the area chart.

In [48]:
spread = alt.Chart(cars).mark_area(opacity=0.5).encode(
    x=alt.X('Year', timeUnit='year'),
    y=alt.Y('ci0(Miles_per_Gallon)', axis=alt.Axis(title='Miles per Gallon')),
    y2='ci1(Miles_per_Gallon)',
    color='Origin'
).properties(
    width=800
)
lines = alt.Chart(cars).mark_line().encode(
    x=alt.X('Year', timeUnit='year'),
    y='mean(Miles_per_Gallon)',
    color='Origin'
).properties(
    width=800
)
spread + lines

Altair makes it very simple to plot even complex plots. It enables us to directly put out thoughts into visualizations without having to worry about the mechanics behind it. A surprising range of simple to sophisticated plots and visualizations can be created using minimal effort.

In [24]:
# load an example dataset
from vega_datasets import data
cars = data.cars()

import altair as alt

interval = alt.selection_interval()

alt.Chart(cars).mark_point().encode(
  x='Horsepower',
  y='Miles_per_Gallon',
  color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).properties(
  selection=interval
)

In [25]:
# load an example dataset
from vega_datasets import data
cars = data.cars()

import altair as alt

points = alt.Chart(cars).mark_point().encode(
  x='Year:T',
  y='Miles_per_Gallon',
  color='Origin'
).properties(
  width=800
)

lines = alt.Chart(cars).mark_line().encode(
  x='Year:T',
  y='mean(Miles_per_Gallon)',
  color='Origin'
).properties(
  width=800
).interactive(bind_y=False)
              
points + lines

In [26]:
from vega_datasets import data
stocks = data.stocks()

import altair as alt
alt.Chart(stocks).mark_line().encode(
  x='date:T',
  y='price',
  color='symbol'
).interactive(bind_y=False)

In [27]:
# load an example dataset
from vega_datasets import data
cars = data.cars()

# plot the dataset, referencing dataframe column names
import altair as alt
alt.Chart(cars).mark_point().encode(
  x='Horsepower',
  y='Miles_per_Gallon',
  color='Origin'
).interactive()

In [28]:
# load an example dataset
from vega_datasets import data
cars = data.cars()

# plot the dataset, referencing dataframe column names
import altair as alt
alt.Chart(cars).mark_bar().encode(
  x=alt.X('Miles_per_Gallon', bin=True),
  y='count()',
)

In [34]:
# load an example dataset
from vega_datasets import data
cars = data.cars()

import altair as alt

interval = alt.selection_interval()

base = alt.Chart(cars).mark_point().encode(
  y='Miles_per_Gallon',
  color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).properties(
  selection=interval
)

base.encode(x='Acceleration') | base.encode(x='Horsepower')

In [30]:
# Dictionary that Map from values to frequencies
t = [23, 45, 67, 23, 56, 77, 11, 22, 34, 45, 67,88, 55, 89, 8, 25, 34, 45, 56, 77, 788]

hist = {}
for x in t:
    hist[x] = hist.get(x, 0) + 1
hist

{8: 1,
 11: 1,
 22: 1,
 23: 2,
 25: 1,
 34: 2,
 45: 3,
 55: 1,
 56: 2,
 67: 2,
 77: 2,
 88: 1,
 89: 1,
 788: 1}

In [31]:
# Mapping from (Values) frequencies to Probabilities we divide through by n which is called Normalization
# PMF === Probability Mass function (A function that maps from Values to Probablities)
n = float(len(t))
pmf = {}
for x, freq in hist.items():
    pmf[x] = freq / n
pmf

{8: 0.047619047619047616,
 11: 0.047619047619047616,
 22: 0.047619047619047616,
 23: 0.09523809523809523,
 25: 0.047619047619047616,
 34: 0.09523809523809523,
 45: 0.14285714285714285,
 55: 0.047619047619047616,
 56: 0.09523809523809523,
 67: 0.09523809523809523,
 77: 0.09523809523809523,
 88: 0.047619047619047616,
 89: 0.047619047619047616,
 788: 0.047619047619047616}

In [0]:
import logging
import math
import random

In [33]:
class _DictWrapper(object):
    """An object that contains a dictionary."""
    
    def __init__(self, d=None, name=''):
         # if d is provided, use it; otherwise make a new dict
        if d == None:
            d = {}
        self.d = d
        self.name = name
        
    def GetDict(self):
        """Gets the dictionary."""
        return self.d
    
    def Values(self):
        """Gets an unsorted sequence of values.

        Note: one source of confusion is that the keys in this
        dictionaries are the values of the Hist/Pmf, and the
        values are frequencies/probabilities.
        """
        return self.d.keys()
    
    def Items(self):
        """Gets an unsorted sequence of (value, freq/prob) pairs."""
        return self.d.items()
    
    def Render(self):
        """Generates a sequence of points suitable for plotting.

        Returns:
            tuple of (sorted value sequence, freq/prob sequence)
        """
        return zip(*sorted(self.Items()))
    
    def Print(self):
        """Prints the values and freqs/probs in ascending order."""
        for val, prob in sorted(self.d.iteritems()):
            print val, prob
            
    def Set(self, x, y=0):
        """Sets the freq/prob associated with the value x.

        Args:
            x: number value
            y: number freq or prob
        """
        self.d[x] = y
    
    def Incr(self, x, term=1):
        """Increments the freq/prob associated with the value x.

        Args:
            x: number value
            term: how much to increment by
        """
        self.d[x] = self.d.get(x, 0) + term
        
    

SyntaxError: ignored

In [0]:
import Pmf
hist = Pmf.MakeHistFromList([1, 2, 2, 3, 5])
print ('hist')

In [0]:
import matplotlib.pyplot as pyplot
pyplot.pie([1,2,3])
pyplot.show()

In [0]:
def Hists(hists):
    """Plot two histograms on the same axes.

    hists: list of Hist
    """
    width = 0.4
    shifts = [-width, 0.0]

    option_list = [
        dict(color='0.9'),
        dict(color='blue')
        ]

    pyplot.clf()
    for i, hist in enumerate(hists):
        xs, fs = hist.Render()
        xs = Shift(xs, shifts[i])
        pyplot.bar(xs, fs, label=hist.name, width=width, **option_list[i])
