In [28]:
import altair as alt
import pandas as pd
import numpy as np

leaf = pd.read_csv('leaf.csv')

leaf.columns = ['Class', 'Specimen Number', 'Eccentricity', 'Aspect Ratio', 
                'Elongation', 'Solidity', 'Stochastic Convexity', 
                'Isoperimetric Factor', 'Maximal Indentation Depth', 
                'Lobedness', 'Average Intensity', 'Average Contrast', 
                'Smoothness', 'Third Moment', 'Uniformity', 'Entropy']

brush = alt.selection_interval()

avcon = alt.Chart(leaf).mark_square().encode(
    x='Class',
    y = 'Average Contrast',
    color = 'Class:N'
).add_selection(
    brush
)

bar = alt.Chart(leaf).mark_bar().encode(
    x='Entropy',
    y = 'Average Intensity',
    color = 'Average Contrast:Q'
).transform_filter(
    brush
)

circ = alt.Chart(leaf).mark_circle().encode(
    alt.X('Eccentricity', bin=True),
    alt.Y('Elongation', bin=True),
    size='count()',
    color='average(Class):Q'
).add_selection(
    brush
)

avcon & circ | bar

Since there were multiple samples for each class of leaf, I chose to visualise the range of average contrast for each class, as I compared average contrast to other features. Plotting this data with squares and colouring nominally by class produced distinct lines which successfully displays overlaps within the data for each class through opacity and consistency of the lines. I chose this method over a box plot as it was more aesthetically pleasing and easier to comprehend due to the large number of classes.

I used a bar chart to compare entropy and average intensity as there is a positive correlation between these features. I also compared the average contrast with these features, declaring its quantitative nature in the colouring. Doing this allowed me to display the positive correlation between all three of these features in one chart. The overlaps in bars accurately display the volume of data represented in the chart while clearly showing the correlation.

To display the relationship between elongation, eccentricity and class, I chose to use a binned scatterplot with quantitative colouring to represent class. I chose this method since there were so many data points that it was difficult to distinguish between the colours when plotting each data point individually, since some areas of the chart were so densely populated. By binning this data, I was able to show those areas of dense data while also displaying the average class through the colour. Here we can see that there are two tails within this chart, which both show that leaves associated with the highest classes tend to have higher entropy. Most of the data fits within the main curve of the graph, however the smaller tail shows that the outliers to this trend are those within the highest and lowest classes.