## Assignment 3 - Building a Custom Visualization
In this assignment you must choose one of the options presented below and submit a visual as well as your source code for peer grading. The details of how you solve the assignment are up to you, although your assignment must use matplotlib so that your peers can evaluate your work. The options differ in challenge level, but there are no grades associated with the challenge level you chose. However, your peers will be asked to ensure you at least met a minimum quality for a given technique in order to pass. Implement the technique fully (or exceed it!) and you should be able to earn full grades for the assignment.

Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions.       In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM. (video)

In this paper the authors describe the challenges users face when trying to make judgements about probabilistic data generated through samples. As an example, they look at a bar chart of four years of data (replicated below in Figure 1). Each year has a y-axis value, which is derived from a sample of a larger dataset. For instance, the first value might be the number votes in a given district or riding for 1992, with the average being around 33,000. On top of this is plotted the 95% confidence interval for the mean (see the boxplot lectures for more information, and the yerr parameter of barcharts).



In [1]:
# Use the following data for this assignment:

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
from sklearn.preprocessing import MinMaxScaler
import ipywidgets as widgets
from IPython.display import display

def generate_data():
    # seed for reproducibility
    np.random.seed(12345)
    
    # generating data for dataframe
    df = pd.DataFrame([np.random.normal(32000,200000,3650), 
                    np.random.normal(43000,100000,3650), 
                    np.random.normal(43500,140000,3650), 
                    np.random.normal(48000,70000,3650)], 
                    index=[1992,1993,1994,1995])

    # returning transpose of dataframe to easily call the datafame based on the years
    return df.T

In [2]:
# Compute overall means, and, lower and upper bounds.
def compute_statistics(df,num_samples=30,n=100):
    # initializing variables
    z = stats.norm.ppf(0.975)  # 1.96 for 95% confidence
    years = []
    means = []
    overall_means = []
    lower_bounds = []
    upper_bounds = []
    

    # for all the years 
    for year in list(df.columns):
        years.append(year)

        # Take 30 samples of 100 observations each from each dataset, and calculate the mean for each set of 30 samples. 
        for _ in range(num_samples):
            sample = df[year].sample(n, random_state = None,replace = True)
            sample_mean = sample.mean()
            means.append(sample_mean)
        # Compute the mean of the means for each of the four groups and save each result as overall_mean.  
        overall_mean = np.mean(means)
        # Store the overall_mean in the overall_means list
        overall_means.append(overall_mean)

        # find the standard deviation and use that to find the standard error
        std_of_mean = np.std(means, ddof=1)
        standard_error = std_of_mean/np.sqrt(n) 

        # use the standard error to find the confidence interval, and use that to find bounds for the error bar plots
        lower_bounds.append(overall_mean - (z * standard_error))
        upper_bounds.append(overall_mean + (z * standard_error))    

    # returning years, overall_mean, lower_bounds, upper_bounds, deviation_from_y
    return years, overall_means,lower_bounds,upper_bounds
    

In [16]:
# Plot the bar chart with error bars and selected value of interest
# and handle colors of bars.
def plot_data(value_of_interest,years, overall_means,lower_bounds,upper_bounds):    
    # Normalize scale and choose a colormap
    norm = Normalize(vmin=(-100),vmax=100)
    cmap = plt.colormaps['coolwarm']
    # initialize colors group
    colors = []

    # Calculate deviation from value_of_interest and determine colors
    deviation_from_y = [value_of_interest - mean for mean in overall_means]

    # Scale deviation_from_y to scale of 100 so as to suit the normalize color scale
    values = np.array([deviation_from_y]).reshape(-1, 1)  
    scaler = MinMaxScaler(feature_range=(0, 100))
    scaled_values = scaler.fit_transform(values)
    scaled_values = scaled_values.flatten()

    # Assign colors based on deviation_from_y with positive deviations 
    # being closer to red and negative deviations being closer to blue
    # while deviations around centre of scale(i.e 0) being close to white
    for zipp in list(zip(deviation_from_y,scaled_values)):
        if zipp[0] < 0:
            colour = cmap(norm(100-zipp[1]))
        elif zipp[0] > 0:
            colour = cmap(norm((-zipp[1])))
        else:
            colour = cmap(norm(0))   
        colors.append(colour)

    # Create the plot            
    plt.figure(figsize=(6,9))
    plt.title('Bar chart of Value of Interest(V of I) viz a viz \n average number of votes from 1992 to 1995', fontweight='bold', fontsize=16)
    yerror = [np.array(overall_means) - np.array(lower_bounds), np.array(upper_bounds) - np.array(overall_means)]
    positions = np.arange(len(years))
    bars = plt.bar(positions,overall_means,width=1.0,edgecolor=None,tick_label=years,color = colors, alpha=0.9)
    plt.errorbar(positions, overall_means, yerr=yerror,fmt=' ', capsize=5,ecolor='black')
    plt.axhline(y = value_of_interest,color='black',linestyle='--', label='V of I')
    plt.yticks([value_of_interest], fontweight='bold')
    cbar = plt.colorbar(plt.cm.ScalarMappable(norm=norm, cmap=cmap),ax=plt.gca(),location = 'bottom', label='Value', alpha=0.9, pad=0.05)
    
    # Dejunking the chart
    # Removing chart borders or spines
    for spine in plt.gca().spines.values():
        spine.set_visible(False)
    # Adding text to bars    
    for bar in bars:
        height = bar.get_height()
        plt.gca().text(bar.get_x() + bar.get_width() / 2, bar.get_height() - 10000, str(int(height)) + '\n votes',
                    ha='center', color='white', fontsize=18, fontweight = 'bold')     
    plt.show();    


In [17]:
# Setup the slider widget
def setup_slider(initial_value=35000):
    y_slider = widgets.FloatSlider(value=initial_value, min=0, max=50000, step=1000,
                                   description='V of I', orientation='vertical',
                                   style={'description_width': 'initial'},
                                   layout=widgets.Layout(width='80px', height='650px', padding='50px 0 0 0'))
    return y_slider

# Update the plot with slider
def update_plot(change):
    with out:
        out.clear_output()
        plot_data(y_slider.value, years, overall_means, lower_bounds, upper_bounds)

# Main function to initialize and display all components
def main():
    global out, y_slider, years, overall_means, lower_bounds, upper_bounds
    
    # Generate data and compute statistics
    df = generate_data()
    years, overall_means, lower_bounds, upper_bounds= compute_statistics(df)
    
    # Set up slider and output display
    y_slider = setup_slider()
    y_slider.observe(update_plot, names='value')
    out = widgets.Output()
    hbox = widgets.HBox([y_slider, out])
    # Display the HBox containing the slider and plot area
    display(hbox)
    
    # Initial plot
    update_plot(None)

# Run main function to start the widget and plot display
main()    

HBox(children=(FloatSlider(value=35000.0, description='V of I', layout=Layout(height='650px', padding='50px 0 …