# SIADS 521 - Data Visualization Dashboard
## Assignment 03

## 1. Visualization Technique

### 1.1 Description of Visualization Types

In this dashboard, I've implemented multiple visualization techniques that work together to provide a comprehensive analysis of wine quality data:

#### Correlation Heatmap
The correlation heatmap provides a comprehensive overview of relationships between all variables in the dataset. This visualization is essential for initial data exploration as it shows which variables correlate with each other and the strength of those correlations. In the context of wine quality analysis, it helps identify which chemical properties might influence quality ratings.

In [18]:
def create_heatmap(df):
    """
    Create a correlation heatmap for the dataframe.
    """

    # Calculate the correlation matrix
    corr = df.corr()
    
    # Convert the correlation matrix to a format suitable for HeatMap
    # Create lists to hold the x, y coordinates and values
    xs, ys, values = [], [], []
    
    for i, idx in enumerate(corr.index):
        for j, col in enumerate(corr.columns):
            xs.append(col)
            ys.append(idx)
            values.append(corr.loc[idx, col])
    
    # Create the heatmap using proper data structure
    heatmap = hv.HeatMap((xs, ys, values)).opts(
        opts.HeatMap(
            tools=['hover'],
            width=700,
            height=600,
            cmap='RdBu_r',
            colorbar=True,
            xrotation=45,
            ylabel='',
            xlabel='',
            fontsize={'title': 16, 'labels': 12, 'xticks': 10, 'yticks': 10}
        )
    )
    
    return heatmap


#### Histograms
Histograms display the distribution of individual variables, showing how values are spread across different ranges. These are critical for understanding the underlying data distribution, identifying outliers, and assessing normality. For wine quality data, histograms help us see the distribution of chemical components like acidity, alcohol content, and pH.

In [19]:
def create_histogram(df, column):
    """
    Create a histogram for a specific column.
    """
    hist = df.hvplot.hist(
        column, 
        bins=20,
        height=250, 
        width=300,
        alpha=0.7, 
        title=f'Distribution of {column}'
    )
    return hist

def create_histograms_grid(df):
    """
    Create a grid of histograms for all columns.
    """
    cols = df.columns.tolist()
    hists = []
    
    for i, col in enumerate(cols):
        hist = create_histogram(df, col)
        hists.append(hist)
    
    # Arrange histograms in a grid
    grid = pn.GridBox(*hists, ncols=3)
    
    return grid

#### Box Plots
Box plots provide a five-number summary (minimum, first quartile, median, third quartile, maximum) for each variable, along with potential outliers. They're excellent for comparing distributions across different variables and identifying skewness. In wine analysis, box plots help compare the range and central tendency of different chemical properties.

In [20]:
def create_boxplots_grid(df):
    """
    Create a grid of box plots for each column in the dataframe.
    """
    plots = []
    for col in df.columns:
        box_plot = df.hvplot.box(
            y=col,
            height=300,
            width=300,
            title=f'Box Plot of {col}',
            box_fill_color='lightblue',
            whisker_color='black',
            tools=['hover']
        )
        plots.append(box_plot)
    
    # Arrange plots in a grid
    grid = hv.Layout(plots).cols(3)
    return grid

#### Scatter Plots
Scatter plots visualize the relationship between two variables, displaying each data point individually. This helps identify patterns, clusters, and potential outliers in bivariate relationships. For wine data, scatter plots can reveal how specific chemical properties relate to each other or to quality ratings.

In [21]:
def create_scatter_plot(df, x_col, y_col):
    """
    Create a scatter plot with given x and y columns.
    """
    scatter = df.hvplot.scatter(
        x=x_col, 
        y=y_col, 
        height=400, 
        width=500,
        alpha=0.7,
        color='blue',
        tools=['hover'],
        title=f'{y_col} vs {x_col} Scatterplot'
    )
    return scatter

#### Line of Best Fit (Regression Line)
The regression line accompanies scatter plots and provides a visual representation of the linear relationship between two variables. By displaying the gradient and R² value in the hover information, users can quickly assess the strength and direction of relationships. This is particularly useful for identifying which wine properties most strongly predict quality.

In [22]:
def create_line_plot(df, x_col, y_col):
    """
    Create only a line of best fit (regression line) with gradient information in hover.
    """
    import numpy as np
    import holoviews as hv
    from holoviews import opts
    from scipy import stats
    import pandas as pd
    
    # Calculate the line of best fit
    mask = ~(np.isnan(df[x_col]) | np.isnan(df[y_col]))
    x = df[x_col][mask]
    y = df[y_col][mask]
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    
    # Create x values spanning the range of the data
    x_line = np.linspace(min(x), max(x), 100)
    y_line = slope * x_line + intercept
    
    # Create a DataFrame with the line points and include gradient information
    line_df = pd.DataFrame({
        x_col: x_line,
        y_col: y_line,
        'Gradient': [slope] * len(x_line),
        'R²': [r_value**2] * len(x_line)
    })
    
    # Create the line of best fit using hvplot for better hover functionality
    hover_info = [x_col, y_col, 'Gradient', 'R²']
    
    fit_line = line_df.hvplot.line(
        x=x_col,
        y=y_col,
        color='green',
        line_width=2,
        height=400,
        width=500,
        title=f'{y_col} vs {x_col} Line of Best Fit (Gradient={slope:.3f})',
        hover_cols=hover_info,
        tools=['hover']
    )
    
    return fit_line

### 1.2 How Visualizations Complement Each Other

These visualization techniques work together synergistically to tell a complete data story:

1. **Overview to Detail Approach**: The correlation heatmap provides a broad overview of all relationships, while scatter plots and regression lines allow for detailed exploration of specific relationships of interest.

2. **Distribution Context for Relationships**: Histograms and box plots provide the distributional context needed to properly interpret the relationships shown in scatter plots and regression lines.

3. **Multi-perspective Analysis**: Each visualization offers a different perspective on the data. For example, a correlation coefficient in the heatmap might show a strong relationship, which can then be visualized spatially in a scatter plot and quantified with a regression line.

4. **Progressive Investigation**: Users can start with the overview (heatmap), identify interesting relationships, examine the distributions of those variables (histograms/box plots), and then explore specific relationships in detail (scatter plots/regression lines).

### 1.3 Dashboard-specific Considerations

The dashboard incorporates several interactive features to enhance the user experience:

1. **Tab-based Navigation**: Organizes visualizations logically (Overview, Univariate Analysis, Multivariate Analysis) to prevent information overload and guide analysis.

2. **Dynamic Plot Generation**: Users can select variables for scatter plots and regression lines, enabling exploration of any relationship in the dataset without predetermined limitations.

3. **Plot Type Selection**: Users can switch between histogram and box plot views in the univariate analysis tab, offering flexibility in how distributions are visualized.

4. **Hover Information**: All plots include hover functionality that displays precise values, enhancing the ability to extract specific insights.

5. **Responsive Layout**: The dashboard adapts to different screen sizes, ensuring usability across devices.

6. **Color Coding**: Consistent color schemes help users interpret visualizations more intuitively, with stronger correlations and relationships visually emphasized.

## 2. Visualization Library

### 2.1 Dashboard Framework and Libraries

For this interactive dashboard, I've used the following libraries:

1. **Panel**: Panel is the primary dashboard framework used in this project. Created by Anaconda and part of the HoloViz ecosystem, Panel is an open-source Python library that provides a high-level interface for creating interactive web applications and dashboards. Panel seamlessly integrates with Jupyter notebooks, making it ideal for data exploration and presentation in a research or educational context.

2. **HoloViews**: HoloViews works in conjunction with Panel and provides high-level data structures for building complex visualizations. It simplifies the creation of interactive plots by focusing on the data rather than the plotting details.

3. **hvPlot**: hvPlot extends Pandas with interactive plotting capabilities. It provides a familiar interface for Pandas users while leveraging HoloViews and Bokeh for interactive visualizations.

4. **Bokeh**: Bokeh is the underlying visualization library that powers the interactive elements in Panel, HoloViews, and hvPlot. It specializes in creating interactive, web-ready visualizations that can be embedded in web applications or Jupyter notebooks.

### 2.2 Installation and Setup

To install the necessary libraries for this dashboard:

```python
pip install panel holoviews hvplot bokeh pandas numpy scipy
```

Or using conda:

```
conda install -c pyviz panel holoviews hvplot
conda install -c anaconda bokeh pandas numpy scipy
```

In [23]:
# Necessary python library imports
import pandas as pd
import numpy as np
import panel as pn
import holoviews as hv
from holoviews import opts
import hvplot.pandas

### 2.3 Framework Approach and Limitations

#### Approach
The HoloViz ecosystem (Panel, HoloViews, hvPlot) follows a **declarative** approach to visualization, where users specify what they want to see rather than how to render it. This high-level abstraction allows for rapid dashboard development without getting bogged down in implementation details.

Panel integrates seamlessly with Jupyter notebooks, enabling interactive dashboard development in a familiar environment. The framework supports both standalone deployment and in-notebook interaction, making it flexible for various use cases.

I chose this framework for several reasons:
1. **Python-centric**: It allows me to create interactive dashboards without leaving the Python ecosystem.
2. **Jupyter Integration**: Direct integration with Jupyter notebooks facilitates an iterative development process.
3. **Flexibility**: Panel supports a wide range of visualization libraries (HoloViews, Matplotlib, Bokeh, Plotly, etc.).
4. **Deployment Options**: Dashboards can be deployed as standalone applications or embedded in existing web applications.
5. **Active Development**: The HoloViz ecosystem is actively maintained with regular updates and improvements.

#### Limitations
While powerful, the framework does have some limitations:
1. **Learning Curve**: The HoloViz ecosystem has multiple interrelated libraries with overlapping functionality, which can be confusing for beginners.
2. **Documentation Gaps**: Some advanced features lack comprehensive documentation or examples.
3. **Performance with Large Datasets**: With very large datasets, performance can degrade, especially with multiple interactive elements.
4. **Styling Flexibility**: Fine-grained control over visual styling can sometimes be challenging compared to more specialized visualization libraries.
5. **Deployment Complexity**: While Panel provides deployment options, setting up production-grade deployments requires additional configuration.

## 3. Demonstration

### 3.1 Dataset Selection and Cleaning

For this demonstration, I've chosen the Wine Quality dataset from Kaggle, which contains various chemical properties of wines along with quality ratings. This dataset is particularly suitable for demonstrating multiple visualization types and interactive features because:

1. It contains both continuous variables (chemical properties) and a discrete outcome (quality rating).
2. It has multiple features with potential relationships to explore.
3. The data is clean, complete and well-structured, allowing us to focus on visualization rather than extensive preprocessing.

#### Data Cleaning Steps:
1. **Loading the Data**: The dataset was loaded from a CSV file.
2. **Removing Unnecessary Columns**: The 'Id' column was removed as it doesn't provide relevant information for analysis.
3. **Standardizing Column Names**: Column names were capitalized for consistent formatting.
4. **Data Validation**: A quick inspection was performed to ensure there were no missing values or obvious anomalies.

In [24]:
# Data loading and cleaning
df = pd.read_csv("WineQT.csv", delimiter=",")   # Reading in of CSV file into Pandas DataFrame
df = df.drop('Id', axis=1)                      # Removal of the ID column

df.columns = [                                  # Capitalisation of column names
    ' '.join(word.capitalize() for word in col.split(' '))
    for col in df.columns
]
df.head()   # Sanity Check: first 5 rows 


Unnamed: 0,Fixed Acidity,Volatile Acidity,Citric Acid,Residual Sugar,Chlorides,Free Sulfur Dioxide,Total Sulfur Dioxide,Density,Ph,Sulphates,Alcohol,Quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [25]:
# Inspection of DataFrame information - check for DataFrame size, missing values and column data types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Fixed Acidity         1143 non-null   float64
 1   Volatile Acidity      1143 non-null   float64
 2   Citric Acid           1143 non-null   float64
 3   Residual Sugar        1143 non-null   float64
 4   Chlorides             1143 non-null   float64
 5   Free Sulfur Dioxide   1143 non-null   float64
 6   Total Sulfur Dioxide  1143 non-null   float64
 7   Density               1143 non-null   float64
 8   Ph                    1143 non-null   float64
 9   Sulphates             1143 non-null   float64
 10  Alcohol               1143 non-null   float64
 11  Quality               1143 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 107.3 KB


In [None]:
# Addtional function for the overview tab (not a visulation)
def data_summary_table(df):
    """
    Create a summary statistics table.
    """
    summary = df.describe().reset_index()
    summary_df = pd.DataFrame(summary)
    summary_df.columns = ['Stat'] + list(df.columns)
    
    table = pn.widgets.Tabulator(
        summary_df,
        pagination='remote',
        page_size=10,
        sizing_mode='stretch_width',
        height=300,
        show_index=False
    )
    return table

In [27]:
# Initialize Panel extension
pn.extension('tabulator', sizing_mode='stretch_width')
hv.extension('bokeh')

def create_dashboard(df):
    """
    Create a dashboard with multiple tabs.
    """
    # Define column selectors for multivariate analysis
    columns = list(df.columns)
    x_select = pn.widgets.Select(
        name='X-axis',
        options=columns,
        value=columns[0]
    )
    y_select = pn.widgets.Select(
        name='Y-axis',
        options=columns,
        value=columns[1]
    )
    
    # Add plot type selector for the univariate tab using a dropdown menu
    plot_type_select = pn.widgets.Select(
        name='Plot Type',
        options=['Histogram', 'Box Plot'],
        value='Histogram'
    )
    
    # Define callbacks for dynamic plots
    @pn.depends(x_select.param.value, y_select.param.value)
    def create_dynamic_scatter(x_col, y_col):
        return create_scatter_plot(df, x_col, y_col)
        
    @pn.depends(x_select.param.value, y_select.param.value)
    def create_dynamic_line(x_col, y_col):
        return create_line_plot(df, x_col, y_col)
    
    # Create a new callback for dynamic univariate plots
    @pn.depends(plot_type_select.param.value)
    def create_dynamic_univariate(plot_type):
        if plot_type == 'Histogram':
            return create_histograms_grid(df)
        else:  # Box Plot
            return create_boxplots_grid(df)
    
    # Create tabs
    # Tab 1: Overview with heatmap
    overview_tab = pn.Column(
        pn.pane.Markdown("## Overview"),
        pn.pane.Markdown("### Data Statistical Summary"),
        data_summary_table(df),
        pn.pane.Markdown("### Correlation Heatmap"),
        create_heatmap(df)
    )
    
    # Tab 2: Univariate Analysis with plot type selection dropdown
    univariate_tab = pn.Column(
        pn.pane.Markdown("## Univariate Analysis"),
        pn.Row(
            pn.Column(
                pn.pane.Markdown("### Select Plot Type:"),
                plot_type_select,
                width=200
            )
        ),
        pn.pane.Markdown("### Distribution of Individual Variables"),
        create_dynamic_univariate
    )
    
    # Tab 3: Multivariate Analysis
    multivariate_tab = pn.Column(
        pn.pane.Markdown("## Multivariate Analysis"),
        pn.pane.Markdown("### Dynamic Plots"),
        pn.Row(
            pn.Column(
                pn.pane.Markdown("#### Select Variables for Plotting:"),
                x_select,
                y_select,
                width=200
            ),
            pn.Column(
                create_dynamic_scatter,
                create_dynamic_line
            )
        )
    )
    
    # Combine all tabs
    tabs = pn.Tabs(
        ('Overview', overview_tab),
        ('Univariate Analysis', univariate_tab),
        ('Multivariate Analysis', multivariate_tab)
    )
    
    # Create the main dashboard layout
    dashboard = pn.Column(
        pn.pane.Markdown("# Wine Quality Data Analysis Dashboard"),
        tabs
    )
    
    return dashboard

dashboard = create_dashboard(df)

In [28]:
dashboard.show()

Launching server at http://localhost:57342


<panel.io.server.Server at 0x7fa440baf7f0>

### 3.4 Insights and Data Story

The dashboard enables users to explore several key aspects of wine quality:

1. **Chemical Composition Overview**: The correlation heatmap immediately reveals which chemical properties are related to each other and to quality.

2. **Distribution Patterns**: Histograms and box plots show the distribution of each chemical property, revealing that most wines in the dataset have moderate to high fixed acidity and alcohol content.

3. **Quality Predictors**: The multivariate analysis tab allows users to explore relationships between variables, showing that alcohol content has a positive correlation with quality, while volatile acidity has a negative correlation.

4. **Outlier Identification**: Box plots highlight outliers in several properties, particularly in residual sugar and chlorides, which might represent specialty wines.

5. **Comparative Analysis**: Users can compare the distributions of different chemical properties and their relationships to understand what characterizes higher-quality wines.

## 4. Conclusion and Future Improvements

### 4.1 Conclusion

This interactive dashboard provides a comprehensive tool for exploring wine quality data through multiple visualization techniques. By combining correlation heatmaps, distribution plots, and relationship analyses in an interactive interface, users can gain insights into the factors that influence wine quality and the relationships between various chemical properties.

The Panel framework, combined with HoloViews and hvPlot, proved to be an effective choice for creating this dashboard, offering a balance of ease of development and interactive capabilities while maintaining integration with the Python data science ecosystem.

### 4.2 Future Improvements

Several enhancements could further improve this dashboard:

1. **Filtering Capabilities**: Add filters to allow users to focus on specific subsets of wines (e.g., by quality rating range).

2. **Statistical Tests**: Integrate statistical significance tests for relationships between variables.

3. **Predictive Modeling**: Add a tab with predictive models for wine quality based on chemical properties.

4. **Cluster Analysis**: Implement a clustering analysis to identify natural groupings of wines.

5. **Performance Optimization**: Optimize for larger datasets by implementing data downsampling for initial views.

6. **Export Functionality**: Add options to export visualizations or analysis results.

7. **Comparative Views**: Enable side-by-side comparison of multiple variables or relationships.

## 5. References

1. Panel Documentation: https://panel.holoviz.org/
2. HoloViews Documentation: https://holoviews.org/
3. hvPlot Documentation: https://hvplot.holoviz.org/
4. Wine Quality Dataset: [Source and attribution information]
5. Bokeh Documentation: https://bokeh.org/