# Interactive Dashboard:

The goal of this project is to build an interactive web application that empowers users to explore and model data without needing to write any code. In the last lesson, I developed a model based on the highest-variance features in the dataset and created various visualizations to clearly communicate the results. Now, I want to take this a step further by combining all these elements into a dynamic tool.

What I aim to achieve is a user-friendly application where users can select their own features, build a model, and evaluate its performance through a graphical interface. Essentially, I’m creating a platform that enables anyone to build and analyze models, making data science accessible to a broader audience.

In [None]:
pip install dash

In [1]:
import pandas as pd
import plotly.express as px
from dash import Dash, Input, Output, dcc, html
from scipy.stats.mstats import trimmed_var
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In the code, I’m trying to achieve the following:

1. Load Data: I start by defining a function called wrangle, which takes a file path as an input. The goal of this function is to read a CSV file containing data from the Survey of Consumer Finances (SCF) into a pandas DataFrame.

2. Filter Data: I want to filter this data to focus only on a specific subset of households. Specifically, I’m interested in households that are "credit fearful" (where the TURNFEAR column equals 1) and have a net worth of less than $2 million (NETWORTH column is less than $2,000,000).

3. Apply Filter: Inside the function, I create a boolean mask that identifies the rows meeting these criteria. I then use this mask to filter the DataFrame so that it only includes the rows matching my conditions.

4. Return Filtered Data: Finally, the filtered DataFrame, which contains only the households that are credit fearful and have a net worth under $2 million, is returned by the function.

In [2]:
def wrangle(filepath):
    """Read SCF data file into ``DataFrame``.

    Returns only credit fearful households whose net worth is less than $2 million.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    # load data
    df = pd.read_csv(filepath)
    # Create mask
    mask = (df['NETWORTH'] < 2e6) & (df['TURNFEAR'] == 1)
    # Subset DataFrame
    df = df[mask]
    return df

In [3]:
# Load the filtered DataFrame using the wrangle function
df = wrangle('Survey of Consumer Finances.csv')

# Print the type of the DataFrame to verify it’s a pandas DataFrame
print("df type:", type(df))

# Print the shape of the DataFrame to see the number of rows and columns
print("df shape:", df.shape)

# Display the first few rows of the DataFrame to preview the data
df.head()


df type: <class 'pandas.core.frame.DataFrame'>
df shape: (4418, 351)


Unnamed: 0,YY1,Y1,WGT,HHSEX,AGE,AGECL,EDUC,EDCL,MARRIED,KIDS,...,NWCAT,INCCAT,ASSETCAT,NINCCAT,NINC2CAT,NWPCTLECAT,INCPCTLECAT,NINCPCTLECAT,INCQRTCAT,NINCQRTCAT
5,2,21,3790.476607,1,50,3,8,2,1,3,...,1,2,1,2,1,1,4,4,2,2
6,2,22,3798.868505,1,50,3,8,2,1,3,...,1,2,1,2,1,1,4,3,2,2
7,2,23,3799.468393,1,50,3,8,2,1,3,...,1,2,1,2,1,1,4,4,2,2
8,2,24,3788.076005,1,50,3,8,2,1,3,...,1,2,1,2,1,1,4,4,2,2
9,2,25,3793.066589,1,50,3,8,2,1,3,...,1,2,1,2,1,1,4,4,2,2


## Application Layout
First, instantiate the application.

In [4]:
# Instantiate the Dash application
app = Dash(__name__)  # Creates an instance of the Dash web application. The __name__ argument allows Dash to locate resources and manage the app's context.


In [5]:
app.layout = html.Div([
        # Application title
        html.H1("Survey of Consumer Finances"),
        
        # Bar chart section
        html.H2("High Variance Features"),
        dcc.Graph(id="bar-chart"),  # Placeholder for the bar chart displaying high variance features
        
        # Radio button for trimmed vs. not trimmed data
        dcc.RadioItems(
            options=[
                {"label": "trimmed", "value": True},
                {"label": "not trimmed", "value": False}
            ],
            value=True,  # Default selection is 'trimmed'
            id="trim-button"
        ),
        
        # K-means Clustering section
        html.H2("K-means Clustering"),
        html.H3("Number of Clusters (k)"),  # Label for the slider
        dcc.Slider(min=2, max=12, step=1, value=2, id="k-slider"),  # Slider to select the number of clusters
        
        # Metrics display section
        html.Div(id="metrics"),  # Placeholder for displaying metrics related to the clustering
        
        # PCA scatter plot section
        dcc.Graph(id="pca-scatter")  # Placeholder for the PCA scatter plot
    ])


In [7]:
def get_high_var_features(trimmed=True, return_feat_names=True):
    
    """Returns the five highest-variance features of ``df``.

    Parameters
    ----------
    trimmed : bool, default=True
        If ``True``, calculates trimmed variance, removing bottom and top 10%
        of observations.

    return_feat_names : bool, default=False
        If ``True``, returns feature names as a ``list``. If ``False``
        returns ``Series``, where index is feature names and values are
        variances.
    """
    if trimmed:
        top_five_features = (
            df.apply(trimmed_var).sort_values().tail(5)
        )
    else:
        top_five_features = df.var().sort_values().tail(5)
    
    # Extract names
    if return_feat_names:
        top_five_features = top_five_features.index.tolist()
    return top_five_features

## Test function

In [8]:
get_high_var_features(trimmed=True, return_feat_names=False)

DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
dtype: float64

In [9]:
@app.callback(
    Output("bar-chart", "figure"), 
    Input("trim-button", "value")
)
def serve_bar_chart(trimmed=True):
    """Returns a horizontal bar chart of five highest-variance features.

    Parameters
    ----------
    trimmed : bool, default=True
        If ``True``, calculates trimmed variance, removing bottom and top 10%
        of observations.
    """
    
    # Get features
    top_five_features = get_high_var_features(trimmed=trimmed, return_feat_names=False)
    
    # Build bar chart
    fig = px.bar(x=top_five_features, y=top_five_features.index, orientation="h")
    fig.update_layout(xaxis_title="Variance", yaxis_title="Feature")
    
    # Save the figure as a static image
    fig.write_image("bar_chart.png")
    
    return fig


## Test Function

In [14]:
serve_bar_chart(trimmed=False)

![](images/img4.jpg)

In [15]:
def get_model_metrics(trimmed=True, k=2, return_metrics=False):

    """Build ``KMeans`` model based on five highest-variance features in ``df``.

    Parameters
    ----------
    trimmed : bool, default=True
        If ``True``, calculates trimmed variance, removing bottom and top 10%
        of observations.

    k : int, default=2
        Number of clusters.

    return_metrics : bool, default=False
        If ``False`` returns ``KMeans`` model. If ``True`` returns ``dict``
        with inertia and silhouette score.

    """
    # Get high var features
    features = get_high_var_features(trimmed=trimmed, return_feat_names=True)
    # Create feature matrix
    X= df[features]
    # Build model
    model = make_pipeline(StandardScaler(), KMeans(n_clusters=k, random_state=42))
    model.fit(X)
    
    if return_metrics:
        # Calculate inertia
        i = model.named_steps["kmeans"].inertia_
        # Calculate silhouette score
        ss = silhouette_score(X, model.named_steps["kmeans"].labels_)
        # Load scores to dictionary
        metrics = {
            "inertia": round(i),
            "silhouette": round(ss, 3)
        }
        # Return dictionary to user
        return metrics
    return model

## Test Function

In [16]:
get_model_metrics(trimmed=True, k=2, return_metrics=False)





In [17]:
@app.callback(
    Output("metrics", "children"),
    Input("trim-button", "value"),
    Input("k-slider", "value")
)
def serve_metrics(trimmed=True, k=2):
    """Returns a list of HTML ``H3`` elements displaying the inertia and silhouette score
    for the KMeans model.

    Parameters
    ----------
    trimmed : bool, default=True
        Determines whether to use trimmed variance for feature selection, which involves
        removing the bottom and top 10% of observations.

    k : int, default=2
        Specifies the number of clusters for the KMeans model.

    Returns
    -------
    list of html.H3
        A list containing HTML ``H3`` elements that display the calculated inertia and silhouette score
        based on the current inputs for trimmed variance and the number of clusters.
    """
    # Retrieve model metrics based on the current input values
    metric = get_model_metrics(trimmed=trimmed, k=k, return_metrics=True)
    
    # Create HTML elements to display the inertia and silhouette score
    text = [
        html.H3(f"Inertia: {metric['inertia']}"),
        html.H3(f"Silhouette Score: {metric['silhouette']}")
    ]
    
    return text


## Test Function

In [18]:
serve_metrics(trimmed=False, k=4)





[H3('Inertia: 5369'), H3('Silhouette Score: 0.706')]

In [19]:
def get_pca_labels(trimmed=True, k=2):
    """
    Applies PCA to the top five highest-variance features and assigns KMeans cluster labels.

    Parameters
    ----------
    trimmed : bool, default=True
        If True, calculates trimmed variance, removing the bottom and top 10% of observations.
    
    k : int, default=2
        Number of clusters for the KMeans model.

    Returns
    -------
    pd.DataFrame
        DataFrame with three columns: 'PC1', 'PC2', and 'labels'. 'PC1' and 'PC2' are the first and second principal components, respectively.
        'labels' are the KMeans cluster labels assigned to each data point.
    """
    
    # Retrieve the top five highest-variance features
    features = get_high_var_features(trimmed=trimmed, return_feat_names=True)
    
    # Extract the subset of the DataFrame with these features
    X = df[features]
    
    # Initialize PCA transformer to reduce dimensions to 2
    transformer = PCA(n_components=2, random_state=42)
    
    # Apply PCA transformation
    X_t = transformer.fit_transform(X)
    X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])
    
    # Get the KMeans model and predict cluster labels
    model = get_model_metrics(trimmed=trimmed, k=k, return_metrics=False)
    X_pca["labels"] = model.named_steps["kmeans"].labels_.astype(str)
    
    # Sort the DataFrame by labels for better visualization
    X_pca.sort_values("labels", inplace=True)
    
    return X_pca


## Test Function

In [20]:
get_pca_labels().head()





Unnamed: 0,PC1,PC2,labels
2208,889749.557584,467355.407904,0
1056,649765.113978,174994.130637,0
1057,649536.017166,176269.044416,0
1058,649536.017166,176269.044416,0
1059,649765.113978,174994.130637,0


In [21]:
@app.callback(
    Output("pca-scatter", "figure"),
    Input("trim-button", "value"),
    Input("k-slider", "value")
)
def serve_scatter_plot(trimmed=True, k=2):

    """Build 2D scatter plot of ``df`` with ``KMeans`` labels.

    Parameters
    ----------
    trimmed : bool, default=True
        If ``True``, calculates trimmed variance, removing bottom and top 10%
        of observations.

    k : int, default=2
        Number of clusters.
    """
    fig = px.scatter(
        data_frame=get_pca_labels(trimmed=trimmed, k=k),
        x="PC1",
        y="PC2",
        color="labels",
        title="PCA Representation of Clusters"
    )
    fig.update_layout(xaxis_title="PC1", yaxis_title="PC2")
    
    # Save the figure as a static image
    fig.write_image("scatter_plot.png")
    
    return fig

In [22]:
# Run the Dash application server
app.run_server(host="127.0.0.1", port=8001, mode="external")  
# Starts the Dash application server:
# - `host="127.0.0.1"` specifies the local address where the server will run (localhost).
# - `port=8001` sets the port number for the server to listen on.
# - `mode="external"` allows the application to be accessible via a web browser, not just within an internal environment.


## Test Function

In [23]:
serve_scatter_plot(trimmed=True, k=2)










![scatter_plot](images/scatter_plot.png)


## Application Deployment

In [24]:
# Run the Dash application server
app.run_server(host="127.0.0.1", port=8001, mode="external")  
# Starts the Dash application server:
# - `host="127.0.0.1"` specifies the local address where the server will run (localhost).
# - `port=8001` sets the port number for the server to listen on.
# - `mode="external"` allows the application to be accessible via a web browser, not just within an internal environment.








![](img.jpg)

![](images/img1.jpg)

![](images/img2.jpg)

![](images/img3.jpg)