 # Notebook 04 – Building the Interactive Streamlit Dashboard

# Objectives
 - Load the cleaned dataset (created in Notebook 01).
 - Prepare data with feature parity from Notebook 3.
 - Generate streamlit_app.py with interactive visualizations: time series, histograms, pollster and methodology counts.
 - Use trained Random Forest (rf_model.pkl) if available to show predicted margins.

# Inputs
 - `data/clean/generic_ballot_polls_clean.csv` (clean dataset from Notebook 01)
 - `models/rf_model.pkl` (RandomForest trained in Notebook 03)
 - `generic_ballot_polls_dashboard.csv` (dashboard ready dataset from Notebook 03)
 
# Outputs
 - `streamlit_app.py` written to project root.
 - Guidance for running the dashboard.


---

# Section 1 — Load Cleaned Dataset

- We load the cleaned dataset produced earlier and inspect the structure.


In [74]:
import pandas as pd
from pathlib import Path
import joblib
from sklearn.preprocessing import MinMaxScaler

BASE_DIR = Path().resolve().parent
df = pd.read_csv(BASE_DIR / "data" / "clean" / "generic_ballot_polls_dashboard.csv")


# Recompute margin and rolling averages for visualization
df['margin'] = df['dem'] - df['rep']
df['dem_roll7'] = df['dem'].rolling(7, min_periods=1).mean()
df['rep_roll7'] = df['rep'].rolling(7, min_periods=1).mean()
df['margin_roll7'] = df['margin'].rolling(7, min_periods=1).mean()

df.head()


Unnamed: 0,start_date,end_date,pollster,sample_size,dem,rep,methodology,pollster_rating_id,numeric_grade,margin,dem_roll7,rep_roll7,margin_roll7
0,2022-11-09,2022-11-11,Morning Consult,7439.0,47.0,44.0,Online Panel,218,1.8,3.0,47.0,44.0,3.0
1,2022-11-10,2022-11-12,Morning Consult,7439.0,48.0,44.0,Online Panel,218,1.8,4.0,47.5,44.0,3.5
2,2022-11-11,2022-11-13,Morning Consult,7439.0,48.0,44.0,Online Panel,218,1.8,4.0,47.666667,44.0,3.666667
3,2022-11-12,2022-11-14,Morning Consult,7439.0,48.0,44.0,Online Panel,218,1.8,4.0,47.75,44.0,3.75
4,2022-11-13,2022-11-15,Morning Consult,7439.0,47.0,44.0,Online Panel,218,1.8,3.0,47.6,44.0,3.6


In [75]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_date          608 non-null    object 
 1   end_date            608 non-null    object 
 2   pollster            608 non-null    object 
 3   sample_size         608 non-null    float64
 4   dem                 608 non-null    float64
 5   rep                 608 non-null    float64
 6   methodology         608 non-null    object 
 7   pollster_rating_id  608 non-null    int64  
 8   numeric_grade       608 non-null    float64
 9   margin              608 non-null    float64
 10  dem_roll7           608 non-null    float64
 11  rep_roll7           608 non-null    float64
 12  margin_roll7        608 non-null    float64
dtypes: float64(8), int64(1), object(4)
memory usage: 61.9+ KB



The dataset above is the cleaned polling data. Next we ensure basic preprocessing needed for the dashboard and for predictions.

---

# Section 2 — Preprocessing & Feature Parity
 
 - Drop fully-empty columns like `ind` (if present).
 - Fill or impute `numeric_grade`.
 - Fill missing `methodology` with 'Unknown'.
 - Compute `margin` and 7-day rolling features for smoother visualisation.
 - Load trained Random Forest model for predictions. (`models/rf_model.pkl`).


In [76]:
# Preprocessing
if 'ind' in df.columns:
    df.drop(columns=['ind'], inplace=True)

df['numeric_grade'] = df['numeric_grade'].fillna(df['numeric_grade'].median())
df['methodology'] = df['methodology'].fillna('Unknown')

df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')

df['margin'] = df['dem'] - df['rep']
df = df.sort_values('start_date').reset_index(drop=True)

df['dem_roll7'] = df['dem'].rolling(7, min_periods=1).mean()
df['rep_roll7'] = df['rep'].rolling(7, min_periods=1).mean()
df['margin_roll7'] = df['margin'].rolling(7, min_periods=1).mean()

df_dashboard = df.copy()

In [77]:
import joblib

model_path = BASE_DIR / "models" / "rf_model.pkl"

if model_path.exists():
    try:
        rf_model = joblib.load(model_path)
        print("Model loaded successfully.")
    except Exception as e:
        print("Model could not be loaded:", e)
else:
    print("No model found.")


Model loaded successfully.


These steps ensure the dataset is ready for visualization and matches the features used in Notebook 03 for model predictions.

---

# Section 3 — Prepare Features for Model Predictions
 - One-hot encode `pollster` and `methodology` to match model training.
 - Scale `numeric_grade` using MinMaxScaler as in Notebook 3.
 - Ensure columns match `rf_model.feature_names_in_`.

In [78]:
if 'rf_model' in locals():
    df_model = pd.get_dummies(df_dashboard, columns=['pollster', 'methodology'], drop_first=True)

    scaler = MinMaxScaler()
    df_model['numeric_grade_scaled'] = scaler.fit_transform(df_model[['numeric_grade']])

    # Align features
    model_features = rf_model.feature_names_in_
    for col in model_features:
        if col not in df_model.columns:
            df_model[col] = 0
    df_model = df_model[model_features]

    # Predictions
    df_dashboard['predicted_margin'] = rf_model.predict(df_model)

Predicted margins are now available in the dataset and ready for display in the dashboard.

---

# Section 4 — Streamlit App Structure
 - This will be written to `streamlit_app.py`.
 - Includes:
     - Time series of `margin` and `predicted_margin`.
     - Pollster counts.
     - Histograms and rolling averages.

In [79]:
import streamlit as st
import pandas as pd
import plotly.express as px
from pathlib import Path

# --- Set paths ---
BASE_DIR = Path.cwd().parent  # project root
DASHBOARD_DIR = BASE_DIR / "dashboard"
DASHBOARD_DIR.mkdir(exist_ok=True)  # ensure folder exists
DASHBOARD_FILE = DASHBOARD_DIR / "polling_data_dashboard.py"

# --- Streamlit code as a string ---
streamlit_code = """
import streamlit as st
import pandas as pd
import plotly.express as px

# Load dataset
df = pd.read_csv('data/clean/generic_ballot_polls_dashboard.csv')
df['start_date'] = pd.to_datetime(df['start_date'])

st.title('2024 Polling Dashboard')

# --- Sidebar Filters ---
st.sidebar.header('Filters')

# Pollster multi-select
pollster_options = df['pollster'].unique()
selected_pollsters = st.sidebar.multiselect('Select Pollster(s)', options=pollster_options, default=pollster_options)

# Methodology multi-select
methodology_options = df['methodology'].unique()
selected_methods = st.sidebar.multiselect('Select Methodology(s)', options=methodology_options, default=methodology_options)

# Date range
date_range = st.sidebar.date_input('Date Range', [df['start_date'].min(), df['start_date'].max()])

# Party selection for charts
party_options = ['dem', 'rep']
selected_parties_margin = st.sidebar.multiselect('Select Party(s) for Margin', party_options, default=party_options)
selected_parties_hist = st.sidebar.multiselect('Select Party(s) for Distribution', party_options, default=party_options)

# Toggle for showing mean and std deviation in histogram
show_stats = st.sidebar.checkbox('Show mean and std deviation on distribution', value=True)

# --- Filter DataFrame ---
df_filtered = df[
    (df['pollster'].isin(selected_pollsters)) &
    (df['methodology'].isin(selected_methods)) &
    (df['start_date'] >= pd.to_datetime(date_range[0])) &
    (df['start_date'] <= pd.to_datetime(date_range[1]))
]

# --- Poll Margin Over Time ---
st.subheader('Poll Margin Over Time')
fig_margin = px.line(df_filtered, x='start_date', y=selected_parties_margin, title='Poll Margin Over Time')
fig_margin.add_scatter(x=df_filtered['start_date'], y=df_filtered['margin_roll7'], mode='lines', name='7-day Rolling Avg')
st.plotly_chart(fig_margin)

# --- Predicted Margin ---
if 'predicted_margin' in df_filtered.columns:
    st.subheader('Predicted Margin Over Time')
    fig_pred = px.line(df_filtered, x='start_date', y='predicted_margin', title='Predicted Margin Over Time')
    st.plotly_chart(fig_pred)

# --- Distribution of Support ---
st.subheader('Distribution of Support')
hist_data = df_filtered[selected_parties_hist].melt(var_name='party', value_name='support')
fig_hist = px.histogram(hist_data, x='support', color='party', barmode='overlay', nbins=20)

if show_stats:
    for party in selected_parties_hist:
        mean_val = df_filtered[party].mean()
        std_val = df_filtered[party].std()
        fig_hist.add_vline(x=mean_val, line_dash='dash', line_color='blue', annotation_text=f"{party} mean")
        fig_hist.add_vline(x=mean_val + std_val, line_dash='dot', line_color='green', annotation_text=f"{party} +1 std")
        fig_hist.add_vline(x=mean_val - std_val, line_dash='dot', line_color='green', annotation_text=f"{party} -1 std")

st.plotly_chart(fig_hist)

# --- Pollster Average Margin ---
st.subheader('Pollster Average Margin')
pollster_avg = df_filtered.groupby('pollster')['margin'].mean().sort_values()
pollster_avg_selected = pollster_avg.loc[selected_pollsters]
st.bar_chart(pollster_avg_selected)

# --- Methodology Average Margin ---
st.subheader('Methodology Average Margin')
method_avg = df_filtered.groupby('methodology')['margin'].mean().sort_values()
method_avg_selected = method_avg.loc[selected_methods]
st.bar_chart(method_avg_selected)

# --- Sample Size vs Margin ---
st.subheader('Sample Size vs Margin')
fig_scatter = px.scatter(
    df_filtered, x='sample_size', y='margin', color='pollster', size='sample_size',
    hover_data=['methodology'], title='Sample Size vs Margin'
)
st.plotly_chart(fig_scatter)
"""

# --- Save to .py file ---
with open(DASHBOARD_FILE, "w", encoding="utf-8") as f:
    f.write(streamlit_code)

print(f"Dashboard script saved to: {DASHBOARD_FILE}")


Dashboard script saved to: e:\VS Code Projects\us_2024_polling_data_analysis_tool\dashboard\polling_data_dashboard.py


---

 # Section 5 — Instructions to Run Dashboard

 In terminal:
 ```
 pip install streamlit plotly pandas
 streamlit run polling_data_dashboard.py
 ```
 Open the browser window to interact with the dashboard.

---

 **Conclusions**
 - Cleaned polling data has been loaded and preprocessed for visualization.
 - Feature parity with Notebook 03 ensured for model predictions.
 - `margin` and optional rolling averages allow clear visualization of trends.
 - Random Forest predictions included if model exists.

 **Next Steps**
 - Explore poll trends over time using the dashboard.
 - Compare predicted vs actual margins.
 - Investigate pollster and methodology biases interactively.