 # Feature Engineering and Modeling – 2024 Polling Data

 ## Objectives

 - Engineer features to capture time-related trends, pollster, and methodology effects.
 - Prepare dataset for modeling.
 - Build baseline and advanced predictive models for the Democratic-Republican margin.
 - Evaluate models and interpret feature importance.

#  Inputs

 - generic_ballot_polls_clean.csv (from Notebook 2)
 - Python libraries: pandas, numpy, matplotlib, seaborn, scikit-learn

# Outputs

 - Engineered dataset ready for modeling
 - Predictive models with performance evaluation
 - Insights on pollster, methodology, and temporal effects


---



# Section 1 – Load Cleaned Dataset

 In this section, we load the cleaned dataset created in Notebook 2. We preview the dataset and inspect the structure.


In [17]:
import pandas as pd
from pathlib import Path

BASE_DIR = Path().resolve().parent
df = pd.read_csv(BASE_DIR / "data" / "clean" / "generic_ballot_polls_clean.csv")
df.head()


Unnamed: 0,start_date,end_date,pollster,sample_size,dem,rep,ind,methodology,pollster_rating_id,numeric_grade
0,2024-08-12,2024-08-14,Emerson,1000.0,47.5,45.5,,IVR/Online Panel/Text-to-Web,88,2.9
1,2024-08-11,2024-08-13,YouGov,1407.0,45.0,44.0,,Online Panel,391,2.9
2,2024-08-08,2024-08-12,Monmouth,801.0,48.0,46.0,,Live Phone/Text-to-Web,215,2.9
3,2024-08-06,2024-08-08,Cygnal,1500.0,46.4,47.1,,,67,2.1
4,2024-08-04,2024-08-06,YouGov,1413.0,45.0,44.0,,Online Panel,391,2.9


In [18]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_date          608 non-null    object 
 1   end_date            608 non-null    object 
 2   pollster            608 non-null    object 
 3   sample_size         608 non-null    float64
 4   dem                 608 non-null    float64
 5   rep                 608 non-null    float64
 6   ind                 0 non-null      float64
 7   methodology         584 non-null    object 
 8   pollster_rating_id  608 non-null    int64  
 9   numeric_grade       568 non-null    float64
dtypes: float64(5), int64(1), object(4)
memory usage: 47.6+ KB


Dataset overview: `dem`, `rep`, `sample_size`, `pollster`, `methodology`, and `numeric_grade` are present, ready for feature engineering.

---

 # Section 2 – Feature Engineering

 We create features to improve model performance:
 1. `margin` = `dem` - `rep`
 2. 7-day rolling averages of `dem`, `rep`, `margin`
 3. One-hot encoding of categorical variables: `pollster`, `methodology`
 4. Scaling `numeric_grade`


In [19]:
# Calculate margin
df['margin'] = df['dem'] - df['rep']

# Convert start_date to datetime
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')

# Sort by date
df.sort_values('start_date', inplace=True)

# 7-day rolling averages
df['dem_roll7'] = df['dem'].rolling(7, min_periods=1).mean()
df['rep_roll7'] = df['rep'].rolling(7, min_periods=1).mean()
df['margin_roll7'] = df['margin'].rolling(7, min_periods=1).mean()

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['pollster', 'methodology'], drop_first=True)

# Scale numeric_grade
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_encoded['numeric_grade_scaled'] = scaler.fit_transform(df_encoded[['numeric_grade']])


In [20]:
# Calculate margin
df['margin'] = df['dem'] - df['rep']

# Convert start_date to datetime
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')

# Sort by date
df.sort_values('start_date', inplace=True)

# 7-day rolling averages
df['dem_roll7'] = df['dem'].rolling(7, min_periods=1).mean()
df['rep_roll7'] = df['rep'].rolling(7, min_periods=1).mean()
df['margin_roll7'] = df['margin'].rolling(7, min_periods=1).mean()

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['pollster', 'methodology'], drop_first=True)

# Scale numeric_grade
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_encoded['numeric_grade_scaled'] = scaler.fit_transform(df_encoded[['numeric_grade']])


---


# Section 3 – Dataset Preparation

- We define features (`X`) and target (`y`) for modeling. The target variable is the Democratic-Republican margin.  
- Train-test split is done while preserving temporal order to simulate forecasting.


In [21]:
# %%
# Features and target
target = 'margin'
X = df_encoded.drop(columns=['dem', 'rep', 'margin', 'start_date', 'end_date', 'ind'])
y = df_encoded[target]

# Train-test split (80% train, 20% test)
train_size = int(len(df_encoded) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]


---


# Section 4 – Baseline Model: Linear Regression

We start with a simple linear regression to establish a baseline for predicting the margin.


In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

# Evaluation
mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

mse_lr, mae_lr, r2_lr


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

---

# Section 5 – Advanced Model: Random Forest Regressor
 We use a Random Forest to capture non-linear relationships between features and the Democratic-Republican margin. This will help to provide feature importance for interpretation.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=200, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluation
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

mse_rf, mae_rf, r2_rf

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values