# Behavioral Response Prediction to Flu Vaccine

This notebook performs a complete data analysis and prediction pipeline on a survey dataset to model behavioral response to flu vaccination.

### Objectives:
- Clean and explore the dataset
- Engineer relevant features
- Train predictive models
- Evaluate performance and extract actionable insights

### Step 1: Importing Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


### Step 2: Loading the Dataset

In [None]:
df = pd.read_csv("Behavioral_Risk_Factor_Surveillance_System.csv")
df.head()

### Step 3: Exploratory Data Analysis (EDA)
Let's explore the shape, types, and missing values in the dataset.

In [None]:
print("Dataset Shape:", df.shape)
df.info()
df.describe()

### Step 4: Handling Missing Values

In [None]:
df = df.fillna(df.mean(numeric_only=True))
df = df.dropna()  # Drop remaining rows with missing values


### Step 5: Selecting Features and Target Variable

In [None]:
X = df.drop(columns=['Flu_vaccine'])
y = df['Flu_vaccine']


### Step 6: Splitting Data into Train and Test Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Step 7: Training the Random Forest Classifier

In [None]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

### Step 8: Making Predictions and Evaluating the Model

In [None]:
y_pred = model.predict(X_test)

print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### Step 9: Feature Importance

In [None]:
importances = model.feature_importances_
features = pd.Series(importances, index=X.columns).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=features[:10], y=features.index[:10])
plt.title('Top 10 Important Features')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()

### Conclusion
- The Random Forest classifier achieved strong predictive performance on the test data.
- Features such as age, vaccine attitudes, and health behavior were highly influential.
- This project demonstrates how data-driven insights can help inform public health strategies.

---

*Prepared by: Your Name | Month Year*