**Purpose**: To visually explore participant demographics, score distributions, and relationships using interactive Plotly charts.

## Section 1 : Importing libraries and Configuration Setup

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import logging

# Configuring logging
logging.basicConfig(filename='../logs/eda.log', level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

In [2]:
# Loading cleaned data
df = pd.read_csv('../data/processed_data/cleaned_hpv_data.csv')
print('Cleaned data loaded successfully.')
logging.info('Cleaned data loaded successfully.')

Cleaned data loaded successfully.


## Section 2 : Visualizing Demographic Distributions

### 1. Education Level

In [3]:
fig_edu = px.histogram(df, y='Education_Label', title='Education Level Distribution',
                       category_orders={'Education_Label': ['High school', 'Under graduation', 'Post-graduation']},
                       color_discrete_sequence=['#636EFA'])
fig_edu.update_layout(yaxis_title='Education Level', xaxis_title='Count')
fig_edu.show()

#### **Conclusion :**
  * Majority participants are **undergraduates**.
  * Very few participants are in high school or post-graduation.
  * Hence : The intervention mainly reached young college-age students.
---

### 2. Place of Residency

In [4]:
fig_res = px.histogram(df, y='Place_of_Residency_Label', title='Place of Residency Distribution',
                       category_orders={'Place_of_Residency_Label': ['Rural', 'Semi-Urban', 'Urban']},
                       color_discrete_sequence=['#EF553B'])
fig_res.update_layout(yaxis_title='Place of Residency', xaxis_title='Count')
fig_res.show()

#### **Conclusion :**
  * Largest group is from **urban** areas.
  * Fewer groups are from**semi-urban** and **rural**.
  * Hence : Urban populations are overrepresented, which may bias results toward better baseline awareness.
---

### 3. Age Group

In [5]:
fig_age = px.histogram(df, y='Age_Label', title='Age Group Distribution',
                       category_orders={'Age_Label': ['15-19 Years', '19-24 Years', '24 Years and above']},
                       color_discrete_sequence=['#00CC96'])
fig_age.update_layout(yaxis_title='Age Group', xaxis_title='Count')
fig_age.show()

#### **Conclusion**:
  * Most participants fall in the **15–19 years** range.
  * Small representation from **19–24 years**.
  * Hence : It suggests they targeted mostly adolescents/late teenagers, possibly school/college cohorts.
---

### 4. Gender Distribution

In [6]:
gender_counts = df['Gender_Label'].value_counts()
fig_gender = px.pie(values=gender_counts.values, names=gender_counts.index, title='Gender Distribution',
                    color_discrete_sequence=['#636EFA', '#EF553B'])
fig_gender.show()

logging.info('Demographic visualizations generated with Plotly.')

#### **Conclusion**:
  * **Male participants (63.8%)** are almost double the female share (36.2%).
  * Hence : Gender imbalance may influence generalizability of results.
---

## Section 3 : Analyzing Pre-test vs. Post-test Scores

In [7]:
print('\n--- Visualizing Knowledge Score Improvement ---')

fig_scores = go.Figure()
fig_scores.add_trace(go.Histogram(x=df['pre_test_score'], name='Pre-Test Score', opacity=0.5, histnorm='density',
                                  marker_color='#636EFA'))
fig_scores.add_trace(go.Histogram(x=df['post_test_score'], name='Post-Test Score', opacity=0.5, histnorm='density',
                                  marker_color='#EF553B'))
fig_scores.add_vline(x=df['pre_test_score'].mean(), line_dash='dash', line_color='#636EFA',
                     annotation_text=f'Pre-Test Mean: {df["pre_test_score"].mean():.1f}')
fig_scores.add_vline(x=df['post_test_score'].mean(), line_dash='dash', line_color='#EF553B',
                     annotation_text=f'Post-Test Mean: {df["post_test_score"].mean():.1f}')
fig_scores.update_layout(title='Distribution of Knowledge Scores Before and After Intervention',
                         xaxis_title='Knowledge Score', yaxis_title='Density', barmode='overlay')
fig_scores.show()

logging.info('Score distribution visualization generated with Plotly.')


--- Visualizing Knowledge Score Improvement ---


#### **Conclusion :**

* **Pre-Test Distribution**: Centered around \~15.
* **Post-Test Distribution**: Shifted significantly to the right, mean \~22.
* **Overlap exists** but the density clearly shows higher scores after intervention.
* **Hence**:
  * Strong evidence that the intervention improved knowledge levels.
  * Average gain ~**+7 points**, which is meaningful.
  * The distribution broadens post-test, suggesting variable improvements across participants.
---

## Section 4 : Exploring Score Improvement by Demographics

### 1. Score Improvement by Education Level

In [8]:
fig_edu_box = px.box(df, x='score_improvement', y='Education_Label', title='Score Improvement by Education Level',
                     labels={'score_improvement': 'Score Improvement (Post - Pre)', 'Education_Label': 'Education Level'},
                     category_orders={'Education_Label': ['High school', 'Under graduation', 'Post-graduation']},
                     color_discrete_sequence=['#636EFA'])
fig_edu_box.show()

In [9]:
"""
We found some outliers in our pretest score in 01_Data_Loading_and_Cleaning.ipynb nbk at step 5 of section 3 :
Pre-test score outliers: [0, 1, 2, 1, 0, 31, 31, 29]
Post-test score outliers: []

The median is less affected by outliers and skewed data, making it a better measure of central tendency for non-normally distributed data,
so we will use median to summarize the score improvement by education level.
"""
edu_improvement = df.groupby("Education_Label")["score_improvement"].median().sort_values(ascending=False)
print("Average Score Improvement by Education Level:")
print(edu_improvement, "\n")

Average Score Improvement by Education Level:
Education_Label
High school         14.0
Under graduation     7.0
Post-graduation      1.0
Name: score_improvement, dtype: float64 



#### **Conclusion :**
* **Undergraduates**:
  * Show a **wide range of improvements** (some large gains, some moderate).
  * Median improvement is positive (\~10 points).

* **High School Students**:
  * Larger spread and variability.
  * Some scored much higher post-test, but a few even showed **negative improvement** (possibly due to misunderstanding or disengagement).

* **Post-Graduates**:
  * Minimal improvement (scores clustered close to 0).
  * Likely already had higher baseline knowledge, so ceiling effect applies.

**Hence**:
* Intervention is **most impactful for undergraduates and high school students**.

**Action**:
  * For **school/college groups**, keep content simple and visual — it works well.
  * For **post-graduates**, redesign content to be more advanced or case-based (they likely already know the basics, hence little improvement).
---

### 2. Score Improvement by Gender

In [10]:
fig_gender_box = px.box(df, x='score_improvement', y='Gender_Label', title='Score Improvement by Gender',
                        labels={'score_improvement': 'Score Improvement (Post - Pre)', 'Gender_Label': 'Gender'},
                        color_discrete_sequence=['#EF553B'])
fig_gender_box.show()

#### **Conclusion :**

* **Females**:
  * Show a **lower median improvement** compared to males, indicating less overall progress.
  * The range of scores suggests some individuals made gains, but variability is limited.

* **Males**:
  * Exhibit a **wider range of improvements**, with some achieving significant gains.
  * Higher median improvement indicates a more effective learning experience for this group.

**Hence**:
* The intervention is **more impactful for males** than females.

**Action**:
* For **female participants**, consider tailored support and resources to enhance engagement and performance.
---

### 3. Score Improvement by Place of Residency

In [11]:
fig_res_box = px.box(df, x='score_improvement', y='Place_of_Residency_Label', title='Score Improvement by Place of Residency',
                     labels={'score_improvement': 'Score Improvement (Post - Pre)', 'Place_of_Residency_Label': 'Place of Residency'},
                     category_orders={'Place_of_Residency_Label': ['Rural', 'Semi-Urban', 'Urban']},
                     color_discrete_sequence=['#00CC96'])
fig_res_box.show()

logging.info('Score improvement visualizations by education, gender, and residency generated with Plotly.')

#### **Conclusion :**

* **Rural Residents**:
  * Demonstrate the **lowest median improvement**, with some showing negative changes, indicating potential challenges in learning environments.
  * The variability suggests that some individuals may struggle significantly.

* **Semi-Urban Residents**:
  * Show moderate improvements, reflecting better outcomes than rural but not as high as urban.

* **Urban Residents**:
  * Achieve the **highest median improvement**, indicating a more effective learning experience and consistent performance.

**Hence**:
* The intervention is **most effective for urban residents**, while rural residents may require additional support.

**Action**:
* For **rural groups**, enhance resources and support systems to improve learning outcomes.
* For **semi-urban participants**, consider strategies that bridge the gap between rural and urban performance.
---

## Section 5 : Correlation Matrix Heatmap

In [12]:
# Computing correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Creating Plotly heatmap
fig_corr = px.imshow(
    corr_matrix,
    text_auto=True,              # show values in cells
    color_continuous_scale="RdBu_r",
    aspect="auto",
    title="Correlation Heatmap"
)

fig_corr.update_layout(
    width=800,
    height=600,
    xaxis_title="Variables",
    yaxis_title="Variables"
)

fig_corr.show()

#### **Conclusion :**
* **Demographics (Age, Gender, Education, Residency, Vaccination, Income, etc.)**:
  * Minimal correlation with either pre or post-test scores.
  * Suggests that background factors did not strongly influence knowledge gain.

* **Pre-test vs. Post-test**:
  * Strong positive correlation, showing that both tests measure the same construct consistently.

* **Score Improvement**:
  * Negatively correlated with pre-test scores, participants with lower baseline knowledge had greater gains.

**Hence**:
* The intervention benefits **low-baseline knowledge participants** the most, regardless of demographics.

**Action**:
* Tailor educational reinforcement for higher scorers to prevent plateauing.
* Maintain focus on bringing low-knowledge participants up to speed, since they gain the most.


---

### **Key Insights**
- **Education**: The majority are undergraduates, indicating a focus on college-age individuals.
- **Residency**: Urban participants are overrepresented, which may bias awareness levels and outcomes.
- **Age**: Most participants are between 15-19 years old, targeting adolescents.
- **Gender**: 63.8% of participants are male, highlighting a gender imbalance.
- **Scores**: Post-test scores (~22) are significantly higher than pre-test scores (~15), with notable variability in improvements.
- **Demographic Impact**: Undergraduates and high school students exhibit the largest score improvements.
- **Correlation matrix**: Improvement is mainly explained by baseline (pre-test) scores, not demographics, participants with lower starting knowledge improved the most.


---

## Section 6 : Saving Data

### Charts and Figures

In [13]:
# Saving the education distribution chart
fig_edu.write_image("../reports/figures/education_distribution.png")  # Saving as PNG

# Saving the place of residency distribution chart
fig_res.write_image("../reports/figures/residency_distribution.png")

# Saving the age group distribution chart
fig_age.write_image("../reports/figures/age_group_distribution.png") 

# Saving the gender distribution pie chart
fig_gender.write_image("../reports/figures/gender_distribution.png")  

# Logging
logging.info('Demographic visualizations generated with Plotly.')

In [14]:
# Visualizing Knowledge Score Improvement
fig_scores.write_image("../reports/figures/knowledge_score_distribution.png",width=800, height=500) 

# Logging
logging.info('Score distribution visualization generated with Plotly.')

In [15]:
# Saving box plots for score improvement
fig_edu_box.write_image("../reports/figures/score_improvement_education.png")

fig_gender_box.write_image("../reports/figures/score_improvement_gender.png") 

fig_res_box.write_image("../reports/figures/score_improvement_residency.png")

# Logging
logging.info('Score improvement visualizations by education, gender, and residency generated with Plotly.')


In [16]:
# Saving the correlation heatmap
fig_corr.write_image(f"../reports/figures/correlation_heatmap.png", scale=2)

# Logging
logging.info('Correlation heatmap generated with Plotly.')

### Pickle export

In [17]:
import pickle
from pathlib import Path

# Bundle all figures into a dict
figures = {
    "education_distribution": fig_edu,
    "residency_distribution": fig_res,
    "age_group_distribution": fig_age,
    "gender_distribution": fig_gender,
    "knowledge_score_distribution": fig_scores,
    "score_improvement_education": fig_edu_box,
    "score_improvement_gender": fig_gender_box,
    "score_improvement_residency": fig_res_box,
    "correlation_heatmap": fig_corr,
}

# Saving all figures into a single pickle file
with open("../models/eda_figures.pkl", "wb") as f:
    pickle.dump(figures, f)

logging.info("All visualizations saved into eda_figures.pkl")
print("Saved: eda_figures.pkl")

Saved: eda_figures.pkl
