# CVD Death Rate Dataset — PySpark Exploratory Data Analysis (EDA)

This notebook performs EDA using **PySpark**, consistent with the main project pipeline.

⚠️ This notebook is *exploration only* — forecasting is done in:
`src/advanced_pipeline.py`

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, isnan

spark = SparkSession.builder.appName('CVD_EDA').getOrCreate()

## 1. Load Dataset (PySpark)

In [None]:
df = spark.read.csv('../dataset/CVD.csv', header=True, inferSchema=True)
df.show(5)

## 2. Schema & Summary

In [None]:
df.printSchema()

In [None]:
df.describe().show()

## 3. Basic Cleaning (PySpark)

In [None]:
data = df.withColumn('Year', col('Year').cast('int')) \
           .withColumn('Data_Value', col('Data_Value').cast('double'))

data = data.filter((col('Year') >= 2010) & (col('Year') <= 2020))
data = data.filter(col('Data_Value').isNotNull())

data.show(5)

## 4. Missing Values Count (PySpark)

In [None]:
missing = data.select([count(when(col(c).isNull() | isnan(c), c)).alias(c) for c in data.columns])
missing.show()

## 5. National Trend (2010–2020) — PySpark Aggregation

In [None]:
yearly = data.groupBy('Year').agg(avg('Data_Value').alias('mean_rate')).orderBy('Year')
yearly.show()

### Convert to pandas only for plotting (Normal Practice)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

yearly_pd = yearly.toPandas()

plt.figure(figsize=(10,5))
plt.plot(yearly_pd['Year'], yearly_pd['mean_rate'], marker='o')
plt.title('National CVD Death Rate Trend (2010–2020)')
plt.xlabel('Year')
plt.ylabel('Mean Death Rate')
plt.grid(True)
plt.show()

## 6. State-Level Average (Top 10)

In [None]:
state_avg = data.groupBy('LocationAbbr').agg(avg('Data_Value').alias('avg_rate')).orderBy(col('avg_rate').desc())
state_avg.show(10)

state_pd = state_avg.limit(10).toPandas()

plt.figure(figsize=(10,5))
sns.barplot(x=state_pd['LocationAbbr'], y=state_pd['avg_rate'])
plt.title('Top 10 States by CVD Death Rate')
plt.show()

## 7. Stratification Analysis (Age Groups, Gender, etc.)

In [None]:
strat = data.groupBy('Stratification1').agg(avg('Data_Value').alias('avg_rate')).orderBy(col('avg_rate').desc())
strat.show()

strat_pd = strat.toPandas()

plt.figure(figsize=(12,5))
sns.barplot(x=strat_pd['Stratification1'], y=strat_pd['avg_rate'])
plt.xticks(rotation=45)
plt.title('CVD Rate by Stratification Group')
plt.show()

# Summary of Insights

- National CVD death rates from **2010–2020** are mostly stable.
- Some states have significantly higher death rates than others.
- Stratification groups show clear differences (age/gender categories vary).
- Dataset is clean enough for forecasting.

⚠️ Forecasting is handled in:
`src/advanced_pipeline.py`