# HW1 – Clustering & Classification

**Name:** Smridhi Patwari  
**AndrewID:** smrihdip  
**Date:** 09/12/2025

**Objective:** Explore 2025 County Health Rankings data, find cluster patterns among counties, and build two models to predict **Premature Death** using **Community Conditions** features only. Document EDA, methods, results, and recommendations for **Allegheny County, PA**.


# 1. Imports

In [None]:
import numpy as np, pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  RobustScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import f_classif


# 2. Loading data 

In [None]:
data_path = "analytic_data2025_v2.csv"
df = pd.read_csv(data_path, low_memory=False)
print(df.shape)
df.head(3)


# 2. Identifying raw data 

In [None]:
#Drop Columns that does not contain the statecode ending with "xxx_rawvalue"
statecode_row = df.iloc[0] #Identifying the row statecode row where "xxx_rawvalue columns exist"
cols_to_drop = [col for i,col in enumerate(df.columns)
                if i>=5 and not str(statecode_row[col]).endswith("_rawvalue") ]
df_raw =df.drop(columns = cols_to_drop)
print(df_raw.shape)
#to export data into CSV 
#df_raw.to_csv("raw_data_cleaned.csv")
df_raw.head()



# 3. Identifying relevant factors

In [None]:
com_cond_factors = [
    "Teen Births",
    "Preventable Hospital Stays",
    "Mammography Screening",
    "Flu Vaccinations",
    "Children in Poverty ",
    "Injury Deaths",
    "Driving Alone to Work",
    "Median Household Income",
    "Suicides",
    "Homicides",
    "Firearm Fatalities",
    "Drug Overdose Deaths",
    "Motor Vehicle Crash Deaths",
    "Reading Scores",
    "Math Scores"
]


factors_cols_to_drop = [col for i,col in enumerate(df_raw.columns)
                            if i>=5 and not any (factor in col for factor in com_cond_factors)] #adapted from docs.python.org, python list comprehensions

df_relevant_factors = df_raw.drop(columns=factors_cols_to_drop, index = 0) #drops all columns and the first row
df_relevant_factors.to_csv("relevant_factors_data_cleaned.csv")
print(df_relevant_factors.shape)

df_relevant_factors.head()


# 4. Exploratory Data Analysis


Since some columns are of object type and these columns also contain missing values, clustering cannot be performed yet. 
To ensure all columns are numeric, any non-conforming values must first be reinterpreted. Specifically, values that cannot be converted into 
numeric form should be replaced with NaN, so they can later be addressed using an appropriate imputation strategy.


In [None]:
X_base = df_relevant_factors.copy()
#print(X_base.dtypes)

#Assigning missing data to NaN for it to be addressed later
for col in X_base.columns[5:]:
    X_base[col] = pd.to_numeric(X_base[col], errors="coerce") # Google Search AI Overview - check for unique non-numeric values in a df column
#print(X_base.dtypes) # Gives all float values 


# The following code base have been adapted from a chat with Google Gemini titled "Gemini-Determining approach for visualizing data" which will be attached with the submission
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(15, 12))
# Flatten axes array for easy iteration
axes = axes.flatten()

# Loop through columns and plot
for i, col in enumerate(X_base.columns[5:]):
    # Plot histogram on the i-th axis
    sns.histplot(X_base[col], kde=True, ax=axes[i], color='skyblue')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel('')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

data_columns = X_base.iloc[:, 5:]  

#Creating Box Plots, adapted from session with chat gpt5- "make me a python function that does boxplots for a bunch of columns in my dataset with seaborn
def compute_and_plot_boxplot(data_columns):
    fig,axes = plt.subplots(5, 3, figsize=(15, 12))
    axes = axes.flatten()
    for i, column in enumerate(data_columns.columns):
        if i < len(axes):
            sns.boxplot(y=data_columns[column], ax=axes[i])
            axes[i].set_title(f'{column}')
            axes[i].set_ylabel('Values')

            s = pd.to_numeric(data_columns[column], errors="coerce")

            # IQR values computation
            q1 = s.quantile(0.25)
            q3 = s.quantile(0.75)
            iqr = q3 - q1
            
            axes[i].set_title(f'{column}')
            axes[i].set_ylabel('Values')

            # Annotate IQR inside the subplot
            axes[i].text(
                0.1, 0.95, f"IQR={iqr:.2f}", 
                ha='left', va='top', transform=axes[i].transAxes,
                fontsize=9, color='red', weight='bold'
            )

    plt.tight_layout()
    plt.show()
compute_and_plot_boxplot(data_columns)

X_base.head()



**Data Imputation Strategy:** 
By studying the histograms and boxplots, many variables exhibit right-skewed distributions (Suicides, Drug Overdose Deaths, Homicides) and substantial outliers (Preventable Hospital Stays, Median Household Income). The median imputation strategy was selected for this dataset since it is preserves the data characteristics better than the mean imputation strategy.

In [None]:
#Inputation Strategy - Median 
for col in X_base.columns[5:]:
    X_base[col] = X_base[col].fillna(X_base[col].median()) # Google Search AI Overview - How to fill NaN values in a df with median


#Boxplot computation after Imputation
data_columns_imp = X_base.iloc[:, 5:]  
compute_and_plot_boxplot(data_columns_imp)
X_base.head()

X_base.to_csv("X_based_cleaned.csv")


# 5. Clustering 

In [None]:
features_df = X_base.copy()
features = features_df.drop(columns = features_df.columns[:5])

#From Recitations - Split between training and test data
features_train, features_test, = train_test_split(
    features, test_size=0.15, random_state=42
)

print(features_train.shape, features_test.shape)


**Standardisation:** 
 K-Means Algorithm is sensitve to the scale of data and hence it needs to be standardised. The RobhustScalar has been identified as the optimal scaling technique since the K-means algorithm minimises the sum of squared distances between points and their clusters which can cause the outliers to pull the center of the cluster further away. This is not ideal since it makes the cluster less compact and lose its intrepretebility leading to misleading representation of the typical profile of the group.

 The RobhustScalar method also uses the IQR range within its scaling process by centering the features along the median and scaling them by the IQR. This method allows the outliers to remain within the dataset while stil giving higher emphasis of scaling on the data within the IQR. 

 From the Box plots and the distribution curves for the columns above, It can be established that the Homocide values have an IQR range of 0. This shows that there is no variability in the "Homocide raw value" data and as such it can be dropped and more relevant data can be concentrated upon for the clustering process. If the column is kept, it will result in errors when the column data is scaled using the RobhustScalar method since it divides by the IQR. Since IQR is 0, the entire Homocide data column will produce NaN values ans eventually break the K-Means. 

In [None]:
#Drop Homocide data from dataset
cols_to_drop = ["Homicides raw value"]

#Adopted from Recitations
features_train = features_train.drop(columns=cols_to_drop, errors= "ignore")
features_test  = features_test.drop(columns=cols_to_drop, errors = "ignore")


scaler = RobustScaler(
    with_centering=True,
    with_scaling=True,
    quantile_range=(25, 75) 
)
features_train_std = scaler.fit_transform(features_train)
features_test_std = scaler.transform(features_test)


# Elbow Method

In [None]:
#Adapted from Recitations
wcss = {}  # Within-cluster sum of squares
warnings.filterwarnings('ignore', category=RuntimeWarning, 
                       message='.*encountered in matmul.*')

for i in range(1, 11):
    kmeans_temp = KMeans(n_clusters=i, random_state=42).fit(features_train_std)  # Replace with the correct dataset if needed
    wcss[i] = kmeans_temp.inertia_


# Plot the WCSS values
plt.figure(dpi=100)
plt.plot(wcss.keys(), wcss.values(), 'gs-')
plt.xlabel("Values of 'k'")
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal k')
plt.show()

In [None]:
#Adapted from https://medium.com/analytics-vidhya/implementation-of-principal-component-analysis-pca-in-k-means-clustering-b4bc0aa79cb6

pca = PCA(2)
data = pca.fit_transform(features_train_std)

kmeans = KMeans(n_clusters = 3, random_state=42)
label = kmeans.fit_predict(data)
centers = kmeans.cluster_centers_

plt.figure(figsize=(10,10))
uniq = np.unique(label)
for i in uniq:
   plt.scatter(data[label == i , 0] , data[label == i , 1] , label = i)
plt.scatter(centers[:,0], centers[:,1], marker="x", color='k')

plt.legend()
plt.xlabel("PC1"); plt.ylabel("PC2"); plt.legend()
plt.show()


#Adapted from ChatGPT 5 - applying ANOVA test to interpret the clustering results, the code generated by the LLM has been attached with the submission
# ANOVA F-test: which features differ most across clusters? 
df_clusters = pd.DataFrame(features_train_std, columns=features_train.columns)
df_clusters["cluster"] = label

X_anova = df_clusters.drop(columns="cluster")
y_anova = df_clusters["cluster"]

f_vals, p_vals = f_classif(X_anova, y_anova) #Adapted from: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html
anova_importance = pd.Series(f_vals, index=X_anova.columns).sort_values(ascending=False)

print("ANOVA top features:\n", anova_importance.head(8))



Using the ANOVA test, we can check if the means of a feature differ significantly acorss clusters. The test produces F-values that determine how much a feature varies between the clusters relative to the variation within the clusters. 
From the results, the highest values are attributed to the Drug Overdose Deaths followed by Injury Deaths and Children in Poverty. 

Some implications from analysing the data are: 
1. some counties have much higher overdose death rates than others, being the a significant factor in differentiating across clusters. 
2. some counties differ strongly on how often they expereince injury-related deaths
3. disadvantages in socioeconimic factors such as poverty and income play a key role in differentiating the clusters of counties.

# 6. Supervised Learning Models

- Model 1 - Multi-Linear Regression

In [None]:
# Adapted from: https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day3_Multiple_Linear_Regression.md

#Target dataset
MLR_features_df = X_base.drop(X_base.columns[:5], axis=1)
col_names = MLR_features_df.columns.tolist()


target_df = df_raw["Premature Death raw value"].drop(index=0)
#Convert valeus to numeric
target_df = pd.to_numeric(target_df, errors="coerce")
#Fill NaN values with median
target_df = target_df.fillna(target_df.median())
target_df = target_df.to_frame(name="Premature Death raw value")


X = MLR_features_df.reset_index(drop=True)
y = target_df.reset_index(drop=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predicting the test set results
y_pred = regressor.predict(X_test)

r2_MLR = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"R² Score: {r2_MLR}")
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")

plot_df = X_train.copy()

plot_df["Premature Death raw value"] = y_train.squeeze() #Adapted from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.squeeze.html

n = len(col_names)
ncols = 3
nrows = (n + ncols - 1) // ncols
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 4*nrows))

axes = axes.flatten()

for i,col in enumerate(col_names): #Adapted from : http://medium.com/@kathy.lu.rentals/visualizing-with-seaborn-regplot-2235ccbaedd4
    sns.regplot(
        data=plot_df,
        x=col,
        y="Premature Death raw value",
        ax=axes[i],
        scatter_kws={"alpha":0.5, "s":18},  
        line_kws={"color":"red"}  
    )

    #To get the slope and intercept of the graph
    slope, intercept = np.polyfit(plot_df[col], plot_df["Premature Death raw value"], 1) #Adapted from https://numpy.org/doc/2.0/reference/generated/numpy.polyfit.html

    axes[i].text(
        0.05, 0.95, f"Slope = {slope:.2f}", 
        transform=axes[i].transAxes, 
        ha="left", va="top", fontsize=9, color="blue", weight="bold"
    )

    axes[i].set_title(col)

plt.tight_layout()
plt.show()



From the MLR plots, the 5 most important factors influencing premature death are *Children in Poverty, Drug Overdose Deaths, Injury Deaths, Median Household Income, and Teen Births*. The *children in poverty* show the steepest positive slope, indicating that socioeconomic disadvantages are the strongest driver in premature deaths across counties. 

On the other hand, the graphs for *Reading Scores, Math Scores, and Median Household Income* show strong negative effects on premature deaths. This highlights the strong link between education, income, and health outcomes. Counties with better literacy and higher household income consistently see lower rates of premature death. 


- Model 2: Decision Tree Regressor


In [None]:

# Referenced from Google Search AI - give me a desition tree regressor algorithm in python for multiple factors

regressor = DecisionTreeRegressor(criterion="squared_error", random_state=0, max_depth=10)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
mae_tree = mean_absolute_error(y_test, y_pred)
r2_tree   = r2_score(y_test, y_pred)   

print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae_tree}")
print(f"R² Score: {r2_tree}")

#Adapted from: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = regressor.feature_importances_
predictions_series = pd.Series(importances, index=MLR_features_df.columns)
top5 = predictions_series.head(5)
print("\nTop 5 factors (Decision Tree):")
print(top5)





Decision tree regressor (DRT), the top 5 most influencial factors are *Children in poverty, Driving ALone to work, Flu Vaccinations, Preventable Hospital Stays, and Mamography screenings.* The *Children in poverty* is given the prediction value of 44% making it sthe strongest factor contributing the premature deaths in counties. 

The other factors, though significanltly predicted to have less contribution on the premature deaths, still give important data on how they could affect premature deaths. The lowest predicted factors out of the top 5 factors are the Mamography Screening and the Flu Vaccinations. 



# Accuracy of Models

For the MLR model, the R² Score = 0.78 is > the R² Score = 0.6 of the DTR. The MLR model explains a greater amount of variance in the premature deaths than the DTR model. 

The Mean Squared Error (MSE) scores of the MLR model is MSE = 3.19 million which is < MSE = 5.89 million for the DTR model. This shows that the  DTR model has a higher prediction errors compared to the MLR model. 

The Mean Absolute Error(MAE) scores for the MLR algorithm is MAE = 1165 which is < MAE = 1469 for the DTR model. This means that the DTR model generally has higher prediction errors than that of the MLR model. 

From the analysis, it is clear that the MLR model is much more accurate than the DTR model. 

# Recommendations

The most strongest predictor by both the models is the Children in Poverty leading to premature deaths in counties. From the analysis, there is also a strong co-relation between children being poor and the education, income, and health outcomes of populations in counties. This could be interpreted as children who are born in poverty tend to have lower access to quality education, limited job opportunities later in life, and poorer health outcomes overall.

Allegheny County should as such prioritize interventions that reduce child poverty. Some programmes that could benefit children could be more assistance in education through social welfare programmes, increased access to affordable healthcare, and community support services. Such interventions can eventually directly impact reduction in poverty and improved opportunities available to children. The county can as such create long-term improvements in health poutcomes and premature reduce deaths. 

In [None]:
#Export to HTML
!jupyter nbconvert --to html IAI_HW1_smridhip.ipynb

1. Corelation matrix, feature selection routines