# Assignment 5: Exploratory Data Analysis and Hypothesis Testing

**Student:** Dhruv Arora  
**Dataset:** Heart Disease
**Date:** October 7th 2025

## Objective
Perform exploratory data analysis (EDA) and hypothesis testing on your chosen dataset. Identify patterns and test meaningful hypotheses about relationships between variables.

---


## 1. Import Data and Libraries

Import the necessary libraries and load your dataset.


In [1]:
# ---- Load Cleveland Heart Disease data as a clean dataframe ----
import pandas as pd
import numpy as np

col_names = [
    
    'age','sex','cp','trestbps','chol','fbs','restecg','thalach',
    'exang','oldpeak','slope','ca','thal','target'
]

df = pd.read_csv(
    "student-work/assignment-5/data/heart.csv",   # <- you copied this file already
    names=col_names,
    na_values='?',
    header=None
)

# Convert target: >0 => disease (1), 0 => no disease (0)
df['target'] = (df['target'] > 0).astype(int)

# Basic cleaning: drop rows with any missing values
df = df.dropna()

# Ensure numeric types where appropriate
for c in ['age','trestbps','chol','thalach','oldpeak','ca','thal','slope','cp','sex','fbs','restecg','exang','target']:
    df[c] = pd.to_numeric(df[c], errors='coerce')

print("After cleaning:", df.shape)
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'student-work/assignment-5/data/heart.csv'

In [None]:
# Load your dataset here
# Replace this with your actual data loading code
# Example for Heart Disease dataset:
# df = pd.read_csv('heart_disease.csv')

# ---- Load Cleveland Heart Disease data as a clean dataframe ----

col_names = [
    'age','sex','cp','trestbps','chol','fbs','restecg','thalach',
    'exang','oldpeak','slope','ca','thal','target'
]

df = pd.read_csv(
    "student-work/assignment-5/data/heart.csv",  # adjust only if your file is elsewhere
    names=col_names,
    na_values='?',
    header=None
)

# Convert target: >0 => disease (1), 0 => no disease (0)
df['target'] = (df['target'] > 0).astype(int)

# Drop rows with any missing values
df = df.dropna()

# Ensure numeric types
for c in ['age','trestbps','chol','thalach','oldpeak','ca','thal','slope','cp','sex','fbs','restecg','exang','target']:
    df[c] = pd.to_numeric(df[c], errors='coerce')

print("After cleaning:", df.shape)
df.head()


## 2. Exploratory Data Analysis (EDA)

Explore your dataset to understand its structure and identify patterns.


In [None]:
print("=== DATASET OVERVIEW ===")
print("Shape:", df.shape)
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
df.describe()


In [None]:
# === TARGET VARIABLE DISTRIBUTION ===
target_counts = df['target'].value_counts().sort_index()

fig, axes = plt.subplots(1, 2, figsize=(10,4))

# Bar plot
target_counts.plot(kind='bar', ax=axes[0], color=['skyblue','salmon'])
axes[0].set_title('Target Distribution')
axes[0].set_xlabel('Target (0 = No Disease, 1 = Disease)')
axes[0].set_ylabel('Count')

# Pie chart
axes[1].pie(target_counts.values, labels=['No Disease', 'Disease'], 
            autopct='%1.1f%%', colors=['skyblue','salmon'])
axes[1].set_title('Target (%)')

plt.tight_layout()
plt.show()


In [None]:
# Numerical variables analysis
print("=== NUMERICAL VARIABLES ===")
num_cols = df.select_dtypes(include=[np.number]).columns
print("Numerical columns:", list(num_cols))

# Create histograms for numerical variables
fig, axes = plt.subplots(1, len(numerical_cols), figsize=(5*len(numerical_cols), 4))
if len(numerical_cols) == 1:
    axes = [axes]

for i, col in enumerate(numerical_cols):
    axes[i].hist(df[col], bins=20, alpha=0.7)
    axes[i].set_title(f'{col} Distribution')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:
# Categorical variables analysis
print("=== CATEGORICAL VARIABLES ===")
categorical_cols = df.select_dtypes(include=['object']).columns
print(f"Categorical columns: {list(categorical_cols)}")

for col in categorical_cols:
    print(f"\n{col} value counts:")
    print(df[col].value_counts())
    
    # Create bar plot
    plt.figure(figsize=(8, 4))
    df[col].value_counts().plot(kind='bar')
    plt.title(f'{col} Distribution')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()


In [None]:
# === CORRELATION ANALYSIS ===
import seaborn as sns
import matplotlib.pyplot as plt

# compute correlation only for numeric columns
num_df = df.select_dtypes(include=[np.number])
corr = num_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Matrix (Numeric Features)")
plt.show()

# show correlations with target variable
if 'target' in corr.columns:
    print("\nCorrelations with target (descending):")
    print(corr['target'].sort_values(ascending=False))


In [None]:
# Relationship between variables and target
print("=== RELATIONSHIPS WITH TARGET ===")

# Box plots for numerical variables vs target
numerical_cols_no_target = [col for col in numerical_cols if col != 'target']

if len(numerical_cols_no_target) > 0:
    fig, axes = plt.subplots(1, len(numerical_cols_no_target), figsize=(5*len(numerical_cols_no_target), 4))
    if len(numerical_cols_no_target) == 1:
        axes = [axes]
    
    for i, col in enumerate(numerical_cols_no_target):
        df.boxplot(column=col, by='target', ax=axes[i])
        axes[i].set_title(f'{col} by Target')
        axes[i].set_xlabel('Target')
        axes[i].set_ylabel(col)
    
    plt.tight_layout()
    plt.show()

# Crosstabs for categorical variables vs target
for col in categorical_cols:
    print(f"\n{col} vs Target:")
    crosstab = pd.crosstab(df[col], df['target'], normalize='index')
    print(crosstab.round(3))


## 3. Hypothesis Formulation

Based on your EDA, formulate **2-3 testable hypotheses**. Each hypothesis should be:
- Clear and specific
- Testable with statistical methods
- Grounded in your observations

### Hypothesis 1: [Your first hypothesis]
**H1:** [State your hypothesis]
- **Null Hypothesis (H0):** [State null hypothesis]
- **Alternative Hypothesis (H1):** [State alternative hypothesis]
- **Rationale:** [Explain why you think this relationship exists]

### Hypothesis 2: [Your second hypothesis]
**H2:** [State your hypothesis]
- **Null Hypothesis (H0):** [State null hypothesis]
- **Alternative Hypothesis (H1):** [State alternative hypothesis]
- **Rationale:** [Explain why you think this relationship exists]

### Hypothesis 3: [Your third hypothesis - optional]
**H3:** [State your hypothesis]
- **Null Hypothesis (H0):** [State null hypothesis]
- **Alternative Hypothesis (H1):** [State alternative hypothesis]
- **Rationale:** [Explain why you think this relationship exists]

---


## 4. Hypothesis Testing

Test each hypothesis using appropriate statistical methods. For each test, explain:
- Why you chose this test
- The results and interpretation
- Whether you reject or fail to reject the null hypothesis


### Testing Hypothesis 1: [Your first hypothesis]

**Test Choice:** [t-test, chi-square, correlation, etc.]
- **Why appropriate:** [Explain why this test is suitable for your data and hypothesis]


In [None]:
# === TESTING H1: chol higher in disease group? (Mann-Whitney U, one-sided) ===
import numpy as np
import pandas as pd
from scipy.stats import mannwhitneyu

# groups
g0 = df.loc[df['target'] == 0, 'chol'].dropna()
g1 = df.loc[df['target'] == 1, 'chol'].dropna()

print(f"n0={len(g0)}, n1={len(g1)}")
print(f"median chol (no disease)  : {np.median(g0):.1f}")
print(f"median chol (disease)     : {np.median(g1):.1f}")

# One-sided alternative: disease > no-disease
U, p = mannwhitneyu(g1, g0, alternative='greater')
print(f"\nMann-Whitney U (one-sided, disease>no-disease): U={U:.1f}, p={p:.4g}")

alpha = 0.05
if p < alpha:
    print("Result: Reject H0 → Cholesterol is higher in the disease group.")
else:
    print("Result: Fail to reject H0 → No significant evidence that cholesterol is higher.")


### Testing Hypothesis 2: [Your second hypothesis]

**Test Choice:** [t-test, chi-square, correlation, etc.]
- **Why appropriate:** [Explain why this test is suitable for your data and hypothesis]


In [None]:
# === TESTING H2: sex and disease associated? (Chi-square of independence) ===
import pandas as pd
from scipy.stats import chi2_contingency

ct = pd.crosstab(df['sex'], df['target'])  # sex: 0=female, 1=male
print("Contingency table (counts):")
display(ct)

chi2, p, dof, expected = chi2_contingency(ct)
print(f"\nChi-square={chi2:.3f}, dof={dof}, p={p:.4g}")
print("Expected counts:")
display(pd.DataFrame(expected, index=ct.index, columns=ct.columns))

alpha = 0.05
if p < alpha:
    print("Result: Reject H0 → Sex and disease status are associated.")
else:
    print("Result: Fail to reject H0 → No significant association detected.")


### Testing Hypothesis 3: [Your third hypothesis - if applicable]

**Test Choice:** [t-test, chi-square, correlation, etc.]
- **Why appropriate:** [Explain why this test is suitable for your data and hypothesis]


In [None]:
# === TESTING H3: thalach lower in disease group? (Mann-Whitney U, one-sided) ===
from scipy.stats import mannwhitneyu
import numpy as np

g0 = df.loc[df['target'] == 0, 'thalach'].dropna()
g1 = df.loc[df['target'] == 1, 'thalach'].dropna()

print(f"n0={len(g0)}, n1={len(g1)}")
print(f"median thalach (no disease)  : {np.median(g0):.1f}")
print(f"median thalach (disease)     : {np.median(g1):.1f}")

# One-sided alternative: disease < no-disease
U, p = mannwhitneyu(g1, g0, alternative='less')
print(f"\nMann-Whitney U (one-sided, disease<no-disease): U={U:.1f}, p={p:.4g}")

alpha = 0.05
if p < alpha:
    print("Result: Reject H0 → Max heart rate is lower in the disease group.")
else:
    print("Result: Fail to reject H0 → No significant evidence of lower max heart rate.")


## 5. Summary and Conclusions

### Key Findings from EDA:
1. [List your main observations from the exploratory analysis]
2. [Describe patterns you identified]
3. [Note any interesting relationships]

### Hypothesis Testing Results:

| Hypothesis | Test Used | p-value | Result | Interpretation |
|------------|-----------|---------|--------|-----------------|
| **H1** | [Test name] | [p-value] | [Significant/Not] | [Brief interpretation] |
| **H2** | [Test name] | [p-value] | [Significant/Not] | [Brief interpretation] |
| **H3** | [Test name] | [p-value] | [Significant/Not] | [Brief interpretation] |

### Key Insights:
- [What do your results tell you about the data?]
- [What are the practical implications?]
- [What limitations should be considered?]

### Future Research:
- [What additional analysis could be done?]
- [What other variables might be important?]

---

**Note:** Remember to replace the sample data with your actual dataset and fill in all the template sections with your own analysis and interpretations.


In [None]:
## 5. Summary and Conclusions

**Dataset:** UCI Heart Disease (Cleveland subset)  
**Sample size (after cleaning):** `{{auto: df.shape[0]}}` rows, `{{auto: df.shape[1]}}` columns

### Key Findings from EDA
- The target distribution shows roughly `[fill in % for class 0]` no-disease and `[fill in % for class 1]` disease.
- Notable patterns:
  - [e.g., boxplots suggested higher `chol` for disease]
  - [e.g., some categorical features (cp/thal) show different proportions by target]

### Hypothesis Tests
- **H1 (chol ↑ in disease):** *[Reject/Fail to reject]* H0.  
  - Interpretation: *[briefly interpret your test’s medians and p-value]*.
- **H2 (sex associated with disease):** *[Reject/Fail to reject]* H0.  
  - Interpretation: *[what the chi-square implies about association]*.
- **H3 (thalach ↓ in disease):** *[Reject/Fail to reject]* H0.  
  - Interpretation: *[short note]*.

### Limitations
- Observational dataset; no causal claims.
- Some variables are categorical encodings; associations may reflect confounders.

### Future Work
- Multivariate modeling (e.g., logistic regression).
- Better handling of missing values and outliers.
