# Assignment 3: DBSCAN

## 1. Data Analysis and Preparation

## 1.1 Load and Summarize Data
- Load required libraries
- Load `patient_priority.csv` into a pandas DataFrame.
- Drop "triage" column and print the header using `DataFrame.head()`.
- Print a summary using `DataFrame.describe()`.

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
url = "https://raw.githubusercontent.com/Hunteracademic/Unsupervised_assignment_1/master/patient_priority.csv"
patient_priority_raw = pd.read_csv(url)
patient_priority = patient_priority_raw.drop("triage", axis=1)
patient_priority.head()

Unnamed: 0.1,Unnamed: 0,age,gender,chest pain type,blood pressure,cholesterol,max heart rate,exercise angina,plasma glucose,skin_thickness,insulin,bmi,diabetes_pedigree,hypertension,heart_disease,Residence_type,smoking_status
0,0,40.0,1.0,2.0,140.0,294.0,172.0,0.0,108.0,43.0,92.0,19.0,0.467386,0.0,0.0,Urban,never smoked
1,1,49.0,0.0,3.0,160.0,180.0,156.0,0.0,75.0,47.0,90.0,18.0,0.467386,0.0,0.0,Urban,never smoked
2,2,37.0,1.0,2.0,130.0,294.0,156.0,0.0,98.0,53.0,102.0,23.0,0.467386,0.0,0.0,Urban,never smoked
3,3,48.0,0.0,4.0,138.0,214.0,156.0,1.0,72.0,51.0,118.0,18.0,0.467386,0.0,0.0,Urban,never smoked
4,4,54.0,1.0,3.0,150.0,195.0,156.0,0.0,108.0,90.0,83.0,21.0,0.467386,0.0,0.0,Urban,never smoked


In [6]:
patient_priority.describe()

Unnamed: 0.1,Unnamed: 0,age,gender,chest pain type,blood pressure,cholesterol,max heart rate,exercise angina,plasma glucose,skin_thickness,insulin,bmi,diabetes_pedigree,hypertension,heart_disease
count,6962.0,6962.0,6961.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0,6962.0
mean,2011.95418,57.450014,0.531964,0.529015,109.629991,184.71129,163.502442,0.061764,98.394283,56.813416,111.09164,27.190908,0.467386,0.071531,0.0395
std,1560.966466,11.904948,0.499013,1.253791,21.534852,32.010359,15.458693,0.240743,28.598084,22.889316,17.470033,7.362886,0.102663,0.257729,0.194796
min,0.0,28.0,0.0,0.0,60.0,150.0,138.0,0.0,55.12,21.0,81.0,10.3,0.078,0.0,0.0
25%,604.0,48.0,0.0,0.0,92.0,164.0,150.0,0.0,78.7075,36.0,97.0,21.8,0.467386,0.0,0.0
50%,1628.5,56.0,1.0,0.0,111.0,179.0,163.0,0.0,93.0,55.0,111.0,26.2,0.467386,0.0,0.0
75%,3368.75,66.0,1.0,0.0,127.0,192.0,177.0,0.0,111.6325,77.0,125.0,31.0,0.467386,0.0,0.0
max,5109.0,82.0,1.0,4.0,165.0,294.0,202.0,1.0,199.0,99.0,171.0,66.8,2.42,1.0,1.0


**Use `DataFrame.describe()` to summarize the dataset.**

- There are 15 different columns with numerical data that descriptive statistics can use. 
- Notable there are 6962 different records in this dataset that represent each row. 


**Explain the meaning of each column.**


- **Unnamed: 0**: This looks like an index column identifying unique patients  
- **age**: Patient age in years
- **gender**: Recorded sex at birth
- **chest pain type**: Encoded category of chest pain (e.g., typical/atypical/none, represented as 1–4
- **blood pressure**: Resting blood pressure
- **cholesterol**: Serum cholesterol level
- **max heart rate**: Maximum heart rate achieved during testing
- **exercise angina**: Presence of exercise‑induced angina (0 = no, 1 = yes)
- **plasma glucose**: Plasma glucose concentration
- **skin_thickness**: Triceps skinfold thickness in mm
- **insulin**: Serum insulin level
- **bmi**:  Body mass index
- **diabetes_pedigree**: Diabetes pedigree function (family diabetes history metric)
- **hypertension**: Indicator for high blood pressure (0 = no, 1 = yes)
- **heart_disease**: Indicator for diagnosed heart disease (0 = no, 1 = yes)
- **Residence_type**: Patient residence type
- **smoking_status**: Smoking status history

**Make observations based on the summary statistics.**

- The ages in this group go from the early 30s all the way to over 80, but most people are in their 50s to 70s. This makes sense since you'd usually see older people in a heart or diabetes clinic

- Even though some people have high blood pressure or lipids, the "hypertension" and "heart disease" labels are often marked as 0. This could mean they haven't been officially diagnosed yet or they are just in for a screening

- The data shows people living in both Urban and Rural areas, so we can compare if where they live changes their health


## Cleaning

In [8]:
patient_priority.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6962 entries, 0 to 6961
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         6962 non-null   int64  
 1   age                6962 non-null   float64
 2   gender             6961 non-null   float64
 3   chest pain type    6962 non-null   float64
 4   blood pressure     6962 non-null   float64
 5   cholesterol        6962 non-null   float64
 6   max heart rate     6962 non-null   float64
 7   exercise angina    6962 non-null   float64
 8   plasma glucose     6962 non-null   float64
 9   skin_thickness     6962 non-null   float64
 10  insulin            6962 non-null   float64
 11  bmi                6962 non-null   float64
 12  diabetes_pedigree  6962 non-null   float64
 13  hypertension       6962 non-null   float64
 14  heart_disease      6962 non-null   float64
 15  Residence_type     6962 non-null   object 
 16  smoking_status     6962 

In [12]:
print("--- Duplicate Row Report ---")
print(f"Duplicates found: {patient_priority.duplicated().sum()}")

print("-" * 30) # Prints a line of 30 dashes

print("--- Missing Value Report ---")
print(patient_priority.isna().sum()[patient_priority.isna().sum() > 0])

--- Duplicate Row Report ---
Duplicates found: 0
------------------------------
--- Missing Value Report ---
gender    1
dtype: int64
