
# Data Cleaning and Descriptive Analysis

This notebook demonstrates data cleaning techniques and descriptive analysis of clinical data. Dataset links are provided for reproducibility.
        


## Dataset Information and Download Links

The examples in this notebook use a **Diabetes dataset**, which can be downloaded from the following source:

1. **Kaggle:**
   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)
   - This dataset includes detailed clinical data for diabetes prediction and analysis.

### Dataset Attributes

- **Pregnancies**: Number of pregnancies.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **SkinThickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age of the patient.
- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).

### Usage Notes

- Ensure the dataset is preprocessed (e.g., handle missing values and normalize if needed).
- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more information.
        


## Data Cleaning: Handling Missing Values

Missing values can affect data analysis. This section demonstrates how to identify and handle missing values.
        

In [None]:

import pandas as pd

# Load the dataset (replace with your file path)
data = pd.read_csv(r'C:\Path\to\diabetes.csv')

# Check for missing values
print("Missing values before cleaning:")
print(data.isna().sum())

# Fill missing values with the median
data = data.fillna(data.median())

print("Missing values after cleaning:")
print(data.isna().sum())
        


## Descriptive Statistics: Summarizing the Data

Descriptive statistics provide insights into the distribution and summary of clinical data.
        

In [None]:

# Summary statistics for numerical columns
summary_stats = data.describe()
print(summary_stats)
        


## Univariate Analysis: Distribution of Glucose Levels

Visualizing the distribution of glucose levels helps understand its spread and central tendency.
        

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram for glucose levels
sns.histplot(data['Glucose'], kde=True)
plt.title("Distribution of Glucose Levels")
plt.xlabel("Glucose")
plt.ylabel("Frequency")
plt.show()
        


## Bivariate Analysis: Glucose vs BMI

Scatter plots visualize the relationship between two numerical variables.
        

In [None]:

# Scatter plot for Glucose vs BMI
sns.scatterplot(x=data['BMI'], y=data['Glucose'])
plt.title("Glucose vs BMI")
plt.xlabel("BMI")
plt.ylabel("Glucose")
plt.show()
        


## Grouping and Aggregation: Mean Glucose by Outcome

Grouping the data by outcome helps compare averages across diabetic and non-diabetic groups.
        

In [None]:

# Group by outcome and calculate mean glucose
mean_glucose_by_outcome = data.groupby('Outcome')['Glucose'].mean()
print(mean_glucose_by_outcome)
        