
# Exercise 1: Data Cleaning and Exploratory Data Analysis

This notebook covers data cleaning, handling missing values, and basic exploratory data analysis (EDA) techniques. Dataset links are provided for practice.
        


## Dataset Information and Download Links

This exercise uses a modified **Diabetes dataset**, which can be downloaded from the following source:

1. **Kaggle:**
   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)
   - This dataset includes patient data for diabetes-related research.

### Dataset Attributes

- **Pregnancies**: Number of pregnancies.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **SkinThickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age of the patient.
- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).

### Usage Notes

- Ensure the dataset is preprocessed (e.g., handle missing values and normalize if necessary).
- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more information.
        


## Handling Missing Values

Identify and handle missing values in the dataset to ensure accuracy during analysis.
        

In [None]:

import pandas as pd

# Load the dataset (replace with your file path)
data = pd.read_csv(r'C:\Path\to\diabetes.csv')

# Check for missing values
print("Missing values before cleaning:")
print(data.isna().sum())

# Fill missing values with median
data = data.fillna(data.median())

# Verify missing values are handled
print("Missing values after cleaning:")
print(data.isna().sum())
        


## Detecting and Removing Duplicates

Duplicates can affect data integrity. This step identifies and removes duplicate rows if present.
        

In [None]:

# Check for duplicates
duplicate_count = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Remove duplicates
data = data.drop_duplicates()

print(f"Number of rows after removing duplicates: {len(data)}")
        


## Descriptive Statistics

Calculate summary statistics to understand the distribution of numerical variables.
        

In [None]:

# Display summary statistics
print(data.describe())
        


## Visualizing Distributions

Use histograms to visualize the distribution of glucose levels and other key variables.
        

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Plot distribution of glucose levels
sns.histplot(data['Glucose'], kde=True)
plt.title("Distribution of Glucose Levels")
plt.xlabel("Glucose")
plt.ylabel("Frequency")
plt.show()
        


## Correlation Analysis

Analyze correlations between numerical variables using a heatmap.
        

In [None]:

# Compute correlation matrix
corr_matrix = data.corr()

# Plot correlation heatmap
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
        