# Ethical Data Analytics with Diabetes Prediction Dataset

This notebook demonstrates ethical principles and best practices in data analytics using a medical dataset (Diabetes Prediction). It covers:

- **Task 1**: Data requirements, bias identification.
- **Task 2**: Data collection, processing, cleansing.
- **Task 3**: Bias detection, mitigation strategies.
- **Task 4**: Best practices for data management and collaboration.



## Task 1: Define Data Requirements and Identify Bias

**Data Requirements:**
- Features: Age, Gender, BMI, Blood Pressure, Glucose Level, Insulin, Diabetes Pedigree Function.
- Target: Diabetes outcome (binary classification).

**Potential Biases:**
- Gender imbalance.
- Age distribution skew.
- Missing values in critical features.

**Impact:** Bias can lead to unfair predictions and inaccurate results, affecting minority groups disproportionately.


## Task 2: Data Collection, Processing, and Cleansing

In [None]:

# Load diabetes dataset (from sklearn for demonstration)
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target)

print("Dataset shape:", X.shape)
print("First 5 rows:
", X.head())

# Check for missing values
print("Missing values per column:
", X.isnull().sum())

# Normalize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


## Task 3: Bias Detection and Mitigation

In [None]:

# Bias detection: Check distribution of target variable
sns.histplot(y, bins=20, kde=False)
plt.title("Distribution of Diabetes Progression")
plt.show()

# Apply SMOTE for balancing (simulate imbalance scenario)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print("Resampled dataset shape:", X_resampled.shape)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_resampled, y_resampled)

# Predictions
y_pred = model.predict(X_test)
print("Classification Report:
", classification_report(y_test, y_pred))


## Task 4: Best Practices for Data Management and Collaboration

- Ensure GDPR compliance and anonymization.
- Maintain version control for datasets and models.
- Document all preprocessing steps for transparency.
- Address barriers:
  - Lack of data literacy → Provide training.
  - Unclear goals → Define objectives collaboratively.
