
**PROJECT 1 : WATER QUALITY ANALYSIS SOLUTION**

20th July 2024




# Data Description and Task Overview

## Data Description

This dataset contains measurements of various water quality parameters collected from different water sources over time. Below is a detailed description of each column in the dataset:

1. **Index**: Unique identifier for each data entry.
2. **pH**: The pH level of the water, indicating its acidity or alkalinity.
3. **Iron**: Concentration of iron in the water (in mg/L).
4. **Nitrate**: Concentration of nitrate in the water (in mg/L).
5. **Chloride**: Concentration of chloride in the water (in mg/L).
6. **Lead**: Concentration of lead in the water (in mg/L).
7. **Zinc**: Concentration of zinc in the water (in mg/L).
8. **Color**: The color of the water sample (e.g., Colorless, Faint Yellow).
9. **Turbidity**: The cloudiness or haziness of the water, measured in Nephelometric Turbidity Units (NTU).
10. **Fluoride**: Concentration of fluoride in the water (in mg/L).
11. **Copper**: Concentration of copper in the water (in mg/L).
12. **Odor**: Descriptive term for the odor of the water (e.g., odorless, faint odor).
13. **Sulfate**: Concentration of sulfate in the water (in mg/L).
14. **Conductivity**: The ability of the water to conduct electricity, measured in microsiemens per centimeter (µS/cm).
15. **Chlorine**: Concentration of chlorine in the water (in mg/L).
16. **Manganese**: Concentration of manganese in the water (in mg/L).
17. **Total Dissolved Solids (TDS)**: Total concentration of dissolved substances in the water (in mg/L).
18. **Source**: Source of the water sample (e.g., Lake, River, Ground).
19. **Water Temperature**: Temperature of the water at the time of measurement (in °C).
20. **Air Temperature**: Temperature of the air at the time of measurement (in °C).
21. **Month**: Month when the sample was taken.
22. **Day**: Day of the month when the sample was taken.
23. **Time of Day**: Time of day when the sample was taken (e.g., 0 for midnight, 12 for noon).
24. **Target**: Binary target variable indicating whether the water quality meets a certain standard (0 for meets standard, 1 for does not meet standard).


## Task for Participants

Participants are tasked with analyzing the water quality data to determine patterns, correlations, and insights that can help in understanding and managing water quality. Specific tasks include:

1. **Data Cleaning and Preprocessing**:
    - Handle missing or inconsistent data.
    - Convert categorical data into numerical format if necessary.
    - Normalize or standardize numerical features.

2. **Exploratory Data Analysis (EDA)**:
    - Generate summary statistics for each feature.
    - Visualize data distributions and relationships between variables using plots (e.g., histograms, scatter plots, box plots).
    - Identify any trends or anomalies in the data.

3. **Feature Engineering**:
    - Create new features that might be useful for prediction, such as interaction terms or aggregates.
    - Evaluate the importance of different features for predicting water quality.

4. **Model Building**:
    - Split the data into training and testing sets.
    - Train machine learning models to predict the target variable (water quality standard).
    - Evaluate model performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

5. **Model Interpretation and Insights**:
    - Interpret the model to understand which factors most influence water quality.
    - Provide actionable insights based on model findings.
    - Discuss potential interventions or policy recommendations to improve water quality.

# **Data Access**

Download the subset `water_quality_dataset_100k_new.csv` dataset from the google drive

https://drive.google.com/file/d/1NPnyvMMa4EgRMpCWd1LOOVGeOOmtcC6f/view?usp=drive_link


`Original dataset` https://www.kaggle.com/datasets/mitanshuchakrawarty/water-quality-prediction/data?select=dataset.csv

## **Import Neccessary Libraries**

In [None]:
# ! pip install seaborn

In [None]:
# ! pip install tqdm

In [None]:
pip install xgboost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report,  ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from tqdm import tqdm
from xgboost import XGBClassifier


## **Read Data**
![](image/data_snapshot.png)

In [None]:
# write your code
data= pd.read_csv('water_quality_dataset_100k_new.csv')
data = data.drop(columns= "Index")
df=data.copy()

df.head()

In [None]:
df.info()

In [None]:
print("Summary Statistics:")
df.describe()

In [None]:
df['Target'].value_counts()

# Visualisation

## Pie chart

In [None]:
target_count=df['Target'].value_counts()
plt.pie(target_count, labels=target_count.index, autopct='%1.1f%%')
plt.show()

In [None]:
# Separate numerical and categorical features
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_features.remove("Target")
numerical_features


In [None]:
categorical_features = df.select_dtypes(include=[object]).columns.tolist()

categorical_features

## Histogram

In [None]:
# Write your code
# Plot histograms for numerical features
df[numerical_features].hist(bins=50, figsize=(20, 15))
plt.show()

In [None]:

# Initial visualization of distributions and relationships
print("\nPairplot of Numerical Features and Target:")
# sns.pairplot(df[numerical_features + ['Target']], diag_kind='kde', hue='Target')
# plt.show()


## Box Plot

In [None]:
print("\nBoxplots of Numerical Features:")
plt.figure(figsize=(15, 10))
sns.boxplot(data=df[numerical_features], orient='h')
plt.show()

In [None]:
# Plot box plots for numerical features
plt.figure(figsize=(20, 15))
for i, column in enumerate(df[numerical_features], 1):
    plt.subplot(6, 6, i)
    sns.boxplot(data=df[numerical_features], x=column)
    plt.title(column)

plt.suptitle('Box Plots of Numerical Features', fontsize=20)
plt.tight_layout()
plt.show()

## Categorical Plots

In [None]:
print("\nCount plots of Categorical Features:")
for feature in categorical_features:
    plt.figure(figsize=(10, 4))
    sns.countplot(data=df, x=feature, hue='Target')
    plt.title(f'Count Plot of {feature} with Target')
    plt.show()

## Scatter

In [None]:
# Plot scatter plots for pairs of numerical features
plt.figure(figsize=(20, 15))

# Example pairs of numerical features
pairs = [('pH', 'Iron'), ('Nitrate', 'Chloride'), ('Lead', 'Zinc'), ('Fluoride', 'Copper'), 
         ('Total Dissolved Solids', 'Water Temperature'), ('Water Temperature', 'Air Temperature'),
         ]

for i, (x_feature, y_feature) in enumerate(pairs, 1):
    plt.subplot(3, 3, i)
    sns.scatterplot(x=x_feature, y=y_feature, data=df, alpha=0.5, hue='Target')

    plt.xlabel(x_feature)
    plt.ylabel(y_feature)
    plt.title(f'{x_feature} vs {y_feature}')
    

plt.suptitle('Scatter Plots of Selected Feature Pairs', fontsize=16)
plt.tight_layout()

plt.show()


In [None]:
print("\nCorrelation Matrix of Numerical Features:")
plt.figure(figsize=(20, 8))
corr_matrix = df[numerical_features + ["Target"]].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

# **Task 1: Data Cleaning and Preprocessing**


## Step 1: Handle Missing or Inconsistent Data

1. **Identify Missing Values:**
   - Detect missing values in the dataset.


In [None]:
df.isna().sum()

In [None]:
df.shape

missing value per rows

In [None]:
missing=df.isna().sum(axis=1)/df.shape[1] *100

In [None]:
missing.value_counts()

In [None]:
df.dropna()

In [None]:
# Detect missing values
missing_values = df.isnull().sum()

# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

# Display missing values count
missing_values = missing_values[missing_values > 0]
print("Missing Values Count:\n", missing_values)


2. **Impute or Remove Missing Values:**
   - For numerical features, you can impute missing values using mean, median, or mode.
   - For categorical features, you can impute missing values using the most frequent category or a placeholder such as 'Unknown'.
   - Alternatively, remove rows with missing values if they are few and do not significantly impact the dataset.


Based on the percentage of missing values, we can decide on the following strategies:

- if a column has a very high percentage of missing values, it might be best to drop it.
- For numerical columns with moderate missing values, we can fill them with the mean or median.
- For categorical columns with missing values, we can fill them with the mode.

In [None]:
df.isna().sum().sort_values()

In [None]:
df.dropna()

In [None]:
# Write your code for step 1.2

In [None]:
# Impute missing values for numerical features with the mean
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns
df[numerical_features] = df[numerical_features].fillna(df[numerical_features].mean())

# Verify if missing values are imputed
print(df[numerical_features].isnull().sum())

In [None]:
categorical_features

In [None]:
df[categorical_features] = df[categorical_features].fillna('Unknown')

In [None]:
# Impute missing values for categorical features with the most frequent category
# categorical_features = df.select_dtypes(include=['object']).columns
# df[categorical_features] = df[categorical_features].apply(lambda x: x.fillna(x.mode()[0]))

# Verify if missing values are imputed
print(df[categorical_features].isnull().sum())

In [None]:
df.isnull().sum()

In [None]:
# Detect missing values
missing_values = df.isnull().sum()

# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

# Display missing values count
missing_values = missing_values[missing_values > 0]
print("Missing Values Count:\n", missing_values)

## Step 2: Convert Categorical Data into Numerical Format

1. **Identify Categorical Features:**
   - Features like `Color`, `Odor`, and `Source` are categorical.



2. **Encode Categorical Features:**
   - Use one-hot encoding for nominal categorical features (e.g., `Color`, `Odor`, `Source`).
   - Use label encoding for ordinal categorical features, if any.

In [None]:
# Identify numerical and categorical features dynamically
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Exclude the target and non-feature columns if present
non_feature_cols = ['Target', 'Month', 'Day', 'Time of Day']
numerical_features = [col for col in numerical_features if col not in non_feature_cols]
categorical_features = [col for col in categorical_features if col not in non_feature_cols]

# Visualize initial dataframe
print("Initial DataFrame:")
df

In [None]:
df[numerical_features].hist(figsize=(15, 12), bins=15)
plt.suptitle('Histograms of Numerical Features Before Transformation', fontsize=16)
plt.show()

## Step 3: Normalize or Standardize Numerical Features

1. **Identify Numerical Features:**
   - Features like `pH`, `Iron`, `Nitrate`, `Chloride`, etc., are numerical.

2. **Normalize or Standardize:**
   - Normalize the features using Min-Max scaling or Standardize the features to have a mean of 0 and a standard deviation of 1.

In [None]:
# Step 2: Normalize or Standardize Numerical Features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', MinMaxScaler())  # Normalize using Min-Max scaling
])

In [None]:
# Step 3: Convert Categorical Data into Numerical Format
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))  # One-hot encode categorical features
])


In [None]:

# Combine transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [None]:
# Apply the transformations to the dataframe
df_preprocessed = preprocessor.fit_transform(df)

# Debug: Check the shape of the transformed data
print(f"Shape of transformed data: {df_preprocessed.shape}")


In [None]:
type(df_preprocessed)

In [None]:
print("Transformed data:")
print(df_preprocessed)

In [None]:
# Get feature names after preprocessing
num_features = numerical_features
cat_features = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
all_features = num_features + list(cat_features)

# Debug: Check the feature names
print(f"Number of feature names: {len(all_features)}")
print(f"Feature names: {all_features}")

In [None]:
# Convert to DataFrame
df_preprocessed = pd.DataFrame(df_preprocessed, columns=all_features)

# Add back non-transformed columns (if any)
df_preprocessed[non_feature_cols] = df[non_feature_cols].reset_index(drop=True)

df_preprocessed

In [None]:
# Show the preprocessed dataframe
print("\nPreprocessed DataFrame:")
df_preprocessed.head()

## Explanation:
1. SimpleImputer is used to handle missing values.
2. OneHotEncoder is used for encoding categorical variables.
3. MinMaxScaler is used to normalize numerical features.
4. ColumnTransformer is used to apply different preprocessing steps to different columns.

Ensure to replace 'your_data.csv' with the actual file path of your dataset. The preprocessed data will be saved in a new file called cleaned_data.csv. This cleaned and preprocessed data is now ready for further analysis or modeling.


# **Task 2. Exploratory Data Analysis (EDA)**

## Exploratory Data Analysis (EDA)

## Step 1: Generate Summary Statistics for Each Feature

1. **Summary Statistics:**
   - Use descriptive statistics to summarize the central tendency, dispersion, and shape of the dataset’s distribution for each feature.

In [None]:
# Write your code
df.describe()

## Step 2: Visualize Data Distributions and Relationships Between Variables

1. **Histograms:**
   - Visualize the distribution of numerical features.

2. **Box Plots:**
   - Identify outliers and understand the distribution of numerical features.

3. **Scatter Plots:**
   - Examine relationships between pairs of numerical features.

4. **Line Plots:**
   - Examine trends in numerical features.

5. **Correlation Matrix and Heatmap:**
   - Show the correlation between numerical features.

6. **Count Plots:**
   - Visualize the frequency of categorical features.

In [None]:
# Select only numerical columns for correlation matrix
numerical_columns_for_corr = df_preprocessed.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Visualize distributions of numerical features after transformation
df_preprocessed[numerical_columns_for_corr].hist(figsize=(20, 15), bins=15,)
plt.suptitle('Histograms of Numerical Features After Transformation', fontsize=16)
plt.show()

In [None]:
print("\nBoxplots of Numerical Features:")
plt.figure(figsize=(15, 10))
sns.boxplot(data=df_preprocessed.drop(['Time of Day', 'Day'], axis=1), orient='h')
plt.show()


In [None]:
plt.figure(figsize=(14, 12))
df_preprocessed.drop(['Time of Day', 'Day'], axis=1).boxplot()
plt.title('Box Plots of Numerical Features', fontsize=16)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Visualize correlations in the preprocessed data
plt.figure(figsize=(14, 12))
correlation_matrix = df_preprocessed[numerical_columns_for_corr].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Preprocessed Data', fontsize=16)
plt.show()


In [None]:
threshold = 0.1
significant_correlations = correlation_matrix[(correlation_matrix > threshold) & (correlation_matrix != 1.0)].dropna(how='all').dropna(axis=1, how='all')

# Show significant correlations
print("\nSignificant Correlations:")
significant_correlations

In [None]:

# Visualize significant correlations in the numerical data including the target column
plt.figure(figsize=(14, 12))
sns.heatmap(significant_correlations, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Significant Correlations in Numerical Data Including Target', fontsize=16)
plt.show()

## Step 3: Identify Trends or Anomalies in the Data

1. **Trend Analysis:**
   - Look for patterns or trends over time or other categorical dimensions.

2. **Anomaly Detection:**
   - Identify any anomalies or unusual observations in the dataset.

In [None]:
# write your code

## Explanation:
1. Descriptive Statistics: Use describe() method to generate summary statistics for each feature.
2. Histograms: Use hist() method to visualize the distribution of numerical features.
3. Box Plots: Use boxplot() function to visualize outliers and understand the distribution of numerical features.
4. Scatter Plots: Use pairplot() function from Seaborn to examine relationships between pairs of numerical features.
5. Correlation Matrix and Heatmap: Use corr() method and heatmap() function from Seaborn to visualize the correlation between numerical features.
6. Count Plots: Use countplot() function from Seaborn to visualize the frequency of categorical features.
7. Trend Analysis: Group data by month and plot average water temperature over time to identify trends.
8. Anomaly Detection: Use box plots to identify outliers in numerical features.
9. Ensure to replace 'your_data.csv' with the actual file path of your dataset. These visualizations and analyses will help you understand the distributions, relationships, and potential anomalies in your data.

In [None]:
df_preprocessed.to_csv("preprocessed.csv", index = False)


# **Task 3. Feature Engineering** (OPTIONAL - Increase the quality of the dataset)
    


## Step 1: Create New Features

1. **Interaction Terms:**
   - Create new features by multiplying or combining existing features to capture interactions between them.

2. **Aggregate Features:**
   - Create aggregate features such as mean, median, sum, or count of other features.


In [None]:
df = pd.read_csv("data/preprocessed.csv")

In [None]:
df

In [None]:
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_features.remove("Target")
numerical_features

In [None]:
# write your code

# Step 3: Feature Engineering
# ==============================

# Create new features
df['Water_Temp_to_Air_Temp_Ratio'] = df['Water Temperature'] / (df['Air Temperature'] + 1)
df['Total_Metals'] = df[['Iron', 'Lead', 'Zinc', 'Copper', 'Manganese']].sum(axis=1)

numerical_features.extend(['Water_Temp_to_Air_Temp_Ratio', 'Total_Metals'])


In [None]:
numerical_features

In [None]:
categorical_features

In [None]:
# numerical_features.remove('Day')
# numerical_features.remove('Time of Day')
# print(numerical_features)
# categorical_features.remove('Month')
# categorical_features

In [None]:

# Apply transformations
X = df.drop('Target', axis=1)
y = df['Target']

# X_transformed = preprocessor.fit_transform(X)

# Create a DataFrame from the transformed data
# X_transformed_df = pd.DataFrame(X, columns=numerical_features + list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)))


X_transformed_df=X[numerical_features].copy()
print("\nTransformed Features Head:")
X_transformed_df.head()

In [None]:
X_transformed_df.columns

In [None]:
X_transformed_df.shape


# **Task 4. Model Building**
## Step 1: Split the Data into Training and Testing Sets
   - Use a function to split the dataset into training and testing sets.


In [None]:
# write your code
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed_df, y, test_size=0.2, random_state=42)


## Step 2: Train Machine Learning Models to Predict the Target Variable

1. **Train Models:**
   - Train different machine learning models such as Logistic Regression, Random Forest, and Support Vector Machine.

In [None]:
# write your code
model_rf=RandomForestClassifier(n_estimators=10, random_state=42)
model_rf.fit(X_train, y_train)


## Step 3: Evaluate Model Performance Using Appropriate Metrics

1. **Evaluate Models:**
   - Evaluate the performance of the models using metrics like accuracy, precision, recall, and F1-score.

In [None]:
# write your code

y_pred = model_rf.predict(X_test)
    
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results= {
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1': f1}

results

In [None]:
# create function to get results to avoid repetition
def get_results(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    results= {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1}
    return results

In [None]:
get_results(y_test, y_pred)

### Compare different models

In [None]:
# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=10, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(random_state=42)
}

# Train and evaluate each model with progress tracking using tqdm
results = {}
for name, model in tqdm(models.items(), desc="Training and Evaluating Models"):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = get_results(y_test, y_pred)

In [None]:
# Display results
results_df = pd.DataFrame(results).T
print("\nModel Performance Comparison:")
results_df


In [None]:

# Plot results for visual comparison
results_df.plot(kind='bar', figsize=(12, 8))
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.xlabel('Model')
plt.grid()
plt.show()


In [None]:

# Choose the best model based on F1-score
best_model_name = results_df['f1'].idxmax()
best_model = models[best_model_name]
print(f"\nBest Model: {best_model_name}")

In [None]:

# Generate and plot the confusion matrix
def plot_confusion_matrix (model, X_test, model_name):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(cmap=plt.cm.Blues)
    plt.title(f"Confusion Matrix for {model_name} Model")
    plt.show()

In [None]:
plot_confusion_matrix (model_rf, X_test, 'Random Forest')

# **Task 5. Model Interpretation and Insights**


## Step 1: Interpret the Model to Understand Which Factors Most Influence Water Quality
1. **Feature Importance:**
   - Use the feature importance scores from tree-based models like Random Forest to identify which factors most influence water quality.
   - Use coefficients from Logistic Regression to understand the direction and magnitude of influence.


In [None]:

# Step 5: Model Interpretation and Insights
# ==============================

# Interpret the best model to understand which factors most influence water quality
if hasattr(best_model, 'feature_importances_'):
    # feature_importance = pd.Series(best_model.feature_importances_, index=numerical_features + list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)))
    feature_importance = pd.Series(best_model.feature_importances_, index=numerical_features )
    feature_importance.sort_values(ascending=False, inplace=True)
    
    plt.figure(figsize=(10, 6))
    feature_importance.plot(kind='bar')
    plt.title(f'Feature Importance in Best Model - {best_model}')
    plt.show()

important_features = feature_importance.head(10).index if 'feature_importances_' in dir(best_model) else numerical_features + list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features))
print(f'\nTop 5 important features: {important_features}')

In [None]:
# Plot scatter plots for pairs of numerical features
plt.figure(figsize=(20, 15))

# Example pairs of numerical features
pairs = [('pH', 'Turbidity'), ('Nitrate', 'Chloride'), ('Manganese', 'Copper'), ('Nitrate', 'Copper'), 

         ('Manganese', 'pH'), ('Turbidity', 'Chloride')
         ]

for i, (x_feature, y_feature) in enumerate(pairs, 1):
    plt.subplot(3, 3, i)
    sns.scatterplot(x=x_feature, y=y_feature, data=df, alpha=0.5, hue='Target')
    plt.xlabel(x_feature)
    plt.ylabel(y_feature)
    plt.title(f'{x_feature} vs {y_feature}')
    

plt.suptitle('Scatter Plots of Selected Feature Pairs', fontsize=16)
plt.tight_layout()
plt.show()


## Step 2: Provide Actionable Insights Based on Model Findings

1. **Identify Key Factors:**
   - Determine the most important factors influencing water quality based on the model interpretation.

2. **Actionable Insights:**
   - Provide specific recommendations for addressing key factors to improve water quality.

## Step 3: Discuss Potential Interventions or Policy Recommendations to Improve Water Quality

1. **Interventions:**
   - Suggest practical measures that can be taken to address the key factors influencing water quality.

2. **Policy Recommendations:**
   - Propose policy changes or initiatives to support the interventions and improve water quality on a broader scale.


In [None]:
# Provide actionable insights based on model findings
insights = f"""
Based on the best model, the most influential factors for water quality are:
{', '.join(important_features.tolist())}

Policy recommendations:
- Regular monitoring and regulation of `{', '.join(important_features.tolist())}` levels in water sources.
- Implementing measures to reduce industrial and agricultural runoff to control Nitrate and Sulfate levels.
- Public awareness campaigns on the importance of water quality and ways to reduce contamination.
"""
print(insights)

# Tune hyperparameter

In [None]:
# Define models and their hyperparameters for grid search
model_params = {
    'Logistic Regression': {
        'model': LogisticRegression(max_iter=1000),
        'params': {
            'C': [0.1, 1, 10],
            'solver': ['liblinear', 'lbfgs']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [10, 50, 100],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    },
    'Decision Tree': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    }
}

# Train and evaluate each model with hyperparameter tuning
results = {}
for name, mp in tqdm(model_params.items(), desc="Training and Evaluating Models"):
    grid_search = GridSearchCV(mp['model'], mp['params'], cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    
    results[name] = get_results (y_test, y_pred)


In [None]:
for model_name, metrics in results.items():
    print(f"{model_name}:")
    best_params = metrics.get('best_params', 'N/A')
    print(f"  Best Parameters: {best_params}")
    for metric_name, metric_value in metrics.items():
        if metric_name != 'best_params':
            print(f"  {metric_name}: {metric_value:.4f}")


## XGBoost

In [None]:

xgb = XGBClassifier(random_state=42, eval_metric='logloss') 
xgb.fit(X_train.values, y_train)


In [None]:
# write your code

y_pred = xgb.predict(X_test.values)
    
get_results (y_test, y_pred)

In [None]:

xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')


# Define the parameter grid for XGBoost
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}
# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train.values, y_train)

# Get the best estimator
best_xgb = grid_search.best_estimator_

# Predict on the test set
y_pred = best_xgb.predict(X_test.values)
              

# Print the best parameters and the evaluation metrics
print("Best Parameters:", grid_search.best_params_)
results= {
    "Best Parameters": grid_search.best_params_}
get_results (y_test, y_pred)

In [None]:
xgb = XGBClassifier(random_state=42, 
                    colsample_bytree= 1.0,
                    learning_rate= 0.1,
                    max_depth= 7,
                    n_estimators= 50,
                    subsample= 0.8, eval_metric='logloss') 
xgb.fit(X_train.values, y_train)
y_pred = xgb.predict(X_test.values)
get_results (y_test, y_pred)

In [None]:
plot_confusion_matrix (model_rf, X_test, 'Random forest')

In [None]:

# Generate and plot the confusion matrix
plot_confusion_matrix (xgb, X_test.values, 'XGBOOST')

