<a href="https://www.kaggle.com/code/amirbaniasadi/water-quality-analysis?scriptVersionId=143000174" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In this project, I have run some EDA and also predictive modelling to see if a water body is potable or not?

In [None]:
# !pip install optuna
# !pip install scikit-optimize

In [None]:
import pandas as pd

#libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns

#importing libraries for imputing missing values
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV, RandomizedSearchCV
import optuna
from skopt import BayesSearchCV
from hyperopt import fmin, tpe, hp
from xgboost import XGBClassifier
from scipy import stats

In [None]:
data = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')

## **Initial Data Exploration**

In [None]:
sum(data.duplicated())

There is no duplicate values in this dataset

In [None]:
data.head()

We have seen several NaN values and different scales of the features.

In [None]:
print(data.info())

So, all the features' amounts are float except the target feature which is potablity which is categorical data with amounts 1 and 0. In regard to some features, there are some null values in dataset. This includes PH, Sulfate, Trihalomethanes.

In [None]:
print(data.describe())

**Here are some insights from the provided data exploration:**

pH: The pH values range from 0 to 14, with an average of approximately 7.08. Most values fall within the pH range recommended by WHO (6.5 to 8.5).

Hardness: The water's hardness varies with an average of around 196.37. Hardness is mainly due to calcium and magnesium salts, and this dataset contains values within a typical range.

Solids (TDS): Total dissolved solids (TDS) values range from around 320 to 61,227, with an average of roughly 22,014. Higher TDS values suggest more mineralized water.

Chloramines: Chloramines levels are varied, averaging about 7.12. Concentrations of up to 4 mg/L are considered safe in drinking water.

Sulfate: Sulfate concentrations range from 129 to 481.03 mg/L. Sulfate levels are generally lower than seawater, which has a concentration of about 2,700 mg/L.

Conductivity: Electrical conductivity (EC) values range from 181.48 to 753.34 μS/cm. WHO recommends an EC value below 400 μS/cm.

Organic Carbon: Organic carbon levels range from 2.2 to 28.3 mg/L. Lower levels are desirable in drinking water.

Trihalomethanes (THMs): THM concentrations vary with an average of approximately 66.4. THMs are byproducts of chlorine treatment and are within acceptable limits.

Turbidity: Turbidity values range from 1.45 to 6.739 NTU. The mean turbidity is below the WHO recommended value of 5.00 NTU.

Potability: The dataset indicates whether water is safe for consumption (1 for potable, 0 for not potable). About 39.01% of the data points are labeled as potable.

These insights provide a preliminary understanding of the water quality parameters and their ranges within the dataset. Further analysis and modeling can help extract more meaningful patterns and relationships.

In [None]:
print(data["Potability"].value_counts())

**In this data set, 1998 of them are not potable and 1278 of water samples are potable.**


---

This distribution suggests that the dataset is slightly imbalanced, with more instances of not potable water compared to potable water. This is an important aspect to consider when building predictive models or conducting further analysis on the data.

In [None]:
data['Potability'].value_counts(normalize=True)

Almost %61 of the instances of our target variable is 'Potable'
%39 of the instances of our target variable is 'Not Potable'

**Skewness**

In [None]:
data.drop('Potability', axis=1).skew()

**Univariate Analysis**

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))

numerical_columns = data.drop("Potability", axis=1).columns

for column in numerical_columns:
    plt.subplot(3, 3, numerical_columns.get_loc(column) + 1)
    sns.histplot(data[column], kde=True)
    plt.title(column)

plt.tight_layout()
plt.show()

Upon analyzing the histograms of various water quality metrics, it is evident that the dataset exhibits normal distribution patterns for multiple parameters. The pH values, hardness levels, total dissolved solids (TDS), chloramines concentration, sulfate levels, organic carbon content, trihalomethanes concentration, and turbidity values all follow approximately normal distribution curves. This observation provides valuable insights into the natural variations of these water quality parameters across the dataset. Such distributions indicate that the dataset is diverse and encompasses a broad range of water quality conditions, mirroring real-world variations in water sources. These findings enhance the dataset's credibility and its potential to support robust analyses and modeling efforts.

**Solids** have slightly right skewness.

**Bivariate Analysis**

In [None]:
import plotly.graph_objs as go
index_vals = data['Potability'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='ph',
                                 values=data['ph']),
                            dict(label='Hardness',
                                 values=data['Hardness']),
                            dict(label='Solids',
                                 values=data['Solids']),
                            dict(label='Chloramines',
                                 values=data['Chloramines']),
                           dict(label='Sulfate',
                                 values=data['Sulfate']),
                            dict(label='Conductivity',
                                 values=data['Conductivity']),
                            dict(label='Organic_carbon',
                                 values=data['Organic_carbon']),
                            dict(label='Trihalomethanes',
                                 values=data['Trihalomethanes']),
                           dict(label='Turbidity',
                                 values=data['Turbidity'])],
                showupperhalf=False,
                text=data['Potability'],
                marker=dict(color=index_vals,
                            showscale=False,
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Water Quality',
    width=1000,
    height=1000,
)

fig.show()

In [None]:
fig = go.Figure(go.Heatmap(z=data.corr(), x=data.corr().columns.tolist(), y=data.corr().columns.tolist(), colorscale='agsunset'))
fig.show()

In [None]:
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()

**Sulfate and Hardness:**

The correlation coefficient between Sulfate and Hardness is approximately -0.1069. This suggests a weak negative correlation between these two variables. While the correlation implies that higher Sulfate levels may be associated with lower Hardness levels, the correlation strength is relatively modest.

**Solids and Sulfate:**

There is a negative correlation of approximately -0.171 between Solids and Sulfate levels. This suggests that higher levels of Solids in the water are associated with lower levels of Sulfate. While the correlation coefficient indicates a moderate negative correlation, it's important to remember that correlation doesn't imply causation.

# **Handling Missing Values**

Visualizations and getting numeric summaries are the first step in understanding the missing information in a dataset.
Identify how many and what type of missing values are in our dataset is one of the first steps when dealing with missing values.

In [None]:
data.isnull().sum()

We have 3 features with the mssing values.

*   ph : 491
*   Sulfate : 781
*   Trihalomethanes : 162

Now let's see the percentage of missing values in each feature.

In [None]:
data.isnull().mean()*100

In [None]:
# Create a heatmap of missing values
plt.figure(figsize=(10, 6))
sns.heatmap(data.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

In [None]:
data[data['ph'].isnull() & data['Sulfate'].isnull()].shape[0]

In [None]:
data[data['ph'].isnull() & data['Trihalomethanes'].isnull()].shape[0]

In [None]:
data[data['Sulfate'].isnull() & data['Trihalomethanes'].isnull()].shape[0]

In [None]:
data[data['Sulfate'].isnull() & data['Trihalomethanes'].isnull() & data['ph'].isnull()].shape[0]

# **Imputing missing values:**

https://www.mdpi.com/2071-1050/13/11/6318

### **Imputing ph values**

In [None]:
rfr_imputer = RandomForestRegressor(n_estimators=100, random_state=0)

In [None]:
df = data.copy().drop(['Potability',], axis = 1)

To do this other two features with nan values should be can be deleted

In [None]:
df = df.drop(['Sulfate', 'Trihalomethanes',], axis = 1)

In [None]:
df.shape

In [None]:
df_x= df.dropna().drop(['ph'],axis = 1)

In [None]:
df_x.head()

In [None]:
df_y = df['ph'].dropna()

In [None]:
df_x.shape, df_y.shape

In [None]:
rfr_imputer.fit(df_x, df_y)

In [None]:
null_mask = df.isnull().any(axis=1)
null_rows = df[null_mask]

In [None]:
null_rows = pd.DataFrame(null_rows)

In [None]:
null_rows.head()

In [None]:
X_unknown = null_rows.drop('ph', axis = 1).copy()

In [None]:
imputed_values = rfr_imputer.predict(X_unknown)

In [None]:
imputed_values

In [None]:
null_rows.ph = imputed_values

In [None]:
null_rows

In [None]:
imputed_data = data.copy()

In [None]:
imputed_data['ph'].fillna(null_rows['ph'], inplace=True)

In [None]:
imputed_data['ph'].isnull().sum()

### **Imputing Sulfate values**

In [None]:
rfr_imputer = RandomForestRegressor(n_estimators=100, random_state=0)

In [None]:
df = imputed_data.copy().drop(['Potability',], axis = 1)

Now that null values in ph column are imputed we can use it for imputing other columns, so just "Trihalomethanes" column needs to be deleted due to it's null values

In [None]:
df = df.drop(['Trihalomethanes',], axis = 1)

In [None]:
df.shape

In [None]:
df_x= df.dropna().drop(['Sulfate'],axis = 1)

In [None]:
df_x.head()

In [None]:
df_y = df['Sulfate'].dropna()

In [None]:
df_x.shape, df_y.shape

In [None]:
rfr_imputer.fit(df_x, df_y)

In [None]:
null_mask = df.isnull().any(axis=1)
null_rows = df[null_mask]

In [None]:
null_rows = pd.DataFrame(null_rows)

In [None]:
null_rows.head()

In [None]:
X_unknown = null_rows.drop('Sulfate', axis = 1).copy()

In [None]:
imputed_values = rfr_imputer.predict(X_unknown)

In [None]:
imputed_values

In [None]:
null_rows.Sulfate = imputed_values

In [None]:
null_rows

In [None]:
#filling na values with imputed values
imputed_data['Sulfate'].fillna(null_rows['Sulfate'], inplace=True)

In [None]:
imputed_data['Sulfate'].isnull().sum()

### **Imputing Trihalomethanes null values:**

In [None]:
rfr_imputer = RandomForestRegressor(n_estimators=100, random_state=0)

In [None]:
df = imputed_data.copy().drop(['Potability',], axis = 1)

In [None]:
df.shape

Now that all columns except 'Trihalomethanes' are without any null values we can use all dataset to impute null values in this column.

In [None]:
df_x= df.dropna().drop(['Trihalomethanes'],axis = 1)

In [None]:
df_x.head()

In [None]:
df_y = df['Trihalomethanes'].dropna()

In [None]:
df_x.shape, df_y.shape

In [None]:
rfr_imputer.fit(df_x, df_y)

In [None]:
null_mask = df.isnull().any(axis=1)
null_rows = df[null_mask]

In [None]:
null_rows = pd.DataFrame(null_rows)

In [None]:
null_rows.head()

In [None]:
X_unknown = null_rows.drop('Trihalomethanes', axis = 1).copy()

In [None]:
imputed_values = rfr_imputer.predict(X_unknown)

In [None]:
null_rows.Trihalomethanes = imputed_values

In [None]:
null_rows

In [None]:
imputed_data['Trihalomethanes'].fillna(null_rows['Trihalomethanes'], inplace=True)

In [None]:
imputed_data['Trihalomethanes'].isnull().sum()

In [None]:
data.isnull().sum()

Now we can go further with a data with no missing value.

**Outliers exploration:**

In [None]:
# List of numerical features

numerical_features = [feature for feature in imputed_data.drop('Potability', axis = 1).columns if imputed_data[feature].dtype != 'object']

# Determine the number of subplots and rows
num_plots = len(numerical_features)
num_rows = 3
num_cols = 3

# Create subplots
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 10))
axes = axes.ravel()

# Loop through each numerical feature and create a box plot
for i, feature in enumerate(numerical_features):
    sns.boxplot(data= imputed_data, y=feature, ax=axes[i])
    axes[i].set_title(f'Box Plot of {feature}')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
# Define a function to count outliers using IQR method
def count_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = column[(column < lower_bound) | (column > upper_bound)]
    return len(outliers)

In [None]:
# Create a DataFrame to store the number of outliers for each feature
outliers_df = pd.DataFrame(columns=['Number of Outliers'])

# Count outliers for each feature and store the results in the DataFrame
for feature in data.drop('Potability', axis = 1).columns:
    if data[feature].dtype != 'object':
        num_outliers = count_outliers(data[feature])
        outliers_df.loc[feature]= num_outliers

# Sort the DataFrame by number of outliers in descending order
outliers_df = outliers_df.sort_values(by='Number of Outliers', ascending=False)

outliers_df

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))

numerical_columns = data[['Sulfate', 'ph', 'Trihalomethanes']].columns

for column in numerical_columns:
    plt.subplot(3, 3, numerical_columns.get_loc(column) + 1)
    sns.histplot(data[column], kde=True)
    plt.title(column)

plt.tight_layout()
plt.show()

# **Feature Engineering:**

let's dive into creating domain knowledge features based on the water quality dataset. As you mentioned earlier, you can start by categorizing pH levels and combining "Hardness" and "Solids" to represent total minerals. Here's how you can approach it:

pH Categories: pH is an important parameter in water quality assessment. Create a new categorical feature that classifies pH levels into different categories such as "Acidic," "Neutral," and "Alkaline." You can define the ranges for each category based on standard pH values:

Acidic: pH < 6.5
Neutral: 6.5 ≤ pH < 8.5
Alkaline: pH ≥ 8.5
This feature will provide insights into the acidity or alkalinity of the water samples.

Total Minerals: Combine "Hardness" and "Solids" to create a new feature that represents the total mineral content in water. This combined feature can provide a broader understanding of the overall mineral concentration in the water, which might have an impact on its quality.

In [None]:
# Assuming you've loaded your dataset as 'data'
imputed_data['pH_category'] = pd.cut(imputed_data['ph'], bins=[0, 6.5, 8.5, 14], labels=['Acidic', 'Neutral', 'Alkaline'])

imputed_data['Total_minerals'] = imputed_data['Hardness'] + imputed_data['Solids']

# Drop the original 'ph', 'Hardness', and 'Solids' columns if needed
# data.drop(['ph', 'Hardness', 'Solids'], axis=1, inplace=True)

# Print a few rows to check the newly created features
imputed_data

In [None]:
data_new = imputed_data.copy()

In [None]:
col = data_new.pop('pH_category')

In [None]:
data_new.insert(1, col.name, col)

In [None]:
data_new

In [None]:
col = data_new.pop('Total_minerals')

In [None]:
data_new.insert(4, col.name, col)

In [None]:
data_new

In [None]:
imputed_data = data_new.copy()

In [None]:
imputed_data.head()

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))

numerical_columns = imputed_data[['Total_minerals']].columns

for column in numerical_columns:
    plt.subplot(3, 3, numerical_columns.get_loc(column) + 1)
    sns.histplot(imputed_data[column], kde=True)
    plt.title(column)

plt.tight_layout()
plt.show()

In [None]:
pd.DataFrame(imputed_data.pH_category.value_counts())

In [None]:
# Assuming you've loaded your dataset as 'data'
sns.histplot(data=imputed_data, x='pH_category', hue='Potability', multiple='stack')
plt.title('pH Category Distribution by Potability')
plt.show()

sns.histplot(data=imputed_data, x='Total_minerals', hue='Potability', multiple='stack')
plt.title('Total Minerals Distribution by Potability')
plt.show()

It is observed that the distribution of these two features for both potable and non-potable water samples looks similar, it indicates that the feature might not be a strong differentiator between the two classes. In other words, the feature might not have a significant influence on predicting water potability.

This could suggest that the feature might not carry enough discriminatory information to distinguish between potable and non-potable water samples. It's important to note that while a similar distribution doesn't necessarily mean the feature is irrelevant, it does raise the question of whether the feature contributes much to predicting the target variable.

In such cases, it is required to further analyze the statistical summary and consider performing hypothesis tests or calculating feature importance scores to quantitatively assess the impact of these features on potability prediction. If the additional analyses also indicate that the features are not strongly related to potability, you might consider excluding them from your modeling process.

However, the absence of a clear distinction in the distribution doesn't necessarily mean the features have no value. It's a part of the exploratory process to assess their potential contribution to your predictive models.

In [None]:
# Calculate correlations between the engineered features and Potability
correlation_matrix = imputed_data[['Total_minerals']].corrwith(imputed_data['Potability'])

# Display the correlation coefficients
print(correlation_matrix)

In [None]:
potable_data = imputed_data[imputed_data['Potability'] == 1]['Total_minerals']
non_potable_data = imputed_data[imputed_data['Potability'] == 0]['Total_minerals']

t_statistic, p_value = stats.ttest_ind(potable_data, non_potable_data)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

In the context of hypothesis testing, the null hypothesis (often denoted as "H0") is a statement that suggests there is no effect or relationship between variables. It's a default assumption to test against by collecting and analyzing data.

In this case, when performing a t-test or Mann-Whitney U test to compare the distributions of the "Total_minerals" feature between potable and non-potable water samples, the null hypothesis might be something like:

"Total_minerals has no significant impact on water potability."

In other words, if the null hypothesis is true, it means that the "Total_minerals" feature doesn't have a noticeable difference between potable and non-potable water samples. The p-value obtained from the test gives an indication of how likely the observed data is under the assumption of the null hypothesis. If the p-value is low (typically below a significance level like 0.05), it might reject the null hypothesis in favor of an alternative hypothesis that suggests there is a significant difference between the groups. If the p-value is high, it might not have enough evidence to reject the null hypothesis, indicating that the observed difference could potentially be due to chance.

So in this case, a p-value of 0.054, which is greater than 0.05, means that there isn't strong enough evidence to reject the null hypothesis. In other words, there isn't enough statistical evidence to conclude that the "Total_minerals" feature significantly impacts water potability.

In [None]:
potable_data = imputed_data[imputed_data['Potability'] == 1]['Total_minerals']
non_potable_data = imputed_data[imputed_data['Potability'] == 0]['Total_minerals']

t_statistic, p_value = stats.ttest_ind(potable_data, non_potable_data)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

If the data is not normally distributed, the Mann-Whitney U test can be used. This is a non-parametric test that assesses whether the distributions of two groups are significantly different. However, as far as in this dataset the features are normally distributed, it'd be better not to use this test.

# **Predictive Modeling**

First let's run the XGboost classifier after removing rows with missing values

## **Removing Observations with Missing Data**

In [None]:
print(f'Before dropping missing values, we have {data.shape[0]} instances')

data_dropped = data.dropna()
print(f'After dropping missing values, we have {data_dropped.shape[0]} instances')

# to see how mnay instances we have lost
print(f'We have lost {data.shape[0]-data_dropped.shape[0]} instances, which means % {round((data.shape[0]-data_dropped.shape[0])/ (float(data.shape[0])),2)*100} data we have lost')

In [None]:
data_dropped.isnull().sum()

**We don't have any missing values and have lost %39 percent of our data.**


**Let's move on to modeling and prediction.**

In [None]:
X = data_dropped.drop('Potability', axis=1)
y = data_dropped['Potability']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)
print(f'Size of the X_train: {X_train.shape[0]}')
model = XGBClassifier(eval_metric='logloss')
model.fit(X_train, y_train)
predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
print (f'Accuracy:: {round(accuracy * 100.0,3)}')

**Now let's work with imputed data**

Split the Dataset:

---



---



Split the dataset into training and testing sets. The training set will be used to train the models, and the testing set will be used to evaluate their performance.

Before that as far as the data is imbalanced:

In [None]:
imputed_data["Potability"].value_counts()

It'd be better to use Stratified Sampling.

When splitting data into training and testing sets, using stratified sampling ensures that the class distribution in the training and testing sets is similar to the original class distribution. Scikit-learn's train_test_split has a stratify parameter that can be set to your target variable.

In [None]:
#creating feature and target matrices

X = imputed_data.drop('Potability', axis=1)  # Features
y = imputed_data['Potability']  # Target

In [None]:
X.head()

As far as 'pH_category' and 'Total_minerals' features are created based on other features there for there is high correlation. This will help avoid multicollinearity and potential issues during model training.

In [None]:
X = X.drop(['pH_category','Total_minerals'], axis = 1)

In [None]:
# Assuming your features are in X and target is in y


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

as you can see, the ratio of potable to not potable water is same in both sets and same as the whole dataset(arount 1.56)

In [None]:
# Choose a machine learning algorithm (Random Forest)
model = RandomForestClassifier()

# Perform k-fold cross-validation (let's use k=5)
k = 5
scores = cross_val_score(model, X_train, y_train, cv=k, scoring='accuracy')

# Print the accuracy scores for each fold
print("Cross-Validation Scores:", scores)

# Calculate the mean and standard deviation of accuracy
mean_accuracy = scores.mean()
std_accuracy = scores.std()

print("Mean Accuracy:", mean_accuracy)
print("Standard Deviation of Accuracy:", std_accuracy)


The mean accuracy of approximately 0.664 indicates that, on average, Random Forest model's predictions were correct around 66.4% of the time. The standard deviation of accuracy (0.0117) gives an idea of the variability in accuracy across different folds of the cross-validation process.

## **Fine-tuning model**

###**Hyper parameter tuning**

In [None]:
class HyperparameterTuner:
    def __init__(self, model, param_space):
        self.model = model
        self.param_space = param_space

    def grid_search(self, X, y):
        grid_search = GridSearchCV(self.model, self.param_space, cv=5)
        grid_search.fit(X, y)
        return grid_search.best_estimator_

    def random_search(self, X, y):
        random_search = RandomizedSearchCV(self.model, self.param_space, n_iter=50, cv=5)
        random_search.fit(X, y)
        return random_search.best_estimator_

    def bayesian_optimization(self, X, y):
        params = {
            'n_estimators': (10, 200),
            'max_depth': (1, 32),
            'min_samples_split': (2, 20),
            'min_samples_leaf': (1, 20),
            'max_features': (0.1, 1.0)
        }

        bayes_cv_tuner = BayesSearchCV(
            self.model,
            params,
            n_iter=5,
            cv=5,
            n_jobs=-1,
            verbose=1,
            refit=True
        )

        X_train_cv, X_val, y_train_cv, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
        bayes_cv_tuner.fit(X_train_cv, y_train_cv)
        best_model = bayes_cv_tuner.best_estimator_

        y_pred = best_model.predict(X_val)
        accuracy = accuracy_score(y_val, y_pred)

        return best_model, accuracy

In [None]:
param_space = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    # Add other hyperparameters here
}
tuner = HyperparameterTuner(RandomForestClassifier(), param_space)
best_model = tuner.bayesian_optimization(X_train, y_train)
best_model

In [None]:
model = RandomForestClassifier(max_depth=28, max_features=0.9473013842824254,
                        min_samples_leaf=4, min_samples_split=13,
                        n_estimators=140)  # Instantiate without hyperparameters
model.fit(X_train, y_train)       # Fit the model


In [None]:
y_pred = model.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)


In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
print("ROC AUC:", roc_auc)


In [None]:
from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Get the predicted probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
