<a href="https://colab.research.google.com/github/Avilez-dev-11/Projects-in-ML-AI/blob/main/homework2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 1 (30 points): Implement a Decision Tree Classifier for your classification problem. You may use a built-in package to implement your classifier. Try modifying one or more of the input parameters and describe what changes you notice in your results. Clearly describe how these factors are affecting your output.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

In [22]:
# Assuming you've downloaded the dataset to the specified path
df = pd.read_csv('drive/MyDrive/ColabNotebooks/water_potability.csv', sep=',', header=0)
df = df.head(3000)  # Using only the first 3000 samples
df.isnull().sum()

ph                 458
Hardness             0
Solids               0
Chloramines          0
Sulfate            704
Conductivity         0
Organic_carbon       0
Trihalomethanes    150
Turbidity            0
Potability           0
dtype: int64

In [23]:
orig_df = df
# Create imputer with mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit imputer to data
imputer.fit(df)

# Transform data with imputation
df = pd.DataFrame(imputer.transform(df), columns=df.columns)
df.describe()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,7.063676,196.52929,21951.716391,7.121532,333.637074,426.423569,14.291086,66.468212,3.967592,0.373
std,1.449499,32.646981,8632.624987,1.619427,36.7181,80.805002,3.320222,15.707666,0.782466,0.483683
min,0.227499,47.432,320.942611,0.352,129.0,181.483754,2.2,8.175876,1.492207,0.0
25%,6.281672,176.961992,15668.824549,6.099903,316.761777,365.811312,12.054236,56.800812,3.437822,0.0
50%,7.063676,197.103467,20863.398168,7.131295,333.637074,422.022214,14.213797,66.468212,3.955122,0.0
75%,7.831012,216.674288,27182.623755,8.153135,350.538503,481.915416,16.568839,76.765238,4.50202,1.0
max,14.0,323.124,61227.196008,13.127,481.030642,753.34262,28.3,124.0,6.739,1.0


In [24]:
X = df.drop('Potability', axis=1)  # Features (water quality parameters)
y = df['Potability']  # Target variable (potability)

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [26]:
# Start with a maximum depth of 3 and Gini impurity criterion
dtcEntropy = DecisionTreeClassifier(max_depth=3, criterion='entropy', random_state=42)
dtcEntropy.fit(X_train, y_train)

In [27]:
y_pred = dtcEntropy.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6733333333333333


In [28]:
# Start with a maximum depth of 7 and Gini impurity criterion
dtcGini = DecisionTreeClassifier(max_depth=7, criterion='gini', random_state=42)
dtcGini.fit(X_train, y_train)

In [29]:
y_pred = dtcGini.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6566666666666666


I created two decision tree models with diferent values for the paremeters criterion and max_depth.

1. **Criterion (entropy vs. gini):**

* While both entropy and gini impurity are valid measures for splitting nodes, they can sometimes lead to slightly different tree structures, potentially affecting accuracy.
* In this case, the entropy-based tree (0.673) slightly outperformed the gini-based tree (0.656), but the difference is relatively small.

2. **Max Depth (3 vs. 7):**

* **Depth 3:** Restricting the tree to a maximum depth of 3 likely resulted in a simpler model with fewer decision rules. This can help prevent overfitting, where a model learns too much from the training data and struggles to generalize to unseen data.
* **Depth 7:** Allowing a deeper tree (depth 7) enables more complex decision boundaries, potentially capturing more intricate patterns in the data. However, it also increases the risk of overfitting, which might explain the slightly lower accuracy on the test set.

**Rationale for Adjusting Max Depth:**

* **Overfitting Prevention:** The primary motivation for adjusting max_depth is to control model complexity and potentially mitigate overfitting. Simpler trees (lower depth) are less prone to overfitting but might miss important patterns in the data.
* **Gini Index and Complexity:** Gini index tends to favor larger, more homogeneous splits, potentially leading to deeper trees to achieve similar purity levels as entropy. In some cases, this might necessitate a higher max_depth to capture patterns effectively.
* **Entropy and Finer-Grained Splits:** Entropy can prioritize smaller, more informative splits, sometimes resulting in more complex trees with fewer splits. This might explain why a lower max_depth could suffice in certain scenarios.

Task 2 (30 points): From the Bagging and Boosting ensemble methods pick any one algorithm
from each category. Implement both the algorithms using the same data. Use k-fold cross
validation to find the effectiveness of both the models. Comment on the difference/similarity of
the results.

In [30]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from numpy import mean, std

# Define K-fold parameters
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

# Implement Random Forest with K-fold CV
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train,y_train)
rf_scores = cross_val_score(rf, X_train, y_train, cv=kfold, scoring='accuracy', n_jobs=-1, error_score='raise')
print("Random Forest Accuracy (K-fold): %.4f (%.4f)" % (mean(rf_scores), std(rf_scores)))

# Implement XGBoost with K-fold CV
xgb = GradientBoostingClassifier(random_state=42)
xgb.fit(X_train,y_train)
xgb_scores = cross_val_score(GradientBoostingClassifier(), X_train, y_train, cv=kfold, scoring='accuracy', n_jobs=-1, error_score='raise')
print("XGBoost Accuracy (K-fold): %.4f (%.4f)" % (mean(xgb_scores), std(xgb_scores)))


Random Forest Accuracy (K-fold): 0.6643 (0.0276)
XGBoost Accuracy (K-fold): 0.6524 (0.0278)


The boosting method I chose to use was the XGBoost classifier. The bagging method that was implemented was random forest. Performance was near identical for both implementation with the Random Forest model having a higher accuracy by 0.0138. The standard deviations for both are very close with the XGBoost having a higher stdev by 0.003. Both models are at default so most likely the performance of both can be improved by adjusting certain parameters.

Task 3 (40 points): Compare the effectiveness of the three models implemented above. Clearly describe the metric you are using for comparison. Describe (with examples) Why is this metric(metrics) suited/appropriate for the problem at hand? How would a choice of a different metric impact your results? Can you demonstrate that?

In [31]:
from sklearn.metrics import confusion_matrix, classification_report, precision_score
def analysis(model, X_test=X_test, y_test=y_test):
  y_pred = model.predict(X_test)
  print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, y_pred)))
  print("Classification Report: \n{}\n".format(classification_report(y_test, y_pred)))
  features = pd.DataFrame(model.feature_importances_, index = X.columns)
  print("Feature importance: \n")
  return features

In [32]:
# Default Decision Tree Classifier with Entropy
print("Default Decision Tree Classifier (Entropy)\n")
analysis(dtcEntropy)

Default Decision Tree Classifier (Entropy)

Confusion Matrix: 
 [[552  16]
 [278  54]]

Classification Report: 
              precision    recall  f1-score   support

         0.0       0.67      0.97      0.79       568
         1.0       0.77      0.16      0.27       332

    accuracy                           0.67       900
   macro avg       0.72      0.57      0.53       900
weighted avg       0.70      0.67      0.60       900


Feature importance: 



Unnamed: 0,0
ph,0.252318
Hardness,0.073216
Solids,0.158775
Chloramines,0.068967
Sulfate,0.446724
Conductivity,0.0
Organic_carbon,0.0
Trihalomethanes,0.0
Turbidity,0.0


In [33]:
# Default Decision Tree Classifier with Gini Index
print("Default Decision Tree Classifier (Gini)\n")
analysis(dtcGini)

Default Decision Tree Classifier (Gini)

Confusion Matrix: 
 [[527  41]
 [268  64]]

Classification Report: 
              precision    recall  f1-score   support

         0.0       0.66      0.93      0.77       568
         1.0       0.61      0.19      0.29       332

    accuracy                           0.66       900
   macro avg       0.64      0.56      0.53       900
weighted avg       0.64      0.66      0.60       900


Feature importance: 



Unnamed: 0,0
ph,0.170064
Hardness,0.118391
Solids,0.144043
Chloramines,0.084983
Sulfate,0.20172
Conductivity,0.071664
Organic_carbon,0.050671
Trihalomethanes,0.10667
Turbidity,0.051794


In [34]:
# Random Forest with K-fold CV
print("Random Forest with K-fold CV\n")
analysis(rf)

Random Forest with K-fold CV

Confusion Matrix: 
 [[514  54]
 [232 100]]

Classification Report: 
              precision    recall  f1-score   support

         0.0       0.69      0.90      0.78       568
         1.0       0.65      0.30      0.41       332

    accuracy                           0.68       900
   macro avg       0.67      0.60      0.60       900
weighted avg       0.67      0.68      0.65       900


Feature importance: 



Unnamed: 0,0
ph,0.116277
Hardness,0.116044
Solids,0.116369
Chloramines,0.115058
Sulfate,0.130328
Conductivity,0.103726
Organic_carbon,0.101391
Trihalomethanes,0.100869
Turbidity,0.099938


In [35]:
# XGBoost with K-fold CV
print("XGBoost with K-fold CV: ")
analysis(xgb)

XGBoost with K-fold CV: 
Confusion Matrix: 
 [[519  49]
 [255  77]]

Classification Report: 
              precision    recall  f1-score   support

         0.0       0.67      0.91      0.77       568
         1.0       0.61      0.23      0.34       332

    accuracy                           0.66       900
   macro avg       0.64      0.57      0.55       900
weighted avg       0.65      0.66      0.61       900


Feature importance: 



Unnamed: 0,0
ph,0.212111
Hardness,0.080244
Solids,0.119121
Chloramines,0.116693
Sulfate,0.205854
Conductivity,0.084144
Organic_carbon,0.066186
Trihalomethanes,0.039462
Turbidity,0.076185


**Model Evaluation:**

 To comprehensively assess model performance, we employed multiple metrics: accuracy, precision, F1-score, and confusion matrices. F1-score, balancing precision and recall, proved particularly valuable in gauging reliability for water potability prediction. Confusion matrices offered granular insights into potential class imbalances and error patterns, crucial for minimizing misclassifications that could impact public health.

**Performance Comparison:**

 Comparing three models, XGBoost performed similarly or slightly worse than the base Decision Tree, while Random Forest emerged as the top contender. Its slightly higher accuracy, precision, and F1-score, coupled with the lowest true positives and highest true negatives, indicate superior performance and reliable error minimization.

**Recommendation:**

 Based on both model evaluation and cross-validation, Random Forest stands out as the most effective model for accurate water potability prediction. Its balanced performance and meticulous error control make it the optimal choice for safeguarding water quality and protecting public health.