# Section 3: Machine Learning

## Objectives:

 The Machine Learning notebook expands on the ETL and Visualisations notebook. It uses the cleaned data "corals_worldwide_dataset_cleaned.csv" 
 to train the best Machine Learning Model on the Coral Reef dataset.

## Input Data:

The input data can be found under "Dataset/cleaned/ "
The name of the file is called:

"corals_worldwide_dataset_cleaned.csv"

___

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/danielledelouw/Documents/code_institute/Coral_Reef_AI/Coral_Reef_AI/jupyter_notebooks'

We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/danielledelouw/Documents/code_institute/Coral_Reef_AI/Coral_Reef_AI'

## Load Python Libraries

In [5]:
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
import numpy as np

## Load Data

In [6]:
# Load the cleaned coral reef dataset from the specified CSV file
df = pd.read_csv('Dataset/cleaned/corals_worldwide_dataset_cleaned.csv')
df

Unnamed: 0,name,salinity,January_temp,June_temp,area,latitude,longitude,type_of_sea,corals,silt_or_sulfide
0,Adriatic Sea,38.298527,15.658799,20.855299,138000,43,15,2,1,0
1,Adriatic Sea,38.304909,16.297098,19.501200,138000,43,15,2,1,0
2,Adriatic Sea,38.462040,16.251598,19.028500,138000,43,15,2,1,0
3,Adriatic Sea,38.121601,15.709500,22.882999,138000,43,15,2,1,0
4,Adriatic Sea,38.519196,15.733400,21.824799,138000,43,15,2,1,0
...,...,...,...,...,...,...,...,...,...,...
2446,Yellow Sea,31.611076,8.349999,19.500000,380000,38,123,3,0,1
2447,Yellow Sea,31.468084,8.441801,19.800000,380000,38,123,3,0,1
2448,Yellow Sea,31.600788,8.432699,20.700000,380000,38,123,3,0,1
2449,Yellow Sea,31.533226,8.102799,19.000000,380000,38,123,3,0,1


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
# Features: all columns except 'name' and 'corals'
# Target: 'corals'
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['name', 'corals'], axis=1),  # Features
    df['corals'],                         # Target variable
    test_size=0.2,                        # 20% for testing, 80% for training
    random_state=101                      # For reproducibility
)

# Print the shapes of the resulting train and test sets
print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:", X_test.shape, y_test.shape)


* Train set: (1960, 8) (1960,) 
* Test set: (491, 8) (491,)


In [None]:
from sklearn.pipeline import Pipeline

# Feature engineering and preprocessing imports
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import CategoricalImputer
from feature_engine.encoding import OrdinalEncoder

# Feature scaling
from sklearn.preprocessing import StandardScaler

# Feature selection
from sklearn.feature_selection import SelectFromModel

# Machine learning algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier

# Function to build a pipeline for model optimization
def PipelineOptimization(model):
  # Create a pipeline with the following steps:
  pipeline_base = Pipeline([
    # 1. Impute missing values in numerical features using the median
    ('median', MeanMedianImputer(
      imputation_method='median',
      variables=[
        'salinity', 'January_temp', 'June_temp', 
        'area', 'latitude', 'longitude', 'type_of_sea', 'silt_or_sulfide'
      ]
    )),
    # 2. (Optional) Impute missing values in categorical features (currently commented out)
    # ('categorical_imputer', CategoricalImputer(imputation_method='frequent', variables=[ ])),
    # 3. (Optional) Ordinal encoding for categorical variables (currently commented out)
    # ("ordinal", OrdinalEncoder(encoding_method='arbitrary', variables=['name'])),
    # 4. Standardize features by removing the mean and scaling to unit variance
    ("feat_scaling", StandardScaler()),
    # 5. Feature selection using the provided model
    ("feat_selection", SelectFromModel(model)),
    # 6. Final estimator (model)
    ("model", model),
  ])
  return pipeline_base

In [None]:
from sklearn.model_selection import GridSearchCV

# Define a class for hyperparameter optimization using GridSearchCV# Define a class for hyperparameter optimization using GridSearchCV# Define a class for hyperparameter optimization using GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):s, params):s, params):
        # Store the models and their corresponding parameter gridseir corresponding parameter gridseir corresponding parameter grids
        self.models = models
        self.params = params        self.params = params        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}{}{}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):e, refit=False):e, refit=False):
        # Fit GridSearchCV for each model in the dictionary        # Fit GridSearchCV for each model in the dictionary        # Fit GridSearchCV for each model in the dictionary
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])  # Create pipeline with the modelelineOptimization(self.models[key])  # Create pipeline with the modelelineOptimization(self.models[key])  # Create pipeline with the model

            params = self.params[key]  # Get parameter grid for the model            params = self.params[key]  # Get parameter grid for the model            params = self.params[key]  # Get parameter grid for the model
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring), n_jobs=n_jobs, verbose=verbose, scoring=scoring), n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X, y)  # Fit GridSearchCVearchCVearchCV
            self.grid_searches[key] = gs  # Store the fitted GridSearchCV objectgrid_searches[key] = gs  # Store the fitted GridSearchCV objectgrid_searches[key] = gs  # Store the fitted GridSearchCV object

    def score_summary(self, sort_by='mean_score'):score'):score'):
        # Summarize the grid search results for all modelss for all modelss for all models
        def row(key, scores, params):
            # Helper function to create a summary row for each parameter setmary row for each parameter setmary row for each parameter set
            d = { = { = {
                 'estimator': key,
                 'min_score': min(scores),                 'min_score': min(scores),                 'min_score': min(scores),
                 'max_score': max(scores),'max_score': max(scores),'max_score': max(scores),
                 'mean_score': np.mean(scores),an(scores),an(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']  # Parameter sets tested            params = self.grid_searches[k].cv_results_['params']  # Parameter sets tested            params = self.grid_searches[k].cv_results_['params']  # Parameter sets tested
            scores = []
            for i in range(self.grid_searches[k].cv):].cv):].cv):
                key = "split{}_test_score".format(i)format(i)format(i)
                r = self.grid_searches[k].cv_results_[key]  # Scores for each split                r = self.grid_searches[k].cv_results_[key]  # Scores for each split                r = self.grid_searches[k].cv_results_[key]  # Scores for each split
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)  # Combine scores from all splits
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))  # Add summary row                rows.append((row(k, s, p)))  # Add summary row                rows.append((row(k, s, p)))  # Add summary row

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)  # Create summary DataFrame

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches  # Return summary DataFrame and fitted GridSearchCV objects
        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)  # Create summary DataFrame

        columns = ['estimator', 'min_score', 'mean_score', 'max_score',
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches  # Return summary DataFrame and fitted GridSearchCV objects 'std_score']
        return df[columns], self.grid_searches  # Return summary DataFrame and fitted GridSearchCV objects

        columns = columns + [c for c in df.columns if c not in columns]
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)  # Create summary DataFrame

In [None]:
# Define a dictionary of machine learning models to be used for hyperparameter search
models_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),         # Decision Tree model
    "RandomForestClassifier": RandomForestClassifier(random_state=0),         # Random Forest model
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0), # Gradient Boosting model
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),             # Extra Trees model
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),                 # AdaBoost model
}

In [None]:
# Define the hyperparameter grids for each classifier to be used in GridSearchCV
params_search = {
    "DecisionTreeClassifier":{},  # No hyperparameters specified for DecisionTreeClassifier
    "RandomForestClassifier":{    # Specify hyperparameters to search for RandomForestClassifier
        "model__n_estimators":[50, 20],      # Number of trees in the forest
        "model__max_depth":[None, 3, 10]     # Maximum depth of the tree
    },
    "GradientBoostingClassifier":{}, # No hyperparameters specified for GradientBoostingClassifier
    "ExtraTreesClassifier":{},       # No hyperparameters specified for ExtraTreesClassifier
    "AdaBoostClassifier":{},         # No hyperparameters specified for AdaBoostClassifier
}

In [None]:
# Define the hyperparameter grids for each classifier to be used in GridSearchCV
params_search = {
    "DecisionTreeClassifier":{},        # No hyperparameters specified for DecisionTreeClassifier
    "RandomForestClassifier":{},        # No hyperparameters specified for RandomForestClassifier
    "GradientBoostingClassifier":{},    # No hyperparameters specified for GradientBoostingClassifier
    "ExtraTreesClassifier":{},          # No hyperparameters specified for ExtraTreesClassifier
    "AdaBoostClassifier":{},            # No hyperparameters specified for AdaBoostClassifier
}

In [None]:
# Initialize the hyperparameter optimization search with the specified models and parameters
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)

# Fit the search object on the training data using accuracy as the scoring metric,
# utilizing all available processors (n_jobs=-1) and 2-fold cross-validation (cv=2)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-1, # use all processors, but one
           cv=2)


Running GridSearchCV for DecisionTreeClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits



Running GridSearchCV for RandomForestClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits


In [None]:
# Get the grid search summary DataFrame and the dictionary of fitted GridSearchCV pipelines
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')

# Display the summary of grid search results, sorted by mean score
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
1,RandomForestClassifier,1.0,1.0,1.0,0.0
3,ExtraTreesClassifier,1.0,1.0,1.0,0.0
4,AdaBoostClassifier,0.997959,0.99898,1.0,0.00102
0,DecisionTreeClassifier,0.977551,0.988776,1.0,0.011224
2,GradientBoostingClassifier,0.977551,0.988776,1.0,0.011224


In [None]:
# Select the best model name from the grid search summary (the first row's estimator)
best_model = grid_search_summary.iloc[0,0]
best_model

'RandomForestClassifier'

In [None]:
# Display the best hyperparameters found for the best model from the grid search
grid_search_pipelines[best_model].best_params_
# I did not adjust the hyperparameters as they were already good at 1.0 accuracy.

{}

In [None]:
# Retrieve the best pipeline (estimator) from the grid search results using the best model's name
best_pipeline = grid_search_pipelines[best_model].best_estimator_

# Display the best pipeline structure
best_pipeline

In [None]:
# Import necessary metrics from scikit-learn
from sklearn.metrics import classification_report, confusion_matrix

# Define a function to print confusion matrix and classification report
def confusion_matrix_and_report(X, y, pipeline, label_map):
  # Make predictions using the provided pipeline
  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  # Display the confusion matrix as a DataFrame for better readability
  print(pd.DataFrame(
    confusion_matrix(y_true=prediction, y_pred=y),
    columns=[["Actual " + str(sub) for sub in label_map]],
    index=[["Prediction " + str(sub) for sub in label_map]]
  ))
  print("\n")

  print('---  Classification Report  ---')
  # Print the classification report
  print(classification_report(y, prediction), "\n")

# Define a function to evaluate classifier performance on train and test sets
def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

In [23]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=best_pipeline,
                label_map = df['corals'].unique()
                # In this case, the target variable is encoded as categories and we
                # get the values with .unique() 
                )
###corals: 0 - absence of corals, 1 - presence of corals

#### Train Set #### 

---  Confusion Matrix  ---
             Actual 1 Actual 0
Prediction 1      277        0
Prediction 0        0     1683


---  Classification Report  ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       277
           1       1.00      1.00      1.00      1683

    accuracy                           1.00      1960
   macro avg       1.00      1.00      1.00      1960
weighted avg       1.00      1.00      1.00      1960
 

#### Test Set ####

---  Confusion Matrix  ---
             Actual 1 Actual 0
Prediction 1       66        0
Prediction 0        0      425


---  Classification Report  ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        66
           1       1.00      1.00      1.00       425

    accuracy                           1.00       491
   macro avg       1.00      1.00      1.00       491
weighted avg       1.00      1.00      1.00      

In [27]:
# Get feature names after preprocessing (before model)
feature_names = ['salinity', 'January_temp', 'June_temp', 'area', 'latitude', 'longitude', 'type_of_sea', 'silt_or_sulfide']

# Access the fitted model inside the pipeline
model = best_pipeline.named_steps['model']

# Get feature importances (works for tree-based models)
importances = model.feature_importances_

# Pair feature names with their importances
feature_importance = list(zip(feature_names, importances))

# Sort by importance, descending
feature_importance_sorted = sorted(feature_importance, key=lambda x: x[1], reverse=True)

# Print top features
print("Top features in the best model:")
for feature, importance in feature_importance_sorted[:3]:
    print(f"{feature}: {importance:.4f}")

Top features in the best model:
salinity: 0.6786
January_temp: 0.3214
