First we need to import the Libraries and establish connection to the database.

In [1]:
import psycopg2
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [2]:
# Database connection parameters
db_params = {
    'host': '',
    'port': '',
    'database': 'postgres',
    'user': '',
    'password': ''
}

# Function to establish a database connection
def get_db_connection(db_params):
    try:
        conn = psycopg2.connect(
            host=db_params['host'],
            port=db_params['port'],
            database=db_params['database'],
            user=db_params['user'],
            password=db_params['password']
        )
        return conn
    except psycopg2.DatabaseError as e:
        print(f"Error: {e}")
        return None

# Function to retrieve data from a specific table
def get_table_data(db_params, table_name):
    conn = get_db_connection(db_params)
    if conn is None:
        return None
    
    try:
        query = f"SELECT * FROM {table_name};"
        df = pd.read_sql_query(query, conn)
        return df
    except Exception as e:
        print(f"Error: {e}")
        return None
    finally:
        conn.close()

In [3]:
# Function to retrieve data from a specific table
def get_table_data(db_params, table_name):
    conn = get_db_connection(db_params)
    if conn is None:
        return None
    
    try:
        query = f"SELECT * FROM {table_name};"
        df = pd.read_sql_query(query, conn)
        return df
    except Exception as e:
        print(f"Error: {e}")
        return None
    finally:
        conn.close()

Then we need to load the dataset after it was cleaned in the pre-processing section (ILO5)

In [4]:
# Load the training, testing, and validation datasets from the database
train_table_name = 'group12_warehouse.train_table'
test_table_name = 'group12_warehouse.test_table'
validation_table_name = 'group12_warehouse.validation_table'

train_data = get_table_data(db_params, train_table_name)
test_data = get_table_data(db_params, test_table_name)
validation_data = get_table_data(db_params, validation_table_name)

if train_data is not None and test_data is not None and validation_data is not None:
    print(train_data.head())
    print(test_data.head())
    print(validation_data.head())
else:
    print("Failed to retrieve data from one or more tables.")

  Accident severity First Mode of Transport           Area Type  \
0               0.0      -0.350029334776985  0.3316993365651348   
1               0.0      -1.810821489630903  0.3316993365651348   
2               0.0      -0.350029334776985  0.3316993365651348   
3               0.0      -0.350029334776985  0.3316993365651348   
4               0.0      -0.350029334776985   -3.01477841455867   

      Light condition        Road Location       Road condition  \
0  0.6027962867863099  -0.8738398047923384    1.513226708878669   
1  0.6027962867863099   1.1443745117992685  -0.6608395121052414   
2  0.6027962867863099  -0.8738398047923384  -0.6608395121052414   
3  0.6027962867863099   1.1443745117992685  -0.6608395121052414   
4  -1.658935235535878   1.1443745117992685  -0.6608395121052414   

          Road surface        Road situation          Speed limit  \
0    1.784021822931772  -0.42384880344171616   1.9706356443111297   
1  -0.7354366365307541    1.0394771034977446  -0.1099889

  df = pd.read_sql_query(query, conn)


Then the data is split to prepare datasets for a machine learning task by separating the features (independent variables) from the target variable (dependent variable) related to accident severity. The train_data, test_data, and validation_data datasets each contain various features along with the 'Accident severity' column, which indicates the severity of an accident. The code first removes the 'Accident severity' column from each dataset to create feature sets (X_train, X_test, and X_val). These feature sets now contain all the data except for the target variable. Next, it extracts the 'Accident severity' column and stores it separately as target variables (y_train, y_test, and y_val). This separation is crucial because it allows machine learning models to learn from the features (X_train), make predictions, and then compare those predictions against the actual outcomes (y_train) to evaluate performance. The test and validation sets serve similar purposes, providing data for assessing the model's generalization capability and fine-tuning, respectively.

In [5]:
# Split features and target variable
X_train = train_data.drop(columns=['Accident severity'])
y_train = train_data[['Accident severity']]
    
X_test = test_data.drop(columns=['Accident severity'])
y_test = test_data[['Accident severity']]
    
X_val = validation_data.drop(columns=['Accident severity'])
y_val = validation_data[['Accident severity']]

## Decision Tree Model

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Predict on the testing data
y_pred_test = clf.predict(X_test)

# Generate the classification report for the test set
report_test = classification_report(y_test, y_pred_test)

# Predict on the validation data
y_pred_val = clf.predict(X_val)

# Generate the classification report for the validation set
report_val = classification_report(y_val, y_pred_val)

print("Classification Report for Test Set:")
print(report_test)
print("\n")
print("Classification Report for Validation Set:")
print(report_val)


Classification Report for Test Set:
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98       251
         1.0       0.78      0.98      0.87       251
         2.0       0.97      0.69      0.80       251

    accuracy                           0.89       753
   macro avg       0.90      0.89      0.88       753
weighted avg       0.90      0.89      0.88       753



Classification Report for Validation Set:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99       508
         1.0       0.76      0.96      0.85       508
         2.0       0.95      0.68      0.79       508

    accuracy                           0.88      1524
   macro avg       0.90      0.88      0.88      1524
weighted avg       0.90      0.88      0.88      1524



The classification report for our decision tree model provides a detailed evaluation of its performance on both the test and validation datasets. The model achieved an overall accuracy of 89% on the test set and 88% on the validation set, indicating a strong predictive ability. 

For the test set, class 0 had an impressive precision and recall of 0.97 and 1.00, respectively, resulting in an F1-score of 0.98. Class 1 showed lower precision at 0.78 but high recall at 0.98, giving an F1-score of 0.87. Class 2 had a high precision of 0.97 but lower recall at 0.69, leading to an F1-score of 0.80. 

Similarly, in the validation set, class 0 maintained excellent precision and recall, both at 0.98 and 1.00 respectively, with an F1-score of 0.99. Class 1 had a precision of 0.76 and recall of 0.96, resulting in an F1-score of 0.85. Class 2 had a precision of 0.95 and recall of 0.68, leading to an F1-score of 0.79. 

Overall, the model demonstrates high effectiveness, particularly in identifying class 0 across both datasets. However, it shows some variability in precision and recall for classes 1 and 2, indicating areas for potential improvement. The macro and weighted averages confirm consistent performance across different metrics and classes.

## Hypertuning: 

In [7]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Define the parameter grid
param_grid = {
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier()

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best estimator
best_clf = grid_search.best_estimator_

# Predict on the testing data
y_pred_test = best_clf.predict(X_test)

# Generate the classification report for the test set
report_test = classification_report(y_test, y_pred_test)

# Predict on the validation data
y_pred_val = best_clf.predict(X_val)

# Generate the classification report for the validation set
report_val = classification_report(y_val, y_pred_val)

print("Best Parameters found by GridSearchCV:")
print(grid_search.best_params_)
print("\n")

print("Classification Report for Test Set:")
print(report_test)
print("\n")

print("Classification Report for Validation Set:")
print(report_val)


Best Parameters found by GridSearchCV:
{'criterion': 'entropy', 'max_depth': 40, 'min_samples_leaf': 1, 'min_samples_split': 2}


Classification Report for Test Set:
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98       251
         1.0       0.78      0.98      0.87       251
         2.0       0.97      0.69      0.80       251

    accuracy                           0.89       753
   macro avg       0.90      0.89      0.88       753
weighted avg       0.90      0.89      0.88       753



Classification Report for Validation Set:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99       508
         1.0       0.78      0.96      0.86       508
         2.0       0.95      0.72      0.82       508

    accuracy                           0.89      1524
   macro avg       0.90      0.89      0.89      1524
weighted avg       0.90      0.89      0.89      1524



Best Parameters found for the decision tree model are: {'criterion': 'entropy', 'max_depth': 40, 'min_samples_leaf': 1, 'min_samples_split': 2}

The decision tree model was trained using specific parameters to optimize its performance. The 'criterion' parameter was set to 'entropy,' which means the model uses entropy to measure the quality of splits at each node. This approach aims to reduce uncertainty and create more homogeneous branches. The 'max_depth' parameter was assigned a value of 40, indicating that the tree can have up to 40 levels. This allows the model to capture complex patterns but risks overfitting if not properly managed. The 'min_samples_leaf' parameter was set to 1, meaning that a leaf node must have at least one sample. This setting ensures the tree can continue growing until all leaves are pure or contain a single sample. Finally, the 'min_samples_split' parameter was set to 2, which specifies that a node must have at least two samples to consider splitting further. These parameters collectively influence how the tree is structured and how it makes decisions based on the training data. By fine-tuning these parameters, the model aims to balance complexity and performance, capturing the underlying patterns in the data while avoiding overfitting.

the hypertuning has only incresed the accuracy by 1% which consdering it's initially high accuracy was not unexpected as there was little room for improvment.