### Short Coding Project: Decision Trees Classification

#### Project Overview

In this project, you will apply decision tree classification to predict the condition of culverts based on various environmental and physical attributes using the Augmented Culvert Dataset. You will preprocess the data, perform feature engineering, build and evaluate decision tree models, and explore advanced topics such as feature importance and hyperparameter tuning.

- Delete the `# YOUR CODE HERE` comments and write your code.
- **Do not change** the variable names.

### Load the Dataset

Start by loading the Augmented Culvert Dataset and examining its structure.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# Load the dataset
url = 'https://raw.githubusercontent.com/CyConProject/Lab/main/Datasets/Augmented%20Culvert%20Dataset.csv'
df = pd.read_csv(url)

# Display the first few rows of the dataset
df.head()

### Question 1: Data Exploration

Explore the dataset to understand its structure and content.

1. Display the summary statistics of the numerical columns in the dataset.
2. Identify and print the names of categorical columns in the dataset.

**Hint for Part 2**: You can use the `dtypes` attribute of the DataFrame to check the data type of each column. Categorical columns often have the data type `'object'`, which means they contain text data. You can loop through all the columns and collect the names of columns where the data type is `'object'`.

In [None]:
# Display summary statistics of numerical columns
des = # YOUR CODE HERE
print(des)

# Identify categorical columns
categorical_columns = []
for col in df.columns:
    # YOUR CODE HERE

# Display the categorical columns
print("Categorical Columns:", categorical_columns)

### Question 2: Handle Missing Values

Handle missing values in the dataset.

1. **Identify columns with missing values** and the number of missing entries in each.
2. **Fill the missing values** in the `'Flooding_Frequency'` column with the mode (most frequent value).
3. **Verify that there are no more missing values** in the dataset.

In [None]:
# Identify columns with missing values
missing_values = # YOUR CODE HERE
print("Missing Values:\n", missing_values)

# Fill missing values of 'Flooding_Frequency' column with mode
# YOUR CODE HERE

# Verify that there are no more missing values
missing_values_after = # YOUR CODE HERE
print("Missing Values After Filling:\n", missing_values_after)

### Question 3: Feature Engineering

Feature engineering is a crucial step in improving a model's performance because it allows us to create new variables that may reveal patterns or relationships not captured by the original features. By creating features like `'Age_Category'` and `'Length_to_Age_Ratio'`, we can simplify complex relationships, such as the effect of age and length, and make them more understandable to the model, potentially leading to better predictions.

1. Create a new feature `'Age_Category'` by binning the `'Age'` column into three categories: `'New'` (<=10 years), `'Moderate'` (11-30 years), and `'Old'` (>30 years).
2. Create a new feature `'Length_to_Age_Ratio'` by dividing `'length'` by `'Age'`.

**Hint:** When creating the `'Length_to_Age_Ratio'` feature, make sure to handle cases where `'Age'` might be zero to avoid division by zero errors. You need to replace such values with a default (e.g., 0) to ensure the model runs smoothly.

In [None]:
# Create 'Age_Category' feature
def categorize_age(age):
    # YOUR CODE HERE

df['Age_Category'] = df['Age'].apply(categorize_age)

# Create 'Length_to_Age_Ratio' feature
df['Length_to_Age_Ratio'] = # YOUR CODE HERE

# Handle division by zero if any
# YOUR CODE HERE

# Display the updated DataFrame
df.head()

### Question 4: Encode Categorical Variables

Machine learning algorithms require numerical input data. Therefore, we need to encode categorical variables into numerical form. One common method is **one-hot encoding**, which converts categorical variables into a set of binary columns, each representing a unique category with 1s and 0s.

**Tasks:**

1. Update the list of categorical columns by including the new `'Age_Category'` feature created in Question 3.
2. Use `pd.get_dummies()` to perform one-hot encoding on these categorical columns, ensuring the encoded columns are in integer format (0s and 1s).

**Hint:** Use the `columns` parameter in `pd.get_dummies()` to specify the columns you want to encode and set `dtype` to ensure the encoded columns are integers. You can learn more in the [Pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

In [None]:
# Update the list of categorical columns to include 'Age_Category'
updated_categorical_columns = # YOUR CODE HERE

# One-hot encode the categorical columns
df_encoded = # YOUR CODE HERE

# Display the first few rows of the encoded dataset
df_encoded.head(10)

### Question 5: Split the Data into Training and Testing Sets

Split the dataset into training and testing sets.

- Use 75% of the data for training and 25% for testing.
- Set `random_state=42` for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = df_encoded.drop('Cul_rating', axis=1)
y = df_encoded['Cul_rating']

# Split the dataset
X_train, X_test, y_train, y_test = # YOUR CODE HERE

# Print the shapes of X_train and y_train
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

### Question 6: Train and Evaluate the Decision Tree Classifier

Initialize and train a decision tree classifier with customizable parameters to control overfitting.

**Tasks:**

1. Initialize the classifier with `random_state=42`, and modify the following parameters to prevent overfitting:
   - `max_depth=10`: This limits the depth of the tree, ensuring the model doesn't become too complex.
   - `min_samples_split=10`: This ensures that a node must have at least 10 samples before it can be split, reducing overfitting by preventing small, overly specific splits.
   - `min_samples_leaf=5`: This ensures that each leaf node has at least 5 samples, which helps smooth the model's predictions and avoid capturing noise in the training data.
2. Train the classifier on the training data.
3. Evaluate the model by calculating both the training and test accuracy scores.
4. Print the classification report for the test data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Decision Tree Classifier with additional parameters
decision_tree = # YOUR CODE HERE

# Train the classifier
# YOUR CODE HERE

# Make predictions on the test data
y_pred = # YOUR CODE HERE

# Make predictions on the training data
y_train_pred = # YOUR CODE HERE

# Calculate the accuracy scores
train_accuracy = # YOUR CODE HERE
test_accuracy = # YOUR CODE HERE

# Print the classification report for the test data
report = # YOUR CODE HERE

# Display the results
print(f"Training Accuracy of the Decision Tree Classifier: {train_accuracy:.2f}")
print(f"Test Accuracy of the Decision Tree Classifier: {test_accuracy:.2f}")
print("Classification Report for Test Data:\n", report)


### Question 7: Analyze Feature Importance (Advanced)

Feature importance helps identify which features have the most influence on the prediction of the target variable. Understanding feature importance can provide insights into the data and help in feature selection and model improvement.

**Hint**: Use the `feature_importances_` attribute of the trained decision tree model to get the importance of each feature. You can learn more about feature importance in the [scikit-learn documentation](https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.feature_importances_).

Extract the feature importances from the trained model. You can see a bar plot showing the top 10 most important features.

In [None]:
import matplotlib.pyplot as plt

# Extract feature importances
importances = # YOUR CODE HERE

# Get indices of the top 10 features
indices = np.argsort(importances)[-10:]

# Plot the feature importances
plt.figure(figsize=(10,6))
plt.title('Top 10 Feature Importances')
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.ylabel('Feature')
plt.show()