In [None]:
Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.

In [None]:
Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

In [None]:
Preprocessing a dataset typically involves handling missing values, encoding categorical variables, and scaling numerical features if needed. Here's a step-by-step guide on how to perform these preprocessing steps:

Handling Missing Values:

Identify Missing Values: Start by identifying which columns in your dataset have missing values.

Impute Missing Values: Decide on a strategy to fill in missing values based on the nature of the data:

For numerical features, you can impute missing values with the mean, median, or mode of the respective feature.
For categorical features, you can impute missing values with the most frequent category or use a special category like "Unknown."
Pandas Example (Assuming 'df' is your DataFrame):

In [None]:
# Impute missing values for numerical columns with the mean
df.fillna(df.mean(), inplace=True)

# Impute missing values for categorical columns with the most frequent category
df.fillna(df.mode().iloc[0], inplace=True)


In [None]:
Encoding Categorical Variables:

Label Encoding: For ordinal categorical variables (categories with an inherent order), you can use label encoding to convert categories into numerical values. Use the LabelEncoder from scikit-learn

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['ordinal_categorical_column'] = label_encoder.fit_transform(df['ordinal_categorical_column'])


In [None]:
One-Hot Encoding: For nominal categorical variables (categories without an inherent order), use one-hot encoding to create binary columns for each category. Use the pd.get_dummies function in pandas or the OneHotEncoder from scikit-learn.

In [None]:
# Using pandas
df = pd.get_dummies(df, columns=['nominal_categorical_column'])

# Using scikit-learn (if needed)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded_cols = encoder.fit_transform(df[['nominal_categorical_column']])


In [None]:
Scaling Numerical Features (if necessary):

Feature Scaling: Scaling numerical features can be important for algorithms that are sensitive to the scale of the input features, such as gradient descent-based methods. Common scaling techniques include Min-Max scaling and Standardization (z-score scaling).

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max Scaling
scaler = MinMaxScaler()
df[['numerical_feature1', 'numerical_feature2']] = scaler.fit_transform(df[['numerical_feature1', 'numerical_feature2']])

# Standardization (z-score scaling)
scaler = StandardScaler()
df[['numerical_feature1', 'numerical_feature2']] = scaler.fit_transform(df[['numerical_feature1', 'numerical_feature2']])


In [None]:
Final Data Inspection: After preprocessing, it's essential to inspect your dataset to ensure that missing values are handled, categorical variables are encoded properly, and numerical features are scaled if necessary. Additionally, check for any outliers or anomalies in the data.

Remember that the specific preprocessing steps may vary depending on the nature of your dataset and the machine learning algorithm you plan to use. It's crucial to understand your data and choose the appropriate preprocessing techniques accordingly

In [None]:
Q2. Split the dataset into a training set (70%) and a test set (30%).

In [None]:
Splitting a dataset into a training set and a test set is a fundamental step in machine learning to evaluate the model's performance on unseen data. You can use various libraries in Python, such as scikit-learn, to perform this data split. Below is a step-by-step guide using scikit-learn:

Assuming you have a DataFrame df with your dataset and a target variable target_column:

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into features (X) and the target variable (y)
X = df.drop(columns=['target_column'])
y = df['target_column']

# Split the data into a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# "test_size" parameter specifies the proportion of the dataset to include in the test split.
# "random_state" ensures reproducibility; you can use any integer value or leave it out for randomness.


In [None]:
After running this code, you will have:

X_train: The feature data (70%) for training your machine learning model.
X_test: The feature data (30%) for evaluating your model's performance.
y_train: The corresponding target values (70%) for training.
y_test: The corresponding target values (30%) for testing.
You can now use X_train and y_train to train your machine learning model and X_test to evaluate its performance. Make sure not to use the test set for model training to ensure that your model's performance assessment is unbiased

In [None]:
To train a Random Forest Classifier using scikit-learn with 100 trees and a maximum depth of 10 for each tree, you can follow these steps:

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest Classifier with 100 trees and a maximum depth of 10 for each tree
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Train the classifier on the training set
rf_classifier.fit(X_train, y_train)


In [None]:
Here's a breakdown of what this code does:

Import the RandomForestClassifier class from scikit-learn's ensemble module.

Create an instance of the Random Forest Classifier with the specified hyperparameters:

n_estimators: This parameter specifies the number of trees in the forest, which is set to 100.
max_depth: It sets the maximum depth of each decision tree to 10.
Initialize the random state with random_state=42 for reproducibility.

Fit (train) the Random Forest Classifier on the training set (X_train and y_train) using the .fit() method.

Now, your rf_classifier is trained and ready to make predictions on new data or be evaluated on the test set.

In [None]:
Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

In [None]:
To evaluate the performance of your Random Forest Classifier on the test set using accuracy, precision, recall, and F1 score, you can follow these steps:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision
precision = precision_score(y_test, y_pred)

# Calculate recall
recall = recall_score(y_test, y_pred)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


In [None]:
Here's what this code does:

Import the necessary evaluation metrics from scikit-learn (accuracy_score, precision_score, recall_score, f1_score).

Use the trained Random Forest Classifier (rf_classifier) to make predictions on the test set (X_test) by calling the .predict() method.

Calculate the accuracy, precision, recall, and F1 score by comparing the predicted labels (y_pred) with the true labels (y_test) using the respective metric functions.

Print the evaluation metrics to the console.

Running this code will provide you with the accuracy, precision, recall, and F1 score, which are common metrics used to assess the performance of a classification model on a test dataset. These metrics provide insights into different aspects of model performance:

Accuracy: Measures the overall correctness of predictions.
Precision: Measures the ratio of correctly predicted positive instances to the total predicted positive instances.
Recall: Measures the ratio of correctly predicted positive instances to the total actual positive instances.
F1 Score: Combines precision and recall into a single metric, useful when dealing with imbalanced datasets.
These metrics collectively give you a good understanding of how well your Random Forest Classifier is performing on the test data.






In [None]:
Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

In [None]:
To identify the top 5 most important features in predicting heart disease risk using the feature importance scores from your Random Forest Classifier and visualize them with a bar chart, you can follow these steps:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Get feature importances from the trained Random Forest Classifier
feature_importances = rf_classifier.feature_importances_

# Get the names of the features (column names)
feature_names = X_train.columns

# Create a DataFrame to organize feature names and their importances
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Select the top 5 most important features
top_features = importance_df.head(5)

# Create a bar chart to visualize the feature importances
plt.figure(figsize=(10, 6))
plt.barh(top_features['Feature'], top_features['Importance'], color='skyblue')
plt.xlabel('Feature Importance')
plt.title('Top 5 Most Important Features for Heart Disease Prediction')
plt.gca().invert_yaxis()  # Invert the y-axis to display the most important feature at the top
plt.show()


In [None]:
Here's a breakdown of what this code does:

Import the necessary libraries, including matplotlib.pyplot for visualization and numpy for numerical operations.

Retrieve the feature importances from the trained Random Forest Classifier using the .feature_importances_ attribute.

Obtain the names of the features (column names) from the training data.

Create a DataFrame (importance_df) to organize the feature names and their corresponding importances.

Sort the DataFrame by importance in descending order to identify the most important features.

Select the top 5 most important features from the sorted DataFrame.

Create a horizontal bar chart using matplotlib to visualize the feature importances. The plt.barh() function is used to create the horizontal bar chart, and plt.gca().invert_yaxis() is used to display the most important feature at the top of the chart.

Running this code will generate a bar chart that visually represents the top 5 most important features for predicting heart disease risk based on the Random Forest Classifier's feature importances.

In [None]:
Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

In [None]:
Tuning the hyperparameters of a Random Forest Classifier using either grid search or random search with cross-validation is a common approach to finding the best combination of hyperparameters for your model. Here, I'll demonstrate how to perform hyperparameter tuning using grid search with 5-fold cross-validation as an example. You can modify it for random search if desired.

First, make sure you have imported the necessary libraries and modules:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

# Define the hyperparameter grid to search
param_grid = {
    'n_estimators': [50, 100, 150],  # Different values for the number of trees
    'max_depth': [10, 20, 30],       # Different values for maximum depth
    'min_samples_split': [2, 5, 10],  # Different values for minimum samples to split
    'min_samples_leaf': [1, 2, 4]     # Different values for minimum samples per leaf
}

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters found by grid search
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the performance of the best model using cross-validation
best_rf_classifier = grid_search.best_estimator_
cv_scores = cross_val_score(best_rf_classifier, X_train, y_train, cv=5, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())


In [None]:
In this code:

Define a grid of hyperparameter values to search through using param_grid. You specify different values for the number of trees (n_estimators), maximum depth (max_depth), minimum samples required to split a node (min_samples_split), and minimum samples required per leaf node (min_samples_leaf).

Create a Random Forest Classifier (rf_classifier) with a fixed random state for reproducibility.

Use GridSearchCV to perform a grid search with 5-fold cross-validation. The cv parameter specifies the number of folds for cross-validation, and scoring is set to 'accuracy' to evaluate the models based on accuracy.

Fit the grid search to the training data (X_train and y_train) to find the best combination of hyperparameters.

Print the best hyperparameters found by grid search.

Retrieve the best estimator (Random Forest Classifier with the best hyperparameters) and evaluate its performance using cross-validation. Cross-validation scores are printed to assess model accuracy.

You can adjust the param_grid and scoring metric as needed to explore different hyperparameter combinations and evaluation metrics during the tuning process.

In [None]:
Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

In [None]:
To report the best set of hyperparameters found by the grid search and the corresponding performance metrics, and to compare the performance of the tuned model with the default model, you can use the following code:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Best hyperparameters found by grid search
best_params = grid_search.best_params_

# Create a Random Forest Classifier with the best hyperparameters
best_rf_classifier = RandomForestClassifier(random_state=42, **best_params)

# Fit the best model to the training data
best_rf_classifier.fit(X_train, y_train)

# Make predictions on the test set using the best model
y_pred_best = best_rf_classifier.predict(X_test)

# Calculate performance metrics for the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
precision_best = precision_score(y_test, y_pred_best)
recall_best = recall_score(y_test, y_pred_best)
f1_score_best = f1_score(y_test, y_pred_best)

# Performance metrics for the default model
y_pred_default = rf_classifier.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)
precision_default = precision_score(y_test, y_pred_default)
recall_default = recall_score(y_test, y_pred_default)
f1_score_default = f1_score(y_test, y_pred_default)

# Print the best hyperparameters and performance metrics
print("Best Hyperparameters:", best_params)
print("Performance Metrics for the Best Model:")
print(f"Accuracy: {accuracy_best:.2f}")
print(f"Precision: {precision_best:.2f}")
print(f"Recall: {recall_best:.2f}")
print(f"F1 Score: {f1_score_best:.2f}")

# Compare performance with the default model
print("\nPerformance Metrics for the Default Model:")
print(f"Accuracy: {accuracy_default:.2f}")
print(f"Precision: {precision_default:.2f}")
print(f"Recall: {recall_default:.2f}")
print(f"F1 Score: {f1_score_default:.2f}")


In [None]:
In this code:

We first retrieve the best hyperparameters found by the grid search using grid_search.best_params_.

Then, we create a Random Forest Classifier (best_rf_classifier) with the best hyperparameters.

We fit the best model to the training data and make predictions on the test set.

We calculate performance metrics (accuracy, precision, recall, and F1 score) for both the best model (best_rf_classifier) and the default model (rf_classifier).

Finally, we print the best hyperparameters and the performance metrics for both models to compare their performance.

This allows you to assess how the tuned model with the best hyperparameters performs compared to the default model. You can evaluate whether the hyperparameter tuning process has resulted in improved model performance.

In [None]:
Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

In [None]:
Interpreting the decision boundaries of a Random Forest Classifier can provide insights into how the model makes predictions. However, plotting the decision boundaries for a Random Forest, which is an ensemble of decision trees, can be challenging due to the complexity of the model. In practice, it's often more common to visualize decision boundaries for simple models like logistic regression or decision trees. Nevertheless, I'll provide some general guidance on how you can analyze and interpret a Random Forest Classifier.

To visualize decision boundaries for a Random Forest Classifier, we'll simplify the task by selecting two of the most important features and using them to create a scatter plot. Keep in mind that this is a simplified representation and may not fully capture the complexity of the Random Forest's decision boundaries.

Here's a step-by-step guide:

Identify the Two Most Important Features: You can refer to the feature importances obtained from the Random Forest Classifier to determine the two most important features.

Create a Scatter Plot: Select two features and create a scatter plot using these features as the x-axis and y-axis. You can use matplotlib for this.

Generate Decision Boundaries: Since Random Forest is an ensemble of decision trees, it's challenging to directly visualize its decision boundaries. One approach is to use a mesh grid of points that cover the feature space and classify each point using the Random Forest. Then, plot the decision regions as contours or color regions.

Here's some example code to get you started:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Select the two most important features (replace with actual feature names)
feature1 = 'Feature1'
feature2 = 'Feature2'

# Extract the corresponding columns from the dataset
X_selected = X_train[[feature1, feature2]]

# Fit the Random Forest Classifier to the selected features
rf_classifier.fit(X_selected, y_train)

# Create a mesh grid of points
x_min, x_max = X_selected[feature1].min() - 1, X_selected[feature1].max() + 1
y_min, y_max = X_selected[feature2].min() - 1, X_selected[feature2].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

# Predict the class labels for each point in the mesh grid
Z = rf_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundaries as contour lines
plt.contourf(xx, yy, Z, alpha=0.8)

# Scatter plot of the data points
plt.scatter(X_selected[feature1], X_selected[feature2], c=y_train, cmap=plt.cm.RdYlBu)
plt.xlabel(feature1)
plt.ylabel(feature2)
plt.title("Random Forest Classifier Decision Boundaries")
plt.show()


In [None]:
Interpreting the insights and limitations:

Insights: The decision boundaries in the scatter plot illustrate how the Random Forest Classifier separates different classes based on the selected features. You can observe regions where the model assigns different class labels. It provides a visual representation of the model's predictions in this simplified feature space.

Limitations: Keep in mind that this visualization simplifies the model's decision boundaries, which are inherently complex and may involve interactions between numerous features. Random Forests are powerful ensemble models, but they can be challenging to interpret directly. This visualization only captures a two-dimensional projection of the model's behavior. Understanding the full scope of the model's decision-making may require more advanced techniques such as partial dependence plots or SHAP (SHapley Additive exPlanations) values for feature importance and interpretation.

Interpreting Random Forest models often involves more than just visualizing decision boundaries. Additional techniques, such as feature importance analysis and model-agnostic interpretation methods, can provide deeper insights into how the model makes predictions.