In [None]:
##Q1.
To preprocess the dataset for building a random forest classifier, we need to handle missing values, encode categorical variables, and scale numerical features if necessary. Let's go through each step:

Handling Missing Values:

Load the dataset and check for missing values.
If missing values are present, decide on an appropriate strategy to handle them. Common strategies include:
Removing instances with missing values.
Removing features with a high number of missing values.
Imputing missing values with the mean, median, mode, or a more advanced method like K-nearest neighbors (KNN) imputation.
Encoding Categorical Variables:

Identify the categorical variables in the dataset.
Encode them into numerical values suitable for the random forest classifier.
Common encoding methods for categorical variables include one-hot encoding and label encoding.
Scaling Numerical Features (if necessary):

Check if any numerical features in the dataset need scaling.
Scaling is often necessary if the numerical features have different scales or units.
Common scaling methods include standardization (subtracting the mean and dividing by the standard deviation) or normalization (scaling the values to a specified range, such as [0, 1]).
Here's a code example in Python that demonstrates how to preprocess the dataset for a random forest classifier:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the dataset
data_url = 'https://drive.google.com/uc?export=download&id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ'
df = pd.read_csv(data_url)

# Separate features (X) and target variable (y)
X = df.drop('target', axis=1)
y = df['target']

# Handle missing values
missing_values = X.isnull().sum()
# Determine appropriate strategy to handle missing values

# Encode categorical variables
categorical_cols = X.select_dtypes(include=['object']).columns
encoder = OneHotEncoder(drop='first')
X_encoded = encoder.fit_transform(X[categorical_cols])

# Replace categorical columns with encoded values
X.drop(categorical_cols, axis=1, inplace=True)
X = pd.concat([X, pd.DataFrame(X_encoded.toarray(), columns=encoder.get_feature_names_out(categorical_cols))], axis=1)

# Scale numerical features (if necessary)
numerical_cols = X.select_dtypes(include=['float', 'int']).columns
scaler = StandardScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the random forest classifier
rf_classifier = RandomForestClassifier()

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Evaluate the classifier
accuracy = rf_classifier.score(X_test, y_test)
print("Accuracy:", accuracy)

Make sure to install the required dependencies like pandas, scikit-learn, and numpy before running the code.

Note: The code assumes that missing values and appropriate strategies have already been determined. You may need to adjust the code based on the specific requirements of your dataset.


In [None]:
##Q2.

To split the dataset into a training set and a test set, we can use the train_test_split function from the scikit-learn library. This function allows us to randomly divide the dataset into two portions based on the specified test size. In this case, we'll use a test size of 30%, meaning the training set will contain 70% of the data, and the test set will contain 30% of the data.

Here's the code to split the dataset into a training set and a test set
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data_url = 'https://drive.google.com/uc?export=download&id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ'
df = pd.read_csv(data_url)

# Separate features (X) and target variable (y)
X = df.drop('target', axis=1)
y = df['target']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the training and testing sets
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)


Make sure to install the required dependencies like pandas and scikit-learn before running the code.

In the code above, we first load the dataset and separate the features (X) from the target variable (y). Then, we use the train_test_split function to split the data into training and testing sets, with a test size of 0.3 (or 30%). The random_state parameter is set to 42 for reproducibility, but you can change it to any desired value.

After running the code, you'll see the shapes of the training and testing sets printed, indicating the number of instances and features in each set.


In [None]:
##Q3.

To train a random forest classifier on the given dataset to predict the risk of heart disease, we'll follow these steps:

Step 1: Download and Load the Dataset

Download the dataset from the provided link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=share_link
Once downloaded, load the dataset into your Python environment.
Step 2: Preprocess the Dataset

Perform any necessary preprocessing steps on the dataset, such as handling missing values, encoding categorical variables, or scaling numerical features. Ensure that the dataset is in the appropriate format for training the random forest classifier.
Step 3: Split the Dataset into Training and Testing Sets

Split the preprocessed dataset into a training set and a testing set. The training set will be used to train the random forest classifier, while the testing set will be used to evaluate its performance.
Step 4: Train the Random Forest Classifier

Import the necessary libraries for building the random forest classifier (e.g., scikit-learn).
Create an instance of the random forest classifier with the desired hyperparameters.
Fit the classifier to the training data using the fit method.
Step 5: Evaluate the Classifier

Use the trained classifier to make predictions on the testing set.
Evaluate the performance of the classifier using appropriate metrics (e.g., accuracy, precision, recall, F1-score).
Optionally, you can also perform cross-validation or tune hyperparameters to further optimize the model's performance.
Here's some sample code that demonstrates the implementation of the above steps:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Download and Load the Dataset
dataset_url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = dataset_url.split("/")[-2]
download_url = f"https://drive.google.com/uc?id={file_id}"
df = pd.read_csv(download_url)

# Step 2: Preprocess the Dataset (if needed)

# Step 3: Split the Dataset into Training and Testing Sets
X = df.drop('target', axis=1)  # Assuming 'target' column contains the labels
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10)
rf_classifier.fit(X_train, y_train)

# Step 5: Evaluate the Classifier
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Make sure to preprocess the dataset as needed and adjust the code according to your specific requirements.


In [None]:
##Q4.

To evaluate the performance of the random forest classifier on the test set using accuracy, precision, recall, and F1-score, we need to compute these metrics based on the predicted labels and the true labels. Here's an updated version of the code that includes the evaluation metrics:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Download and Load the Dataset
dataset_url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = dataset_url.split("/")[-2]
download_url = f"https://drive.google.com/uc?id={file_id}"
df = pd.read_csv(download_url)

# Step 2: Preprocess the Dataset (if needed)

# Step 3: Split the Dataset into Training and Testing Sets
X = df.drop('target', axis=1)  # Assuming 'target' column contains the labels
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10)
rf_classifier.fit(X_train, y_train)

# Step 5: Evaluate the Classifier
y_pred = rf_classifier.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

In this code, we calculate the accuracy, precision, recall, and F1-score using scikit-learn's accuracy_score, precision_score, recall_score, and f1_score functions, respectively. Adjust the code as needed and preprocess the dataset according to your specific requirements.

In [None]:
##Q5.

To identify the top 5 most important features in predicting heart disease risk using a random forest classifier and visualize the feature importances, you can use the following code:
    
    import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Step 1: Download and Load the Dataset
dataset_url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = dataset_url.split("/")[-2]
download_url = f"https://drive.google.com/uc?id={file_id}"
df = pd.read_csv(download_url)

# Step 2: Preprocess the Dataset (if needed)

# Step 3: Split the Dataset into Features and Target
X = df.drop('target', axis=1)  # Assuming 'target' column contains the labels
y = df['target']

# Step 4: Train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10)
rf_classifier.fit(X, y)

# Get feature importances
importances = rf_classifier.feature_importances_
feature_names = X.columns

# Sort feature importances in descending order
indices = importances.argsort()[::-1]
top_features = feature_names[indices][:5]
top_importances = importances[indices][:5]

# Step 5: Visualize the Feature Importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(top_importances)), top_importances, tick_label=top_features)
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.title('Top 5 Features Importance')
plt.xticks(rotation=45)
plt.show()

In this code, we train a random forest classifier using the entire dataset, extract the feature importances using the feature_importances_ attribute of the trained classifier, and then sort the importances in descending order. We select the top 5 features and their corresponding importances, and visualize them using a bar chart.

Make sure to adjust the code as needed and preprocess the dataset according to your specific requirements. Additionally, you may want to customize the plot further to suit your preferences.



In [None]:
##Q6.


To tune the hyperparameters of the random forest classifier using grid search or random search and evaluate the performance using 5-fold cross-validation, you can use the following code as a starting point:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score

# Step 1: Download and Load the Dataset
dataset_url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = dataset_url.split("/")[-2]
download_url = f"https://drive.google.com/uc?id={file_id}"
df = pd.read_csv(download_url)

# Step 2: Preprocess the Dataset (if needed)

# Step 3: Split the Dataset into Features and Target
X = df.drop('target', axis=1)  # Assuming 'target' column contains the labels
y = df['target']

# Step 4: Define the Hyperparameter Grid for Search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Step 5: Perform Grid Search or Random Search
# Grid Search


# classifier = RandomForestClassifier()
# grid_search = GridSearchCV(classifier, param_grid, cv=5)
# grid_search.fit(X, y)
# best_params = grid_search.best_params_

# Random Search
classifier = RandomForestClassifier()
random_search = RandomizedSearchCV(classifier, param_grid, cv=5)
random_search.fit(X, y)
best_params = random_search.best_params_

# Step 6: Evaluate the Best Model using Cross-Validation
best_classifier = RandomForestClassifier(**best_params)
cv_scores = cross_val_score(best_classifier, X, y, cv=5)
mean_cv_score = cv_scores.mean()

print("Best Hyperparameters:", best_params)
print("Mean Cross-Validation Score:", mean_cv_score)


In this code, we first load and preprocess the dataset. Then, we define a parameter grid with different values for the hyperparameters we want to tune. Next, we perform either grid search or random search using GridSearchCV or RandomizedSearchCV, respectively. Finally, we create a random forest classifier with the best parameters found, and evaluate its performance using 5-fold cross-validation.

You can uncomment the desired search method (GridSearchCV or RandomizedSearchCV) based on your preference. Adjust the code as needed and preprocess the dataset according to your specific requirements. Additionally, you may want to consider expanding the search space by adding more values or hyperparameters to the parameter grid for a more comprehensive search.


In [None]:
##Q7.

To report the best set of hyperparameters found by the search and the corresponding performance metrics, as well as compare the performance of the tuned model with the default model, you can use the following code as a continuation of the previous code:


 import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Download and Load the Dataset
dataset_url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = dataset_url.split("/")[-2]
download_url = f"https://drive.google.com/uc?id={file_id}"
df = pd.read_csv(download_url)

# Step 2: Preprocess the Dataset (if needed)

# Step 3: Split the Dataset into Features and Target
X = df.drop('target', axis=1)  # Assuming 'target' column contains the labels
y = df['target']

# Step 4: Split the Dataset into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Define the Hyperparameter Grid for Search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Step 6: Perform Grid Search or Random Search
# Grid Search
# classifier = RandomForestClassifier()
# grid_search = GridSearchCV(classifier, param_grid, cv=5)
# grid_search.fit(X_train, y_train)
# best_params = grid_search.best_params_

# Random Search
classifier = RandomForestClassifier()
random_search = RandomizedSearchCV(classifier, param_grid, cv=5)
random_search.fit(X_train, y_train)
best_params = random_search.best_params_

# Step 7: Train and Evaluate the Best Model
best_classifier = RandomForestClassifier(**best_params)
best_classifier.fit(X_train, y_train)
y_pred_best = best_classifier.predict(X_test)

accuracy_best = accuracy_score(y_test, y_pred_best)
precision_best = precision_score(y_test, y_pred_best)
recall_best = recall_score(y_test, y_pred_best)
f1_best = f1_score(y_test, y_pred_best)

# Step 8: Train and Evaluate the Default Model
default_classifier = RandomForestClassifier()
default_classifier.fit(X_train, y_train)
y_pred_default = default_classifier.predict(X_test)

accuracy_default = accuracy_score(y_test, y_pred_default)
precision_default = precision_score(y_test, y_pred_default)
recall_default = recall_score(y_test, y_pred_default)
f1_default = f1_score(y_test, y_pred_default)

# Step 9: Print the Results
print("Best Hyperparameters:", best_params)
print("\nPerformance Metrics - Best Model:")
print("Accuracy:", accuracy_best)
print("Precision:", precision_best)
print("Recall:", recall_best)
print("F1-score:", f1_best)

print("\nPerformance Metrics - Default Model:")
print("Accuracy:", accuracy_default)
print("Precision:", precision_default)
print("Recall:", recall_default)
print("F1-score:", f1_default)

In this code, after splitting the dataset into training and testing sets, we define the hyperparameter grid and perform either grid search 



In [None]:
##Q8.

To plot the decision boundaries of the random forest classifier on a scatter plot of two of the most important features, we can use the following code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Step 1: Download and Load the Dataset
dataset_url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = dataset_url.split("/")[-2]
download_url = f"https://drive.google.com/uc?id={file_id}"
df = pd.read_csv(download_url)

# Step 2: Preprocess the Dataset (if needed)

# Step 3: Split the Dataset into Features and Target
X = df.drop('target', axis=1)  # Assuming 'target' column contains the labels
y = df['target']

# Step 4: Train the Random Forest Classifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X, y)

# Step 5: Get the Most Important Features
importances = rf_classifier.feature_importances_
feature_names = X.columns
indices = np.argsort(importances)[::-1]
top_features = feature_names[indices][:2]  # Select the top 2 features

# Step 6: Prepare Data for Scatter Plot
X_top = X[top_features]

# Step 7: Create Meshgrid for Decision Boundaries
plot_step = 0.02
x_min, x_max = X_top.iloc[:, 0].min() - 1, X_top.iloc[:, 0].max() + 1
y_min, y_max = X_top.iloc[:, 1].min() - 1, X_top.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

# Step 8: Predict and Plot Decision Boundaries
Z = rf_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.8)

# Step 9: Plot the Scatter Plot
plt.scatter(X_top.iloc[:, 0], X_top.iloc[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='k')
plt.xlabel(top_features[0])
plt.ylabel(top_features[1])
plt.title('Decision Boundaries of Random Forest Classifier')
plt.colorbar()

plt.show()

