Random Forest Classification (Iris Dataset)           
📌 Concepts Covered: Basic Random Forest classifier, feature importance.
📌 Dataset: Iris Dataset (from sklearn.datasets)
📌 Goal: Classify different species of Iris flowers using Random Forest.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Feature importance
importances = rf.feature_importances_
for feature, importance in zip(iris.feature_names, importances):
    print(f'{feature}: {importance:.3f}')


Accuracy: 1.00
sepal length (cm): 0.104
sepal width (cm): 0.045
petal length (cm): 0.417
petal width (cm): 0.434


Predicting House Prices with Random Forest            
📌 Concepts Covered: Regression, bagging, handling real-world data.
📌 Dataset: California Housing Dataset (sklearn.datasets.fetch_california_housing)
📌 Goal: Predict house prices based on different features.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae:.2f}')


Mean Absolute Error: 0.33


Customer Churn Prediction (Random Forest for Classification)            
📌 Concepts Covered: Overfitting, feature engineering, categorical data handling.
📌 Dataset: Customer Churn Dataset (from Kaggle)
📌 Goal: Predict whether a customer will churn based on their usage behavior.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

# Convert to DataFrame
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(10)])
df['Churn'] = y

# Split data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Churn']), df['Churn'], test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 0.88


Fraud Detection using Random Forest           
📌 Concepts Covered: Handling imbalanced data, ensemble learning.
📌 Dataset: Credit Card Fraud Detection Dataset (from Kaggle)
📌 Goal: Detect fraudulent transactions using Random Forest.

In [8]:
from sklearn.datasets import make_classification
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from imblearn.under_sampling import RandomUnderSampler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, weights=[0.9, 0.1], random_state=42)

# Convert to DataFrame
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(10)])
df['Class'] = y

# Handle class imbalance using under-sampling
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(df.drop(columns=['Class']), df['Class'])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 0.90


Predicting Loan Default (Random Forest with Hyperparameter Tuning)          
📌 Concepts Covered: Hyperparameter tuning, cross-validation, overfitting reduction.
📌 Dataset: Loan Default Prediction Dataset (from Kaggle)
📌 Goal: Predict whether a person will default on a loan.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
df = pd.read_csv(url, header=None)

# Assign column names
df.columns = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']

# Split into features and target
X = df.drop(columns=['Class'])
y = df['Class']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5, 10],
}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predict and evaluate
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Best Model Accuracy: {accuracy:.2f}')


Best Model Accuracy: 0.99
