# **Comparative Analysis of Decision Tree and Random Forest Classifiers on Iris and Wearable Sensor Datasets**

# **Project Description**
This project explores the application of machine learning models on two distinct datasets for classification tasks:

**1. Iris Dataset:** Predicting Flower Species
The first dataset is the classic Iris dataset, which consists of 150 samples of iris flowers, each described by four features: sepal length, sepal width, petal length, and petal width. The goal is to classify these samples into one of three species: setosa, versicolor, or virginica.

**Using a Random Forest Classifier**:

1.The dataset was split into training and testing sets.          
2.The model was trained to predict the species based on the input features.      
3.Out-of-bag (OOB) accuracy was calculated to evaluate the performance of the model without using additional test data.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target
print(X)
print(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


rf_classifier = RandomForestClassifier(n_estimators=50, random_state=42, oob_score=True)
model=DecisionTreeClassifier()

rf_classifier.fit(X_train, y_train)

oob_accuracy = rf_classifier.oob_score_
print(f"Out-of-Bag Accuracy: {oob_accuracy:.2f}")



**2. Wearable Devices Dataset**
**Objective:** Predict the activity type (e.g., lying, walking, running) based on sensor data from wearable devices.                                  
**Approach:**                                                   
1.Sensor features, including heart rate, steps, and calories burned, were used as inputs.                                
2.Decision Tree and Random Forest classifiers were applied to determine the activity type.              
3.Model parameters for Random Forest were optimized to improve prediction accuracy.

In [None]:

df = pd.read_csv('https://raw.githubusercontent.com/neo-stark-team/Datasets/main/smartwatch.csv')
df['activity'].unique()


array(['Lying', 'Sitting', 'Self Pace walk', 'Running 3 METs',
       'Running 5 METs', 'Running 7 METs'], dtype=object)

In [None]:
df

Unnamed: 0.1,Unnamed: 0,X1,age,gender,height,weight,steps,hear_rate,calories,distance,entropy_heart,entropy_setps,resting_heart,corr_heart_steps,norm_heart,intensity_karvonen,sd_norm_heart,steps_times_distance,device,activity
0,1,1,20,1,168.0,65.4,10.771429,78.531302,0.344533,0.008327,6.221612,6.116349,59.0,1.000000,19.531302,0.138520,1.000000,0.089692,apple watch,Lying
1,2,2,20,1,168.0,65.4,11.475325,78.453390,3.287625,0.008896,6.221612,6.116349,59.0,1.000000,19.453390,0.137967,1.000000,0.102088,apple watch,Lying
2,3,3,20,1,168.0,65.4,12.179221,78.540825,9.484000,0.009466,6.221612,6.116349,59.0,1.000000,19.540825,0.138587,1.000000,0.115287,apple watch,Lying
3,4,4,20,1,168.0,65.4,12.883117,78.628260,10.154556,0.010035,6.221612,6.116349,59.0,1.000000,19.628260,0.139208,1.000000,0.129286,apple watch,Lying
4,5,5,20,1,168.0,65.4,13.587013,78.715695,10.825111,0.010605,6.221612,6.116349,59.0,0.982816,19.715695,0.139828,0.241567,0.144088,apple watch,Lying
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6259,6260,3666,46,0,157.5,71.4,1.000000,35.000000,20.500000,1.000000,0.000000,0.000000,35.0,1.000000,0.000000,0.000000,0.000000,1.000000,fitbit,Running 7 METs
6260,6261,3667,46,0,157.5,71.4,1.000000,35.000000,20.500000,1.000000,0.000000,0.000000,35.0,1.000000,0.000000,0.000000,1.000000,1.000000,fitbit,Running 7 METs
6261,6262,3668,46,0,157.5,71.4,1.000000,35.000000,20.500000,1.000000,0.000000,0.000000,35.0,1.000000,0.000000,0.000000,1.000000,1.000000,fitbit,Running 7 METs
6262,6263,3669,46,0,157.5,71.4,1.000000,35.000000,20.500000,1.000000,0.000000,0.000000,35.0,1.000000,0.000000,0.000000,1.000000,1.000000,fitbit,Running 7 METs


In [None]:
df.isnull().sum()

Unnamed: 0              0
X1                      0
age                     0
gender                  0
height                  0
weight                  0
steps                   0
hear_rate               0
calories                0
distance                0
entropy_heart           0
entropy_setps           0
resting_heart           0
corr_heart_steps        0
norm_heart              0
intensity_karvonen      0
sd_norm_heart           0
steps_times_distance    0
device                  0
activity                0
dtype: int64

In [None]:
X = df.drop(['activity'], axis=1)

y = df['activity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)


In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

y_pred_dt = decision_tree.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print("Decision Tree Accuracy:", accuracy_dt)


Decision Tree Accuracy: 0.8020750199521149


In [None]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=150)
random_forest.fit(X_train, y_train)

y_pred_rf = random_forest.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)


Random Forest Accuracy: 0.8627294493216281


In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV


rf_classifier = RandomForestClassifier()

param_grid = {
    'n_estimators': [10, 20, 30],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(rf_classifier,param_grid, cv=5)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_rf_classifier = grid_search.best_estimator_


print("Best Parameters:", best_params)

test_accuracy = best_rf_classifier.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)


Best Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 30}
Test Accuracy: 0.88268156424581


**Results**                           
**Model Performance:**

Decision Tree Accuracy: 80.21%
Random Forest Accuracy: 86.27%
Test Accuracy after Hyperparameter Tuning: 88.27% (Random Forest with optimized parameters)
Best Random Forest Parameters:

max_depth: 20
min_samples_leaf: 1
min_samples_split: 2
n_estimators: 30
Observations:

Random Forest consistently outperformed the Decision Tree model in accuracy due to its ensemble approach, which reduces overfitting and improves generalization.
Hyperparameter tuning significantly boosted the performance of the Random Forest classifier.


**Conclusion**     
                                         
**Random Forest Classifier** was more effective than the Decision Tree Classifier, achieving higher accuracy due to its ensemble learning nature.
The project demonstrates how model tuning can enhance performance by selecting optimal parameters.        

The results highlight the potential of these models for classification tasks in structured datasets (Iris) and complex sensor data (wearable devices).

This project showcases the importance of comparing models and tuning hyperparameters to achieve better accuracy and reliable predictions in machine learning applications.