# Random Forest Algorithm in Machine Learning

## What is Random Forest?
Random Forest is a supervised machine learning algorithm that combines multiple decision trees to produce a more accurate and robust prediction. It can be used for both classification and regression tasks. 

The algorithm works by constructing multiple decision trees during training and aggregating their outputs for final predictions.

### Key Features:
- **Ensemble Learning**: Combines multiple models to improve performance.
- **Bagging (Bootstrap Aggregating)**: Each tree is trained on a random subset of data.
- **Feature Randomness**: Splits are made based on a random subset of features at each node.
- **Reduces Overfitting**: By averaging predictions, it minimizes the risk of overfitting.

## How Random Forest Works
1. Randomly select subsets of data (with replacement) to train individual decision trees.
2. For each tree:
   - Randomly choose a subset of features to consider for splitting at each node.
   - Grow the tree to its maximum depth (or based on a stopping criterion).
3. Combine predictions from all trees using:
   - **Majority voting** (for classification).
   - **Averaging** (for regression).

### Parameters to Tune:
- **n_estimators**: Number of trees in the forest.
- **max_depth**: Maximum depth of each tree.
- **max_features**: Maximum number of features to consider for splitting.
- **min_samples_split**: Minimum number of samples required to split a node.
- **min_samples_leaf**: Minimum number of samples required to be a leaf node.

## Use Cases of Random Forest
### 1. Classification
- **Spam detection**: Identifying whether an email is spam or not.
- **Fraud detection**: Detecting fraudulent transactions.
- **Image classification**: Identifying objects in images.

### 2. Regression
- **House price prediction**: Estimating the price of a property.
- **Weather forecasting**: Predicting temperature or rainfall.

### 3. Feature Selection
- Identifying the most important features contributing to predictions.


## Conclusion
Random Forest is a versatile and powerful machine learning algorithm suitable for a wide range of applications. It excels in both classification and regression tasks while providing robustness against overfitting and noisy data.

![Screenshot (8122).png](attachment:5814b9a6-54ea-4ef5-8636-44af952067e0.png)

![Screenshot (8123).png](attachment:87756d39-c474-4f39-8e02-c9ef7a2d14bc.png)

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv('Kyphosis.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Kyphosis  81 non-null     object
 1   Age       81 non-null     int64 
 2   Number    81 non-null     int64 
 3   Start     81 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 2.7+ KB


In [7]:
data.shape

(81, 4)

In [9]:
data.head()

Unnamed: 0,Kyphosis,Age,Number,Start
0,absent,71,3,5
1,absent,158,3,14
2,present,128,4,5
3,absent,2,5,1
4,absent,1,4,15


In [11]:
x = data.drop('Kyphosis', axis=1)

In [13]:
x.head()

Unnamed: 0,Age,Number,Start
0,71,3,5
1,158,3,14
2,128,4,5
3,2,5,1
4,1,4,15


In [15]:
y = data['Kyphosis']
y.head()

0     absent
1     absent
2    present
3     absent
4     absent
Name: Kyphosis, dtype: object

In [17]:
# divide dataset into training and testing set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

In [27]:
# build a random forest model and train it
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=50)
model.fit(x_train, y_train)

In [29]:
pred = model.predict(x_test)
pred

array(['absent', 'absent', 'absent', 'absent', 'absent', 'absent',
       'absent', 'present', 'present', 'absent', 'present', 'absent',
       'absent', 'absent', 'absent', 'absent', 'absent', 'absent',
       'absent', 'absent', 'absent', 'absent', 'absent', 'absent',
       'present'], dtype=object)

In [31]:
y_test

31     absent
67     absent
79    present
41     absent
69     absent
1      absent
73     absent
21    present
57    present
6      absent
50     absent
15     absent
54     absent
16     absent
26     absent
35     absent
52    present
75     absent
55     absent
59     absent
64     absent
70     absent
18     absent
4      absent
2     present
Name: Kyphosis, dtype: object

In [33]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

0.88

In [35]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

array([[19,  1],
       [ 2,  3]], dtype=int64)