# Random Forest with Breast Cancer Dataset
![](https://img.freepik.com/free-photo/dna_1048-4880.jpg?t=st=1655813117~exp=1655813717~hmac=4a3594af68a2d7d9ef718c3411624f28e6dcc6489d31cf706481c8a3dc9e0fe7&w=1060)

Hi Guys 😀

In this notebook, I'm going to talk about random forest.

- First, I'm going to cover what random forest is
- Next, I'm going to mention the some advantages and disadvantages of random forest.
- Finally, I'm going to show you how to implement random forest with a real world dataset.

Please don't forget to follow us on [Tirendaz Academy](https://youtube.com/c/tirendazacademy) YouTube channel 

Happy learning 🐱‍🏍

# What is Random Forest?
![](https://img.freepik.com/free-photo/dense-forest-with-tall-pine-trees-fog-it_181624-11826.jpg?t=st=1655813320~exp=1655813920~hmac=e7d8537257c1c4844b31aa6d831f28225019833106ba96630aae60b5741bb248&w=1060)

You can think of random forest as an ensemble of decision trees. Each tree in random forest is slightly different from the others. These trees are selected a different subset of features. When building the model since we use many trees, we can reduce the amount of overfitting.

# Some Advantages of Random Forest
![](https://img.freepik.com/free-photo/glad-dark-haired-young-woman-says-sounds-good-confirms-something-everything-control-going-great-approves-promo-has-glad-expression-agrees-with-person-wears-yellow-sweatshirt_273609-42865.jpg?t=st=1655813420~exp=1655814020~hmac=1e0fce9eff4ffbc5385b758adf88e5d67ea1949ef6e5c9441ec2725bb374893a&w=1060)

- You can use random forest for both classification and regression tasks.
- Random forest often work well without heavy tuning of the hyperparameters.
- You don't need to scale the data.
- Random forest may provide better accuracy than decision tree since it overcomes the overfitting problem.

# Some Disadvantages Of Random Forest
![](https://img.freepik.com/free-photo/photo-thoughtful-handsome-adult-european-man-holds-chin-looks-pensively-away-tries-solve-problem_273609-45891.jpg?t=st=1655813585~exp=1655814185~hmac=2a5cb9a9972936443d261541efc6d48d8cb60a5d75259e6223729df99b3155c9&w=1060)

- Random forest cannot be perform well on very high dimensional, and sparse data such as text data.
- Random forest is not simple to interpret since it uses deeper tree than decision trees.

# Loading Dataset

In [1]:
import pandas as pd
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [2]:
df.shape

(569, 33)

# Data Preprocessing

Let's create the target and features variables.

In [3]:
y = df.loc[:,"diagnosis"].values
X = df.drop(["diagnosis","id","Unnamed: 32"], axis=1).values

Let's encode the target variable with label encoder.

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Let's split the dataset into the training and test set.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,stratify=y, random_state=0)

# Building the Model

First, I'm going to build the model with the default hyperparameters.

In [6]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

Let's predict the values of the training and test set.

In [7]:
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

Let's take a look at the performances of the model on trainin and test set with the accuracy score function.

In [8]:
from sklearn.metrics import accuracy_score
rf_train = accuracy_score(y_train, y_train_pred)
rf_test = accuracy_score(y_test, y_test_pred)
print(f"Random forest train/test accuracies:{rf_train:.3f}/{rf_test:.3f}")

Random forest train/test accuracies:1.000/0.958


As you can see, the score on the training set is 100%, and the score on the test set is 95%. This means that the model has the overfitting problem. Note that this random forest model learned the training set so well. So, it simply memorized the outcome. But, the model cannot generalize. To overcome the overfitting problem, we control the complexity of the model.

# Building the Model with Grid Search Technique

The grid search helps us to improve the performance of a model by finding the optimal combination of hyperparameter values.

In [9]:
from sklearn.model_selection import GridSearchCV
# Creating an object from RandomForestClassifier class.
rf = RandomForestClassifier(random_state=0)
# Specifing the values of the parameters.
parameters = {"max_depth":[5,10,20],
              "n_estimators":[i for i in range(10,100,10)],
              "min_samples_leaf":[i for i in range(1,10)],
              "criterion":["gini","entropy"],
              "max_features":["auto","sqrt","log2"]}
clf = GridSearchCV(rf, parameters, n_jobs=-1)
# Building the model.
clf.fit(X_train, y_train)
# Seeing the best parameters.
print(clf.best_params_)

{'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 'min_samples_leaf': 3, 'n_estimators': 10}


# Evaluationg the Model

In [10]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
rf_train = accuracy_score(y_train, y_train_pred)
rf_test = accuracy_score(y_test, y_test_pred)
print(f"Random forest train/test accuracies:{rf_train:.3f}/{rf_test:.3f}")

Random forest train/test accuracies:0.993/0.965


The performance of the model is better on both the training and test set. Notice that the score of our model on the training set is close to the score on the test set.In addition, both accuracy scores are close to 1.

Please don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎

![](https://img.freepik.com/free-vector/thank-you-lettering_1262-7412.jpg?t=st=1655813654~exp=1655814254~hmac=86c396cf2062d906540b475205797d297f2e540109d094d3087394e37d847e94&w=996)