# Introducing Random Forest
A Random Forest 🌲🌲🌲 is actually just a bunch of Decision Trees 🌲 bundled together (ohhhhh that’s why it’s called a forest). In this notebook we will learn how to build Random Forest Model.

![Random Forest](https://www.frontiersin.org/files/MyHome%20Article%20Library/284242/284242_Thumb_400.jpg)

## Agenda
*  About Dataset
*  Loading Libraries
*  Loading Data
*  Understanding Data
*  Separating Input Features and Ouput/Target Features
*  Splitting Data into Train and Test Sets.
*  Build Model
*  Prediction
*  Check Model Performance

## About Dataset
I hope all of you guys remembered the wine dataset on which we have done exploratory data analysis and also build logistic regression model for this dataset. Here we will take red wine data. Given different physiochemical tests, we want to predict the quality of wine in range 1 to 10.

The reason behind taking the same dataset is that we can easily notice the differences between Logistic Regression and Random Forest models.

## Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.

In data science, numpy and pandas are most commonly used libraries. Numpy is required for calculations like means, medians, square roots, etc. Pandas is used for data processin and data frames. We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).

In [None]:
import numpy as np        # Fundamental package for linear algebra and multidimensional arrays
import pandas as pd       # Data analysis and manipultion tool

## Loading Data
Pandas module is used for reading files. We have our data in '.csv' format. We will use 'read_csv()' function for loading the data.

In [None]:
# In read_csv() function, we have passed the location to where the files are located in the UCI website. The data is separated by ';'
# so we used separator as ';' (sep = ";")
red_wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")

## Understanding Data
Let's see how our data looks.

One can explore my another notebook which is especially focused on understanding the data and getting insights from it (in short Exploratory Data Analysis) using the same dataset. It will help you correlate the works done in both the notebooks. 

Link to the notebook: https://www.kaggle.com/manishkc06/eda-an-introduction

In [None]:
# Red Wine
red_wine_data.head() 

In [None]:
red_wine_data.columns

### Different attributes
**Input variables (based on physicochemical tests):**

*  fixed acidity
*  volatile acidity
*  citric acid
*  residual sugar
*  chlorides
*  free sulfur dioxide
*  total sulfur dioxide
*  density
*  pH
*  sulphates
*  alcohol

**Output variable (based on sensory data):**

*  quality (score between 0 and 10)

In [None]:
# Basic statistical details about data
red_wine_data.describe()

Let's see target variable 'quality'.

In [None]:
red_wine_data.quality.value_counts().plot(kind = 'bar')

We can observe here more wines are of average quality than poor quality and good quality. This is what we had observed in our EDA notebook of wine data.

We have already done the EDA part of this dataset in our earlier notebook. So we will not dive into EDA more here. Let's separate the independent and dependent variables.

## Separating Input Features and Output Features
Before building any machine learning model, we always separate the input variables and output variables. Input variables are those quantities whose values are changed naturally in an experiment, whereas output variable is the one whose values are dependent on the input variables. So, input variables are also known as independent variables as its values are not dependent on any other quantity, and output variable/s are also known as dependent variables as its values are dependent on other variable i.e. input variables. Like here in this data, we can see that whether a person will buy insurance or not is dependent on the age of that person

By convention input variables are represented with 'X' and output variables are represented with 'y'.

In [None]:
# Input/independent variables
X = red_wine_data.drop('quality', axis = 1)   # her we are droping the quality feature as this is the target and 'X' is input features, the changes are not 
                                              # made inplace as we have not used 'inplace = True'

y = red_wine_data.quality             # Output/Dependent variable

## Splitting the data into Train and Test Set
We want to check the performance of the model that we built. For this purpose, we always split (both input and output data) the given data into training set which will be used to train the model, and test set which will be used to check how accurately the model is predicting outcomes.

For this purpose we have a class called 'train_test_split' in the 'sklearn.model_selection' module.

In [None]:
# import train_test_split
from sklearn.model_selection import train_test_split

In [None]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state = 42)

# X_train: independent/input feature data for training the model
# y_train: dependent/output feature data for training the model
# X_test: independent/input feature data for testing the model; will be used to predict the output values
# y_test: original dependent/output values of X_test; We will compare this values with our predicted values to check the performance of our built model.
 
# test_size = 0.30: 30% of the data will go for test set and 70% of the data will go for train set
# random_state = 42: this will fix the split i.e. there will be same split for each time you run the code

## Building Model
Now we are finally ready, and we can train the model.

First, we need to import our model - Random Forest Classifier (again, using the sklearn library).

Then we would feed the model both with the data (X_train) and the answers for that data (y_train)

In [None]:
# Importing RandomForestClassifier from sklearn.ensemble
# We will be further discussing about why Random Forest is in ensemble module of sklearn library
from sklearn.ensemble import RandomForestClassifier 

In [None]:
rfc = RandomForestClassifier()

In [None]:
rfc.fit(X_train, y_train)

## Prediction
Now Random Forest model (i.e. rfc) is trained using X_train and y_trian data. Let's predict the target value (i.e. quality of wine) for the X_test data. We use "predict()" method for prediction.

In [None]:
predictions = rfc.predict(X_test)

We already have actual target values (i.e. y_test) for X_test. Let's compare y_test and the predicted value for X_test by our log_model.

In [None]:
y_test.values

In [None]:
predictions

## Model Performance
We can also check how accurate our model is performing using the 'accuracy_score' class from 'sklearn.metrics'.

In [None]:
# The confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

If you observe here, the class wise false positives (above the main diagonal) and the class wise false negatives (below the main diagonal) are almost symmetrical. So, the accuracy score is an important metric here.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, predictions)

You can observe that the accuracy is improved with Random Forest Model. Logistic Regression Model gave 54% of accuracy and Random Forest is giving 66.8% of accuracy on the same dataset.

**Thanks for reading the Notebook!!!**