![](logo.png)

![](pic.jpg)

# Day Objectives

## Random Forest
- It can be used for both Classification and Regression problems in ML.
- It is based on the concept of ensemble learning
- Which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
- Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset
- Takes the average to improve the predictive accuracy of that dataset.
- The random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
- **The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.**

### Why use Random Forest?
- It takes less training time as compared to other algorithms.
- It predicts output with high accuracy, even for the large dataset it runs efficiently.
- It can also maintain accuracy when a large proportion of data is missing.


![](tree.png)


### Advantages of Random Forest
- Random Forest is capable of performing both Classification and Regression tasks.
- It is capable of handling large datasets with high dimensionality.
- It enhances the accuracy of the model and prevents the overfitting issue.



- Under fitting, OVerfitting, Best fit   at training and testing 

![](fit.png)

## Regularization

**Tuning Technique**

- Regularization is a technique used 
    - attempt to solve the overfitting problem in statistical models.
    - Reduce the errors by fitting the function appropriately on the given training set and avoid       overfitting.
    - which makes slight modifications to the learning algorithm such that the model generalizes better
    - This in turn improves the model’s performance on the unseen data as well.



### 3 Techniques:
- Ridge
- Lasso
- Elastic Net


### What is Ridge Regression?

Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values.


***Multicollinearity:*** Predictors are correlated with other predictors

To fix the problem of overfitting, we need to balance two things:
1. How well function/model fits data.
2. Magnitude of coefficients.


*Total Cost Function = Measure of fit of model + Measure of magnitude                    of coefficient*

cost functions are used to estimate how badly models are performing


* The cost function for ridge regression:
$$ Min(||Y – X(theta)||^2 + λ||theta||^2) $$


#### Ridge Regression Models 

$$ Y = XB + e $$
Where Y is the dependent variable, X represents the independent variables, B is the regression coefficients to be estimated, and e represents the errors are residuals. 

**Limitation of Ridge Regression:** Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

[boston Dataset](https://www.kaggle.com/c/boston-housing)



### Lasso Regression

[Salary_Dataset](https://raw.githubusercontent.com/AP-State-Skill-Development-Corporation/Datasets/master/Regression/Salary_Data.csv)
- Lasso regression stands for Least Absolute Shrinkage and Selection Operator

#### Elastic-Net Regression
- By combining lasso and ridge regression we get Elastic-Net Regression.
-  the lasso regression can cause a small bias in the model where the prediction is too dependent upon a particular variable. In these cases, elastic Net is proved to better it combines the regularization of both lasso and Ridge. 
- **Advantage** of that it does not easily eliminate the high collinearity coefficient.





# Random Forest as classifier

In [1]:
import pandas as pd
import numpy as np

In [2]:
wine = pd.read_csv("https://raw.githubusercontent.com/LavanyaPolamarasetty/Datasets/master/Classification/wine.data.csv")
wine.head()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [4]:
wine["Class"].value_counts()

2    71
1    59
3    48
Name: Class, dtype: int64

In [5]:
wine.shape

(178, 14)

In [6]:
wine.isnull().sum()

Class                           0
Alcohol                         0
Malic acid                      0
Ash                             0
Alcalinity of ash               0
Magnesium                       0
Total phenols                   0
Flavanoids                      0
Nonflavanoid phenols            0
Proanthocyanins                 0
Color intensity                 0
Hue                             0
OD280/OD315 of diluted wines    0
Proline                         0
dtype: int64

In [7]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Class                         178 non-null    int64  
 1   Alcohol                       178 non-null    float64
 2   Malic acid                    178 non-null    float64
 3   Ash                           178 non-null    float64
 4   Alcalinity of ash             178 non-null    float64
 5   Magnesium                     178 non-null    int64  
 6   Total phenols                 178 non-null    float64
 7   Flavanoids                    178 non-null    float64
 8   Nonflavanoid phenols          178 non-null    float64
 9   Proanthocyanins               178 non-null    float64
 10  Color intensity               178 non-null    float64
 11  Hue                           178 non-null    float64
 12  OD280/OD315 of diluted wines  178 non-null    float64
 13  Proli

In [8]:
wine.describe()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [11]:
X = wine[wine.columns[1:]]
X.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [12]:
y = wine["Class"]

In [13]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25, random_state = 42)

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
rcls = RandomForestClassifier()

In [16]:
rcls.fit(X_train,y_train)

RandomForestClassifier()

In [17]:
y_pred = rcls.predict(X_test)

In [18]:
from sklearn.metrics import accuracy_score,confusion_matrix

In [20]:
accuracy_score(y_test,y_pred) * 100

100.0

In [21]:
confusion_matrix(y_test,y_pred)

array([[15,  0,  0],
       [ 0, 18,  0],
       [ 0,  0, 12]], dtype=int64)

# Random Forest Regressor

In [24]:
data = pd.read_csv("https://raw.githubusercontent.com/LavanyaPolamarasetty/Datasets/master/Regression/age_salary_hours.csv")
data.head()

Unnamed: 0,Age,Annual Salary,Weekly hours,Education
0,72,160000.0,40.0,Bachelor's degree or higher
1,72,100000.0,50.0,Bachelor's degree or higher
2,31,120000.0,40.0,Bachelor's degree or higher
3,28,45000.0,40.0,Bachelor's degree or higher
4,54,85000.0,40.0,Bachelor's degree or higher


In [26]:
data.shape

(500, 4)

In [27]:
data.isnull().sum()

Age              0
Annual Salary    0
Weekly hours     0
Education        0
dtype: int64

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Age            500 non-null    int64  
 1   Annual Salary  500 non-null    float64
 2   Weekly hours   500 non-null    float64
 3   Education      500 non-null    object 
dtypes: float64(2), int64(1), object(1)
memory usage: 15.8+ KB


In [31]:
data.columns


Index(['Age', 'Annual Salary', 'Weekly hours', 'Education'], dtype='object')

In [40]:
X = data[["Age","Weekly hours"]]
X.head()

Unnamed: 0,Age,Weekly hours
0,72,40.0
1,72,50.0
2,31,40.0
3,28,40.0
4,54,40.0


In [41]:
y = data["Annual Salary"]

In [42]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25)

In [43]:
from sklearn.ensemble import RandomForestRegressor

In [44]:
rreg = RandomForestRegressor()

In [45]:
rreg.fit(X_train,y_train)

RandomForestRegressor()

In [46]:
y_pred = rreg.predict(X_test)

In [48]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)* 100

0.7627708718566661

In [49]:
X_train.iloc[100]

Age             64.0
Weekly hours     0.0
Name: 123, dtype: float64

In [50]:
y_train.iloc[100]

0.0

In [51]:
X_test.iloc[7]

Age             52.0
Weekly hours    50.0
Name: 194, dtype: float64

In [52]:
y_test.iloc[7]

165000.0

In [53]:
rreg.predict([[52,50]])

array([75384.23333333])

In [54]:
165000-75384

89616

# Task

https://raw.githubusercontent.com/LavanyaPolamarasetty/Datasets/master/Regression/Admission_Predict.csv

In [55]:
df = pd.read_csv("https://raw.githubusercontent.com/LavanyaPolamarasetty/Datasets/master/Regression/Admission_Predict.csv")
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [59]:
df.drop("Serial No.", axis = 1, inplace = True)

In [60]:
df.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


In [62]:
X = df[df.columns[:-1]]
X

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research
0,337,118,4,4.5,4.5,9.65,1
1,324,107,4,4.0,4.5,8.87,1
2,316,104,3,3.0,3.5,8.00,1
3,322,110,3,3.5,2.5,8.67,1
4,314,103,2,2.0,3.0,8.21,0
...,...,...,...,...,...,...,...
395,324,110,3,3.5,3.5,9.04,1
396,325,107,3,3.0,3.5,9.11,1
397,330,116,4,5.0,4.5,9.45,1
398,312,103,3,3.5,4.0,8.78,0


In [63]:
y = df["Chance of Admit "]

In [64]:
df.shape

(400, 8)

In [65]:
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [74]:
model = RandomForestRegressor(n_estimators=100,max_depth=4,)

In [75]:
model.fit(X_train,y_train)

RandomForestRegressor(max_depth=4)

In [76]:
y_pred = model.predict(X_test)

In [77]:
r2_score(y_test,y_pred) * 100

85.30462974476707

In [73]:
help(RandomForestRegressor)

Help on class RandomForestRegressor in module sklearn.ensemble._forest:

class RandomForestRegressor(ForestRegressor)
 |  RandomForestRegressor(n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest regressor.
 |  
 |  A random forest is a meta estimator that fits a number of classifying
 |  decision trees on various sub-samples of the dataset and uses averaging
 |  to improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------

In [80]:
from sklearn.model_selection import train_test_split as t
d2=pd.read_csv("https://raw.githubusercontent.com/LavanyaPolamarasetty/Datasets/master/Regression/Admission_Predict.csv")
df2=pd.DataFrame(d2)

x=df2[df2.columns[1:8]]

y=df2[df2.columns[8:]]
x_train,x_test,y_train,y_test=t(x,y,test_size=0.25)
from sklearn.ensemble import RandomForestRegressor as r
rr=r(n_estimators=100,max_depth=5)
rr.fit(x_train,y_train)
y_pred1=rr.predict(x_test)
r2_score(y_pred1,y_test)*100

  rr.fit(x_train,y_train)


78.72162368017351

# Ridge Regressor

In [82]:
df = pd.read_csv("https://raw.githubusercontent.com/LavanyaPolamarasetty/Datasets/master/Regression/1000_Companies.csv")
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [83]:
df.shape

(1000, 5)

In [84]:
df.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

In [85]:
df.columns

Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

In [86]:
X = df[['R&D Spend', 'Administration', 'Marketing Spend']]

In [87]:
y = df["Profit"]

In [88]:
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [89]:
from sklearn.linear_model import Ridge 

In [90]:
rmodel = Ridge()

In [91]:
rmodel.fit(X_train,y_train)

Ridge()

In [93]:
y_pred = rmodel.predict(X_test)

In [95]:
r2_score(y_test,y_pred) * 100

91.25484124648212

In [96]:
from sklearn.neighbors import KNeighborsRegressor

In [97]:
kmodel = KNeighborsRegressor(n_neighbors=5)

In [99]:
kmodel.fit(X_train,y_train)

KNeighborsRegressor()

In [100]:
y_pred = kmodel.predict(X_test)

In [102]:
r2_score(y_test,y_pred) * 100

90.72027446561194

# Task

In [103]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import bagging

