# <font color = 'orange'> Random Forest Classifier and Regressor
## With Pipelines and Hyperparameter Tuning

#### Pipeline - to automate the process to retrain the model for new data.

### <font color = 'green'> Aim : To automate the entire process like feature enginnering, model training, model evaluation and many more 

---

## Classifier

### <font color = 'Blue'> Load dataset

In [1]:
import seaborn as sns

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### <font color = '#AA00FF'> Observation:
* Let's us consider **time** as output parameter.

---

### <font color = 'Blue'> EDA - all about analyizing data and understanding data
* Handling missing values.
* Handling duplicate values.
* Handling Categorical features.
* Handling Outliers.
* Feature Scaling.
* Checking data types.
* Checking number of unique values.
* Checking statistics of dataset.

In [2]:
df['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

### <font color = '#AA00FF'> Observation:
* As there is 2 output categories it is a **binary class** classificatin model.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


### <font color = '#AA00FF'> Observation:
* There are 4 Categorical feature.

In [4]:
# let's do label encoding for time categorical feature as it is a binary classification
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoded = encoder.fit_transform(df['time'])

print(encoded)

df['time'] = encoded

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [5]:
df['time'].value_counts()

0    176
1     68
Name: time, dtype: int64

In [6]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,0,2
1,10.34,1.66,Male,No,Sun,0,3
2,21.01,3.5,Male,No,Sun,0,3
3,23.68,3.31,Male,No,Sun,0,2
4,24.59,3.61,Female,No,Sun,0,4


---

### <font color = 'Blue'> 1. Independent and Dependent data

In [7]:
# independent features
x = df.drop('time',axis = 1)

# dependent features
y = df['time']

In [8]:
x.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
1,10.34,1.66,Male,No,Sun,3
2,21.01,3.5,Male,No,Sun,3
3,23.68,3.31,Male,No,Sun,2
4,24.59,3.61,Female,No,Sun,4


In [9]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: time, dtype: int32

---

### <font color = 'Blue'> 2. Train test split

In [10]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)

---

### <font color = 'Blue'> 3. Feature Engineering using Pipeline 

In [11]:
# Handling Missing Values - mean or median imputation
from sklearn.impute import SimpleImputer 
# Handling Categorical Featres - Encoding
from sklearn.preprocessing import OneHotEncoder  
# Handling numerical Features - Feature Scaling
from sklearn.preprocessing import StandardScaler

# we have to automate the following
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [12]:
# identify the categorical and numerical features
categorical_cols = ['sex','smoker','day']
numerical_cols = ['total_bill','tip','size']

#### Automating feature engineering

#### 1. Creating pipeline

In [13]:
# numerical pipline responsible for any feature engineeing activities for all the numerical feature
num_pipeline = Pipeline(
    steps = [
        # name and object
        ('imputer',SimpleImputer(strategy = 'median')), # handling the missing values
        ('scaler',StandardScaler()) # for feature scaling
    ]
)

# categorical pipline responsible for any feature engineeing activities for all the categorical feature
cat_pipeline = Pipeline(
    steps = [
        ('imputer',SimpleImputer(strategy = 'most_frequent')), # handling missing values
        ('onehotencoder',OneHotEncoder()) # categorical into numerical 
    ]
)

#### 2. Combining different pipelines using ColumnTransformer

In [14]:
# So, when new data comes we have to use the above both so we can create a wrapper of them  
# wrapping is done using ColumnTransformer by telling the pipelines
# basically combining both pipelines

preprocessor = ColumnTransformer(
    [
        # name , pipeline variable and features name
        ('numerical_pipeline',num_pipeline,numerical_cols),
        ('categorical_pipeline',cat_pipeline,categorical_cols)
    ]
)

---

#### Using automated feature engineering process

In [15]:
# Preprocessor which can handle the feature engineering

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.fit_transform(x_test)

---

### <font color = 'Blue'> 4. Model Training and Evaluation

#### Automating model training process

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [17]:
# automatically model training and evaluation should happen within one function

# models contain all the algorithm we will use and we will select the best model among them 
models = {
    # key will be model name and value will be model  
    'Random Forest' : RandomForestClassifier(),
    'Decision Tree' : DecisionTreeClassifier(),
    'Support Vector Classifier' : SVC()
}

In [18]:
from sklearn.metrics import accuracy_score

#### Function that perform model training and evalution 

In [19]:
def evaluate_models(x_train, y_train, x_test, y_test, models):
    report = {}
    
    for i in range(len(models)):
        # Taking one model 
        model = list(models.values())[i]
        
        # Training model
        model.fit(x_train,y_train)
        
        # Predicting results 
        y_pred = model.predict(x_test)
        
        # Calulating accuracy 
        model_score = accuracy_score(y_test, y_pred)
        
        # storing the accuracy of the model
        report[list(models.keys())[i]] = model_score
    
    return report

In [20]:
report = evaluate_models(x_train, y_train, x_test, y_test, models)
# report contains the accuracy score of the models 

report

{'Random Forest': 0.9795918367346939,
 'Decision Tree': 0.9795918367346939,
 'Support Vector Classifier': 0.9795918367346939}

---

### <font color = 'Blue'> 5. Hyperparameter tuning

In [21]:
classifier = RandomForestClassifier()

parameters = {
    'max_depth':[3,5,10,None],
    'n_estimators':[100,200,300],
    'criterion':['gini','entropy']
}

from sklearn.model_selection import RandomizedSearchCV
rand_cv = RandomizedSearchCV(classifier, param_distributions = parameters, scoring = 'accuracy', cv = 5, verbose = 3)

rand_cv.fit(x_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END criterion=entropy, max_depth=None, n_estimators=200;, score=0.974 total time=   0.0s
[CV 2/5] END criterion=entropy, max_depth=None, n_estimators=200;, score=0.923 total time=   0.0s
[CV 3/5] END criterion=entropy, max_depth=None, n_estimators=200;, score=1.000 total time=   0.0s
[CV 4/5] END criterion=entropy, max_depth=None, n_estimators=200;, score=0.923 total time=   0.0s
[CV 5/5] END criterion=entropy, max_depth=None, n_estimators=200;, score=0.923 total time=   0.0s
[CV 1/5] END criterion=gini, max_depth=5, n_estimators=200;, score=0.974 total time=   0.0s
[CV 2/5] END criterion=gini, max_depth=5, n_estimators=200;, score=0.923 total time=   0.0s
[CV 3/5] END criterion=gini, max_depth=5, n_estimators=200;, score=0.974 total time=   0.1s
[CV 4/5] END criterion=gini, max_depth=5, n_estimators=200;, score=0.923 total time=   0.1s
[CV 5/5] END criterion=gini, max_depth=5, n_estimators=200;, score=0.923 total ti

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(),
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [3, 5, 10, None],
                                        'n_estimators': [100, 200, 300]},
                   scoring='accuracy', verbose=3)

In [22]:
rand_cv.best_params_

{'n_estimators': 300, 'max_depth': 3, 'criterion': 'gini'}

---

## Internal assignment - Regression 

In [23]:
from sklearn.ensemble import RandomForestRegressor

In [24]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,0,2
1,10.34,1.66,Male,No,Sun,0,3
2,21.01,3.5,Male,No,Sun,0,3
3,23.68,3.31,Male,No,Sun,0,2
4,24.59,3.61,Female,No,Sun,0,4


In [25]:
# Segregate independent and dependent features
x = df.drop('total_bill',axis = 1)

y = df['total_bill']

In [26]:
# train test split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

In [27]:
x.head()

Unnamed: 0,tip,sex,smoker,day,time,size
0,1.01,Female,No,Sun,0,2
1,1.66,Male,No,Sun,0,3
2,3.5,Male,No,Sun,0,3
3,3.31,Male,No,Sun,0,2
4,3.61,Female,No,Sun,0,4


In [28]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [29]:
cat_feature = ['sex','smoker','day']
num_feature = ['tip','time','size']

In [30]:
# creating pipelines

cat_pipe = Pipeline(
    steps = [
        ('imputer',SimpleImputer(strategy = 'most_frequent')),
        ('encoding',OneHotEncoder())
    ]
)

num_pipe = Pipeline(
    steps = [
        ('imputer',SimpleImputer(strategy = 'median')),
        ('scaler',StandardScaler())
    ]
)

In [31]:
# composing the pipelines

preprocessor = ColumnTransformer(
    [
        ('categorical_pipeline', cat_pipe, cat_feature),
        ('numerical_pipeline', num_pipe, num_feature)
    ]
)

In [32]:
# automated feature engineering

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.fit_transform(x_test)

In [33]:
from sklearn.ensemble import RandomForestRegressor

models = {
    'Random Forest Regressor' : RandomForestRegressor()
}

In [36]:
# function that performs model training and evalution
from sklearn.metrics import r2_score

def evaluate_model(x_train, y_train, x_test, y_test, models):
    report = {}
    
    for model in models:
        # select model 
        m = models[model]
        
        # model training 
        m.fit(x_train, y_train)
        
        # model predication
        y_pred = m.predict(x_test)
        
        # accuracy 
        r2 = r2_score(y_test, y_pred)
        
        # creating report 
        report[model] = r2
    
    return report 

In [37]:
# model accuracy 
report = evaluate_model(x_train, y_train, x_test, y_test, models)

report 

{'Random Forest Regressor': 0.5220182562669657}

In [41]:
# hyperparamter tuning

import warnings 
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV

# estimator
regressor = RandomForestRegressor()
# parameters
parameters = {
    'n_estimators' : [1,100],
    'criterion' : ('squared_error','absolute_error','friedman_mse','poisson'),
    'max_depth' : [1,10],
    'max_features' : ('sqrt','log2',None),
    'oob_score' : (True, False),
}

grid = GridSearchCV(regressor, param_grid = parameters, scoring = 'accuracy', cv = 5, verbose = True)

grid.fit(x_train, y_train)

best_params = grid.best_params_

best_params

Fitting 5 folds for each of 96 candidates, totalling 480 fits


{'criterion': 'squared_error',
 'max_depth': 1,
 'max_features': 'sqrt',
 'n_estimators': 1,
 'oob_score': True}

In [46]:
regressor = RandomForestRegressor(n_estimators = best_params['n_estimators'], criterion = best_params['criterion'], max_depth = best_params['max_depth'], max_features = best_params['max_features'], oob_score = best_params['oob_score'])

models = {
    'Random Forest Regressor With Hyperparameter Tuning' : regressor
}

report = evaluate_model(x_train, y_train, x_test, y_test, models)

report 

{'Random Forest Regressor With Hyperparameter Tuning': -0.028420809904450506}

---