## Random Forest Classifier With Pipeline And Hyperparameter Tuning

In [2]:
# Let's take a random dataset
import seaborn as sns
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


###total_bill = Amount that a family pays
###tip = tips that family gave
###size = how many members gone to that restaurant
### Here, we need to precdict time i.e. Dinner or Lunch
### As of now, we are taking output as time because it is binary classification but day is multi-class classification. So, for better and easier computation and results we will take binary classification
### And, we also need to handle the numerical and categorical features

In [3]:
df['day'].unique()

['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']

In [4]:
df['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [5]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

No missing values!

In [6]:
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [7]:
df.time

0      Dinner
1      Dinner
2      Dinner
3      Dinner
4      Dinner
        ...  
239    Dinner
240    Dinner
241    Dinner
242    Dinner
243    Dinner
Name: time, Length: 244, dtype: category
Categories (2, object): ['Lunch', 'Dinner']

Let's convert this time feature into numerical feature cuz this is our output feature which we will predict. And computers understand numbers so we need to convert it into numerical features.

In [8]:
# For converting to numerical features
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['time'] = encoder.fit_transform(df['time'])

In [9]:
df.time.unique()

array([0, 1])

In [10]:
df.time

0      0
1      0
2      0
3      0
4      0
      ..
239    0
240    0
241    0
242    0
243    0
Name: time, Length: 244, dtype: int64

In [11]:
## Split into independent and dependent feature
X = df.drop(labels=['time'], axis=1) # Independent
y = df.time # Dependent

In [12]:
X.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
1,10.34,1.66,Male,No,Sun,3
2,21.01,3.5,Male,No,Sun,3
3,23.68,3.31,Male,No,Sun,2
4,24.59,3.61,Female,No,Sun,4


In [13]:
y

0      0
1      0
2      0
3      0
4      0
      ..
239    0
240    0
241    0
242    0
243    0
Name: time, Length: 244, dtype: int64

In [14]:
X['day'].value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

In [15]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

We need to automate the categorical features (sex, smoker, day) so that it will be converted into numerical values on a large scale. On a small scale, we can do the same using EDA and feature engineering. That's why we need to create pipelines, use OneHotEncoding, handling missing values, automate feature scaling. We can't do EDA manually everytime. That's why we will use pipelines.

### https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

After this, we will run this pipeline in an automated way.

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # For Handling Missing Values
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
from sklearn.preprocessing import StandardScaler # For Feature Scaling
from sklearn.preprocessing import OneHotEncoder # Categorical to Numerical
from sklearn.compose import ColumnTransformer # For combining multiple pipelines

In [17]:
categorical_cols = ['sex', 'smoker','day']
numerical_cols = ['total_bill', 'tip','size']

### Here, we are segregatting the columns into categorical and numerical columns because we need to train our model continuously on a timely basis because whenever there is new data, there will be missing values, duplicate values. Means we need to clean that dataset. So, for better computation we are dividing the dataset.

### We will create separate pipeline for numerical columns and categorical columns.

### This will handle each missisng and duplicate values in each of these columns separately.

### Followed by feature scaling process step by step.

In [18]:
## For Feature Engineering Automation
## Numerical Pipeline
num_pipeline = Pipeline(
          steps=[
        ('imputer', SimpleImputer(strategy='median')), ## Missing Values
        ('scaler', StandardScaler()) ## Feature Scaling
    ]

)

## Categorical Pipeline
cat_pipeline = Pipeline(
                  steps=[
                ('imputer', SimpleImputer(strategy='most_frequent')), ## Handling Missing values
                ('onehotencoder', OneHotEncoder()) ## Categorical features to numerical
                ]

            )
# 'imputer', SimpleImputer(strategy='most_frequent') = most frequently appeared categorical value will be taken into account from new data
# 'imputer', SimpleImputer(strategy='median') = median value will be taken into account as it is a numerical data
# OneHotEncoder is used because sex, smoker, day are nominal features

In [19]:
# To combine this both pipeline, we will use column transformer
preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, numerical_cols),
    ('cat_pipeline', cat_pipeline, categorical_cols)

])

Now my preprocessor is ready. Now we will apply it to the dataset.

In [20]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [23]:
## Model Training Automation
models={
    'Random Forest':RandomForestClassifier(),
    'Logistic Regression':LogisticRegression(),
    'Decision Tree':DecisionTreeClassifier()

}

In [24]:
from sklearn.metrics import accuracy_score

In [26]:
def evaluate_model(X_train, y_train, X_test, y_test, models):

    report = {}
    for i in range(len(models)):
        model = list(models.values())[i]
        # Train model
        model.fit(X_train, y_train)



        # Predict Testing data
        y_test_pred = model.predict(X_test)

        # Get accuracy for test data prediction

        test_model_score = accuracy_score(y_test, y_test_pred)

        report[list(models.keys())[i]] = test_model_score



    return report


In [27]:
evaluate_model(X_train, y_train, X_test, y_test, models)

{'Random Forest': 0.9591836734693877,
 'Logistic Regression': 1.0,
 'Decision Tree': 0.9387755102040817}

Here, we are getting accuarcy of these 3 algorithms that are 95.9%, 100% and finally 93.9%. Thus, Random Forest helps us to create a generalized model.

In [28]:
classfier = RandomForestClassifier()

In [29]:
## Hypeparameter Tuning
params={'max_depth':[3,5,10,None],
              'n_estimators':[100,200,300],
               'criterion':['gini','entropy']
              }

In [30]:
from sklearn.model_selection import RandomizedSearchCV

In [31]:
cv = RandomizedSearchCV(classfier, param_distributions=params, scoring='accuracy', cv=5, verbose=3)
cv.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END criterion=entropy, max_depth=None, n_estimators=300;, score=0.974 total time=   0.6s
[CV 2/5] END criterion=entropy, max_depth=None, n_estimators=300;, score=0.923 total time=   0.5s
[CV 3/5] END criterion=entropy, max_depth=None, n_estimators=300;, score=1.000 total time=   0.5s
[CV 4/5] END criterion=entropy, max_depth=None, n_estimators=300;, score=0.949 total time=   0.5s
[CV 5/5] END criterion=entropy, max_depth=None, n_estimators=300;, score=0.923 total time=   0.4s
[CV 1/5] END criterion=gini, max_depth=None, n_estimators=300;, score=0.974 total time=   0.4s
[CV 2/5] END criterion=gini, max_depth=None, n_estimators=300;, score=0.923 total time=   0.5s
[CV 3/5] END criterion=gini, max_depth=None, n_estimators=300;, score=1.000 total time=   0.5s
[CV 4/5] END criterion=gini, max_depth=None, n_estimators=300;, score=0.949 total time=   0.5s
[CV 5/5] END criterion=gini, max_depth=None, n_estimators=300;, score

In [32]:
cv.best_params_

{'n_estimators': 300, 'max_depth': None, 'criterion': 'entropy'}