# CLASSIFICATION WITH RANDOM FOREST

The random forests algorithm is a machine learning method that can be used for supervised learning tasks such as classification and regression. The algorithm works by constructing a set of decision trees trained on random subsets of features. In the case of classification, the output of a random forest model is the mode of the predicted classes across the decision trees.

### Problem Description

* Dataset: Intake information such as breed, color , sex and age from the Austin Animal Center

* Target : Predict the outcome (e.g. adoption or transfer to owner) of the animals as they leave the Animal Center.

* Source : [Reference](https://www.kaggle.com/c/shelter-animal-outcomes)

In [1]:
import pandas as pd
import numpy as np


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split 
#from sklearn.preprocessing import StandardScaler # used for feature scaling

from sklearn import model_selection
from sklearn.metrics import classification_report, confusion_matrix

from sklearn import preprocessing

#from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#from sklearn.model_selection import StratifiedKFold
#from sklearn import svm


import warnings

warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('C:/Users/Toshiba/Documents/OVGU/SS2018/Advanced Business Analytics/Analytics Challenge/Analytics Challenge-20180530/Datasets/Shelter Animal Outcomes/train.csv')


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26729 entries, 0 to 26728
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   AnimalID        26729 non-null  object 
 1   Name            19038 non-null  object 
 2   DateTime        26729 non-null  object 
 3   OutcomeType     26729 non-null  object 
 4   OutcomeSubtype  13117 non-null  object 
 5   AnimalType      26729 non-null  object 
 6   SexuponOutcome  26728 non-null  object 
 7   AgeuponOutcome  26711 non-null  object 
 8   Breed           26729 non-null  object 
 9   Color           26729 non-null  object 
 10  Unnamed: 10     0 non-null      float64
 11  Unnamed: 11     0 non-null      float64
 12  Unnamed: 12     14 non-null     object 
 13  Unnamed: 13     24 non-null     object 
dtypes: float64(2), object(12)
memory usage: 2.9+ MB


## Data Preprocessing & Feature extraction

In [4]:
# filter unwanted columns or atttributes
df = df.drop(['Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13'], axis=1)

df.describe().transpose()


Unnamed: 0,count,unique,top,freq
AnimalID,26729,26729,A671945,1
Name,19038,6374,Max,136
DateTime,26729,22918,11/08/2015 00:00,19
OutcomeType,26729,5,Adoption,10769
OutcomeSubtype,13117,16,Partner,7816
AnimalType,26729,2,Dog,15595
SexuponOutcome,26728,5,Neutered Male,9779
AgeuponOutcome,26711,44,1 year,3969
Breed,26729,1380,Domestic Shorthair Mix,8810
Color,26729,366,Black/White,2824


####  Handling of missing values

In [5]:
#df[df.isnull().any(axis=1)] 
df.isnull().sum()

AnimalID              0
Name               7691
DateTime              0
OutcomeType           0
OutcomeSubtype    13612
AnimalType            0
SexuponOutcome        1
AgeuponOutcome       18
Breed                 0
Color                 0
dtype: int64

#### Filling missing values

In [6]:
df['Name'] = df.Name.replace(np.nan, 'Unkown', regex=True)

# filling missing values with most common class
df_clean = df.apply(lambda x: x.fillna(x.value_counts().index[0]))



#### 'Ageuponoutcome' preprocessing

In [7]:
# new data frame with split value columns, (\d+) stands for the integers, and ([A-Za-z]+) stands for the strings.
result = df_clean['AgeuponOutcome'].str.split('(\d+) ([A-Za-z]+)', n=1, expand = True)
#result = result.loc[:, ['1', '2']]
#result.rename(columns={1:'x', 2:'y'}, inplace=True)

# replace 'AgeuponOutcome' column with first split column from new data frame
df_clean['AgeuponOutcome'] = result[1]

# making separate '_AgeuponOutcome' column from new data frame
df_clean['_AgeuponOutcome'] = result[2]



In [8]:
# new data frame with split value columns withou 's'
result2 = df_clean['_AgeuponOutcome'].str.split('s', n=1, expand = True)

# replace 'AgeuponOutcome' column with first split column from new data frame
df_clean['_AgeuponOutcome'] = result2[0]





In [9]:
print(df_clean['_AgeuponOutcome'].unique())

['year' 'week' 'month' 'day']


In [10]:
# string to number conversion of '_AgeuponOutcome' column
mapping = {'year': 365, 'week':7, 'month':28, 'day' :1}
df_clean = df_clean.replace({'_AgeuponOutcome':mapping})




#### Covert Data types

In [11]:
cols_to_convert = ['AgeuponOutcome','_AgeuponOutcome']

for col in cols_to_convert:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
    



In [12]:
# compute age in days
values = df_clean.AgeuponOutcome*df_clean._AgeuponOutcome

# replace the original values in the column 'AgeuponOutcome'
df_clean['AgeuponOutcome'] = values


In [13]:
df_clean = df_clean.drop(['_AgeuponOutcome'], axis=1)
df_clean.shape

(26729, 10)

In [14]:
age = [0, 365, 7300]
labels = ['Infant', 'Adult']
df_clean['AgeuponOutcome'] = pd.cut(df_clean['AgeuponOutcome'], bins=age, labels=labels, include_lowest=True)


#### 'Date Time' Preprocessing

In [15]:
# new data frame with split value columns
result3 = df_clean['DateTime'].str.split(' ', n=1,  expand=True)
#result3.rename(columns={0:'Date', 1:'Time'}, inplace=True)

# Create new column 'Date' with first split column from new data frame
df_clean['Date'] = result3[0]

# Create new column 'Time' with  second column from new data frame
df_clean['Time'] = result3[1]



In [16]:
df_clean = df_clean.drop(['Date'], axis=1)



#### Create seasons from 'DateTime'

In [17]:
df_clean['DateTime'] = pd.to_datetime(df_clean.DateTime)

conditions = [
    df_clean.DateTime.dt.month.isin(np.arange(5,10)),
    (df_clean.DateTime.dt.month.isin(np.arange(1,5))) | (df_clean.DateTime.dt.month.isin(np.arange(10,13))),
]

options = ['Winter months', 'Summer months']
df_clean['Seasons'] = np.select(conditions, options)



#### Create day from the week from 'DateTime'

In [18]:

df_clean['Day'] = df_clean['DateTime'].dt.day_name()




#### 'Hour' preprocessing

In [19]:
# new data frame with split value columns
result4 = df_clean['Time'].str.split(':', n=1,  expand=True)

# Create new column 'Time' with first split column from new data frame
df_clean['Hour'] = result4[0]

# Create new column 'Minutes' with  second column from new data frame
df_clean['Minutes'] = result4[1]



In [20]:
df_clean['Hour'] = pd.to_numeric(df_clean['Hour'], errors='coerce')
df_clean.dtypes

AnimalID                  object
Name                      object
DateTime          datetime64[ns]
OutcomeType               object
OutcomeSubtype            object
AnimalType                object
SexuponOutcome            object
AgeuponOutcome          category
Breed                     object
Color                     object
Time                      object
Seasons                   object
Day                       object
Hour                       int64
Minutes                   object
dtype: object

In [21]:
b = [0, 4, 8, 12, 16, 20, 24]
shifts = ['Late Night', 'Early Morning','Morning','Noon','Evening','Night']
df_clean['Time of the day'] = pd.cut(df_clean['Hour'], bins=b, labels=shifts, include_lowest=True)


In [22]:
df_clean = df_clean.drop(['DateTime','Time', 'Hour', 'Minutes' ], axis=1)


#### 'Breed' Preprocessing

In [23]:

df_clean['Breed (Mix/Not Mix)'] = df_clean['Breed'].str.extract('(Mix|/)')[0]


df_clean = df_clean.replace({'Breed (Mix/Not Mix)': '/'},
                            {'Breed (Mix/Not Mix)': 'Mix'}, regex=True)

check for missing values again.

In [24]:
#df_clean[df_clean.isnull().any(axis=1)] 
df_clean.isnull().sum()

AnimalID                  0
Name                      0
OutcomeType               0
OutcomeSubtype            0
AnimalType                0
SexuponOutcome            0
AgeuponOutcome            0
Breed                     0
Color                     0
Seasons                   0
Day                       0
Time of the day           0
Breed (Mix/Not Mix)    1391
dtype: int64

In [25]:
df_clean = df_clean.replace(np.nan, 'Not Mix', regex=True)

In [26]:
# drop
df_clean = df_clean.drop(['Breed'], axis=1)

df_clean.columns

Index(['AnimalID', 'Name', 'OutcomeType', 'OutcomeSubtype', 'AnimalType',
       'SexuponOutcome', 'AgeuponOutcome', 'Color', 'Seasons', 'Day',
       'Time of the day', 'Breed (Mix/Not Mix)'],
      dtype='object')

#### 'Name' Preprocessing

In [27]:
df_clean['Name_dirt'] = df_clean['Name'].str.extract('(\d+)')[0]

df_clean = df_clean.drop(['Name_dirt'],axis=1)

#### Sex/Gender categories Preprocessing

In [28]:
df_clean['SexuponOutcome'].unique()

array(['Neutered Male', 'Spayed Female', 'Intact Male', 'Intact Female',
       'Unknown'], dtype=object)

In [29]:
df_clean['Gender'] = df_clean['SexuponOutcome'].str.extract('(Male|Female|Unkown)')[0]

df_clean['Health'] = df_clean['SexuponOutcome'].str.extract('(Neutered|Spayed|Intact|Unknown)')[0]

In [30]:
df_clean = df_clean.drop(['SexuponOutcome'], axis=1)

df_clean.columns

Index(['AnimalID', 'Name', 'OutcomeType', 'OutcomeSubtype', 'AnimalType',
       'AgeuponOutcome', 'Color', 'Seasons', 'Day', 'Time of the day',
       'Breed (Mix/Not Mix)', 'Gender', 'Health'],
      dtype='object')

#### Color Preprocessing

In [31]:
# we use find delimiter  '/' to define mix color
df_clean['Color(Mix/Not Mix)'] = df_clean['Color'].str.extract('(/)')[0]

df_clean = df_clean.replace({'Color(Mix/Not Mix)': '/'},
                            {'Color(Mix/Not Mix)': 'Mix'}, regex=True)

In [32]:
# replace missing value with 'Not mix'
df_clean = df_clean.replace(np.nan, 'Not Mix', regex=True)



In [33]:
# drop unused columns

df_clean = df_clean.drop(['Color'], axis=1)
df_clean.shape

(26729, 13)

In [34]:
# Split Animal Id from the rest of the columns
df_ID = df_clean[['AnimalID', 'Name']].copy()

In [35]:
df_clean = df_clean.drop(['AnimalID', 'Name'], axis=1)

In [36]:
df_clean.columns

Index(['OutcomeType', 'OutcomeSubtype', 'AnimalType', 'AgeuponOutcome',
       'Seasons', 'Day', 'Time of the day', 'Breed (Mix/Not Mix)', 'Gender',
       'Health', 'Color(Mix/Not Mix)'],
      dtype='object')

In [37]:
df_clean.shape

(26729, 11)

### Label Encoding
Conversion of each column from string categorical to numerical categorical

In [38]:
df_clean.describe().transpose()

Unnamed: 0,count,unique,top,freq
OutcomeType,26729,5,Adoption,10769
OutcomeSubtype,26729,16,Partner,21428
AnimalType,26729,2,Dog,15595
AgeuponOutcome,26729,2,Infant,15877
Seasons,26729,2,Summer months,15533
Day,26729,7,Saturday,4183
Time of the day,26729,6,Noon,10640
Breed (Mix/Not Mix),26729,2,Mix,25338
Gender,26729,3,Male,13305
Health,26729,4,Neutered,9780


In [39]:
# store original labe data before encoding in a new data-frame 'non_num_cols'
non_num_cols = df_clean[df_clean.columns[0]]
y_labels = pd.DataFrame(non_num_cols)
y_labels = y_labels.rename(columns={"OutcomeType":"OutcomeType_labels"})

# Conversion of each column from string categorical to numerical categorical
for col in list (df_clean.columns):
    df_clean[col] = df_clean[col].astype('category').copy()#  convert your column to categorical data type
    df_clean[col] = df_clean[col].cat.codes# Get codes for each value
    
df_clean.head()


Unnamed: 0,OutcomeType,OutcomeSubtype,AnimalType,AgeuponOutcome,Seasons,Day,Time of the day,Breed (Mix/Not Mix),Gender,Health,Color(Mix/Not Mix)
0,3,12,1,0,0,5,4,0,1,1,0
1,2,15,0,0,0,3,2,0,0,2,1
2,0,6,1,1,0,2,2,0,1,1,0
3,4,12,0,0,0,0,4,0,1,0,1
4,4,12,1,1,0,0,2,0,1,1,1


In [40]:
# store label data after encoding in a new data-frame 'num_col'
num_col = df_clean[df_clean.columns[0]]
y_codes = pd.DataFrame(num_col)
y_codes = y_codes.rename(columns={"OutcomeType":"OutcomeType_codes"})


In [41]:
# confirmation that transformation went smoothly
print(y_labels.shape)
print(y_codes.shape)

(26729, 1)
(26729, 1)


#### Label Mapping

Decipher the meaning of the tranformed categorical codes.

In [42]:
# inner join on indices used to merge the data frame of original y labels with the y encoded to create a new 'labels' data-frame
labels = pd.merge(y_labels, y_codes, left_index=True, right_index=True )

labels = labels.drop_duplicates(subset=['OutcomeType_labels'],ignore_index=True)

labels

Unnamed: 0,OutcomeType_labels,OutcomeType_codes
0,Return_to_owner,3
1,Euthanasia,2
2,Adoption,0
3,Transfer,4
4,Died,1


Splitting the attributes into independent and dependent attributes

In [43]:
# Remove the labels from the features (X)
X = df_clean.drop('OutcomeType', axis=1)

# labels we want to predict (Y/Outcome)
Y = df_clean['OutcomeType']

# Saving feature names for later use
X_actual = list(df_clean.columns)

#### Splitting the dataset into **trainning** and **testing datasets**

Any machine learning algorithm needs to be tested for accuracy. In order to do that, we divide our data set into two parts:training set and testing set.

Training set to be 80% of the original data set and testing set to be 20% of the original data set.
Generally, when training a model, we randomly split the data into training and testing sets to get a representation of all data points (if we trained on the first nine months of the year and then used the final three months for prediction, our algorithm would not perform well because it has not seen any data from those last three months.) I am setting the random state to 42 which means the results will be the same each time I run the split for reproducible results.

In [44]:
random_state = np.random.RandomState(0)
# splitting the dataset into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=random_state) 


We can look at the shape of all the data to make sure we did everything correctly. We expect the training features number of columns to match the testing feature number of columns and the number of rows to match for the respective training and testing features and the labels :

In [45]:
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', Y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', Y_test.shape)


Training Features Shape: (20046, 10)
Training Labels Shape: (20046,)
Testing Features Shape: (6683, 10)
Testing Labels Shape: (6683,)


####  Training using random forest model



In [46]:
# random forest model object creation.
model = RandomForestClassifier(random_state=random_state)

# Train the model on training data
model.fit(X_train, Y_train)

# predictions
y_predicted = model.predict(X_test)

#### Evaluating Model Performance



In [47]:
print("---- Confusion Matrix ----")
print(confusion_matrix(Y_test, y_predicted))
print('\n')

print("--- Classification Report ----")
print(classification_report(Y_test, y_predicted))
print('\n')

---- Confusion Matrix ----
[[2149    2    1  315  243]
 [   5   28    1    1    4]
 [  12    1  372    2    6]
 [ 470    0    0  474  248]
 [ 416    0    0  276 1657]]


--- Classification Report ----
              precision    recall  f1-score   support

           0       0.70      0.79      0.75      2710
           1       0.90      0.72      0.80        39
           2       0.99      0.95      0.97       393
           3       0.44      0.40      0.42      1192
           4       0.77      0.71      0.74      2349

    accuracy                           0.70      6683
   macro avg       0.76      0.71      0.73      6683
weighted avg       0.70      0.70      0.70      6683





### Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused outcome of Death with Adoption, you would look in the 5th row and 3rd column of the confusion matrix.
Each row in a confusion matrix represents an actual class.

Each row in a confusion matrix represents an actual class, while each column represents a predicted class.
([Reference](https://stackoverflow.com/questions/25692293/inserting-a-link-to-a-webpage-in-an-ipython-notebook))

The confusion matrix is useful for giving you false positives and false negatives. The **classification report** tells you the accuracy(0.70) of your model.

#### K-fold Cross validation

Results validated by cross-validation and the model still performed slightly better(0.70 accuracy) than the cross validation results of 0.67 accuracy. 

In [48]:
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from scipy import interp
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

* In this curve x axis is false positive rate and y axis is true positive rate
* If the curve in plot is closer to left-top corner, test is more accurate.
* Roc curve score is auc that is computation area under the curve from prediction scores
* We want auc to closer 1

In [49]:
# K-fold f1_weighted

kf = KFold(shuffle=True, n_splits=5) # To make a 5-fold CV

cv_results_kfold = cross_val_score(model, X_test, Y_test, cv=kf, scoring= 'f1_weighted')

print("K-fold Cross Validation f1_weighted Results: ",cv_results_kfold)
print("K-fold Cross Validation f1_weighted Results Mean: ",cv_results_kfold.mean())


K-fold Cross Validation f1_weighted Results:  [0.69119315 0.66032447 0.67454461 0.69686801 0.66549904]
K-fold Cross Validation f1_weighted Results Mean:  0.6776858548522116


In [50]:
# K-fold accuracy
kf = KFold(shuffle=True, n_splits=5) # To make a 5-fold CV

cv_results_kfold = cross_val_score(model, X_test, Y_test, cv=kf, scoring= 'accuracy')

print("K-fold Cross Validation accuracy Results: ",cv_results_kfold)
print("K-fold Cross Validation accuracy Results Mean: ",cv_results_kfold.mean())


K-fold Cross Validation accuracy Results:  [0.6551982  0.6671653  0.69708302 0.69311377 0.66991018]
K-fold Cross Validation accuracy Results Mean:  0.6764940948320264


### Results

In [53]:
df_new = pd.DataFrame({'x':Y_test, 'y':y_predicted}) 

# using dictionary to convert specific columns
convert_dict = {'x': int,
                'y': float
                }
df_new = df_new.astype(convert_dict)

transform_nums = {"x":     {0: "Adoption", 1: "Died", 2:"Euthanasia", 3:"Return_to_owner", 4:"Transfer"},
                  "y":     {0: "Adoption", 1: "Died", 2:"Euthanasia", 3:"Return_to_owner", 4:"Transfer"}}
df_new = df_new.replace(transform_nums)
df_new.head()

Unnamed: 0,x,y
18437,Return_to_owner,Adoption
3076,Adoption,Adoption
20034,Return_to_owner,Transfer
18699,Transfer,Transfer
16317,Adoption,Adoption


In [57]:
print(df_ID.shape)
print(df_new.shape)
df_ID.head()

(26729, 2)
(6683, 2)


Unnamed: 0,AnimalID,Name
0,A671945,Hambone
1,A656520,Emily
2,A686464,Pearce
3,A683430,Unkown
4,A667013,Unkown


In [58]:
results = pd.merge(df_ID, df_new, left_index=True, right_index=True )
results.head()

Unnamed: 0,AnimalID,Name,x,y
4,A667013,Unkown,Transfer,Return_to_owner
7,A701489,Unkown,Transfer,Transfer
8,A671784,Lucy,Adoption,Adoption
12,A684601,Rocket,Adoption,Adoption
18,A679010,Chrissy,Transfer,Transfer


### Further Analysis

AUC (Area Under the Curve) and ROC (Receiver Operationg Characteristic) so to speak about ROC AUC score we need to define ROC curve first.It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR).The higher TPR and the lower FPR is for each threshold the betterand so classifiers that have curves that are more top-left-side on the chart/plot ([Reference](https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc)) are better.

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.