# Welcome!
In this example, we will use a random forest model to classify an animal as a cat or a dog given certain features. The dataset that we used will be linked, so you can upload that a similar collab notebook if you want to follow along. This notebook will be split into **4 parts**:


1.   Preparing the data
2.   Making the model
3.   Training our model with the processed data
4.  Testing our model by having it make predictions on new data 





### 1. Preparing the data
Using pandas and numpy we will process the CatsVsDogs.csv file so its data can be used in our random forest model

In [4]:
#importing the libraries we will be using to process the data - pandas and numpy
import pandas as pd
from sklearn import *

In [5]:
#taking a look at the structure of the csv file we have
df = pd.read_csv("sample_data/CatsVsDogs.csv")
df.head()

Unnamed: 0,Whiskers,Large Teeth,Likes Milk,Loud,Cat
0,1,0,1,0,1
1,0,1,0,1,0
2,1,0,1,0,1
3,0,1,0,1,0
4,1,0,1,0,1


In [6]:
#separating the training data into train and test features
X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

In [7]:
#making an 80/20 train-test split for the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

###2. Making the Model
We will use sklearn's inbuilt random forest model

In [8]:
#importing the Random Forest model
from sklearn.ensemble import RandomForestClassifier

#creating the random forest object with 30 trees (as indicated by the n_estimators param)
classifier = RandomForestClassifier(n_estimators=30)

###3. Training the Model

In [9]:
#fitting the model to our training data
classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

###4. Testing and evaluating the model

In [10]:
import numpy as np

#making prediction on our test data
y_pred = classifier.predict(X_test)
results = []
#merging X_test and y_pred in new array res to add to dataframe
for i in range(0,4):
  results.append(np.append(X_test[i],y_pred[i]))

#visualizing results in a new dataframe
r_df = pd.DataFrame(data = results, columns = ["Whiskers","Large Teeth","Likes Milk","Loud","Cat"])
print(r_df.head())

#looking at our results
from sklearn.metrics import accuracy_score
print(str(accuracy_score(y_test, y_pred)*100) + "% accuracy rate")

   Whiskers  Large Teeth  Likes Milk  Loud  Cat
0         1            0           1     0    1
1         0            1           0     1    0
2         1            0           1     0    1
3         0            1           0     1    0
100.0% accuracy rate


###Conclusions


*   Our model performed extraordinarily well on the data (however data will **almost never** be this perfect though)
*   In general, anything over a 98% accuracy rate is sufficient, so we will not have to finetune our parameters for the model this time
*   You could try this approach on other datasets as well, and tweak parameters such as the number of estimators (number of trees in the model)




