# Random Forest Implementation
This is a hybird sklearn and custom random forest implementation.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

  from numpy.core.umath_tests import inner1d


Load the built-in wine dataset

In [3]:
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
!cat wine.names

1. Title of Database: Wine recognition data
	Updated Sept 21, 1998 by C.Blake : Added attribute information

2. Sources:
   (a) Forina, M. et al, PARVUS - An Extendible Package for Data
       Exploration, Classification and Correlation. Institute of Pharmaceutical
       and Food Analysis and Technologies, Via Brigata Salerno, 
       16147 Genoa, Italy.

   (b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au
   (c) July 1991
3. Past Usage:

   (1)
   S. Aeberhard, D. Coomans and O. de Vel,
   Comparison of Classifiers in High Dimensional Settings,
   Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
   Mathematics and Statistics, James Cook University of North Queensland.
   (Also submitted to Technometrics).

   The data was used with many others for comparing various 
   classifiers. The classes are separable, though only RDA 
   has achieved 100% correct classification.
   (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
   (All results usi

In [4]:
import pandas as pd
yX = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data")
yX.columns = ["class","alcohol","malic_acid","ash","alcalinity_ash","magnesium","total_phenols",
              "flavanoids","nonflavanoid_phenols","proanthocyanins","color_intensity","hue","OD280_over_OD315_diluted_wines","proline"]
X = yX.drop("class",axis=1)
y = yX['class']

In [5]:
X

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,OD280_over_OD315_diluted_wines,proline
0,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
2,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
3,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
4,14.20,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
...,...,...,...,...,...,...,...,...,...,...,...,...,...
172,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
173,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
174,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
175,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


In [6]:
y.value_counts()

2    71
1    58
3    48
Name: class, dtype: int64

Let's see how well the sklearn random forest algorithm works by using the sklearn functions.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [8]:
y_test_pred = rfc.predict(X_test)

In [9]:
acc = sum(y_test_pred == y_test)/len(y_test)
acc

0.9777777777777777

In [10]:
y_test_pred

array([1, 1, 3, 1, 2, 1, 2, 3, 2, 3, 1, 3, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2,
       2, 3, 3, 3, 2, 2, 2, 1, 1, 2, 3, 1, 1, 2, 3, 3, 2, 3, 1, 2, 2, 2,
       3])

Let's compare that to a decision tree in sklearn

In [11]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=0)
dt.fit(X_train,y_train)
dt_acc = sum(dt.predict(X_test) == y_test)/len(y_test)
dt_acc

0.8888888888888888

So I don't know about you, but I'm sold that we should implement random forest. Here is your prompt in class. Can you and your group implement random forest using the built in decision tree classifier of sklearn? You may not look at or use the RandomForestClassifier used above. Up to the challenge? You can only use basic numpy and pandas functionality. 

<img width=500 src="http://www.globalsoftwaresupport.com/wp-content/uploads/2018/02/ggff5544hh.png">

0.9066666666666666

Challenge Questions:

How would you modify this algorithm to return the probabilities?

From those probabilities, how would you create a ROC curve?

0.9543859649122807

In [19]:
accs

[0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.0,
 0.8888888888888888,
 0.9333333333333333,
 0.9333333333333333,
 0.0,
 0.9333333333333333,
 0.8888888888888888,
 0.8888888888888888,
 0.9333333333333333,
 0.9333333333333333,
 0.8888888888888888,
 0.0,
 0.9333333333333333,
 0.0,
 0.9333333333333333,
 0.8888888888888888,
 0.9333333333333333,
 0.8888888888888888,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.0,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.8888888888888888,
 0.9333333333333333,
 0.8888888888888888,
 0.0,
 0.9333333333333333,
 0.8888888888888888,
 0.8888888888888888,
 0.9333333333333333,
 0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.8888888888888888,
 0.9333333333333333