In [1]:
##safety executable check (for conda env)
import sys
sys.executable

'/home/hackerman/anaconda2/envs/anaconda2_py27/bin/python'

### Naive Bayes (NB) classifer

#### 1. Using a database of breast cancer tumor information, using a Naive Bayes (NB) classifer to predict whether or not a tumor is malignant or benign. 
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn

#### 2. Load dataset

In [3]:
import sklearn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

Attributes are a critical part of any classifier. Attributes capture important characteristics about the nature of the data. Given the label we are trying to predict (malignant vs benign tumor), possible useful attributes include the size, radius, and texture of the tumor.

In [5]:
# Variables for each set of data
label_names = data['target_names'] #the ones to be classified
labels = data['target'] # labels mapped to binary values 0:malignant, 1:benign
feature_names = data['feature_names']
features = data['data']

In [24]:
# Get a feel of the data
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])

['malignant' 'benign']
0
mean radius
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


#### 3. Organize data into sets
To evaluate how well a classifier is performing, you should always test the model on unseen data. Therefore, before building a model, split your data into two parts: a training set and a test set.

In [26]:
from sklearn.model_selection import train_test_split

train, test, train_labels, test_labels = train_test_split(features,labels,test_size=0.33,random_state=42) 
#The function randomly splits the data using the test_size parameter. 

#### 4. Build and evaluate classifier model
We focus on a simple algorithm that usually performs well in binary classification tasks, namely Naive Bayes (NB). 

In [28]:
from sklearn.naive_bayes import GaussianNB # Check how this one works

#Initialize classifier
gnb_clf = GaussianNB()

#Train classifier
model = gnb_clf.fit(train, train_labels) #Training data 

The predict() function returns an array of predictions for each data instance in the test set. We can then print our predictions to get a sense of what the model determined. 

In [29]:
#Make predictions
prediction = gnb_clf.predict(test)
print(prediction) # Plot binary results for each test case

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


#### 5. Evaluate model's accuracy
Now compare the label predictions (prediction) against the actual real label values (test_labels)

In [30]:
from sklearn.metrics import accuracy_score

#Evaluate accuracy
print(accuracy_score(test_labels,prediction)) #real vs predicted values

0.9414893617021277


The NB classifier is 94.15% accurate. This means that 94.15 percent of the time the classifier is able to make the correct prediction as to whether or not the tumor is malignant or benign. 
These results suggest that our feature set of 30 attributes are good indicators of tumor class. 

#### Future work
Continue to work with your code to see if you can make your classifier perform even better. You could experiment with different subsets of features or even try completely different algorithms. Check out Scikit-learn's website for more machine learning ideas.

### Random Forest Classifer
The data for this tutorial is called the iris dataset, it contains four variables measuring various parts of iris flowers of three related species, and then a fourth variable with the species name.

https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/

#### 1. Libraries

In [33]:
from sklearn.datasets import load_iris #dataset
from sklearn.ensemble import RandomForestClassifier #RF classifier
import pandas as pd 
import numpy as np
np.random.seed(0)

#### 2. Load data

In [34]:
iris = load_iris()

#Create dataframe with the four feature variables 
df = pd.DataFrame(iris.data, columns=iris.feature_names)

#Display top 5 rows 
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [40]:
# Add a new column with the species names, this is what we are going to try to predict
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [72]:
print(iris.target_names)
print(iris.target)

['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


#### 3. Create Training And Test Data

In [43]:
# Create a new column that for each row, generates a random number between 0 and 1, and
# if that value is less than or equal to .75, then sets the value of that cell as True
# and false otherwise. This is a quick and dirty way of randomly assigning some rows to
# be used as the training data and some as the test data.

df['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,is_train
0,5.1,3.5,1.4,0.2,setosa,True
1,4.9,3.0,1.4,0.2,setosa,False
2,4.7,3.2,1.3,0.2,setosa,True
3,4.6,3.1,1.5,0.2,setosa,True
4,5.0,3.6,1.4,0.2,setosa,True


In [44]:
# Create two new dataframes, one with the training rows, one with the test rows
train, test = df[df['is_train'] == True], df[df['is_train'] == False]

In [45]:
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))

('Number of observations in the training data:', 112)
('Number of observations in the test data:', 38)


#### Preprocess data 

In [46]:
# Create a list of the feature column's names
features = df.columns[:4]

# View features
features

Index([u'sepal length (cm)', u'sepal width (cm)', u'petal length (cm)',
       u'petal width (cm)'],
      dtype='object')

In [48]:
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]

# View target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

#### Train RF Classifier

In [58]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(train[features], y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [56]:
train[features].head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
7,5.0,3.4,1.5,0.2


In [57]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

#### Apply classifier to Test Data

In [59]:
# Apply the Classifier we trained to the test data (which, remember, it has never seen before)
clf.predict(test[features])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [61]:
# View the predicted probabilities of the first 10 observations
clf.predict_proba(test[features])[0:20]                       

array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [0.9, 0.1, 0. ],
       [0.9, 0.1, 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 0.9, 0.1],
       [0. , 1. , 0. ],
       [0. , 0.2, 0.8]])

#### Evaluate classifier

In [62]:
# Create actual english names for the plants for each predicted plant class
preds = iris.target_names[clf.predict(test[features])] # Going back from numerical to written species

In [65]:
# View the PREDICTED species for the first 20 observations
preds[0:20]

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'virginica'],
      dtype='|S10')

In [67]:
# View the ACTUAL (REAL) species for the first five observations
test['species'].head(20)

1          setosa
5          setosa
6          setosa
13         setosa
14         setosa
15         setosa
24         setosa
27         setosa
34         setosa
43         setosa
59     versicolor
60     versicolor
65     versicolor
69     versicolor
71     versicolor
73     versicolor
75     versicolor
78     versicolor
90     versicolor
101     virginica
Name: species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]

##### That looks pretty good. At least for the first 20 observations. Now let’s use look at all the data.

#### Creating a confusion matrix
In the confusion matrix, the columns are the species we predicted for the test data and the rows are the actual species for the test data.
The short explanation of how to interpret a confusion matrix is: anything on the diagonal was classified correctly and anything off the diagonal was classified incorrectly.

In [68]:
# Create confusion matrix
pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])

Predicted Species,setosa,versicolor,virginica
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,10,0,0
versicolor,0,9,0
virginica,0,1,18


#### View Feature Importance
Determine how important each feature was for the classifier. This is one of the most powerful parts of random forests, because we can clearly see that petal width was more important in classification than sepal width.

In [69]:
# View a list of the features and their importance scores
list(zip(train[features], clf.feature_importances_))

[('sepal length (cm)', 0.13313710938050363),
 ('sepal width (cm)', 0.03292918868346942),
 ('petal length (cm)', 0.299290321517918),
 ('petal width (cm)', 0.5346433804181089)]