# Machine Learning with Pandas and Scikit learn

First import the libraries we will need.  Note we have a new friend, sklearn a.k.a. [scikit-learn](http://scikit-learn.org/stable/)

In [1]:
import pandas as pd
from sklearn import linear_model
from sklearn import svm
from sklearn.metrics import accuracy_score

Now we import data with pandas, following the ussual steps

In [2]:
diabetes=pd.read_csv('indians-diabetes.csv')

print (diabetes.shape)
print (diabetes.columns)
diabetes.head()

print ('There are',diabetes.shape[1], 'collumns')
print ('Since one is a class there are',diabetes.shape[1]-1,'features')
print ('There are',diabetes.shape[0],'rows.  Each row is a feature vector')

(393, 9)
Index(['Pregnancy', 'Glucose', 'bp', 'skinfold', 'insulin', 'bmi', 'dpf',
       'age', 'class'],
      dtype='object')
There are 9 collumns
Since one is a class there are 8 features
There are 393 rows.  Each row is a feature vector


As with KNIME we have to define features and class.  In KNIME we do this through the GUI.  In Python we do this through code.  So we need to assign the feature set and class labels to variables.  In this eamples we call the features 'x' and the class 'y'

In [3]:
# got the x (feature table) by removing the class from the original data
# (the rest of the data will be features)
x=diabetes.drop('class',axis=1)

print ("type of x")
# note x is a numpy array
print (type(x))

print ('x')
print (x.head())

type of x
<class 'pandas.core.frame.DataFrame'>
x
   Pregnancy  Glucose  bp  skinfold  insulin   bmi    dpf  age
0          1       89  66        23       94  28.1  0.167   21
1          0      137  40        35      168  43.1  2.288   33
2          3       78  50        32       88  31.0  0.248   26
3          2      197  70        45      543  30.5  0.158   53
4          1      189  60        23      846  30.1  0.398   59


Now get the class values and assign them to a variable y

In [4]:
# get the y (the class) by just retrieving the class from the original data
y=diabetes['class']

print ('y')
print (y.head())

print (type(y))

y
0    0
1    1
2    1
3    1
4    1
Name: class, dtype: int64
<class 'pandas.core.series.Series'>


Now convert x (the feature vectors) and y (the class labels) to numpy array.  That means we are converting them to numbers so we can do machine learning calculations and training with them. 

In [5]:
print ("get x and y as numeric (numpy) arrays")
x=x.values
y=y.values

print(type(x))
print(type(y))

get x and y as numeric (numpy) arrays
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


Divide into training and testing sets.  This is equivalent to the KNIME partitioning step 

In [6]:
train_size=200

x_train=x[:train_size,:]
y_train=y[:train_size]
x_test=x[train_size:,:]
y_test=y[train_size:]

Initialize a nearest neighbors learner.  This step is equivalent to setting up the KNIME K nearest neighbor node. 

In [7]:
from sklearn import neighbors
nn=neighbors.KNeighborsClassifier(3)

Fit the traininng data.  For k-nearest neighbors this basically means 'set up' the machine learning system, in other systems (such as decision trees, and neural networks) there would be a more involved training process. 

In [8]:
nn.fit(x_train, y_train)

Run the classifier on the testing data.  The result is a vector of 'predicitons'.  0 indicates the first class (no diabetes) and 1 is the second class (diabetes)

In [9]:
predictions_nn=nn.predict(x_test)
type(predictions_nn)
print(predictions_nn.shape)
print(predictions_nn[0:10])

(193,)
[0 1 0 0 0 0 1 1 1 0]


Now we score the algorithm

In [10]:
nn_score=accuracy_score(y_test, predictions_nn)

print ("Neareset Neighbors")
print ("train size: ",train_size,"has accuracy:", nn_score)

Neareset Neighbors
train size:  200 has accuracy: 0.7253886010362695


Now try Naive Bayes.  Naive Baye's is a machine learning method based on Baye's rule and conditional probability.  Meaning we can take into account prior knowledge, specifically the probality of the class occurring.  Note the code is almost exactly the same as for K Nearest Neighbors

In [11]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

First we fit the training data.  

In [12]:
gnb.fit(x_train, y_train)

Run the classifier on the testing data and compute the score

In [13]:
predictions_gnb=gnb.predict(x_test)
gnb_score=accuracy_score(y_test, predictions_gnb)

print ("Gaussian Naive Bayes")
print ("train size: ",train_size,"has accuracy:", gnb_score)

Gaussian Naive Bayes
train size:  200 has accuracy: 0.7772020725388601


Now try Logistic Regression.  Logistic Regression is a machine learning technique that uses optimizaation to find the best regression coefficients. 

In [14]:
logistic = linear_model.LogisticRegression()

fit the training data (this is equivalent to the learning stage in Naive Bayes)

In [15]:
logistic.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


now make the predictions and calculate the score

In [16]:
predictions_log=logistic.predict(x_test)
logistic_score=accuracy_score(y_test, predictions_log)
    
print ("logistic !")
print ("train size: ",train_size,"has accuracy:", logistic_score)

logistic !
train size:  200 has accuracy: 0.7305699481865285
