### Using Scikit-Learn


*  Scikit-Learn has implemented all the basic algorithms of machine learning. 
*  Let’s take a look at some of them.
*  First of all, the data should be loaded into memory, so that we could work with it. 
*  The Scikit-Learn library uses NumPy arrays in its implementation, so we will use NumPy to load `*.csv` files.
 

In [6]:
import numpy as np
import pandas as pd


In [7]:
url = "https://raw.githubusercontent.com/RWorkshop/workshopdatasets/master/pimadiabetes.csv"

myData = pd.read_csv(url)

#### About the PIMA data set
The Pima diabetes database, donated by Vincent Sigillito (John Hopkins University), is a collection of medical diagnostic reports of 768 examples from a population living near Phoenix, Arizona, USA. 

The samples consist of examples with 8 attribute values and one of the two possible outcomes, namely whether the patient is tested positive for diabetes (indicated by output one) or not (indicated by two). 

The database now available in the repository has 512 examples in the training set and 256 examples in the test set.

#### Attribute Information
There are nine variables in this data set


1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)$^2$) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1)


In [8]:
myData.head(5)

Unnamed: 0,npreg,glu,bp,skin,serum,bmi,ped,age,type
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
X = myData.drop("type",1)
y = myData["type"]


We will work with this dataset in all examples, namely, with the X feature-object matrix and values of the y target variable.

#### Data Normalization

*  The majority of gradient methods (on which almost all machine learning algorithms are based) are highly sensitive to \textit{data scaling}.
*  Therefore, before running an algorithm, we should perform either normalization, or the so-called standardization. 
*  Normalization involves replacing nominal features, so that each of them would be in the range from 0 to 1.
*   As for standardization, it involves data pre-processing, after which each feature has an average 0 and 1 dispersion. 
*  The Scikit-Learn library provides ready-made functions for this:


In [10]:

from sklearn import preprocessing

# normalize the data attributes
normalized_X = preprocessing.normalize(X)

# standardize the data attributes
standardized_X = preprocessing.scale(X)


#### More on Data Normalization and Data Scaling
Data scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the distance. 

If one of the features has a broad range of values, the distance will be governed by this particular feature[citation needed]. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

#### Rescaling
The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula is given as:

$$ x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}$$
where x is an original value, x' is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span [160 pounds, 200 pounds]. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).

#### Standardization
In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the enumerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines, logistic regression, and neural networks). 

This is typically done by calculating standard scores. The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.

The standard score of a raw score x is

$$z = {x- \mu \over \sigma}$$
where:


- [$\mu$] is the mean of the population;
- [$\sigma$] is the standard deviation of the population.

The absolute value of z represents the distance between the raw score and the population mean in units of the standard deviation. z is negative when the raw score is below the mean, positive when above.

#### {Scaling to unit length}
Another option that is widely used in machine-learning is to scale the components of a feature vector such that the complete vector has length one. This usually means dividing each component by the Euclidean length of the vector. In some applications (e.g. Histogram features) it can be more practical to use the L1 norm (i.e. Manhattan Distance, City-Block Length or Taxicab Geometry) of the feature vector:

$$ x' = \frac{x}{||x||} $$
This is especially important if in the following learning steps the Scalar Metric is used as a distance measure.








#### Feature Selection

The most important thing in solving a task is the ability to properly choose or even create features. It’s called Feature Selection and Feature Engineering. While Future Engineering is quite a creative process and relies more on intuition and expert knowledge, there are plenty of ready-made algorithms for Feature Selection. Tree algorithms allow to compute the informativeness of features.


In [11]:

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X, y)

# display the relative importance of each attribute
print(model.feature_importances_)



[ 0.1036469   0.22530292  0.10056835  0.07999164  0.07552998  0.15237788
  0.11254108  0.15004124]


All other methods are based on the effective search of subsets of features in order to find the best subset, on which the developed model gives the best quality. One of these search algorithms is the Recursive Feature Elimination Algorithm that is also available in the Scikit-Learn library.


In [12]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)

# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)



[ True False False False False  True  True False]
[1 2 3 5 6 1 1 4]


#### Logistic Regression
Most often used for solving tasks of classification (binary), but multiclass classification (the so-called one-vs-all method) is also allowed. The advantage of this algorithm is that there’s the probability of belonging to a class for each object at the output.


In [13]:


from sklearn import metrics
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
             precision    recall  f1-score   support

          0       0.79      0.90      0.84       500
          1       0.74      0.55      0.63       268

avg / total       0.77      0.77      0.77       768

[[448  52]
 [121 147]]



#### Naive Bayes
This is also one of the most well-known machine learning algorithms, the main task of which is to 
restore the density of data distribution of the training sample. 
This method often provides good quality in multiclass classification problems.


In [14]:

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))


GaussianNB(priors=None)
             precision    recall  f1-score   support

          0       0.80      0.84      0.82       500
          1       0.68      0.62      0.64       268

avg / total       0.76      0.76      0.76       768



In [15]:
print(metrics.confusion_matrix(expected, predicted))

[[421  79]
 [103 165]]



#### k-Nearest Neighbours
The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. 

For instance, we can use its estimate as an object’s feature. Sometimes, a simple kNN provides great quality on well-chosen features. 

When parameters (metrics mostly) are set well, the algorithm often gives good quality in regression problems.


In [56]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
             precision    recall  f1-score   support

          0       0.83      0.88      0.85       500
          1       0.75      0.65      0.70       268

avg / total       0.80      0.80      0.80       768

[[442  58]
 [ 93 175]]



#### Decision Trees
Classification and Regression Trees (CART) are often used in problems, in which objects have category features and used for regression and classification problems. The trees are very well suited for multiclass classification.



In [57]:

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       500
          1       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
 [  0 268]]


#### Support Vector Machines
SVM (Support Vector Machines) is one of the most popular machine learning algorithms used mainly for the classification problem. As well as logistic regression, SVM allows multi-class classification with the help of the one-vs-all method.


In [58]:

from sklearn import metrics
from sklearn.svm import SVC

# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))




SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       500
          1       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
 [  0 268]]



In addition to classification and regression algorithms, Scikit-Learn has a huge number of more complex algorithms, including clustering, and also implemented techniques to create compositions of algorithms, including Bagging and Boosting.





#### How to Optimize Algorithm Parameters
One of the most difficult stages in creating really efficient algorithms is choosing correct parameters. It’s usually easier with experience, but one way or another, we have to do the search. Fortunately, Scikit-Learn provides many implemented functions for this purpose.

As an example, let’s take a look at the selection of the regularization parameter, in which several values are searched in turn:


In [59]:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV

# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)

# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)


GridSearchCV(cv=None, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'alpha': array([  1.00000e+00,   1.00000e-01,   1.00000e-02,   1.00000e-03,
         1.00000e-04,   0.00000e+00])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.279617559313
1.0


Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.


In [60]:
import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV

# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}

# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)

# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
          fit_params={}, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f0b36423780>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring=None, verbose=0)
0.279617540491
0.9990375546835027
