# Python Scikit-Learn Cheat Sheet for Machine Learning

Scikit-learn is an open source Python library used for machine learning, preprocessing, cross-validation and visualization algorithms. It provides a range of supervised and unsupervised learning algorithms in Python.

### A Basic Example
Load the data...
Divide the data into train and test...
Train your data using the KNN Algorithm and...
Predict the result.

In [2]:
import numpy as np
import sklearn as sk

from sklearn import neighbors, datasets, preprocessing

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score


In [3]:
# Loading the data

import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
X[X < 0.7] = 0

In [4]:
iris = datasets.load_iris()

X,y = iris.data[:,:2], iris.target

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 33)

scalar = preprocessing.StandardScaler().fit(X_train)

X_train = scalar.transform(X_train)
X_test = scalar.transform(X_test)


In [6]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

0.631578947368421

### Data Preprocessing

Standardization :

Data standardization is one of the data preprocessing step which is used for rescaling one or more attributes so that the attributes have a mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution.

In [7]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler().fit(X_train)
standardized_X = scalar.transform(X_train)
standardized_X_test = scalar.transform(X_test)

Normalization :

Normalization is a technique generally used for data preparation for machine learning. The main goal of normalization is to change the values of numeric columns in the dataset so that we can have a common scale, without losing the information or distorting the differences in the ranges of values.

In [8]:
from sklearn.preprocessing import Normalizer

scalar = Normalizer().fit(X_train)
normalized_X = scalar.transform(X_train)
normalized_X_test = scalar.transform(X_test)

Binarization :
    
Binarization is a common operation performed on text count data. Using binarization the analyst can decide to consider the presence or absence of a feature rather than having a quantified number of occurrences for instance.

In [9]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

Encoding Categorical Features :
    
The LabelEncoder is another class used in data-preprocessing for encoding class levels. It can also be used to transform non-numerical labels into numerical labels.

In [10]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
y = enc.fit_transform(y)

Imputing missing values :
    
The Imputer class in python will provide you with the basic strategies for imputing/filling missing values. It does this by using the mean, median values or the most frequent value of the row or column in which the missing values are located. This class also allows for encoding different missing values.

In [11]:
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit_transform(X_train)



array([[-0.91090798, -1.59775374],
       [-1.0271058 ,  0.08448757],
       [ 0.59966379, -1.59775374],
       [ 0.01867465, -0.96691325],
       [ 0.48346596, -0.33607276],
       [-1.25950146,  0.29476773],
       [-1.37569929,  0.71532806],
       [-0.79471015, -1.17719341],
       [-1.14330363,  0.71532806],
       [ 2.45882905,  1.55644871],
       [-0.79471015,  0.71532806],
       [-0.79471015,  1.34616854],
       [-0.21372101, -0.33607276],
       [ 0.83205945, -0.1257926 ],
       [-0.44611666,  1.76672887],
       [ 1.41304859,  0.29476773],
       [ 0.01867465, -0.54635292],
       [ 2.22643339, -0.96691325],
       [-0.32991883, -1.17719341],
       [ 0.13487248,  0.29476773],
       [-1.0271058 ,  1.13588838],
       [-1.49189712, -1.59775374],
       [ 0.59966379, -0.54635292],
       [-1.60809495, -0.33607276],
       [-0.91090798,  1.13588838],
       [ 1.64544425, -0.1257926 ],
       [ 0.25107031,  0.71532806],
       [ 0.48346596, -1.8080339 ],
       [ 1.8778399 ,

Generating Polynomial Features :
    
Polynomial Feature generates a new feature matrix which consists of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], then the 2-degree polynomial features are [1, a, b, a^2, ab, b^2].

In [12]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(5)
poly.fit_transform(X)

array([[1.00000000e+00, 5.10000000e+00, 3.50000000e+00, ...,
        1.11517875e+03, 7.65318750e+02, 5.25218750e+02],
       [1.00000000e+00, 4.90000000e+00, 3.00000000e+00, ...,
        6.48270000e+02, 3.96900000e+02, 2.43000000e+02],
       [1.00000000e+00, 4.70000000e+00, 3.20000000e+00, ...,
        7.23845120e+02, 4.92830720e+02, 3.35544320e+02],
       ...,
       [1.00000000e+00, 6.50000000e+00, 3.00000000e+00, ...,
        1.14075000e+03, 5.26500000e+02, 2.43000000e+02],
       [1.00000000e+00, 6.20000000e+00, 3.40000000e+00, ...,
        1.51084576e+03, 8.28528320e+02, 4.54354240e+02],
       [1.00000000e+00, 5.90000000e+00, 3.00000000e+00, ...,
        9.39870000e+02, 4.77900000e+02, 2.43000000e+02]])

### Create Your Model

Supervised Learning Estimator :
    
Supervised learning is a type of machine learning that enables the model to predict future outcomes after they are trained on labelled data.

In [13]:
# Linear Regression Algorithm
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)


# Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()


# KNN Algorithm
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)


# Support Vector Machines (SVM)
from sklearn.svm import SVC
svc = SVC(kernel='linear')

Unsupervised Learning Estimator :
    
Unsupervised learning is a type of machine learning that enables the model to predict future outcomes without being trained on the labelled data.

In [14]:
# Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)


# K Means Clustering Algorithm
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)

Model Fitting  :

Fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained

In [15]:
# For Supervised learning

lr.fit(X, y) #Fits data into the model
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)



# For Unsupervised Learning

k_means.fit(X_train)#Fits data into the model
pca_model = pca.fit_transform(X_train) #Fit to data, then transform it

Prediction :
    
Fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained

In [16]:
# For Supervised learning
y_pred=svc.predict(np.random.random((2,5))) #predict label

y_pred=lr.predict(X_test) #predict label
y_pred=knn.predict_proba(X_test)#estimate probablity of a label


# For Unsupervised Learning
y_pred=k_means.predict(X_test) #predict labels in clustering algorithm

ValueError: X.shape[1] = 5 should be equal to 2, the number of features at training time

### Evaluating the model's performance

Classification Metrics :
    
The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. 

In [None]:
# Mean Absolute Error
from sklearn.metrics import mean_absolute_error

y_true = [3, -0.5, 2]
mean_absolute_error(y_true, y_pred)


In [None]:
# Mean Squared Error
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)


In [None]:
# R² Score
from sklearn.metrics import r2_score

r2_score(y_true, y_pred)

Regression Metrics :
    
The sklearn.metrics module implements several loss, score, and utility functions to measure regression performance. 

In [None]:
# Accuracy Score
knn.score(X_test, y_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)


# Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


# Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

### Clustering Metrics

In [None]:
# Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)

# Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)

# V-measure
from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred)

### Cross Validation

In [None]:
from sklearn.cross_validation import cross_val_score 

print(cross_val_score(knn, X_train, y_train, cv=4)) 
print(cross_val_score(lr, X, y, cv=2))

### Tune your model

Grid Search :
    
GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

In [None]:
from sklearn.grid_search import GridSearchCV

params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)
grid.fit(X_train, y_train)

print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

Randomized Parameter Optimization :
    
RandomizedSearchCV performs the random search on hyper parameters. In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

In [None]:
from sklearn.grid_search import RandomizedSearchCV

params = {"n_neighbours": range(1,5), "weights":["uniform", "distance"]}
rserach = RandomizedSearchCV(estimator=knn,param_distribution=params, cv=4, n_iter=8, random_state=5)
rsearch.fit(X_train, Y_train)

print(rsearch.best_score)