# Scikit-Learn
Scikit-learn is an open source **Python** library that implements a range of *machine learning*, *preprocessing*, *cross-validation* and *visualization algorithms* using a unifed interface.

## Loading Data
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas DataFrame, are also acceptable.

In [1]:
import numpy as np
X, y = np.arange(10).reshape((5, 2)), range(5)

In [2]:
print("X shape: ", X.shape)
print(X)

X shape:  (5, 2)
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


In [3]:
print("y length: ", len(y))
print(y)

y length:  5
range(0, 5)


## Training And Test Data
Split arrays or matrices into random train and test subsets

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [5]:
X_train

array([[4, 5],
       [0, 1],
       [6, 7]])

In [6]:
y_train

[2, 0, 3]

In [7]:
X_test

array([[2, 3],
       [8, 9]])

In [8]:
y_test

[1, 4]

In [9]:
import pandas as pd

In [10]:
df_data = pd.DataFrame(data=np.random.randn(10,4),columns=['feat{}'.format(i) for i in range(4)])
df_label = pd.DataFrame(data=np.random.rand(10,1),columns=['label{}'.format(i) for i in range(1)])

In [11]:
df_data

Unnamed: 0,feat0,feat1,feat2,feat3
0,0.584704,-0.624848,-1.385444,0.986745
1,-0.473472,0.106248,-0.409133,0.34687
2,-0.360423,1.491793,0.853144,-0.183072
3,-0.081842,0.189746,-0.936664,0.516492
4,-0.061321,1.316358,-0.899885,-0.349968
5,-0.541016,-1.221575,-0.583535,-1.075328
6,1.109273,2.010399,-0.868175,1.461278
7,-2.07223,-1.063477,0.561686,0.394464
8,0.305869,0.598809,-0.420476,-0.56589
9,0.327999,0.863518,-0.407569,0.009105


In [12]:
df_label

Unnamed: 0,label0
0,0.722533
1,0.060583
2,0.470904
3,0.092204
4,0.683375
5,0.018809
6,0.229178
7,0.709488
8,0.943038
9,0.218867


In [13]:
X_train, X_test, y_train, y_test = train_test_split(df_data, df_label, test_size=0.33, random_state=42)

In [14]:
X_train

Unnamed: 0,feat0,feat1,feat2,feat3
7,-2.07223,-1.063477,0.561686,0.394464
2,-0.360423,1.491793,0.853144,-0.183072
9,0.327999,0.863518,-0.407569,0.009105
4,-0.061321,1.316358,-0.899885,-0.349968
3,-0.081842,0.189746,-0.936664,0.516492
6,1.109273,2.010399,-0.868175,1.461278


In [15]:
y_train

Unnamed: 0,label0
7,0.709488
2,0.470904
9,0.218867
4,0.683375
3,0.092204
6,0.229178


In [16]:
X_test

Unnamed: 0,feat0,feat1,feat2,feat3
8,0.305869,0.598809,-0.420476,-0.56589
1,-0.473472,0.106248,-0.409133,0.34687
5,-0.541016,-1.221575,-0.583535,-1.075328
0,0.584704,-0.624848,-1.385444,0.986745


In [17]:
y_test

Unnamed: 0,label0
8,0.943038
1,0.060583
5,0.018809
0,0.722533


## Preprocessing The Data
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in [Compare the effect of different scalers on data with outliers](http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py).

### Standardization
Standardize features by removing the mean and scaling to unit variance.Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

In [19]:
X_train

Unnamed: 0,feat0,feat1,feat2,feat3
7,-2.07223,-1.063477,0.561686,0.394464
2,-0.360423,1.491793,0.853144,-0.183072
9,0.327999,0.863518,-0.407569,0.009105
4,-0.061321,1.316358,-0.899885,-0.349968
3,-0.081842,0.189746,-0.936664,0.516492
6,1.109273,2.010399,-0.868175,1.461278


In [20]:
standardized_X

array([[-1.95615249, -1.85572669,  1.16195424,  0.14454674],
       [-0.1773455 ,  0.68701986,  1.56292883, -0.82150698],
       [ 0.53802094,  0.06182412, -0.17149905, -0.50005002],
       [ 0.13346328,  0.51244481, -0.84880362, -1.10067622],
       [ 0.11213901, -0.60864584, -0.89940245,  0.34866399],
       [ 1.34987476,  1.20308374, -0.80517795,  1.92902249]])

### Normalization
Normalize samples individually to unit norm. Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

In [21]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

In [22]:
X_train

Unnamed: 0,feat0,feat1,feat2,feat3
7,-2.07223,-1.063477,0.561686,0.394464
2,-0.360423,1.491793,0.853144,-0.183072
9,0.327999,0.863518,-0.407569,0.009105
4,-0.061321,1.316358,-0.899885,-0.349968
3,-0.081842,0.189746,-0.936664,0.516492
6,1.109273,2.010399,-0.868175,1.461278


In [23]:
normalized_X

array([[-0.85339745, -0.43796715,  0.2313166 ,  0.16245058],
       [-0.20415656,  0.84500558,  0.48325178, -0.10369855],
       [ 0.32485589,  0.8552443 , -0.40366386,  0.0090176 ],
       [-0.03753606,  0.80577472, -0.55084148, -0.21422373],
       [-0.0751254 ,  0.17417386, -0.85979354,  0.47410411],
       [ 0.38829354,  0.70372635, -0.30389865,  0.51151047]])

### Binarization
Boolean thresholding of array-like or scipy.sparse matrix

In [24]:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(df_data)
binary_X = binarizer.transform(df_data)

In [25]:
df_data

Unnamed: 0,feat0,feat1,feat2,feat3
0,0.584704,-0.624848,-1.385444,0.986745
1,-0.473472,0.106248,-0.409133,0.34687
2,-0.360423,1.491793,0.853144,-0.183072
3,-0.081842,0.189746,-0.936664,0.516492
4,-0.061321,1.316358,-0.899885,-0.349968
5,-0.541016,-1.221575,-0.583535,-1.075328
6,1.109273,2.010399,-0.868175,1.461278
7,-2.07223,-1.063477,0.561686,0.394464
8,0.305869,0.598809,-0.420476,-0.56589
9,0.327999,0.863518,-0.407569,0.009105


In [26]:
binary_X

array([[ 1.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  1.,  1.,  0.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.],
       [ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  1.]])