Scikit-learn is an open source Python library that
 implements a range of 
machine learning,
 preprocessing, cross-validation and visualization algorithms using a unified interface

In [112]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [190]:
import warnings
warnings.filterwarnings('ignore') # setting ignore as a parameter


In [191]:
iris=datasets.load_iris()

In [192]:

X, y = iris.data[:, :2], iris.target

In [193]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

In [194]:
# Print the shapes of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (112, 2) (112,)
Testing set shape: (38, 2) (38,)


In [195]:
scaler = preprocessing.StandardScaler().fit(X_train)


In [196]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [197]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

In [198]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [199]:
y_pred = knn.predict(X_test)

In [200]:
accuracy_score(y_test, y_pred)


0.631578947368421

# Load the data

In [240]:
import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
X[X < 0.7] = 0

In [241]:
X

array([[0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.73712142, 0.        , 0.        , 0.        , 0.83619419],
       [0.        , 0.        , 0.86571997, 0.        , 0.7491379 ],
       [0.        , 0.        , 0.        , 0.        , 0.88651229],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.87451975, 0.        , 0.        ],
       [0.91054666, 0.        , 0.        , 0.        , 0.97018693]])

In [242]:
y

array(['M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'F'], dtype='<U1')

## Preprocessing the Data 

### Standardization 

In [204]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

- Standardize - Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
- transform -Perform standardization by centering and scaling.
- Scaling - Scaling is a preprocessing step in machine learning that involves transforming the input features so that they have the same scale or distribution. This is important because many machine learning algorithms assume that all features are on the same scale, and that the scale of the features does not affect their importance.

### Normalization
- Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

In [205]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

### Binarization
- Feature binarization is the process of thresholding numerical features to get boolean values.

In [206]:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

### Encoding Categorical Features
- Often features are not given as continuous values but categorical. For example a person could have features ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features can be efficiently coded as integers, for instance ["male", "from US", "uses Internet Explorer"] could be expressed as [0, 1, 3] while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1].

- To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1):

In [207]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)

### imputing Missing Values 
- The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.

In [208]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=0, strategy='mean')
imp.fit_transform(X_train)

array([[-0.91090798, -1.59775374],
       [-1.0271058 ,  0.08448757],
       [ 0.59966379, -1.59775374],
       [ 0.01867465, -0.96691325],
       [ 0.48346596, -0.33607276],
       [-1.25950146,  0.29476773],
       [-1.37569929,  0.71532806],
       [-0.79471015, -1.17719341],
       [-1.14330363,  0.71532806],
       [ 2.45882905,  1.55644871],
       [-0.79471015,  0.71532806],
       [-0.79471015,  1.34616854],
       [-0.21372101, -0.33607276],
       [ 0.83205945, -0.1257926 ],
       [-0.44611666,  1.76672887],
       [ 1.41304859,  0.29476773],
       [ 0.01867465, -0.54635292],
       [ 2.22643339, -0.96691325],
       [-0.32991883, -1.17719341],
       [ 0.13487248,  0.29476773],
       [-1.0271058 ,  1.13588838],
       [-1.49189712, -1.59775374],
       [ 0.59966379, -0.54635292],
       [-1.60809495, -0.33607276],
       [-0.91090798,  1.13588838],
       [ 1.64544425, -0.1257926 ],
       [ 0.25107031,  0.71532806],
       [ 0.48346596, -1.8080339 ],
       [ 1.8778399 ,

### Generating Polynomial Features
- Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

In [209]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly.fit_transform(X)

array([[1.00000000e+00, 5.10000000e+00, 3.50000000e+00, ...,
        1.11517875e+03, 7.65318750e+02, 5.25218750e+02],
       [1.00000000e+00, 4.90000000e+00, 3.00000000e+00, ...,
        6.48270000e+02, 3.96900000e+02, 2.43000000e+02],
       [1.00000000e+00, 4.70000000e+00, 3.20000000e+00, ...,
        7.23845120e+02, 4.92830720e+02, 3.35544320e+02],
       ...,
       [1.00000000e+00, 6.50000000e+00, 3.00000000e+00, ...,
        1.14075000e+03, 5.26500000e+02, 2.43000000e+02],
       [1.00000000e+00, 6.20000000e+00, 3.40000000e+00, ...,
        1.51084576e+03, 8.28528320e+02, 4.54354240e+02],
       [1.00000000e+00, 5.90000000e+00, 3.00000000e+00, ...,
        9.39870000e+02, 4.77900000e+02, 2.43000000e+02]])

### Training And Test Data

In [210]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

# Create Your Model

In [211]:
# Supervised 
# Linear regression 

from sklearn.linear_model  import LinearRegression 
lr= LinearRegression(normalize=True)


In [212]:
# Support Vector Machines (SVM)

from sklearn.svm import SVC
svc=SVC(kernel='linear')

In [213]:
# NAive Bayes

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [214]:
# KNN 

from sklearn import neighbors
knn=neighbors.KNeighborsClassifier(n_neighbors=5)

In [215]:
#Unsupervised 
#Principal Component analysis (PCA) 

from sklearn.decomposition import PCA 
pca= PCA(n_components=0.95)

In [216]:
#K Means
from sklearn.cluster import KMeans 
k_means = KMeans ( n_clusters=3, random_state =0 )

## Model Fitting 

In [217]:
# Supervised

lr.fit(X,y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)

SVC(kernel='linear')

In [218]:
#unsupervised

k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction in machine learning. It is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing the variances.

## Prediction

In [219]:
# Supervised Estimators

In [220]:
# y_pred = svc.predict(np.random.random((2,5)))

In [221]:
y_pred = lr.predict(X_test)

In [222]:
y_pred =knn.predict_proba(X_test)

In [223]:
#unsupervised estimators

In [224]:
y_pred =k_means.predict(X_test)

## evaluate model's performance 

In [225]:
#Classification metrics

In [226]:
#accuracy score

In [227]:
knn.score(X_test, y_test)


0.631578947368421

In [228]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.5

In [229]:
# classification report 
from sklearn.metrics import classification_report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       0.25      0.36      0.30        11
           2       0.50      0.37      0.42        19

    accuracy                           0.50        38
   macro avg       0.58      0.58      0.57        38
weighted avg       0.53      0.50      0.51        38



In [230]:
# confusion matrix 
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

[[ 8  0  0]
 [ 0  4  7]
 [ 0 12  7]]


## regression Metrics

In [231]:
#mean absolute error 

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2]
y_pred = [2.5, 0.0, 2]
mean_absolute_error(y_true, y_pred)

0.3333333333333333

In [232]:
# mean squared error 
 
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

0.16666666666666666

In [233]:
# R2 score
# R2 is a measure of the goodness of fit of a model. 
# In regression, the R2 coefficient of determination is a statistical measure 
# of how well the regression predictions approximate the real data points. 
# An R2 of 1 indicates that the regression predictions perfectly fit the data.

from sklearn.metrics import r2_score 
r2_score(y_true, y_pred)

0.9230769230769231

## Clustering Metrics

In [234]:
# Adjusted Rand Index

from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)

1.0

In [235]:
# Homogeneity

from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)

1.0

In [236]:
# V-measure

from sklearn.metrics import v_measure_score
v_measure_score(y_true, y_pred)

1.0

## Cross-validation 


In [237]:
from sklearn.model_selection import cross_val_score

print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))

[0.85714286 0.75       0.85714286 0.89285714]
[-4.31567384 -1.89773191]


## Tune your model 
#### Grid search

In [238]:
from sklearn.model_selection import GridSearchCV
params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn,param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

0.8126482213438736
2


#### Randomized Parameter Optimization

In [239]:
from sklearn.model_selection import RandomizedSearchCV
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn,
   param_distributions=params,
   cv=4,
   n_iter=8,
   random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

0.8392857142857142
