# ScikitLearn - Introduction

- It is the most useful and robust library for machine learning in python.
- It contains powerful tools for machine learning and statistical modelling.
- The statistical modelling includes classification, regression, clustering and dimensionality reduction through a consistent interface in python.
- This library is written in python and built on top of Numpy, Scipy and Matplotlib.

# ScikitLearn - Features

- ScikitLearn library is used for modelling on data rather than focused on loading, manipulating and summarising data.
- Following are the popular groups of models provided by sklearn.
    1. Supervised Learning Algorithms - Machine learning algorithms like linear regression, Support Vector Machine(SVM) and Decision Tree are part of scikit-learn
    2. Unsupervised Learning Algorithms - It has all unsupervised learning algoritms from clustering, factor analysis and PCA (Principal Component Analysis)
    
    
- Clustering - This model is grouping unlabelled data.
- Cross validation - It is used to check the accuracy of supervised models on unseen data.
- Dimensionality Reduction - It is used to reduce the number of attributes in the data. It can be used for summarisation, visualisation and feature selection.
- Ensemble Methods - It is used to combine the predictions of multiple supervised models.
- Feature Extraction - It is used to extract the features from data to define the attributes in an image.
- Feature selection - It is used to identify the useful attributes to create supervised models.
- Open Source - It is a open source library and available under BSD license.

# ScikitLearn - Modelling Process

It deals with the modelling process involved in sklearn.
1. Dataset Loading - Dataset is nothing but a collection of data.
2. Features - Variables of data are called its features. They are known as predictor inputs or attributes.
3. Feature matrix - A feature matrix is nothing but a collection of features.
4. Feature names - It is a list of all the names of the features.

#### Example

In [3]:
from sklearn.datasets import load_iris
iris = load_iris()
x = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 10 rows of x:\n",x[:10])

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

First 10 rows of x:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
x = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(105, 4)
(45, 4)
(105,)
(45,)
