# Scikit-Learn

A powerful and modern machine learning Python library.

Use for fully and semi-automated data analysis and information extraction.

Scikit-learn provides:

    - Tools to identify, organize and solve real-life problems
    - Free downloadable datasets
    - Libraries to learn and make predictions
    - Model support for every problem type
    - Model persistence
    - Open-source community and vendor support

Use to build a machine learning model that has features like:
    
    - Classification
    - Regression
    - Clustering

the Scikit-learn library includes various algorithms like:

    - K-means
    - K-nearest neighbors
    - Support vector machine (SVM)
    - Decision trees

## Problem-solution approach:

    - Selecting a model
    - Finding and estimator object
    - Training the model
    - Making predictions
    - Tuning the model
    - Finding accuracy

### Points to be considered while working with a scikit-learn dataset or loading data to scikit-learn:

    - Create separate objects for features and responses
    - Ensure features and responses only have numeric values
    - Verify that the features and responses are in the form of a NumPy ndarray
    - Ensure the features and responses have the same shape and size as the array
    - Map the features as x and responses as y

Some popular models provided by scikit-learn:

    - Clustering
    - Cross-validation
    - Ensemble methods
    - Feature extraction
    - Feature selection 
    - Parameter tuning
    - Supervised learning algorithms
    - Unsupervised learning algorithms

Sklearn.preprocessing package includes a set of common utility methods and transformer classes for transforming raw feature vectors into the best-fitting representation
for downstream estimators. These are:
    
    - Standardization (mean removal and variance scaling)
    - Normalization
    - Imputation of missing values
    - Encoding categorical features

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv('Advertising.csv', index_col=0)
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [4]:
#dataset size
data.size

800

In [5]:
#dataset shape
data.shape

(200, 4)

In [7]:
#dataset columns
data.columns

Index(['TV', 'Radio', 'Newspaper', 'Sales'], dtype='object')

In [11]:
#create a feature object from the columns
X_feature = data[['Newspaper', 'Radio', 'TV']]

#view feature object
X_feature.head()

Unnamed: 0,Newspaper,Radio,TV
1,69.2,37.8,230.1
2,45.1,39.3,44.5
3,69.3,45.9,17.2
4,58.5,41.3,151.5
5,58.4,10.8,180.8


In [22]:
X_feature.shape #200 means 200 observations or rows is present and 3 is columns

(200, 3)

In [23]:
#create target object form sales column which is a response in the dataset
Y_target = data[['Sales']]

#view the target object
Y_target.head()

Unnamed: 0,Sales
1,22.1
2,10.4
3,9.3
4,18.5
5,12.9


In [24]:
Y_target.shape #200 means 200 observations or rows is present and 1 is columns

(200, 1)

In [25]:
#split test and training data
#by default 75% training data and 25% testing data

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_feature, Y_target, random_state=1)

In [28]:
#view shape of train and test data sets for both feature and response
print (x_train.shape)
print (y_train.shape)
print (x_test.shape)
print (y_test.shape)

(150, 3)
(150, 1)
(50, 3)
(50, 1)


In [31]:
#linear regression model
from sklearn.linear_model import LinearRegression

In [32]:
#code to create a linear regression model which will predict the sales outcome for any new data
linreg = LinearRegression()
linreg.fit(x_train, y_train)

In [34]:
#print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

[2.87696662]
[[0.00345046 0.17915812 0.04656457]]


In [41]:
#prediction
y_pred = linreg.predict(x_test)
y_pred

array([[21.70910292],
       [16.41055243],
       [ 7.60955058],
       [17.80769552],
       [18.6146359 ],
       [23.83573998],
       [16.32488681],
       [13.43225536],
       [ 9.17173403],
       [17.333853  ],
       [14.44479482],
       [ 9.83511973],
       [17.18797614],
       [16.73086831],
       [15.05529391],
       [15.61434433],
       [12.42541574],
       [17.17716376],
       [11.08827566],
       [18.00537501],
       [ 9.28438889],
       [12.98458458],
       [ 8.79950614],
       [10.42382499],
       [11.3846456 ],
       [14.98082512],
       [ 9.78853268],
       [19.39643187],
       [18.18099936],
       [17.12807566],
       [21.54670213],
       [14.69809481],
       [16.24641438],
       [12.32114579],
       [19.92422501],
       [15.32498602],
       [13.88726522],
       [10.03162255],
       [20.93105915],
       [ 7.44936831],
       [ 3.64695761],
       [ 7.22020178],
       [ 5.9962782 ],
       [18.43381853],
       [ 8.39408045],
       [14

In [42]:
#test model accuracy by calculating Mean Squared Error (MSE)
from sklearn import metrics
import numpy as np

In [46]:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.4046514230328944
