# Linear Regression using 'sklearn' library

**scikit-learn** or **sklearn** is a python library which have functionalities in machine learning. It is opensource and easy to use. This can be used for simple modelling purposes.

Documentation: https://scikit-learn.org/stable/

All the necessary functions are included in the **sklearn** library and we are going to import and use them when we need those.

First, the dataset must be loaded to our working area. In here, we are using a simple dataset for demonstration.

In [1]:
import pandas as pd
import numpy as np

salary_data=pd.read_csv('datasets\Salary_Data.csv')
salary_data.head()

Unnamed: 0,YearsExperience,Salary
0,1.1,39343
1,1.3,46205
2,1.5,37731
3,2.0,43525
4,2.2,39891


Then the data should be divided in to *independent* and *depended* variables. In this situation, **YearsExperience** is taken as the independent variable and **Salary** is taken as the dependent variable.

In [3]:
X=salary_data.iloc[:,:-1].values
y=salary_data.iloc[:,1].values

The next step is dividing the data in to **training** and **testing** sets. For this, we can use **train_test_split** function of the **sklearn.model_selection**. 

The **test_size** parameter determines the portion of the *test data* from the original data set. **test_size = 0.2** means, 20% of the data set has used as the testing data set.

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.35,random_state=0)

For this data set, there is no need of standardizing data. Hence, we can move to the model creating step. So, the **Linear Regression model** can be selected from the *sklearn* library. 

In here, an object of the model has created, so that the further processing will be easy.

In [9]:
from sklearn.linear_model import LinearRegression
lr_object=LinearRegression()
lr_object.fit(X_train,y_train)

LinearRegression()

In [10]:
y_pred=lr_object.predict(X_test)
y_pred

array([ 40551.75603614, 123166.57867471,  64960.68090662,  63083.0713012 ,
       115656.14025302, 108145.70183133, 116594.94505573,  64021.87610391,
        76226.33853916, 100635.26340965,  53695.02327409])

In [11]:
from sklearn import metrics
print('MAE: ',metrics.mean_absolute_error(y_test,y_pred))
print('MsE: ',metrics.mean_squared_error(y_test,y_pred))
print('MAE: ',np.sqrt(metrics.mean_absolute_error(y_test,y_pred)))

MAE:  3352.0271062423035
MsE:  19421097.588238075
MAE:  57.896693396447986
