## Scikit-learn
* Used for machine learning
* It is characterized by clean, uniform and streamlined API
* Different steps to follow in ML model building -
1. Loading the data
2. Preprocessing data
3. Training the data
4. Testing the model
5. Evaluating model performance

### Loading the data

In [2]:
import pandas as pd
df = pd.read_csv('autompg1627540133031.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Data preprocessing

#### 1. Data properties

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


#### 2. Dropping null values

In [4]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model_year    392 non-null    int64  
 7   origin        392 non-null    object 
 8   name          392 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 30.6+ KB


#### 3. Predictors and targets
* The target variable which has to be predicted
* predictors are the variables that are used to predict the target

In [5]:
#creating matrix of predictors
x = df.iloc[:,1:8]

#creating target
y = df.iloc[:,0]

The origin feature is categorical variable, get_dummies function can be used from pandas to encode it.

In [7]:
x = pd.get_dummies(x)
x.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin_europe,origin_japan,origin_usa
0,8,307.0,130.0,3504,12.0,70,0,0,1
1,8,350.0,165.0,3693,11.5,70,0,0,1
2,8,318.0,150.0,3436,11.0,70,0,0,1
3,8,304.0,150.0,3433,12.0,70,0,0,1
4,8,302.0,140.0,3449,10.5,70,0,0,1


### Train and test split
* Data must be divided into two parts
* First - training set on which model can be trained, second, a testing on which model can be validated

In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=0)

### Applying standard scaler on the data
* To standardize the variables, since all the variables have different units of measurements and different scales
* Standard scaler performs the operation by transforming the columns such that the mean of every column or variable is 0 and std is 1

In [10]:
#Applying standard scaler on the data
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scale.fit_transform(x_train)
scale.transform(x_test);

### Importing linear regressor

In [11]:
#importing and fitting the model on training set
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

#fitting the model on training data
reg.fit(x_train,y_train)

#checking the coefficient slope and intercept
m = reg.coef_
c = reg.intercept_
m,c

(array([-0.38946904,  0.02158376, -0.01237154, -0.00700083,  0.12954429,
         0.76774449,  0.83388251,  0.71982383, -1.55370633]),
 -16.203123000581034)

In [12]:
#predicting the target against the predictors in the training data set
#predicted data stored in y_pred_train
y_pred_train = reg.predict(x_train)

#predicting the target against the predictors in the testing data set
#predicted data stored in y_pred_test
y_pred_test = reg.predict(x_test)

### Model Evaluation

In [13]:
#predicting accuracy in terms of how close is the predicted value of target to real value in training data set
from sklearn.metrics import r2_score
r2_S = r2_score(y_train,y_pred_train)
r2_S

0.8194239716903474

In [14]:
#predicting accuracy in terms of how close is the predicted value of target to real value in testing data set
from sklearn.metrics import r2_score
r2_T = r2_score(y_test,y_pred_test)
r2_T

0.8387519287083124