# Machine Learning with Scikit-learn
More advanced stuffs can be found in DS_Data_Analysis > IBM4 Model Development file.

### Basics
It uses the famous car data from https://archive.ics.uci.edu/ml/datasets/auto+mpg.

In [3]:
import pandas as pd

In [4]:
# Check version
sklearn.__version__

'0.24.2'

In [6]:
# Check directory
!dir
!dir .\files\auto-mpg.csv

 Volume in drive C is Windows
 Volume Serial Number is 3838-B871

 Directory of C:\Users\sori-\Desktop\DS_Machine_Learning

06/29/2021  04:15 PM    <DIR>          .
06/29/2021  04:15 PM    <DIR>          ..
06/29/2021  03:39 PM             1,928 .gitignore
06/29/2021  03:43 PM    <DIR>          .idea
06/29/2021  03:52 PM    <DIR>          .ipynb_checkpoints
06/29/2021  04:15 PM             4,636 Day29_scikit_learn.ipynb
06/29/2021  04:05 PM    <DIR>          files
06/29/2021  03:39 PM            11,558 LICENSE
06/29/2021  03:39 PM                21 README.md
               4 File(s)         18,143 bytes
               5 Dir(s)  150,653,026,304 bytes free
 Volume in drive C is Windows
 Volume Serial Number is 3838-B871

 Directory of C:\Users\sori-\Desktop\DS_Machine_Learning\files

06/29/2021  04:05 PM            21,913 auto-mpg.csv
               1 File(s)         21,913 bytes
               0 Dir(s)  150,653,026,304 bytes free


In [9]:
# Import data into dataframe
car = pd.read_csv('./files/auto-mpg.csv', header=None)
car.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [17]:
# Set column name
car.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'name']
car.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [18]:
# Check null-values
car.info()                     # Check dtype 'object': horsepower, name

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [27]:
# Set x and y data
x = car[['weight']]           # Should be 2D array, not Series
y = car[['mpg']]

In [28]:
# x and y should have same number of data
x.shape, y.shape

((398, 1), (398, 1))

In [29]:
# Import module for Linear Regression
from sklearn.linear_model import LinearRegression

In [30]:
# Set LenearRegression object
lm = LinearRegression()

In [31]:
# Fit the object with train data
lm.fit(x, y)

LinearRegression()

In [32]:
# Check coefficient & intercept
lm.coef_, lm.intercept_

(array([[-0.00767661]]), array([46.31736442]))

From above, we can construct the equation as follows:
* **y = -0.00767661 * x + 46.31736442 Or**
* **y = lm.coef_ * x + lm.entercept_**

In [33]:
# Get R-score
lm.score(x, y)   

0.6917929800341573

Above outcome can be interpreted as **"69.2% of variance of target variable can be explained by this linear regression"**. Thus, **the higher R-score is, the better the model is**.

### Model Evaluation with Train/Test Data

In [44]:
# Import module
from sklearn.model_selection import train_test_split

In [45]:
# Split train data & Test data
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.85, random_state=0)

In [49]:
# Confirm the length of train and test data
len(x_train), len(x_test), len(y_train), len(y_test)

(338, 60, 338, 60)

#### Fit Model with Train Data

In [55]:
lm.fit(x_train, y_train)
lm.coef_, lm.intercept_

(array([[-0.00752676]]), array([45.81368469]))

* **y = -0.00752676 * x + 45.81368469**

In [57]:
lm.score(x_train, y_train)

0.6697656455991319

#### Fit Model with Test Data

In [58]:
lm.fit(x_test, y_test)
lm.coef_, lm.intercept_

(array([[-0.00854704]]), array([49.20543496]))

* **y = -0.00854704 * x + 49.20543496**

In [59]:
lm.score(x_test, y_test)

0.7372440949756491

The difference of R-score of train & test data is quite big. This could be solved by **"cross validation"** which will be dealt with later.