# General steps to develop any ML model with Scikit-Learn

https://scikit-learn.org/stable/supervised_learning.html

## Data

Based on the previous variables, the goal is to find the best numbers $b_0, b_1, b_2, ...$ that minimize the error of the following model:

$$
Y = b_0 + b_1 \cdot X_1 + b_2 \cdot X_2 + ... 
$$

Where the variables are:

- $Y$ `Price` is the target variable
- $X_1$ `Bedrooms` is one explanatory variable
- $X_2$ `Bathrooms` is another explanatory variable



## The Machine Learning System

To create any ML model in Python, you always [follow the same system](https://datons.craft.me/h3f5pSQSE7l6RW/b/53521460-00B0-4667-9FEB-4D31916272C3/Machine-Learning-%E2%86%92-Sklearn-).

## Select the variables for the model

1. **`y` target**: output variable (to predict)
2. **`X` explanatory**: input variables to calculate/explain the prediction

In [97]:
import pandas

In [98]:
df = pandas.read_excel("/workspaces/ML/data/premier/output/simple_reg.xlsx")

In [99]:
path = '/workspaces/ML/data/premier/output/simple_reg.xlsx'

In [100]:
from sklearn.linear_model import LinearRegression

In [101]:
model = LinearRegression()

In [102]:
df

Unnamed: 0,team,goals,points
0,Manchester City,99,93
1,Liverpool,94,92
2,Chelsea,76,74
3,Tottenham Hotspur,69,71
4,Arsenal,61,69
5,Manchester United,57,58
6,West Ham United,60,56
7,Leicester City,62,52
8,Brighton and Hove Albion,42,51
9,Wolverhampton Wanderers,38,51


In [103]:
y = df['points']
X = df[['goals']]

In [104]:
y


0     93
1     92
2     74
3     71
4     69
5     58
6     56
7     52
8     51
9     51
10    49
11    48
12    46
13    45
14    40
15    39
16    38
17    35
18    23
19    22
Name: points, dtype: int64

In [105]:
X


Unnamed: 0,goals
0,99
1,94
2,76
3,69
4,61
5,57
6,60
7,62
8,42
9,38


## Linear Regression

### Fit model

In [106]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X, y)

### Model's score

In [107]:
model.score(X, y)

0.9036097775581716

### Predictions

In [108]:
y_pred_lr = model.predict(X)

y_pred_lr

array([95.56258909, 90.83623165, 73.82134488, 67.20444448, 59.64227258,
       55.86118663, 58.69700109, 60.58754407, 41.68211432, 37.90102838,
       43.5726573 , 49.24428622, 47.35374325, 51.1348292 , 42.62738581,
       42.62738581, 41.68211432, 34.11994243, 34.11994243, 23.72195607])

## K Nearest Neighbors

### Fit model

In [109]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()
model.fit(X, y)

### Model's score

In [110]:
model.score(X, y)

0.7740619902120718

### Predictions

In [111]:
y_pred_kn = model.predict(X)

y_pred_kn

array([76.4, 76.4, 64.4, 64.4, 61.2, 56. , 56. , 61.2, 43.4, 39.6, 43.4,
       49.2, 45.6, 49.2, 43.4, 43.4, 43.4, 39.6, 39.6, 36.4])

## Random Forest

From the ensemble sub-module.

### Fit model

In [113]:
from sklearn.ensemble import RandomForestRegressor

### Model's score

In [114]:
model = RandomForestRegressor()
model.fit(X, y)

In [115]:
model.score(X, y)

0.9478488728173949

### Predictions

In [117]:
y_pred_rf= model.predict(X)
y_pred_rf

array([90.81      , 90.42      , 69.21      , 65.41      , 63.68      ,
       58.34      , 59.26      , 56.29      , 43.73802381, 43.46866667,
       46.44133333, 47.17      , 46.61      , 45.89      , 40.14683333,
       40.14683333, 43.73802381, 31.73009524, 31.73009524, 25.58833333])

## BONUS: Comparing models visually