# Machine Learning Basics with Scikit-learn: Day 1

## Introduction

### Objectives
The workshop will focus on the basics of *scikit-learn*, which is one of the most popular machine learning libraries in Python. After the workshop you will:

* Understand how to use scikit-learn functions and documentation.
* Learn the usual procedure to transform your data for a machine learning model.
* Create basic machine learning models.

### Why scikit-learn?
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities. It has a standardized and simple interface for preprocessing data and model training, optimization and evaluation. 

### Why are we teaching machine learning with scikit-learn?
Using machine learning is becoming a mandatory new skill for many professional workers. As work and organizations demand more use of data analyses, machine learning allows us to analyze large datasets to extract meaningful information. Although machine learning techniques have been there for decades, modern programming languages are making these techniques available for thousands of users. Rather than developing each method from scratch, *scikit-learn* offers a simple way to implement them with our datasets. This package provides a easy toolkit to master and leverage machine learning skills. We hope that learning sckikit-learn helps you comprehend the overall process of data transformation: from curating and importing datasets, to curating models for data knowledge purposes. 

### Structure of the Workshop
The workshop is divided into 5 days:

1. Introduction to scikit-learn
2. Supervised learning: Classification models
3. Unsupervised learning: Clustering models
4. Data cleaning / transformation
5. Model selection

Today, we will start with an overview of *scikit-learn.* We will load datasets, create training and testing datasets, create a model, and evaluate it. We will delve into the functions, models, and details these days. However, keep in mind that this workshop's content will stay at an introductory level. If you are interested in learning more, here are some good resources that you can explore:
* [Official documentation](https://scikit-learn.org/stable/index.html)
* [Google Cloud AI Adventures](https://www.youtube.com/hashtag/aiadventures)
* [Google's Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
* [Stanford University: CS229: Machine Learning](http://cs229.stanford.edu/)

## Installing scikit-learn

Conda and Google Colab already have scikit-learn. You should skip the instalation if you're using one of those. If you are running this notebook on your own environment, please run the following command. And for more instructions, please check out the [documentation](https://scikit-learn.org/stable/install.html)

In [None]:
# If you do not have scikit-learn installed, uncomment the following lines and run the following command. 
import sys
!{sys.executable} -m pip install scikit-learn==0.24.2
!{sys.executable} -m pip install scikit-learn --upgrade

Let's check that the package is in your environment. Run the following command.

In [None]:
import sklearn
sklearn.show_versions()

## The big picture
One of the biggest advantages of machine learning is to determine how to differentiate observations using a computational model, rather than using human coding and manual rules. When we have thousands of obsevations, machine learning models helps us automatize, scale, and guarantee the reproducibility these coding processes. The basic steps are:
1. Gathering data
2. Preparing that data
3. Choosing a model
4. Training
5. Evaluation
6. Prediction.

## 1. Gathering data

The first step is collecting data and understanding the dataset. This step is very important because the quality and quantity of data that you gather will directly determine how good your predictive model can be. Also, many machine learning models are *biased* because of the data. A well-known example is [this face-generator model](https://www.theverge.com/21298762/face-depixelizer-ai-machine-learning-tool-pulse-stylegan-obama-bias) that did not recognize President Obama as an African-American person. 

You need to collect data and take into account these recommendations:
* **Number of observations (*N*):** In statistics, there is a theorem called "Law of Big Numbers." According to this law, the average of the results obtained from a large number of trials should be close to the expected value. In other words, by having a large number of observations, your model will tend to become closer to the expected value. Creating machine learning models is based on *big datasets.* Imagine companies that collect big data from their clients and are able to get a clear picture of them. Prediciton becomes more accurate as long as you have more observations and knowledge of your population. 
* **Missing data**: Many observations may lack some values. You can have multiple reasons for this: data-collection issues, restricted data, information not available, etc. It is important to make strategies whenever you have missing data. 
* **Representativity**: One main problem in most datasets is checking how representative is the dataset with respect to the population. How can we be sure that the dataset is not baised? Computers do not understand the dataset context, or how the dataset was collected. This problem lies on the people who collected the data. Moreover, people who got the data must be aware about any potential flaws or inequealities in the dataset. Descriptive analysis should guide any checkings and validation processes. 

In this workshop, we will use the datasets provided by *scikit-learn*. You can check other [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) if you are interested in exploring more. These datasets are well-known, public, and used frequently for learning and testing purposes.

### Loading datasets

Before we get started, some terms that we must get familiar with:
* **Samples**: A sample is an observation available in the dataset. Also, they are known as observations, records, etc. They are usually the rows of a dataset table.
* **Features**: A feature is an individual measurable property. We also know them as variables and attributes. Usually, these are the columns of a dataset table.

We will start importing the scikit-learn's datasets:

In [None]:
from sklearn import datasets

All datasets are now in the environment. We can call specifically one of those and assign them as a variable. We will use the *Boston house prices dataset.* Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are deﬁned as follows (taken from the UCI Machine Learning Repository1): CRIM: per capita crime rate by town

* CRIM: crime per capita crime rate by town
* ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS: proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: nitric oxides concentration (parts per 10 million)
* RM: average number of rooms per dwelling
* AGE: proportion of owner-occupied units built prior to 1940
* DIS: weighted distances to five Boston employment centres
* RAD: index of accessibility to radial highways
* TAX: full-value property-tax rate per \$10,000
* PTRATIO: pupil-teacher ratio by town
* B: 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
* LSTAT: proportion of lower status of the population
* MEDV: Median value of owner-occupied homes in $1000’s

This model has been frequently used for regression models.

In [1]:
import pandas as pd
boston_df = pd.read_csv("BostonHousing.csv")

In [2]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV,CAT. MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0,0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6,0
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7,1
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4,1
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2,1


In [3]:
boston_df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'LSTAT', 'PRICE', 'CATMEDV']

## 2. Preparing the data

In a machine learning model, we will have *features* and a *target*. The target is the variable to predict/estimate by the model. The value of the target depends on the features' values. Usually, targets are the outcomes of a process (e.g., giving a loan, earnings). In statistics, the target is also known as the *dependent variable*, and the features are the *independent variables*.

![In this example, demographic informations are used to predict users' behavior when they are navigating on a website](https://d2m6ke2px6quvq.cloudfront.net/uploads/2020/09/11/0e1df989-5fc9-474b-ba49-5eaebfc2d795.png)

In this example, users' demographic and activity information are used to predict users' behavior when they are navigating on a website.

We will print the first 5 columns of this pandas dataframe. We will see each column with their respective feature name.

In [4]:
print(boston_df.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

   LSTAT  PRICE  CATMEDV  
0   4.98   24.0        0  
1   9.14   21.6        0  
2   4.03   34.7        1  
3   2.94   33.4        1  
4   5.33   36.2        1  


### Exercise 1
Before we create a model, let's get familiar with the target column, "PRICE," which is the value of prices of the houses. Run the following command and check the mean, minimum value, maximum value, and the 50\%. How is the data distributed?

In [5]:
print(boston_df['PRICE'].describe())

count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: PRICE, dtype: float64


## 3. Training and Testing sets

To see if the model is capable to predict "future" values, we split the original dataset in two sets: one portion will be used for **training**, and the second portion will be for **testing** purposes. The testing dataset will *test* the model against data that has never been used for *training*. Having a testing dataset allows us to see how the model might perform against data that it has not yet seen. This is meant to be representative of how the model might perform in the real world.

A rule of thumb is use for a training-evaluation split somewhere on the order of 80/20, 90/10. Much of this depends on the size of the original source dataset. We will start training the model with 80% of the sample and test the model with the remaining 20%. We do this to assess the model's performance on unseen data.

We must separate the features that will act as independent variables (`X`) from the target (`Y`). The independent variables include all attributes but `'PRICE'`:

In [6]:
X = boston_df.drop('PRICE', axis = 1)
y = boston_df['PRICE']

To split the data, we use `train_test_split` function provided by *scikit-learn* library. We finally print the shapes of our training and test set to verify if the splitting has occurred properly.

In [8]:
# Import the train_test_split function
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Now, we check the dimensions of both sets are correct. Since we have 506 observations, the training set has ~404 observations (80%) and the testing set has about 102 observations. Since we have 13 features in total, each `X` dataframe has 13 columns:

In [10]:
print(X_train.shape)
print(X_test.shape)

(404, 13)
(102, 13)


Finally, the target is a single column with 404 observations for the training set, and 102 observations for the testing set.

In [11]:
print(y_train.shape)
print(y_test.shape)

(404,)
(102,)


### Exercise 2
Change the proportion for the training set. Instead of 80\%, set it up for **90%** (i.e., 90% for training and 10% for testing). How many observations would you have for the traning and testing datasets?

In [12]:
## Run the code here
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size = 0.1)

In [13]:
# Print X datasets' shape
print(X_train_2.shape)
print(X_test_2.shape)

(455, 13)
(51, 13)


In [14]:
# Print y datasets' shape
print(y_train_2.shape)
print(y_test_2.shape)

(455,)
(51,)


## 4. Choosing a model

The next step in our workflow is choosing a model. There are many models that researchers and data scientists have created over the years. Some are very well suited for image data, others for sequences (like text, or music), some for numerical data, others for text-based data. 

Since we have 13 features, one target, and numerical data, we can use a **linear regression model.** Regression models are very useful for numeric datasets.

In [15]:
# Import Linear Regression model
from sklearn.linear_model import LinearRegression

In [16]:
reg = LinearRegression()
reg.fit(X_train, y_train)

### Exercise 3
Run again the lineal regression model using your training set with 90% of the observations (`X_train_2`,`y_train_2`). Call your model `reg2`.

In [17]:
# Run regression model
reg2 = LinearRegression()
reg2.fit(X_train_2, y_train_2)

## 5. Evaluating your model

Once training is complete, it's time to see if the model provides accurate results. A good practice is to compare the original values (`y_train`) with the predicted values given by the model. We want to check how *good* is the model predicitng the values were used for training. 

We use the function `predict` to get the values according to the trained model.

In [18]:
y_train_predict = reg.predict(X_train)

We compare the predicted training values with the real training values by checking their difference. If the difference is small, the model is predicting values close to the real ones. We will create a table now to compare them visually.

In [19]:
y_train_df = pd.DataFrame({'y_real': y_train, 'y_predicted': y_train_predict})
y_train_df['difference'] = y_train_df.y_real - y_train_df.y_predicted 
y_train_df.head(10)

Unnamed: 0,y_real,y_predicted,difference
336,19.5,20.503928,-1.003928
293,23.9,25.596591,-1.696591
440,10.5,11.855436,-1.355436
198,34.6,37.569967,-2.969967
414,7.0,1.827492,5.172508
160,27.0,29.952321,-2.952321
28,18.4,18.414223,-0.014223
478,14.6,17.19353,-2.59353
235,24.0,22.710305,1.289695
47,16.6,17.847663,-1.247663


As expected, some predicted values are close to the real values (i.e., their differences are close to zero), and some others are far from expected (i.e., big values in their differences). 

We need **metrics** to evaluate empirically how good is this model. For linear regression models, one metric is the *coefficient of determination* (R^2) of the prediction. This score is related to the differences between the predicted values and the original values. The best possible score is 1.0. For more details, click [here](https://en.wikipedia.org/wiki/Coefficient_of_determination). 

In [20]:
r2 = round(reg.score(X_train, y_train),2)
print('R2 score is {}'.format(r2))

R2 score is 0.85


### Exercise 4
Calcualte the predicted value using the model `reg2` with the 90% training dataset. Use `y_train_predict_2` for the new predicted values. Then, copy the pandas dataframe and replace the files with your 90% datasets (`y_train_2` and `y_train_predict_2`). The new dataframe should be called `y_train_df_2`. Print `y_train_df_2` dataframe's head. 

In [21]:
# Calculate the predicted prices based on the 90% training set
y_train_predict_2 = reg2.predict(X_train_2)

# Create the dataframe
y_train_df_2 = pd.DataFrame({'y_real': y_train_2, 'y_predicted': y_train_predict_2})
y_train_df_2['difference'] = y_train_df_2.y_real - y_train_df_2.y_predicted 
y_train_df_2.head(10)

Unnamed: 0,y_real,y_predicted,difference
202,42.3,38.339876,3.960124
411,17.2,16.331458,0.868542
430,14.5,18.590209,-4.090209
113,18.7,18.784155,-0.084155
50,19.7,20.426836,-0.726836
346,17.2,17.596551,-0.396551
295,28.6,26.304314,2.295686
407,27.9,19.5243,8.3757
100,27.5,21.718153,5.781847
273,35.2,40.153392,-4.953392


Compute the R^2 of your model and compare it with the previous model' R^2. Is it better or worse?

In [22]:
r2 = round(reg2.score(X_train_2, y_train_2),2)
print('R2 score is {}'.format(r2))

R2 score is 0.83


## 6. Prediction
In this final step, we use the model that we trained to *predict* new values (or values that were not tested before). This is the final test to check how good the model is. If the phenomena or casual effects are reflected correctly in our model, then the model will be capable to predict new observations. If the model's performance gets worse, we must reconsider the training dataset, features, and machine learning model used. 

Like the prior step, we first start predicting the target values (`y_test_predict`) from the testing dataset (`X_test`)

In [23]:
y_test_predict = reg.predict(X_test)

We compare the predicted testing values with the real testing values by checking their difference. If the difference is small, the model is predicting values close to the real ones. We will create a table now to compare the numbers.

In [24]:
y_test_df = pd.DataFrame({'y_original': y_test, 'y_predicted': y_test_predict})
y_test_df['difference'] = y_test_df.y_original - y_test_df.y_predicted 
y_test_df.head(10)

Unnamed: 0,y_original,y_predicted,difference
248,24.5,20.783458,3.716542
201,24.1,23.605704,0.494296
177,24.6,24.944676,-0.344676
288,22.3,23.673788,-1.373788
489,7.0,13.20718,-6.20718
267,50.0,39.627067,10.372933
22,15.2,15.569097,-0.369097
180,39.8,37.65351,2.14649
334,20.7,21.590442,-0.890442
277,33.1,41.398983,-8.298983


Finally, we compute the R^2 of this model and check its performance. We can expect to see a lower performance compared to the training validation. We will discuss in another session how we can make these models have better scores. 

In [25]:
r2 = round(reg.score(X_test, y_test),2)
print("R^2: {}".format(r2))

R^2: 0.81


### Exercise 5

Calcualte the predicted value using the model `reg2` with the 90% testing dataset. Use `y_test_predict_2` for the new predicted values. Then, copy the pandas dataframe and replace the variables with your 90% testing dataset (`y_test_2` and `y_test_predict_2`). The new dataframe should be called `y_test_df_2`. Print `y_test_df_2` dataframe's head. 

In [26]:
# Calculate the predicted value
y_test_predict_2 = reg2.predict(X_test_2)

In [27]:
# Create the dataframe
y_test_df_2 = pd.DataFrame({'y_original': y_test_2, 'y_predicted': y_test_predict_2})
y_test_df_2['difference'] = y_test_df_2.y_original - y_test_df_2.y_predicted 
y_test_df_2.head(10)

Unnamed: 0,y_original,y_predicted,difference
129,14.3,15.965541,-1.665541
204,50.0,40.958182,9.041818
386,10.5,8.474505,2.025495
436,9.6,15.254705,-5.654705
173,23.6,24.571495,-0.971495
347,23.1,20.586548,2.513452
380,10.4,6.161557,4.238443
383,12.3,13.320262,-1.020262
339,19.0,21.446008,-2.446008
29,21.0,19.304946,1.695054


Compute the R2 of your model and compare it with the previous model' R2. Is it better or worse?

In [28]:
# Compute the score
r2 = round(reg2.score(X_test_2, y_test_2),2)
print("R^2: {}".format(r2))

R^2: 0.88
