## Scikit-learn

This notebook covers the most basic operations from ```scikit-learn``` library.

### Explore the data

Firstly, let's take a look at the data. We will use the [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) which is a classical dataset in machine learning. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The dataset contains 4 features: sepal length, sepal width, petal length and petal width. The dataset is available in ```scikit-learn``` library.

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

#### Task 1 (0.5 point)

a) What are the keys of the dataset? What is the type of the data in each key?

b) Print the description of the dataset.

c) Print the feauture and target names

#### Task 2 (1 point)
Visualize the data set using ```seaborn```. What type of plot would you use? Why?

### Construct the training and test sets

#### Task 3 (0.5 point)
Load the iris data set. Split it into training and test sets. Use 30% of the data for testing. Use ```random_state=100``` for reproducibility. Finally print the shape of resulting data sets

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Train a simple classifier

Now, let's build a simple ML model to verify how it works without any data processing

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [14]:
prediction = model.predict(X_test)
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=prediction)
print(f'Accuaracy: {accuracy}')

Accuaracy: 0.9555555555555556


OK, it seems that the model works, but we can do better. Let's try to preprocess the data.

### Data preprocessing

The preprocessing module from ```scikit-learn``` provides a lot of useful functions to preprocess data. The brief description of the most important functions can be found in [official documentation](https://scikit-learn.org/stable/modules/preprocessing.html).

In [15]:
from sklearn import preprocessing

#### Task 4 (1.5 point)
a ) Define the transformers for the following tasks:
* Normalization - scales each feature to have unit norm
* Standardization - scales each feature to have zero mean and unit variance
* Non-linear transformation - applies a non-linear transformation to each feature in order to achieve a Gaussian-like distribution
* Higher order features generation - It is used to generate higher order features from the original ones. For example, if we have two features $x_1$ and $x_2$, then the second order features will be $x_1^2$, $x_2^2$, $x_1x_2$.


Normalization (```Normalizer```)

Standardization (```StandardScaler```, ```MinMaxScaler```, ```MaxAbsScaler```, ```RobustScaler```)

Non-linear transformations (```QuantileTransformer``` - with uniform and normal distribution, ```PowerTransformer``` - with Yeo-Johnson and Box-Cox transformations)

Higher order features (```PolynomialFeatures```, ```SplineTransformer```)

b) Define custom transformer which will calculate the logarithm of the features. Use ```FunctionTransformer``` from ```sklearn.preprocessing```. You can use ```np.log``` function.

In [30]:
import numpy as np
custom_transformer = preprocessing.FunctionTransformer(np.log, validate=True)

#### Task 5 (3 points)
Apply different previously defined transformers to the data set. Which one gives the best results? Try to use different parameters and different combinations of transformers.

Hint: Use the previously defined model to compare the results.

In [31]:
identity_transformer = preprocessing.FunctionTransformer(validate=True)

In [34]:
for normalizer_name, scaler_name, hof_transformer_name, accuracy in best_configs:
    print(f'Normalizer: {normalizer_name}, Scaler: {scaler_name}, HOF Transformer: {hof_transformer_name}, Accuracy: {accuracy}')


Normalizer: normalizer, Scaler: std_scaler, HOF Transformer: poly, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: std_scaler, HOF Transformer: spline, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: min_max_scaler, HOF Transformer: spline, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: max_abs_scaler, HOF Transformer: spline, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: robust_scaler, HOF Transformer: spline, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: quantile_transformer, HOF Transformer: poly, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: quantile_transformer, HOF Transformer: spline, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: quantile_transformer, HOF Transformer: identity, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: quantile_norm_transformer, HOF Transformer: identity, Accuracy: 0.9777777777777777
Normalizer: normalizer, Scaler: power_bc_transformer, HOF Tran

### Different model impact

In [35]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=100)
print(f'Non normalized data: {accuracy}')

model = DecisionTreeClassifier(max_depth=3, random_state=100)
print(f'Normalized data: {accuracy}')

Non normalized data: 0.9555555555555556
Normalized data: 1.0


#### Task 6 (1 point)
Fill the missing values for the following numpy array using ```SimpleImputer```.

In [43]:
X = np.random.uniform(0, 10, size = (10, 2))
X[np.random.randint(0, 10, size = 5), np.random.randint(0, 2, size = 5)] = np.nan

In [44]:
X

array([[2.66434377, 7.83561972],
       [2.20217791, 8.3599885 ],
       [       nan, 8.40649594],
       [1.85263737, 4.68650596],
       [       nan, 3.09384407],
       [8.49917944,        nan],
       [5.32789179, 9.25247214],
       [2.05985554, 5.13474496],
       [       nan, 8.93111768],
       [       nan, 8.09575778]])

In [45]:
from sklearn.impute import SimpleImputer

[[2.66434377 7.83561972]
 [2.20217791 8.3599885 ]
 [3.76768097 8.40649594]
 [1.85263737 4.68650596]
 [3.76768097 3.09384407]
 [8.49917944 7.0885052 ]
 [5.32789179 9.25247214]
 [2.05985554 5.13474496]
 [3.76768097 8.93111768]
 [3.76768097 8.09575778]]


## Pipelines

In [48]:
import pandas as pd
import numpy as np
data = pd.read_csv('https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/daily-bike-share.csv')
data.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
rentals         int64
dtype: object

In [49]:
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,rentals
0,1,1/1/2011,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331
1,2,1/2/2011,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131
2,3,1/3/2011,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120
3,4,1/4/2011,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108
4,5,1/5/2011,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82


In [50]:
data = data[['season'
             , 'mnth'
             , 'holiday'
             , 'weekday'
             , 'workingday'
             , 'weathersit'
             , 'temp'
             , 'atemp'
             , 'hum'
             , 'windspeed'
             , 'rentals']]

### Task 7 (0.5 point)
Construct a training and test set, using 'rentals' as labels. Use 30% of the data for testing. Use ```random_state=100``` for reproducibility. Finally print the shape of resulting data sets.

In [52]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(511, 10)
(220, 10)
(511,)
(220,)


### Task 8 (2 points)
Construct a pipeline (```Pipeline``` from ```sklearn.pipeline```) which will perform the following steps:
* Impute missing values
* Scale the data
* Convert categorical features to one-hot encoding

Hint:
1) ['temp', 'atemp', 'hum', 'windspeed'] are numerical features, ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit'] are categorical features.

2) Use ```ColumnTransformer``` from ```sklearn.compose``` to apply different transformers to different columns.

3) Use ```OneHotEncoder``` from ```sklearn.preprocessing``` to convert categorical features to one-hot encoding.

4) As a model use ```LinearRegression``` from ```sklearn.linear_model```

In [53]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [54]:
numeric_transformer = 
categorical_transformer = 

In [55]:
numeric_features = ['temp', 'atemp', 'hum', 'windspeed']

categorical_features = ['season', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']

preprocessor =

In [56]:
from sklearn.linear_model import LinearRegression
pipeline = 

In [1]:
rf_model = pipeline.fit(X_train, y_train)
print (rf_model)

NameError: name 'pipeline' is not defined

In [58]:
from sklearn.metrics import r2_score
predictions = rf_model.predict(X_test)
print (r2_score(y_test, predictions))


0.6623170377045995


### Task 9 (5 points) - Contest
Try different combinations of data processing methods. The highest accuracy wins. The winner gets 5 points. Second person gets 3 points. Third result is awarded with 1 point. Deadline - end of the lab session.