## Scikit-learn

This notebook covers the most basic operations from ```scikit-learn``` library.

### Explore the data

Firstly, let's take a look at the data. We will use the [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) which is a classical dataset in machine learning. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The dataset contains 4 features: sepal length, sepal width, petal length and petal width. The dataset is available in ```scikit-learn``` library.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

#### Task 1 (0.5 point)

a) What are the keys of the dataset? What is the type of the data in each key?

b) Print the description of the dataset.

c) Print the feauture and target names

#### Task 2 (1 point)
Visualize the data set using ```seaborn```. What type of plot would you use? Why?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('iris')

### Construct the training and test sets

#### Task 3 (0.5 point)
Load the iris data set. Split it into training and test sets. Use 30% of the data for testing. Use ```random_state=100``` for reproducibility. Finally print the shape of resulting data sets. Name the variables as follows: ```X_train, X_test, y_train, y_test```.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Train a simple classifier

Now, let's build a simple ML model to verify how it works without any data processing

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
model = LogisticRegression(random_state=100, max_iter=1000)
y_train = y_train.values.ravel()
model.fit(X_train, y_train)

In [None]:
prediction = model.predict(X_test)
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=prediction)
print(f'Accuaracy: {accuracy}')

OK, it seems that the model works, but we can do better. Let's try to preprocess the data.

### Data preprocessing

The preprocessing module from ```scikit-learn``` provides a lot of useful functions to preprocess data. The brief description of the most important functions can be found in [official documentation](https://scikit-learn.org/stable/modules/preprocessing.html).

In [None]:
from sklearn import preprocessing

#### Task 4 (1.5 point)
a ) Define the transformers for the following tasks:
* Normalization - scales each feature to have unit norm
* Standardization - scales each feature to have zero mean and unit variance
* Non-linear transformation - applies a non-linear transformation to each feature in order to achieve a Gaussian-like distribution
* Higher order features generation - It is used to generate higher order features from the original ones. For example, if we have two features $x_1$ and $x_2$, then the second order features will be $x_1^2$, $x_2^2$, $x_1x_2$.


Normalization (```Normalizer```)

Standardization (```StandardScaler```, ```MinMaxScaler```, ```MaxAbsScaler```, ```RobustScaler```)

In [None]:
scalers = {} # Add the newly created scaler to this dictionary

Non-linear transformations (```QuantileTransformer``` - with uniform and normal distribution, ```PowerTransformer``` - with Yeo-Johnson and Box-Cox transformations)

In [None]:
gaussian_transformers = {} # Add the newly created transformer to this dictionary

Higher order features (```PolynomialFeatures```, ```SplineTransformer```)

In [None]:
hof_transformers = {} # Add the newly created transformer to this dictionary

b) Define custom transformer which will calculate the logarithm of the features. Use ```FunctionTransformer``` from ```sklearn.preprocessing```. You can use ```np.log``` function.

In [None]:
import numpy as np

#### Task 5 (3 points)
Apply different previously defined transformers to the data set. Which one gives the best results? Try to use different parameters and different combinations of transformers.

Hint: Use the previously defined model to compare the results.

In [None]:
identity_transformer = preprocessing.FunctionTransformer(validate=True)

In [None]:
# Define X_train_preprocessed and X_test_preprocessed by applying different combinations of transformers to X_train and X_test

...

# Then run the following:
model = LogisticRegression(random_state=100, max_iter=1000)
model.fit(X_train_preprocessed, y_train)
prediction = model.predict(X_test_preprocessed)
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=prediction)

### Different model impact

For comparison you can check also different model. How does it work with the same data?

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=100)
model.fit(X_train, y_train)
prediction = model.predict(X_test)
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=prediction)
print(f'Non normalized data: {accuracy}')

model = DecisionTreeClassifier(max_depth=3, random_state=100)
X_train_transformed = normalizer.fit_transform(X_train)
X_test_transformed = normalizer.transform(X_test)
model.fit(X_train_transformed, y_train)
prediction = model.predict(X_test_transformed)
accuracy = metrics.accuracy_score(y_true=y_test, y_pred=prediction)
print(f'Normalized data: {accuracy}')

#### Task 6 (1 point)
Fill the missing values for the following numpy array using ```SimpleImputer```.

In [None]:
X = np.random.uniform(0, 10, size = (10, 2))
X[np.random.randint(0, 10, size = 5), np.random.randint(0, 2, size = 5)] = np.nan

In [None]:
X

In [None]:
from sklearn.impute import SimpleImputer

## Pipelines

Let's read some data first.

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/daily-bike-share.csv')
data.dtypes

In [None]:
data.head()

In [None]:
data = data[['season'
             , 'mnth'
             , 'holiday'
             , 'weekday'
             , 'workingday'
             , 'weathersit'
             , 'temp'
             , 'atemp'
             , 'hum'
             , 'windspeed'
             , 'rentals']]

### Task 7 (0.5 point)
Construct a training and test set, using 'rentals' as labels. Use 30% of the data for testing. Use ```random_state=100``` for reproducibility. Finally print the shape of resulting data sets.

### Task 8 (2 points)
Construct a pipeline (```Pipeline``` from ```sklearn.pipeline```) which will perform the following steps:
* Impute missing values
* Scale the data
* Convert categorical features to one-hot encoding

Hint:
1) ['temp', 'atemp', 'hum', 'windspeed'] are numerical features, ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit'] are categorical features.

2) Use ```ColumnTransformer``` from ```sklearn.compose``` to apply different transformers to different columns.

3) Use ```OneHotEncoder``` from ```sklearn.preprocessing``` to convert categorical features to one-hot encoding.

4) As a model use ```LinearRegression``` from ```sklearn.linear_model```

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

### Task 9 - Contest (5 points)*
*Rules will be announced during the lab