First part of laboratories were conducted by <a href="https://github.com/makskliczkowski">Maksymilian Kliczkowski</a>. He created notebooks we used in laboratories and he help me and my friends with understanding basic and advanced topics in machine learning. I am really glad that I had him as a tutor.

In first course of ML/AI we were build ML models from the ground using math.

In this notebook I will present only some task we had as a homework from notebook about scikit-learn

### Explore the data

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

#### Task 1
a) What are the keys of the dataset? What is the type of the data in each key?

In [None]:
for i in range(len(iris.keys())):
  print(list(iris.keys())[i], type(iris[list(iris.keys())[i]]))

b) Print the description of the dataset. Use DESCR property.

In [None]:
iris.DESCR

c) Print the feauture and target names

In [None]:
print(iris.feature_names, iris.target_names)

#### Task 2
Visualize the data set using ```seaborn```. What type of plot would you use? Why?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('iris')
print(df.head())

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(df, hue ="species",
              height = 6).map(plt.scatter,
                              'sepal_length',
                              'petal_length').add_legend()
plt.show()

### Construct the training and test sets

#### Task 3

Load the iris data set. Split it into training and test sets. Use 30% of the data for testing. Use ```your index number``` for reproducibility. Finally print the shape of resulting data sets. Use train_test_split function to do this.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
iris    = load_iris()
X       = iris.data
y       = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=253880)

In [None]:
# Exploring the training and testing datasets
print("X_train's shape is   :", X_train.shape)
print("X_test's shape is    :", X_test.shape)
print("y_train's shape is   :", y_train.shape)
print("y_test's shape is    :", y_test.shape)

### Train a simple classifier

Now, let's build a simple ML model to verify how it works without any data processing

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Let's use the <a src=https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>logistic regression</a> model at first.

In [None]:
model   = LogisticRegression()
y_train = y_train.ravel()
model.fit(X_train, y_train)

Create accuracy variable from <a src = https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics> sklearn metrics</a> and use accuracy_score.

In [None]:
prediction  = model.predict(X_test)
accuracy    =  metrics.accuracy_score(y_test, prediction)
print(f'Accuaracy: {accuracy}')

OK, it seems that the model works, but we can do better. Let's try to preprocess the data.

### Data preprocessing

The preprocessing module from ```scikit-learn``` provides a lot of useful functions to preprocess data. The brief description of the most important functions can be found in [official documentation](https://scikit-learn.org/stable/modules/preprocessing.html).

In [None]:
from sklearn import preprocessing

#### Task 4
a ) Define the transformers for the following tasks:
* Normalization                     - scales each feature to have unit norm
* Standardization                   - scales each feature to have zero mean and unit variance
* Non-linear transformation         - applies a non-linear transformation to each feature in order to achieve a Gaussian-like distribution
* Higher order features generation  - It is used to generate higher order features from the original ones. For example, if we have two features $x_1$ and $x_2$, then the second order features will be $x_1^2$, $x_2^2$, $x_1x_2$.

Use dictionaries to store the transformers.

Normalization (```Normalizer```)

In [None]:
normalizer = {}
normalizer['l1']  = preprocessing.Normalizer(norm='l1')
normalizer['l2']  = preprocessing.Normalizer(norm='l2')
normalizer['max']  = preprocessing.Normalizer(norm='max')

Standardization (```StandardScaler```, ```MinMaxScaler```, ```MaxAbsScaler```, ```RobustScaler```)

In [None]:
scalers     = {}
scalers['std_scaler']       = preprocessing.StandardScaler()
scalers['min_max_scaler']   = preprocessing.MinMaxScaler()
scalers['max_abs_scaler']   = preprocessing.MaxAbsScaler()
scalers['robust_scaler']    = preprocessing.RobustScaler()

Non-linear transformations (```QuantileTransformer``` - with uniform and normal distribution, ```PowerTransformer``` - with Yeo-Johnson and Box-Cox transformations)

In [None]:
gaussian_transformers = {}
gaussian_transformers['quantile_transformer']       = preprocessing.QuantileTransformer(n_quantiles=105)
gaussian_transformers['quantile_norm_transformer']  = preprocessing.QuantileTransformer(output_distribution='normal', n_quantiles=105)
gaussian_transformers['power_bc_transformer']       = preprocessing.PowerTransformer(method="box-cox")
gaussian_transformers['power_yj_transformer']       = preprocessing.PowerTransformer(method="yeo-johnson")

Higher order features (```PolynomialFeatures```, ```SplineTransformer```)

In [None]:
hof_transformers = {}
hof_transformers['poly']    = preprocessing.PolynomialFeatures()
hof_transformers['spline']  = preprocessing.SplineTransformer()

b) Define custom transformer which will calculate the logarithm of the features. Use ```FunctionTransformer``` from ```sklearn.preprocessing```. You can use ```np.log``` function.

In [None]:
import numpy as np
custom_transformer          = preprocessing.FunctionTransformer(np.log)

#### Task 5
Apply different previously defined transformers to the data set. Which one gives the best results? Try to use different parameters and different combinations of transformers.

Hint: Use the previously defined model to compare the results. Create a loop over different methods

In [None]:
iris    = load_iris()
X       = iris.data
y       = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=253880)
y_train = y_train.ravel()
best_accuracy   = 0
best_configs    = []
for g in gaussian_transformers:
  for n in normalizer:
    for s in scalers:
      for h in hof_transformers:
        data = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(gaussian_transformers[g].fit_transform(X_train))))
        test = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(gaussian_transformers[g].fit_transform(X_test))))
        model.fit(data, y_train)
        prediction  = model.predict(test)
        accuracy    =  metrics.accuracy_score(y_test, prediction)
        best_configs.append([n,s,h,g, accuracy])


In [None]:
for normalizer_name, scaler_name, hof_transformer_name, gauss, accuracy in best_configs:
    print(f'Normalizer: {normalizer_name}, Scaler: {scaler_name}, HOF Transformer: {hof_transformer_name}, Gaussian: {gauss}, Accuracy: {accuracy}')

### Different model impact

In [None]:
from sklearn.tree import DecisionTreeClassifier
model               = DecisionTreeClassifier(max_depth=3, random_state=100)
model.fit(X_train, y_train)
prediction          = model.predict(X_test)
accuracy            = metrics.accuracy_score(y_true=y_test, y_pred=prediction)
print(f'Non normalized data: {accuracy}')
normalizer  = preprocessing.Normalizer()
model               = DecisionTreeClassifier(max_depth=3, random_state=100)
X_train_transformed = normalizer.fit_transform(X_train)
X_test_transformed  = normalizer.transform(X_test)
model.fit(X_train_transformed, y_train)
prediction          = model.predict(X_test_transformed)
accuracy            = metrics.accuracy_score(y_true=y_test, y_pred=prediction)
print(f'Normalized data: {accuracy}')

#### Task 6
Fill the missing values for the following numpy array using ```SimpleImputer```.

In [None]:
X = np.random.uniform(0, 10, size = (10, 2))
X[np.random.randint(0, 10, size = 5), np.random.randint(0, 2, size = 5)] = np.nan

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(X)

print(imp.transform(X))

## <a src= https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html> Pipelines </a> <- go on, read more if you want

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/daily-bike-share.csv')
data.dtypes

In [None]:
data.head()

In [None]:
data = data[['season'
            , 'yr'
             , 'mnth'
             , 'holiday'
             , 'weekday'
             , 'workingday'
             , 'weathersit'
             , 'temp'
             , 'atemp'
             , 'hum'
             , 'windspeed'
             , 'rentals']]
data

### Task 7
Construct a training and test set, using 'rentals' as labels. Use 30% of the data for testing. Use ```your index``` for reproducibility. Finally print the shape of resulting data sets.

In [None]:
X = data[['season'
            , 'yr'
             , 'mnth'
             , 'holiday'
             , 'weekday'
             , 'workingday'
             , 'weathersit'
             , 'temp'
             , 'atemp'
             , 'hum'
             , 'windspeed']]
y = data['rentals']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=253880)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Task 8
Construct a pipeline (```Pipeline``` from ```sklearn.pipeline```) which will perform the following steps:
* Impute missing values
* Scale the data
* Convert categorical features to one-hot encoding

Hint:
1) ['temp', 'atemp', 'hum', 'windspeed'] are numerical features, ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit'] are categorical features.

2) Use ```ColumnTransformer``` from ```sklearn.compose``` to apply different transformers to different columns.

3) Use ```OneHotEncoder``` from ```sklearn.preprocessing``` to convert categorical features to one-hot encoding. Why do we do this?

4) As a model use ```LinearRegression``` from ```sklearn.linear_model```

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

In [None]:
numeric_features     =  ['temp', 'atemp', 'hum', 'windspeed']

categorical_features =  ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']

preprocessor         = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [None]:
from sklearn.linear_model import LinearRegression
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

In [None]:
rf_model = pipeline.fit(X_train, y_train)
print (rf_model)

In [None]:
from sklearn.metrics import r2_score
predictions = rf_model.predict(X_test)
print(r2_score(y_test, predictions))

## WARNING! TRAGIC IMPLEMENTATION 
### Task 9


Try different combinations of data processing methods. The highest accuracy wins. The winner gets additional 3 points. Second person gets 2 points. Third result is awarded with 1 point.

In [None]:
iris    = load_iris()
X       = iris.data
y       = iris.target
testSize = np.linspace(0.05, 0.5,46)
best_configs    = []
for i in testSize:
  print(i)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i, random_state=253880)
  y_train = y_train.ravel()


  for g in gaussian_transformers:
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(gaussian_transformers[g].fit_transform(X_train))))
          test = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(gaussian_transformers[g].fit_transform(X_test))))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n,s,h,g, accuracy, i])
  for g in gaussian_transformers:
          data = gaussian_transformers[g].fit_transform(X_train)
          test = gaussian_transformers[g].fit_transform(X_test)
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([None, None, None, g, accuracy, i])
  for n in normalizer:
          data = normalizer[n].fit_transform(X_train)
          test = normalizer[n].fit_transform(X_test)
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n, None, None, None, accuracy, i])
  for s in scalers:
          data = scalers[s].fit_transform(X_train)
          test = scalers[s].fit_transform(X_test)
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([None, s, None, None, accuracy, i])
  for h in hof_transformers:
          data = hof_transformers[h].fit_transform(X_train)
          test = hof_transformers[h].fit_transform(X_test)
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([None, None, h, None, accuracy, i])
  for g in gaussian_transformers:
    for n in normalizer:
      data = normalizer[n].fit_transform(gaussian_transformers[g].fit_transform(X_train))
      test = normalizer[n].fit_transform(gaussian_transformers[g].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append([n,None,None,g, accuracy, i])
  for g in gaussian_transformers:
    for s in scalers:
      data = scalers[s].fit_transform(gaussian_transformers[g].fit_transform(X_train))
      test = scalers[s].fit_transform(gaussian_transformers[g].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append([None,s,None,g, accuracy, i])
  for g in gaussian_transformers:
    for h in hof_transformers:
      data = hof_transformers[h].fit_transform(gaussian_transformers[g].fit_transform(X_train))
      test = hof_transformers[h].fit_transform(gaussian_transformers[g].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append([n,s,h,g, accuracy, i])
  for n in normalizer:
    for s in scalers:
          data = scalers[s].fit_transform(normalizer[n].fit_transform(X_train))
          test = scalers[s].fit_transform(normalizer[n].fit_transform(X_test))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n,"first "+ s,None, None, accuracy, i])
  for n in normalizer:
    for s in scalers:
          data = normalizer[n].fit_transform(scalers[s].fit_transform(X_train))
          test = normalizer[n].fit_transform(scalers[s].fit_transform(X_test))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["first "+n,s,None, None, accuracy, i])
  for n in normalizer:
    for h in hof_transformers:
      data = hof_transformers[h].fit_transform(normalizer[n].fit_transform(X_train))
      test = hof_transformers[h].fit_transform(normalizer[n].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append([n, None,"first " + h, None, accuracy, i])
  for n in normalizer:
    for h in hof_transformers:
      data = normalizer[n].fit_transform(hof_transformers[h].fit_transform(X_train))
      test = normalizer[n].fit_transform(hof_transformers[h].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append(["first " + n, None, h, None, accuracy, i])
  for s in scalers:
    for h in hof_transformers:
      data = hof_transformers[h].fit_transform(scalers[s].fit_transform(X_train))
      test = hof_transformers[h].fit_transform(scalers[s].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append([None,s,"first " +h,None, accuracy, i])
  for s in scalers:
    for h in hof_transformers:
      data = scalers[s].fit_transform(hof_transformers[h].fit_transform(X_train))
      test = scalers[s].fit_transform(hof_transformers[h].fit_transform(X_test))
      model.fit(data, y_train)
      prediction  = model.predict(test)
      accuracy    =  metrics.accuracy_score(y_test, prediction)
      best_configs.append([None,"first " +s,h,None, accuracy, i])
  for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(X_train)))
          test = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n,"second "+s,"first "+h, None, accuracy, i])
  for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = hof_transformers[h].fit_transform(normalizer[n].fit_transform(scalers[s].fit_transform(X_train)))
          test = hof_transformers[h].fit_transform(normalizer[n].fit_transform(scalers[s].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["second "+n,s,"first "+h, None, accuracy, i])
  for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = scalers[s].fit_transform(normalizer[n].fit_transform(hof_transformers[h].fit_transform(X_train)))
          test = scalers[s].fit_transform(normalizer[n].fit_transform(hof_transformers[h].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["second "+n,"first "+s,h, None, accuracy, i])
  for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = scalers[s].fit_transform(hof_transformers[h].fit_transform(normalizer[n].fit_transform(X_train)))
          test = scalers[s].fit_transform(hof_transformers[h].fit_transform(normalizer[n].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n,"first "+s,"second "+h, None, accuracy, i])
  for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = normalizer[n].fit_transform(hof_transformers[h].fit_transform(scalers[s].fit_transform(X_train)))
          test = normalizer[n].fit_transform(hof_transformers[h].fit_transform(scalers[s].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["first "+n,s,"second "+h, None, accuracy, i])
  for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = normalizer[n].fit_transform(scalers[s].fit_transform(hof_transformers[h].fit_transform(X_train)))
          test = normalizer[n].fit_transform(scalers[s].fit_transform(hof_transformers[h].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["first "+n,"second "+s,h, None, accuracy, i])
  for g in gaussian_transformers:
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(X_train)))
          test = hof_transformers[h].fit_transform(scalers[s].fit_transform(normalizer[n].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n,"second "+s,"first "+h, g, accuracy, i])
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = hof_transformers[h].fit_transform(normalizer[n].fit_transform(scalers[s].fit_transform(X_train)))
          test = hof_transformers[h].fit_transform(normalizer[n].fit_transform(scalers[s].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["second "+n,s,"first "+h, g, accuracy, i])
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = scalers[s].fit_transform(normalizer[n].fit_transform(hof_transformers[h].fit_transform(X_train)))
          test = scalers[s].fit_transform(normalizer[n].fit_transform(hof_transformers[h].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["second "+n,"first "+s,h, g, accuracy, i])
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = scalers[s].fit_transform(hof_transformers[h].fit_transform(normalizer[n].fit_transform(X_train)))
          test = scalers[s].fit_transform(hof_transformers[h].fit_transform(normalizer[n].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append([n,"first "+s,"second "+h, g, accuracy, ])
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = normalizer[n].fit_transform(hof_transformers[h].fit_transform(scalers[s].fit_transform(X_train)))
          test = normalizer[n].fit_transform(hof_transformers[h].fit_transform(scalers[s].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["first "+n,s,"second "+h, g, accuracy, i])
    for n in normalizer:
      for s in scalers:
        for h in hof_transformers:
          data = normalizer[n].fit_transform(scalers[s].fit_transform(hof_transformers[h].fit_transform(X_train)))
          test = normalizer[n].fit_transform(scalers[s].fit_transform(hof_transformers[h].fit_transform(X_test)))
          model.fit(data, y_train)
          prediction  = model.predict(test)
          accuracy    =  metrics.accuracy_score(y_test, prediction)
          best_configs.append(["first "+n,"second "+s,h, g, accuracy, i])

In [None]:
pandas_dataframe = pd.DataFrame(best_configs, columns=["Normalizer", "Scaler", "Hof transformer", "Gaussian", "Accuracy", "testSize"])

pandas_dataframe.loc[pandas_dataframe['Accuracy']==pandas_dataframe['Accuracy'].max()]