# Day 09. Exercise 04
# Pipelines and OOP

## 0. Imports

In [66]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn import svm, tree, ensemble  
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
import joblib


## 1. Preprocessing pipeline

Create three custom transformers, the first two out of which will be used within a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

1. `FeatureExtractor()` class:
 - Takes a dataframe with `uid`, `labname`, `numTrials`, `timestamp` from the file [`checker_submits.csv`](https://drive.google.com/file/d/14voc4fNJZiLEFaZyd8nEG-lQt5JjatYw/view?usp=sharing).
 - Extracts `hour` from `timestamp`.
 - Extracts `weekday` from `timestamp` (numbers).
 - Drops the `timestamp` column.
 - Returns the new dataframe.


2. `MyOneHotEncoder()` class:
 - Takes the dataframe from the result of the previous transformation and the name of the target column.
 - Identifies all the categorical features and transforms them with `OneHotEncoder()`. If the target column is categorical too, then the transformation should not apply to it.
 - Drops the initial categorical features.
 - Returns the dataframe with the features and the series with the target column.


3. `TrainValidationTest()` class:
 - Takes `X` and `y`.
 - Returns `X_train`, `X_valid`, `X_test`, `y_train`, `y_valid`, `y_test` (`test_size=0.2`, `random_state=21`, `stratified`).


**Pipeline** — это механизм из библиотеки scikit-learn, который позволяет последовательно применять несколько шагов обработки данных, таких как трансформация признаков или обучение модели. Это упрощает процесс обработки данных и делает код более компактным и читаемым. Каждый шаг в Pipeline реализует интерфейсы fit и transform.

In [67]:
class FeatureExtractor():
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X['hour'] = pd.to_datetime(X['timestamp']).dt.hour
        X['dayofweek'] = pd.to_datetime(X['timestamp']).dt.weekday
        X = X.drop(columns=['timestamp'])
        return X


In [68]:
class MyOneHotEncoder():
    def __init__(self, target_column):
        self.target_column = target_column
        self.encoder = None
    
    def fit(self, X, y=None):
        categorical_cols = X.select_dtypes(include='object').columns
        categorical_cols = [col for col in categorical_cols if col != self.target_column]
        # Исключает колонку с целевыми значениями из списка категориальных признаков.
        self.encoder = OneHotEncoder(sparse_output=False, drop='first').fit(X[categorical_cols]) 
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        categorical_cols = X.select_dtypes(include='object').columns
        categorical_cols = [col for col in categorical_cols if col != self.target_column]
        
        encoded_features = self.encoder.transform(X[categorical_cols])
        encoded_df = pd.DataFrame(encoded_features, columns=self.encoder.get_feature_names_out(categorical_cols))
        
        X = pd.concat([X.drop(columns=categorical_cols), encoded_df], axis=1)
        return X


In [69]:
class TrainValidationTest:
    def __init__(self, test_size=0.2, random_state=21):
        self.test_size = test_size
        self.random_state = random_state
    
    def split(self, X, y):
        X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state, stratify=y)
        X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=self.random_state, stratify=y_temp)
        return X_train, X_valid, X_test, y_train, y_valid, y_test


## 2. Model selection pipeline

`ModelSelection()` class

 - Takes a list of `GridSearchCV` instances and a dict where the keys are the indexes from that list and the values are the names of the models, the example is below in the reverse order (from high-level to low-level perspective):

```
ModelSelection(grids, grid_dict)

grids = [gs_svm, gs_tree, gs_rf]

gs_svm = GridSearchCV(estimator=svm, param_grid=svm_params, scoring='accuracy', cv=2, n_jobs=jobs), where jobs you can specify by yourself

svm_params = [{'kernel':('linear', 'rbf', 'sigmoid'), 'C':[0.01, 0.1, 1, 1.5, 5, 10], 'gamma': ['scale', 'auto'], 'class_weight':('balanced', None), 'random_state':[21], 'probability':[True]}]
```

 - Method `choose()` takes `X_train`, `y_train`, `X_valid`, `y_valid` and returns the name of the best classifier among all the models on the validation set
 - Method `best_results()` returns a dataframe with the columns `model`, `params`, `valid_score` where the rows are the best models within each class of models.

```
model	params	valid_score
0	SVM	{'C': 10, 'class_weight': None, 'gamma': 'auto...	0.772727
1	Decision Tree	{'class_weight': 'balanced', 'criterion': 'gin...	0.801484
2	Random Forest	{'class_weight': None, 'criterion': 'entropy',...	0.855288
```

 - When you iterate through the parameters of a model class, print the name of that class and show the progress using `tqdm.notebook`, in the end of the cycle print the best model of that class.

```
Estimator: SVM
100%
125/125 [01:32<00:00, 1.36it/s]
Best params: {'C': 10, 'class_weight': None, 'gamma': 'auto', 'kernel': 'rbf', 'probability': True, 'random_state': 21}
Best training accuracy: 0.773
Validation set accuracy score for best params: 0.878 

Estimator: Decision Tree
100%
57/57 [01:07<00:00, 1.22it/s]
Best params: {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 21, 'random_state': 21}
Best training accuracy: 0.801
Validation set accuracy score for best params: 0.867 

Estimator: Random Forest
100%
284/284 [06:47<00:00, 1.13s/it]
Best params: {'class_weight': None, 'criterion': 'entropy', 'max_depth': 22, 'n_estimators': 50, 'random_state': 21}
Best training accuracy: 0.855
Validation set accuracy score for best params: 0.907 

Classifier with best validation set accuracy: Random Forest
```

Функция enumerate() в Python — это эффективный инструмент для циклов, создающий пары, состоящие из счётчика и элементов итерируемого объекта. Эти пары упакованы в кортежи

**grids:**
Список экземпляров GridSearchCV (один на каждую модель).

Каждый экземпляр отвечает за поиск лучших гиперпараметров для конкретной модели.
**grid_dict:**
Словарь, связывающий индекс модели из списка grids с её именем, например: {0: 'SVM', 1: 'Decision Tree', 2: 'Random Forest'}.


**best_model_name:** Хранит название модели с лучшими результатами.

**best_model_score:** Начальное значение задаётся минимально возможным.

**enumerate(self.grids):** Итерация по всем объектам GridSearchCV с их индексами.

**tqdm:** Добавляет индикатор прогресса для удобства наблюдения за выполнением.

**self.grid_dict[idx]:** Извлекает имя модели по текущему индексу.

In [70]:
class ModelSelection:
    def __init__(self, grids, grid_dict):
        self.grids = grids
        self.grid_dict = grid_dict
        self.best_results_df = pd.DataFrame(columns=['model', 'params', 'valid_score'])

    
    def choose(self, X_train, y_train, X_valid, y_valid):
        for idx, grid in enumerate(self.grids):
            print(f"Estimator: {self.grid_dict[idx]}")
            grid.fit(X_train, y_train)
            best_params = grid.best_params_
            best_score = grid.best_score_
            valid_score = grid.score(X_valid, y_valid)

            temp_df = pd.DataFrame({
                'model': [self.grid_dict[idx]],
                'params': [best_params],
                'valid_score': [valid_score]
            })

            self.best_results_df = pd.concat([self.best_results_df, temp_df], ignore_index=True)
            
            print(f"Best params: {best_params}")
            print(f"Validation set accuracy score for best params: {valid_score:.3f}")
        return self.best_results_df


## 3. Finalization

`Finalize()` class
 - Takes an estimator.
 - Method `final_score()` takes `X_train`, `y_train`, `X_test`, `y_test` and returns the accuracy of the model as in the example below:
```
final.final_score(X_train, y_train, X_test, y_test)
Accuracy of the final model is 0.908284023668639
```
 - Method `save_model()` takes a path, saves the model to this path and prints that the model was successfully saved.

In [71]:
class Finalize:
    def __init__(self, estimator):
        self.estimator = estimator
    
    def final_score(self, X_train, y_train, X_test, y_test):
        self.estimator.fit(X_train, y_train)
        test_score = self.estimator.score(X_test, y_test)
        print(f"Accuracy of the final model is {test_score:.5f}")
        return test_score
    
    def save_model(self, path):
        joblib.dump(self.estimator, path)
        print(f"Model successfully saved to {path}")


## 4. Main program

1. Load the data from the file (****name of file****).
2. Create the preprocessing pipeline that consists of two custom transformers: `FeatureExtractor()` and `MyOneHotEncoder()`:
```
preprocessing = Pipeline([('feature_extractor', FeatureExtractor()), ('onehot_encoder', MyOneHotEncoder('dayofweek'))])
```
3. Use that pipeline and its method `fit_transform()` on the initial dataset.
```
data = preprocessing.fit_transform(df)
```
4. Get `X_train`, `X_valid`, `X_test`, `y_train`, `y_valid`, `y_test` using `TrainValidationTest()` and the result of the pipeline.
5. Create an instance of `ModelSelection()`, use the method `choose()` applying it to the models that you want and parameters that you want, get the dataframe of the best results.
6. create an instance of `Finalize()` with your best model, use method `final_score()` and save the model in the format: `name_of_the_model_{accuracy on test dataset}.sav`.

That is it, congrats!

In [72]:
if __name__ == "__main__":
    df = pd.read_csv('../data/checker_submits.csv') 
    print("Данные успешно загружены. Первые строки:")
    print(df.head())

    #Создание и применение конвейера предобработки
    # Конвейер включает два пользовательских трансформера: FeatureExtractor и MyOneHotEncoder
    preprocessing = Pipeline([
        ('feature_extractor', FeatureExtractor()),  # Извлекает час и день недели, убирает timestamp
        ('onehot_encoder', MyOneHotEncoder('dayofweek'))  # Кодирует категориальные признаки
    ])
    
    # Применяем конвейер к данным
    data = preprocessing.fit_transform(df)
    print("Конвейер предобработки успешно выполнен.")
    print("Обработанные данные:")
    print(data.head())

    target_column = 'dayofweek' 
    X = data.drop(columns=[target_column]) 
    y = data[target_column] 

    # Разделение данных на тренировочные, валидационные и тестовые наборы
    splitter = TrainValidationTest(test_size=0.2, random_state=21)
    X_train, X_valid, X_test, y_train, y_valid, y_test = splitter.split(X, y)
    print("Данные успешно разделены на тренировочные, валидационные и тестовые наборы.")
    print(f"Размеры: X_train: {X_train.shape}, X_valid: {X_valid.shape}, X_test: {X_test.shape}")

    # Подбор модели
    # Настройки GridSearchCV для разных моделей
    svm_params = [{
        'kernel': ['linear', 'rbf', 'sigmoid'],
        'C': [0.01, 0.1, 1, 1.5, 5, 10],
        'gamma': ['scale', 'auto'],
        'class_weight': [None, 'balanced'],
        'probability': [True],
        'random_state': [21]
    }]
    tree_params = [{
        'criterion': ['gini', 'entropy'],
        'max_depth': [None, 5, 10, 15, 20],
        'class_weight': [None, 'balanced'],
        'random_state': [21]
    }]
    rf_params = [{
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 10, 15, 20],
        'criterion': ['gini', 'entropy'],
        'class_weight': [None, 'balanced'],
        'random_state': [21]
    }]

    # Создаем экземпляры GridSearchCV
    gs_svm = GridSearchCV(estimator=svm.SVC(), param_grid=svm_params, scoring='accuracy', cv=2, n_jobs=-1)
    gs_tree = GridSearchCV(estimator=tree.DecisionTreeClassifier(), param_grid=tree_params, scoring='accuracy', cv=2, n_jobs=-1)
    gs_rf = GridSearchCV(estimator=ensemble.RandomForestClassifier(), param_grid=rf_params, scoring='accuracy', cv=2, n_jobs=-1)

    # Создаем словарь моделей
    grids = [gs_svm, gs_tree, gs_rf]
    grid_dict = {0: 'SVM', 1: 'Decision Tree', 2: 'Random Forest'}

    # Подбор моделей с помощью ModelSelection
    model_selection = ModelSelection(grids, grid_dict)
    best_results_df = model_selection.choose(X_train, y_train, X_valid, y_valid)
    print("Результаты подбора моделей:")
    print(best_results_df)

    # Выбор лучшей модели
    best_model_index = best_results_df['valid_score'].idxmax()
    best_model_name = best_results_df.loc[best_model_index, 'model']
    print(f"Лучшая модель: {best_model_name}")

    # Финализация и сохранение модели
    best_estimator = grids[best_model_index].best_estimator_
    finalizer = Finalize(best_estimator)

    # Оценка точности модели на тестовом наборе
    accuracy = finalizer.final_score(X_train, y_train, X_test, y_test)

    # Сохранение модели
    model_path = f"{best_model_name}_{accuracy:.5f}.sav"
    finalizer.save_model(model_path)


Данные успешно загружены. Первые строки:
      uid   labname  numTrials                   timestamp
0  user_4  project1          1  2020-04-17 05:19:02.744528
1  user_4  project1          2  2020-04-17 05:22:45.549397
2  user_4  project1          3  2020-04-17 05:34:24.422370
3  user_4  project1          4  2020-04-17 05:43:27.773992
4  user_4  project1          5  2020-04-17 05:46:32.275104
Конвейер предобработки успешно выполнен.
Обработанные данные:
   numTrials  hour  dayofweek  uid_user_1  uid_user_10  uid_user_11  \
0          1     5          4         0.0          0.0          0.0   
1          2     5          4         0.0          0.0          0.0   
2          3     5          4         0.0          0.0          0.0   
3          4     5          4         0.0          0.0          0.0   
4          5     5          4         0.0          0.0          0.0   

   uid_user_12  uid_user_13  uid_user_14  uid_user_15  ...  labname_lab02  \
0          0.0          0.0          0.

  self.best_results_df = pd.concat([self.best_results_df, temp_df], ignore_index=True)


Best params: {'C': 10, 'class_weight': None, 'gamma': 'auto', 'kernel': 'rbf', 'probability': True, 'random_state': 21}
Validation set accuracy score for best params: 0.870
Estimator: Decision Tree
Best params: {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'random_state': 21}
Validation set accuracy score for best params: 0.828
Estimator: Random Forest
Best params: {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'n_estimators': 100, 'random_state': 21}
Validation set accuracy score for best params: 0.917
Результаты подбора моделей:
           model                                             params  \
0            SVM  {'C': 10, 'class_weight': None, 'gamma': 'auto...   
1  Decision Tree  {'class_weight': None, 'criterion': 'gini', 'm...   
2  Random Forest  {'class_weight': None, 'criterion': 'gini', 'm...   

   valid_score  
0     0.869822  
1     0.828402  
2     0.917160  
Лучшая модель: Random Forest
Accuracy of the final model is 0.95266
Model suc