# Создание модели для последующего использования

Эта модель = модели, обученной в ноутбуке 4_Pl_Classifying.ipynb. Здесь оформлена в класс для сохранения и последующего использования. (С помощью [Алексея](https://github.com/AlexSkrn))

**Важно! Для удобства работы этот ноутбук назван и пронумерован в соответствии с другими частями проекта, но при сохранении класса\модели в формате .py файл должен называться именем, под которым впоследствии будет импортироваться (например, prep.py).**

In [5]:
import pandas as pd
import re
from string import digits
import pymorphy2
from nltk.corpus import stopwords
from sklearn.base import BaseEstimator, TransformerMixin # base sklearn classes to inherit methods from, see below
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [6]:
df = pd.read_csv("/Users/liza/PycharmProjects/Planeta_project/plset_fin_upd_clustered.tsv", sep ="\t")
df = df.drop(df.columns[0:2], axis=1)
df = df.rename_axis(None, axis=1).rename_axis('Id', axis=1)

В Python можно сохранять все объекты. Для того чтобы сохранить всю модель вместе со способом обработки данных, который использовался для обучения (а по-другому она работать не будет), нужно токенизатор и предиктор упаковать в pipeline. Pipeline - это цепочка процессов обработки данных. У пайплана есть метод fit. После того, как мы зафиттили свои данные, мы сохраняем пайплайн.

[Далее - отсюда.](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65)

All transformers and estimators in scikit-learn are implemented as Python classes , each with their own attributes and methods. So every time you write Python statements like these:

```
from sklearn.preprocessing import OneHotEncoder 

#Initializing an object of class OneHotEncoder
one_hot_enc = OneHotEncoder( sparse = True )

#Calling methods on our OneHotEncoder object
one_hot_enc.fit( some_data ) #returns nothing
transformed_data = one_hot_enc.transform( som_data ) #returns something
```

you are essentially creating an instance called ‘one_hot_enc’ of the class ‘OneHotEncoder’ using its class constructor and passing it the argument ‘False’ for its parameter ‘sparse’. The OneHotEncoder class has methods such as ‘fit’, ‘transform’ and fit_transform’ and others which can now be called on our instance with the appropriate arguments as seen here.
**In order for our custom transformer to be compatible with a scikit-learn pipeline it must be implemented as a class with methods such as fit, transform, fit_transform, get_params, set_params** so we’re going to write all of those…… or we can simply just code the kind of transformation we want our transformer to apply and inherit everything else from some other class!

Inheriting from TransformerMixin ensures that all we need to do is write our fit and transform methods and we get fit_transform for free. Inheriting from BaseEstimator ensures we get get_params and set_params for free. 



In [7]:
# Создаем кастомизированный токенизатор, который можно вставить в Pipeline

class Prep(BaseEstimator, TransformerMixin): # we put BaseEstimator and TransformerMixin in parenthesis while declaring the class 
                                            # to let Python know our class is going to inherit from them.

    def __init__(self):
        self.morph_analyzer = pymorphy2.MorphAnalyzer()
        self.stop_words = stopwords.words('russian')
        self.stop_words.extend(['это', '–', '-', 'фонд', 'наш', 'помощь', 'помогать',
                   'помочь', 'поддержать', 'поддержка', 'средство', 'который', 'весь',
                   'благотворительный', 'деньги', 'рубль', 'год', 'день', 'тысяча',
                   'ваш', 'сегодня', 'завтра', 'этот', 'дать', 'проект', 'свой' ])

    def prep(self, text):
        clean_text = text.translate(str.maketrans('', '', '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~«»№!—'))
        clean_text = clean_text.translate(str.maketrans('', '', digits))
        clean_text = re.sub("-", " ", clean_text)
        # clean_text = re.sub("[a-zA-Z]", "", clean_text)  # исключаем слова латиницей
        clean_text = clean_text.lower()
        clean_text = clean_text.split()
        

        # words = [word for word in clean_text if word not in self.stop_words]
        # return words

        lemmas = [self.morph_analyzer.parse(word)[0].normal_form for word in clean_text]
        lemmas = [word for word in lemmas if word not in self.stop_words]
        return ' '.join(lemmas)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X = X.map(self.prep)

        return X
        print(X)

# Укладываем все три процесса работы над текстом в пайп
pipe = Pipeline([
    ('tokenizer', Prep()),
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression(max_iter=5000))
    ]
    )

# Делим исходные данные
X_train, X_test, y_train, y_test = train_test_split(df.Description,
                                                    df.Category,
                                                    stratify=df.Category)
# Обучаем пайп как обычный классификатор
pipe.fit(X_train, y_train)

# Дальше предсказывать, оценивать и  сохранять


Pipeline(memory=None,
         steps=[('tokenizer', Prep()),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=5000,
                                    multi_

\
\
⬇︎ Сохраняем обученную модель, которую лальше будем вызывать и использовать для предсказания.
[Подробнее в документации sklearn.](https://scikit-learn.org/stable/modules/model_persistence.html)

In [8]:
from joblib import dump, load
dump(pipe, 'pipe.joblib') 

['pipe.joblib']

**Важно! Для удобства работы этот ноутбук назван и пронумерован в соответствии с другими частями проекта, но при сохранении класса\модели в формате .py файл должен называться именем, под которым впоследствии будет импортироваться (например, prep.py).**

**Также: датасет, на котором обучалась модель, должен оставаться с тем же названием файла в той же папке, где она лежит.**

# Основные понятия

What is the difference between estimators vs transformers vs predictors in sklearn?

While working in Machine Learning projects using scikit-learn library, I would like to highlight important and fundamental concepts that every ML ninja needs to be aware of. In this post i am highlighting few concepts to differentiate estimators vs transformers vs predictors in building machine learning solutions using sklearn.


1) **Estimators**: Any objects that can estimate some parameters based on a dataset is called an estimator. The estimation itself is performed by calling fit() method.
This method takes one parameter (or two in case of supervised learning algorithms). Any other parameter needed to guide the estimation process is called hyperparameter and must be set as in instance variable.

For example: i would like to estimate a mean, median or most frequent value of a column in my dataset.


This is a cheat sheet of sklearn estimators. you can find the up to date version here.


2) **Transformers**: Transform a dataset. It transforms a dataset by calling transform() method and it returns a transformed dataset. some estimators can also transform a dataset.

For example: Imputer class in sklearn is an estimator and a transformer. You can call fit_transform() method that estimate and transform a dataset.

Python code: 

from sklearn.preprocessing inport Imputer

imputer = Imputer(strategy="mean") #estimate mean value for dataset columns

imputer.fit(mydataset)    # Imputer as an estimator

imputer.fit_transform(mydataset)   # Imputer as a transformer and estimator (Combined two steps)

3) **Predictors**: making predictions for  given a dataset. A predictor class has predict() method that takes a new instances of a dataset and returns a dataset with corresponding predictions. Also, it contains score() method that measures the quality of the predictions for a giving test dataset.

For example: LinearRegression, SVM, Decision Tree,..etc are predictors.


**You can combine building blocks of estimators, transformers and predictors as a pipeline in sklearn.** This allows developers to use multiple estimators from a sequence of transformers followed by a final estimator or predictor. This concept is called **composition** in Machine Learning.



⬆︎[Отсюда](http://www.mostafa.rocks/2017/04/what-is-difference-between-estimators.html)


**estimator**: This isn't a word with a rigorous definition but it usually associated with finding a current value in data. If we didn't explicitly count the change in our pocket we might use an estimate. That said, in machine learning it is most frequently used in conjunction with parameter estimation or density estimation. In both cases there is an assumption that data we currently have comes in a form that can be described with a function. With parameter estimation, we believe that the function is a known function that has additional parameters such as rate or mean and we may estimate the value of those parameters. In density estimation we may not even have an assumption about the function but we will attempt to estimate the function regardless. Once we have an estimation we may have at our disposal a model. The estimator then would be the method of generating estimations, for example the method of maximum likelihood.

**classifier**: This specifically refers to a type of function (and use of that function) where the response (or range in functional language) is discrete. Compared to this a regressor will have a continuous response. There are additional response types but these are the two most well known. Once we may have built a classifier, it is expected to predict for us from within a finite range of classes which class a vector of data is likely to indicate. As an example a voice recognition software may record a meeting and attempt to record at any given time which of the finite number of meeting attendees are speaking. Building this software we would give each attendee a number that is nominal only and attempt to classify to that number for each segment of speech.

**model**: The model is the function (or pooled set of functions) that you may accept or reject as being representative of your phenomenon. The word stems from the idea that you may apply domain knowledge to explaining/predicting the phenomenon though this isn't required. A non-parametric model might be derived entirely from the data at hand but the result is often still called a model. This terminology highlights the fact that what has been constructed when a model has been constructed is not reality but only a 'model' of reality. As George Box has said "All models are wrong but some are useful". Having a model allows you to predict but that may not be its purpose; it could also be used to simulate or to explain.

⬆︎ [Отсюда](https://stats.stackexchange.com/questions/103475/classifier-vs-model-vs-estimator)