# Fit preprocessors

## Import libraries and load data

In [1]:
from dmml_project.dataset import Dataset
from dmml_project.preprocessor import Preprocessor
from dmml_project import PROJECT_ROOT

dataset = Dataset.load(f"{PROJECT_ROOT}/data/train.tsv")
tfidf_preprocessor = Preprocessor(kind="tfidf")
count_preprocessor = Preprocessor(kind="count")
binary_preprocessor = Preprocessor(kind="binary")

An example of the preprocessing without vectorization is shown below.

In [2]:
examples = dataset.data["text"][:5]
for example in examples:
    print(example)
    print(count_preprocessor._preprocess_text(example))
    print()

i have a feeling that city rules of delicate speech wont be as practical out here said rasa
i have a feel that citi rule of delic speech wont be as practic out here said rasa

i love gift baskets in general and for weddings they are a fun way to give things that they need and want without feeling like your gift is boring
i love gift basket in general and for wed they are a fun way to give thing that they need and want without feel like your gift is bore

i know no one is reading this and because i feel the need to be sentimental dear brother you are one of the most important people in my life
i know no one is read this and becaus i feel the need to be sentiment dear brother you are one of the most import peopl in my life

i had a feeling that daisy s could look cute and age ap
i had a feel that daisi s could look cute and age ap

i almost feel that louise is so eager to overcome her gilligans island typecasting and demonstrate what she is capable of that shes like a horse thats out of 

## Fit preprocessors on training data

In [3]:
text = dataset.get_x()
tfidf_preprocessor.fit(text)
count_preprocessor.fit(text)
binary_preprocessor.fit(text)

Preprocessing data: 100%|██████████| 365447/365447 [01:20<00:00, 4550.16it/s]
Preprocessing data: 100%|██████████| 365447/365447 [01:16<00:00, 4791.11it/s]
Preprocessing data: 100%|██████████| 365447/365447 [01:15<00:00, 4815.87it/s]


## Save preprocessors

In [4]:
tfidf_preprocessor.save(f"{PROJECT_ROOT}/data/preprocessor/tfidf.pkl")
count_preprocessor.save(f"{PROJECT_ROOT}/data/preprocessor/count.pkl")
binary_preprocessor.save(f"{PROJECT_ROOT}/data/preprocessor/binary.pkl")

## A little demo and some info

In [5]:
print(len(tfidf_preprocessor.vectorizer.vocabulary_))
print(len(count_preprocessor.vectorizer.vocabulary_))
print(len(binary_preprocessor.vectorizer.vocabulary_))

print(text[0])
print(count_preprocessor._preprocess_text(text[0]))
print(count_preprocessor([text[0]]))

53976
53976
53976
i have a feeling that city rules of delicate speech wont be as practical out here said rasa
i have a feel that citi rule of delic speech wont be as practic out here said rasa
  (0, 8504)	1
  (0, 11450)	1
  (0, 15802)	1
  (0, 36508)	1
  (0, 38053)	1
  (0, 39976)	1
  (0, 40248)	1
  (0, 43880)	1
  (0, 52591)	1
