# Fit preprocessors

## Import libraries and load data

In [1]:
from dmml_project.dataset import Dataset
from dmml_project.preprocessor import Preprocessor
from dmml_project import PROJECT_ROOT

dataset = Dataset.load(f"{PROJECT_ROOT}/data/train.tsv")
tfidf_preprocessor = Preprocessor(kind="tfidf")
count_preprocessor = Preprocessor(kind="count")
binary_preprocessor = Preprocessor(kind="binary")

An example of the preprocessing without vectorization is shown below.

In [2]:
examples = dataset.data["text"][:5]
for example in examples:
    print(example)
    print(count_preprocessor._preprocess_text(example))
    print()

i have a love hate relationship with it really nice product but made my lips feel miserable for a while
i have a love hate relationship with it realli nice product but made my lip feel miser for a while

i lay down in my bed feeling completely groggy with all the medicine
i lay down in my bed feel complet groggi with all the medicin

i remember feeling very uncertain at that time about what would happen next and i knew it was going to bring something unexpected my way but i didnt know what
i rememb feel veri uncertain at that time about what would happen next and i knew it was go to bring someth unexpect my way but i didnt know what

i even feel like my life is honestly worthless and plus i feel as being skinny as iam doesnt help
i even feel like my life is honest worthless and plus i feel as be skinni as iam doesnt help

i had basically chopped them down with a machete which could both leave him feeling rejected and leave me with the opposite of what i want
i had basic chop them down 

## Fit preprocessors on training data

In [3]:
text = dataset.get_x()
tfidf_preprocessor.fit(text)
count_preprocessor.fit(text)
binary_preprocessor.fit(text)

Preprocessing data:   0%|          | 0/365447 [00:00<?, ?it/s]

Preprocessing data: 100%|██████████| 365447/365447 [01:12<00:00, 5070.97it/s]
Preprocessing data: 100%|██████████| 365447/365447 [01:13<00:00, 4999.92it/s]
Preprocessing data: 100%|██████████| 365447/365447 [01:12<00:00, 5053.21it/s]


## Save preprocessors

In [4]:
tfidf_preprocessor.save(f"{PROJECT_ROOT}/data/preprocessor/tfidf.pkl")
count_preprocessor.save(f"{PROJECT_ROOT}/data/preprocessor/count.pkl")
binary_preprocessor.save(f"{PROJECT_ROOT}/data/preprocessor/binary.pkl")

## A little demo and some info

In [5]:
print(len(tfidf_preprocessor.vectorizer.vocabulary_))
print(len(count_preprocessor.vectorizer.vocabulary_))
print(len(binary_preprocessor.vectorizer.vocabulary_))

print(text[0])
print(count_preprocessor._preprocess_text(text[0]))
print(count_preprocessor([text[0]]))

53951
53951
53951
i have a love hate relationship with it really nice product but made my lips feel miserable for a while
i have a love hate relationship with it realli nice product but made my lip feel miser for a while
  (0, 15793)	1
  (0, 20103)	1
  (0, 26948)	1
  (0, 27505)	1
  (0, 27967)	1
  (0, 29977)	1
  (0, 32105)	1
  (0, 36934)	1
  (0, 38232)	1
  (0, 38693)	1
