## Preprocessing Functions
To bring your text into a format ideal for analysis, you can write preprocessing functions to encapsulate your cleaning process. For example, in this section, you’ll create a preprocessor that applies the following operations:

1. Lowercases the text
2. Lemmatizes each token
3. Removes punctuation symbols
3. Removes stop words


A preprocessing function converts text to an analyzable format. It’s typical for most NLP tasks. You can write a preprocessing function that applies a series of transformations to your text.

In [28]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [32]:
complete_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company. He is"
    " interested in learning Natural Language Processing."
    " There is a developer conference happening on 21 July"
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number'
    " available at +44-1234567891. Gus is helping organize it."
    " He keeps organizing local Python meetups and several"
    " internal talks at his workplace. Gus is also presenting"
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    " Apart from his work, he is very passionate about music."
    " Gus is learning to play the Piano. He has enrolled"
    " himself in the weekend batch of Great Piano Academy."
    " Great Piano Academy is situated in Mayfair or the City"
    " of London and has world-class piano instructors."
)
complete_doc = nlp(complete_text)


def is_token_allowed(token):
    return bool(
        token and str(token).strip() and not token.is_stop and not token.is_punct
    )


def preprocess_token(token):
    return token.lemma_.strip().lower()


complete_filtered_tokens = [
    preprocess_token(token) for token in complete_doc if is_token_allowed(token)
]

complete_filtered_tokens

['gus',
 'proto',
 'python',
 'developer',
 'currently',
 'work',
 'london',
 'base',
 'fintech',
 'company',
 'interested',
 'learn',
 'natural',
 'language',
 'processing',
 'developer',
 'conference',
 'happen',
 '21',
 'july',
 '2019',
 'london',
 'title',
 'application',
 'natural',
 'language',
 'processing',
 'helpline',
 'number',
 'available',
 '+44',
 '1234567891',
 'gus',
 'helping',
 'organize',
 'keep',
 'organize',
 'local',
 'python',
 'meetup',
 'internal',
 'talk',
 'workplace',
 'gus',
 'present',
 'talk',
 'talk',
 'introduce',
 'reader',
 'use',
 'case',
 'natural',
 'language',
 'processing',
 'fintech',
 'apart',
 'work',
 'passionate',
 'music',
 'gus',
 'learn',
 'play',
 'piano',
 'enrol',
 'weekend',
 'batch',
 'great',
 'piano',
 'academy',
 'great',
 'piano',
 'academy',
 'situate',
 'mayfair',
 'city',
 'london',
 'world',
 'class',
 'piano',
 'instructor']

In [20]:
# doc = nlp(complete_doc)


# defining a function to propress the text
# def preprocess_text(text):
#     text = text.lower()
#     text = nlp(text)
#     tokens = []
#     for token in text:
#         if (
#             not token.is_stop
#             and not token.is_punct
#             and not token.is_space
#             and not token.like_num
#         ):
#             tokens.append(token.lemma_)
#     return " ".join(tokens)


# pp = preprocess_text(complete_text)
# pp