# NLP_preprocessing techniques

- In natural language processing (NLP), tokenization refers to the process of breaking down a text or a sequence of words into smaller units called tokens. These tokens can be individual words, sentences, or even characters, depending on the level of granularity required for the analysis.

- Tokenization is a crucial step in NLP tasks as it serves as the foundation for various downstream processes such as part-of-speech tagging, named entity recognition, sentiment analysis, and machine translation. By breaking down the text into tokens, it becomes easier to analyze and understand the linguistic structure and meaning of the text.

- Tokenization methods can vary based on the specific requirements of the task or the language being processed. Common tokenization techniques include whitespace tokenization, where tokens are split based on spaces, and word-level tokenization, where tokens are individual words. More advanced tokenization techniques also exist, such as subword tokenization, which breaks down words into subword units, and character-level tokenization, where tokens are individual characters.

- Overall, tokenization is a fundamental process in NLP that enables the effective analysis and processing of textual data.






### Input Text: "I love natural language processing!"

- Tokenization Output: ["I", "love", "natural", "language", "processing", "!"]

- In this example, the input text is tokenized into individual words. Each word becomes a separate token, and the punctuation mark ("!") is also treated as a separate token.

- Tokenization breaks down the input text into meaningful units, allowing further analysis and processing of the text at the word level.






## Tokenization

In [3]:
#! pip install nltk

In [2]:
import nltk

In [4]:
text = "I love natural language processing!"

In [6]:
sent_token = nltk.sent_tokenize(text)
sent_token # this is a sentence tokenization

['I love natural language processing!']

In [7]:
word_token = nltk.word_tokenize(text)
word_token # this is a word tokenization

['I', 'love', 'natural', 'language', 'processing', '!']

### Let's take big text

In [8]:
bg_text = "Natural language processing (NLP) is a field of study that focuses on the interaction between computers and human language. It involves tasks such as text classification, sentiment analysis, machine translation, and named entity recognition. Tokenization is a fundamental step in NLP, where the input text is divided into smaller units called tokens. These tokens can be individual words, phrases, or even characters. The tokenization process serves as the foundation for various NLP tasks and allows for effective analysis and processing of textual data."

In [9]:
sent_token = nltk.sent_tokenize(bg_text)
sent_token # this is a sentence tokenization

['Natural language processing (NLP) is a field of study that focuses on the interaction between computers and human language.',
 'It involves tasks such as text classification, sentiment analysis, machine translation, and named entity recognition.',
 'Tokenization is a fundamental step in NLP, where the input text is divided into smaller units called tokens.',
 'These tokens can be individual words, phrases, or even characters.',
 'The tokenization process serves as the foundation for various NLP tasks and allows for effective analysis and processing of textual data.']

In [10]:
sent_token[0]

'Natural language processing (NLP) is a field of study that focuses on the interaction between computers and human language.'

In [11]:
word_token = nltk.word_tokenize(bg_text)
word_token # this is a word tokenization

['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'field',
 'of',
 'study',
 'that',
 'focuses',
 'on',
 'the',
 'interaction',
 'between',
 'computers',
 'and',
 'human',
 'language',
 '.',
 'It',
 'involves',
 'tasks',
 'such',
 'as',
 'text',
 'classification',
 ',',
 'sentiment',
 'analysis',
 ',',
 'machine',
 'translation',
 ',',
 'and',
 'named',
 'entity',
 'recognition',
 '.',
 'Tokenization',
 'is',
 'a',
 'fundamental',
 'step',
 'in',
 'NLP',
 ',',
 'where',
 'the',
 'input',
 'text',
 'is',
 'divided',
 'into',
 'smaller',
 'units',
 'called',
 'tokens',
 '.',
 'These',
 'tokens',
 'can',
 'be',
 'individual',
 'words',
 ',',
 'phrases',
 ',',
 'or',
 'even',
 'characters',
 '.',
 'The',
 'tokenization',
 'process',
 'serves',
 'as',
 'the',
 'foundation',
 'for',
 'various',
 'NLP',
 'tasks',
 'and',
 'allows',
 'for',
 'effective',
 'analysis',
 'and',
 'processing',
 'of',
 'textual',
 'data',
 '.']