## Creating basic tokenizer

Description: Write a Python function to implement a basic tokenization algorithm for a given language.\
Guidelines: You can choose any language as you like. 
Note: GUI is not required.

Brief Description: Tokenization is a fundamental and crucial step in text processing and natural language processing (NLP), transforming raw text into compliant units for analysis.

Note: In this task, I have considered English language as I have the choice to select any language for making a basic tokenization algorithm (as per the task's instruction).

I have done the task by using four methods:

1. By using NLTK’s word_tokenize() method
2. By using str.split() method
3. By using Regex with re.findall() method
4. By using Gensim’s tokenize() method

Now, let's discuss when to use each method and create tokenization algorithms by using Python language.

## 1. By using NLTK’s word_tokenize() method

We use NLTK library to tokenize string into words and punctuation marks. It identifies punctuation as separate tokens, which is essential when the meaning of the text could change depending on punctuation. It is a sophisticated tokenization approach. This method is particularly suitable when dealing with projects that require detailed text analysis.

Use case:

a. Advanced natural language processing tasks.\
b. When precise tokenization is called-for.\
c. Processing the punctuation marks in text efficiently.

In [9]:
import nltk
from nltk.tokenize import word_tokenize

def tokenization_algorithm1(text):
    """
    Tokenizes the input text by using NLTK’s word_tokenize() method.

    Args:
      text: The input string to tokenize.

    Returns:
      A list of tokens.
    """
    return word_tokenize(text)

# Usage
text = "John and Caroline are my good friends"
tokens = tokenization_algorithm1(text)
print(tokens)

['John', 'and', 'Caroline', 'are', 'my', 'good', 'friends']


## 2. By using str.split() method

This method tokenizes text in DataFrames using the str.split() method. It is suitable for processing huge text data all at once as it allows to tokenize text in an entire column of a Pandas' DataFrame.

Use case:

a. Processing large amount of text across entire columns.\
b. Dealing with large datasets in DataFrames.

In [12]:
import pandas as pd

def tokenization_algorithm2(text):
    """
    Tokenizes the input text by using str.split() method.

    Args:
      text: The input string to tokenize.

    Returns:
      A list of tokens.
    """
    df = pd.DataFrame({"text": ["John and Caroline are my good friends"]})
    df['tokens'] = df['text'].str.split()
    return (df['tokens'][0])

# Usage
text = "John and Caroline are my good friend"
tokens = tokenization_algorithm2(text)
print(tokens)

['John', 'and', 'Caroline', 'are', 'my', 'good', 'friends']


## 3. By using Regex with re.findall() method

The re.findall() function in Python allows us to extract tokens based on a specific pattern that we define. We can define patterns by using re module. Here, we have complete control over how the text is tokenized.

Use case:

a. Extracting patterns like email addresses, hashtags, or other custom tokens.\
b. When complete control over token patterns is needed.

In [15]:
import re

def tokenization_algorithm3(text):
    """
    Tokenizes the input text by using Regex with re.findall() method.

    Args:
      text: The input string to tokenize.

    Returns:
      A list of tokens.
    """
    return re.findall(r'\w+', text)

# Usage
text = "John and Caroline are my good friends"
tokens = tokenization_algorithm3(text)
print(tokens)

['John', 'and', 'Caroline', 'are', 'my', 'good', 'friends']


## 4. By using Gensim’s tokenize() method

Genism is a useful library in Python, especially useful for  building word vectors and topic modelling. It utilizes tokenize() function to tokenize text. This method integrates seamlessly into Gensim’s environment, facilitating tokenization process in the context of more complicated text analysis.

Use case:

a. Unification with Gensim’s other functionalities.\
b. While dealing with topic modeling or text processing with Gensim.


In [18]:
from gensim.utils import tokenize

def tokenization_algorithm4(text):
    """
    Tokenizes the input text by using Gensim’s tokenize() method.

    Args:
      text: The input string to tokenize.

    Returns:
      A list of tokens.
    """
    return list(tokenize(text))

# Usage
text = "John and Caroline are my good friends"
tokens = tokenization_algorithm4(text)
print(tokens)

['John', 'and', 'Caroline', 'are', 'my', 'good', 'friends']


### **Conclusion**

Choosing the appropriate tokenization depends on specific requirements, such as processing large datasets, integrating with advanced text analysis mechanism, or handling punctuation in a sentence. By understanding the strength of each method, we can effectively prepare our text data for further analysis and modeling, ensuring that our workflows are coherent, accurate and efficient.