<a href="https://colab.research.google.com/github/Abhiprameesh/NLP-lab/blob/main/TSL_nolibraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Provide Python code to perform tokenization, stemming, and lemmatization on an input sentence without using external natural language processing libraries.

## Tokenization

### Subtask:
Implement a function to split the input sentence into words (tokens) based on whitespace and punctuation.


**Reasoning**:
Define a function to tokenize the input sentence based on whitespace and punctuation.



In [9]:
import string

def simple_tokenize(sentence):
    tokens = []
    current_token = ""
    punctuation = string.punctuation

    for char in sentence:
        if char.isspace():
            if current_token:
                tokens.append(current_token)
            current_token = ""
        elif char in punctuation:
            if current_token:
                tokens.append(current_token)
            tokens.append(char)
            current_token = ""
        else:
            current_token += char

    if current_token:
        tokens.append(current_token)

    return tokens

# Example usage:
sentence = "Hello, world! This is a test sentence."
tokenized_sentence = simple_tokenize(sentence)
print(tokenized_sentence)

['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', 'sentence', '.']


## Stemming

### Subtask:
Implement a basic stemming algorithm to reduce words to their root form. This will involve handling common suffixes and prefixes.


**Reasoning**:
Implement the `simple_stemmer` function as described in the instructions and test it with example words.



In [10]:
def simple_stemmer(word):
    word = word.lower() # Convert to lowercase for consistent stemming

    if word.endswith('es'):
        word = word[:-2]
    elif word.endswith('s'):
        word = word[:-1]
    elif word.endswith('ed'):
        word = word[:-2]
    elif word.endswith('ing'):
        word = word[:-3]
    elif word.endswith('ly'):
        word = word[:-2]

    # Add more suffix removal rules as needed for a more comprehensive stemmer

    return word

# Test cases
words_to_stem = ["running", "jumps", "jumped", "easily", "flies", "happinesses"]
stemmed_words = [simple_stemmer(word) for word in words_to_stem]
print(stemmed_words)

['runn', 'jump', 'jump', 'easi', 'fli', 'happiness']


## Lemmatization

### Subtask:
Implement a basic lemmatization algorithm to reduce words to their base or dictionary form. This will require some form of lookup or rule-based system to handle different word forms.


**Reasoning**:
Implement the `simple_lemmatizer` function as described in the instructions, including the lookup table and handling of irregular forms and unknown words.



In [11]:
def simple_lemmatizer(word):

    lemma_lookup = {
        "is": "be",
        "are": "be",
        "was": "be",
        "were": "be",
        "am": "be",
        "running": "run",
        "ran": "run",
        "better": "good",
        "best": "good",
        "go": "go",
        "goes": "go",
        "went": "go",
        "gone": "go",
        "going": "go"
    }

    word_lower = word.lower()

    return lemma_lookup.get(word_lower, word_lower)

# Example usage:
words_to_lemmatize = ["is", "are", "running", "better", "goes", "quickly", "cat"]
lemmatized_words = [simple_lemmatizer(word) for word in words_to_lemmatize]
print(lemmatized_words)

['be', 'be', 'run', 'good', 'go', 'quickly', 'cat']


## Combine steps

### Subtask:
Create a script or function that takes an input sentence and applies the tokenization, stemming, and lemmatization functions in sequence.


**Reasoning**:
Define a function to combine tokenization, stemming, and lemmatization, then apply it to an example sentence and print the result.



In [12]:
def process_sentence(sentence):

    tokens = simple_tokenize(sentence)
    processed_tokens = []
    for token in tokens:
        stemmed_token = simple_stemmer(token)
        lemmatized_token = simple_lemmatizer(stemmed_token)
        processed_tokens.append(lemmatized_token)
    return processed_tokens

# Example usage:
sample_sentence = "Running is better than jumping and goes quickly."
processed_result = process_sentence(sample_sentence)
print(processed_result)

['runn', 'i', 'good', 'than', 'jump', 'and', 'go', 'quick', '.']


**Reasoning**:
The previous code failed because the `simple_tokenize`, `simple_stemmer`, and `simple_lemmatizer` functions were not available in the current execution environment. Need to include the definitions of these functions.



In [13]:
import string

def simple_tokenize(sentence):

    tokens = []
    current_token = ""
    punctuation = string.punctuation

    for char in sentence:
        if char.isspace():
            if current_token:
                tokens.append(current_token)
            current_token = ""
        elif char in punctuation:
            if current_token:
                tokens.append(current_token)
            tokens.append(char)
            current_token = ""
        else:
            current_token += char

    if current_token:
        tokens.append(current_token)

    return tokens

def simple_stemmer(word):

    word = word.lower() # Convert to lowercase for consistent stemming

    if word.endswith('es'):
        word = word[:-2]
    elif word.endswith('s'):
        word = word[:-1]
    elif word.endswith('ed'):
        word = word[:-2]
    elif word.endswith('ing'):
        word = word[:-3]
    elif word.endswith('ly'):
        word = word[:-2]

    # Add more suffix removal rules as needed for a more comprehensive stemmer

    return word

def simple_lemmatizer(word):
    lemma_lookup = {
        "is": "be",
        "are": "be",
        "was": "be",
        "were": "be",
        "am": "be",
        "running": "run",
        "ran": "run",
        "better": "good",
        "best": "good",
        "go": "go",
        "goes": "go",
        "went": "go",
        "gone": "go",
        "going": "go"
    }

    word_lower = word.lower()

    return lemma_lookup.get(word_lower, word_lower)


def process_sentence(sentence):

    tokens = simple_tokenize(sentence)
    processed_tokens = []
    for token in tokens:
        stemmed_token = simple_stemmer(token)
        lemmatized_token = simple_lemmatizer(stemmed_token)
        processed_tokens.append(lemmatized_token)
    return processed_tokens

# Example usage:
sample_sentence = "Running is better than jumping and goes quickly."
processed_result = process_sentence(sample_sentence)
print(processed_result)

['runn', 'i', 'good', 'than', 'jump', 'and', 'go', 'quick', '.']
