<a href="https://colab.research.google.com/github/MK316/Spring2024/blob/main/Corpus/TTR-and-lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌿 Topics:

## 1. **Type vs. Token**
## 2. Lexical Diversity measures (10 types)

# Part 1. Type vs. Token

Example: A cat is chasing a mouse.

+ Tokens: Tokens are often words, but they can also include punctuation, numbers, and other characters depending on the analysis. Simply put, tokens are the total number of words in a given text.

  + 6 tokens in the given example

+ Types: A type is the unique form of a token, disregarding its frequency of occurrence.

  + 5 types in the given example.

[text samples from Aesop fables](https://aesopsfables.org/)

In [None]:
text = "An ant went to a Mistic-fountain."
len(text)

33

> **text.split()** # split string by space

In [None]:
step1 = text.split()
print(step1)

['An', 'ant', 'went', 'to', 'a', 'Mistic-fountain.']


In [None]:
step2 = text.split(".") # delimiter here is '.'
print(step2)

['An ant went to a Mistic-fountain', '']


Using a longer text

In [None]:
# This includes all characters: letters, numbers, spaces, punctuation marks, and special characters.

text = """
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.
"""

print("Number of strings: ", len(text))

Number of strings:  554


In [None]:
tokens = text.split()
len(tokens)

107

In [None]:
types = set(tokens)
len(types)


76

Define a function

In [None]:
def count_types_and_tokens(text):
    tokens = text.split()
    types = set(tokens)
    return len(types), len(tokens)

In [None]:
# Example text

num_types, num_tokens = count_types_and_tokens(text)
print("Number of types:", num_types)
print("Number of tokens:", num_tokens)

Number of types: 76
Number of tokens: 107


# Lemmatization

+ lemma: a dictionary form or base form of a set of words.
+ example: 'run, runs, running, ran' => 'run'

We will use {nltk} modules

In [None]:
!pip install nltk



In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download the WordNet resource (if not already downloaded)
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True


The function below, get_wordnet_pos, is designed to map the part-of-speech (POS) tags provided by NLTK's pos_tag function to the format that is recognized by the WordNet Lemmatizer, which is part of the NLTK library. This mapping is essential for accurate lemmatization, as it allows the lemmatizer to understand the grammatical category of each word.

In [None]:
# Define get_wordnet_pos(word)

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
sentence = "The cats are running faster than the dogs"

In [None]:
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Lemmatization using POS tags
lemmatized_output = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

print('Original Sentence:', sentence)
print('Lemmatized Sentence:', ' '.join(lemmatized_output))

Original Sentence: The cats are running faster than the dogs
Lemmatized Sentence: The cat be run faster than the dog


### Lemmatization practice with our text

In [None]:
text = """
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.
"""

In [None]:
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize the sentence
tokens = nltk.word_tokenize(text)

# Lemmatization using POS tags
lemmatized_output = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

print('Original Sentence:', text)
print('Lemmatized Sentence:', ' '.join(lemmatized_output))

Original Sentence: 
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.

Lemmatized Sentence: An ant go to a fountain to quench his thirst and , tumble in , be almost drown . But a dove that happen to be sit on a neighbor tree saw the ant 's danger and , pluck off a leaf , let it drop into the water before him . The ant mount upon it , be presently waft safely ashore . Just at that time , a fowler be spread his net and be in the act of ensnare the dove , when the ant , perceive his object , bit

In [None]:
len(lemmatized_output)

124

Let's compare text tokens, types, and lemmatized

In [None]:
text = """
An ant went to a fountain to quench his thirst and, tumbling in, was almost drowned. But a dove that happened to be sitting on a neighboring tree saw the ant's danger and, plucking off a leaf, let it drop into the water before him. The ant mounting upon it, was presently wafted safely ashore.
Just at that time, a fowler was spreading his net and was in the act of ensnaring the dove, when the ant, perceiving his object, bit his heel. The start this gave the man made him drop his net and the dove, aroused to a sense of her danger, flew safely away.
"""

In [None]:
tokens = text.split(); print(len(tokens))
print(len(lemmatized_output))

107
124


Types in the text order

In [None]:
# Assuming 'tokens' is already defined

types_in_order = []
seen = set()

for token in tokens:
    if token not in seen:
        seen.add(token)
        types_in_order.append(token)

# Now 'types_in_order' contains unique elements from 'tokens' in the order they appear in the text


In [None]:
# Creating a dataframe with tokens, types_in_order, lemmatized_output

print(len(tokens))
print(len(types_in_order))
print(len(lemmatized_output))

107
76
124


In [None]:
!pip install pandas



In [None]:
import pandas as pd

# Assuming tokens, types_in_order, and lemmatized_output are already defined
# and their lengths are 107, 76, 124 respectively

# Extend types_in_order and tokens with 'None' to match the length of lemmatized_output
types_in_order.extend([None] * (len(lemmatized_output) - len(types_in_order)))
tokens.extend([None] * (len(lemmatized_output) - len(tokens)))

# Create the DataFrame
df = pd.DataFrame({
    'Tokens': tokens,
    'Types': types_in_order,
    'Lemmatized Output': lemmatized_output
})

df[1:20]


## TTR (Type-to-Token Ratio)

In [None]:
# Assuming you have already calculated the number of types and tokens
number_of_types = len(types)  # Number of unique words
number_of_tokens = len(tokens)  # Total number of words

# Calculate TTR
TTR = number_of_types / number_of_tokens

print("Type-Token Ratio (TTR):", TTR)


Type-Token Ratio (TTR): 0.6129032258064516
