# **My Tokenizer**

In this assignment, you are asked to create your own word tokenizer without the help of external tokenizers. Steps to the assignment:
1. Choose one of the corpora from nltk.corpus list given - assign it to corpus_name
1. Create your tokenizer in the code block - tokenize the selected corpus into token_list
1. Give the raw corpus text, corpus_raw, and the my_token_list to the evaluation block

Only splitting on whitespace is not enough. At least try two other improvements on the tokenization. Please write sufficient comments to show your reasoning.

## Rules
### Allowed:
 - Choosing a top-down tokenizer or bottom-up tokenizer
 - Using regular expressions library (import re)
 - Adding additional coding blocks
 - Having an additional dataset if you are creating a bottom-up tokenizer but you need to be able to run the code standalone.

### Not allowed:
 - Using tokenizer libraries such as nltk.tokenize, or any other external libraries to tokenize.
 - Changing the contents of the evaluation block at the end of the notebook.

## Assignment Report
Please write a short assignment report at the end of the notebook (max 500 words). Please include all of the following points in the report:
 - Corpus name and the selection reason
 - Design of the tokenizer and reasoning
 - Challenges you have faced while writing the tokenizer and challenges with the specific corpus
 - Limitations of your approach
 - Possible improvements to the system

## Grading
You will be graded with the following criteria:
 - running complete code (0.5),
 - tokenizer algorithm (2),
 - clear commenting (0.5),
 - evaluation score - comparison with nltk word tokenizer (at most 1 point),
 - assignment report (1).

## Submission

Submission will be made to SUCourse. Please submit your file using the following naming convention.


`studentid_studentname_tokenizer.ipynb  - ex. 26744_aysegulrana_tokenizer.ipynb`


**Deadline is October 22nd, 5pm.**

In [1]:
import nltk
import re

In [2]:
# Handle possessives
def handle_possessives(text):
    # Replace possessives like "John's" with "John 's"
    text = re.sub(r"(\w+)'s", r"\1 's", text)
    return text



In [3]:
def my_tokenizer(corpus_raw):

    corpus_raw= corpus_raw.lower()


    contractions = {
        "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        "could've": "could have",
        "couldn't": "could not",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he would",
        "he'll": "he will",
        "he's": "he is",
        "how'd": "how did",
        "how'll": "how will",
        "how's": "how is",
        "I'd": "I would",
        "I'll": "I will",
        "I'm": "I am",
        "I've": "I have",
        "isn't": "is not",
        "it'd": "it would",
        "it'll": "it will",
        "it's": "it is",
        "let's": "let us",
        "ma'am": "madam",
        "mightn't": "might not",
        "might've": "might have",
        "mustn't": "must not",
        "must've": "must have",
        "needn't": "need not",
        "oughtn't": "ought not",
        "shan't": "shall not",
        "she'd": "she would",
        "she'll": "she will",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "that'd": "that would",
        "that's": "that is",
        "there'd": "there would",
        "there's": "there is",
        "they'd": "they would",
        "they'll": "they will",
        "they're": "they are",
        "they've": "they have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'll": "we will",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "where's": "where is",
        "who'd": "who would",
        "who'll": "who will",
        "who's": "who is",
        "won't": "will not",
        "wouldn't": "would not",
        "you'd": "you would",
        "you'll": "you will",
        "you're": "you are",
        "you've": "you have"
    }
    
    """
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus to be tokenized
    rtype: list
    return: a list of tokens extracted from the corpus_raw
    '''
    token_list =corpus_raw.split()
    """
    
    for contraction, expansion in contractions.items():
        corpus_raw = re.sub(r'\b' + contraction + r'\b', expansion, corpus_raw)
    corpus_raw = handle_possessives(corpus_raw)

    # This regex matches words, numbers, and punctuation
    token_list = re.findall(r"\b\w+\b|[.,!?;:()]", corpus_raw)

    # Keep hyphenated words together (e.g., "well-being"), but split compound hyphens (e.g., "mother-in-law")
    token_list = [re.sub(r'(\w+)-(\w+)', r'\1 \2', token) for token in token_list]

    
    # write your tokenizer here and apply to corpus_raw. Return the resulting token_list.
    # you are NOT allowed to use external tokenizers such as word_tokenize from nltk.
    # Only splitting on whitespace is not enough. At least try two other improvements on the tokenization.


    return token_list

You are allowed to add code blocks above to use for your tokenizer or evaluate it.



In [4]:
#main code to run your tokenizer.

#import your libraries here
import nltk
import re
from nltk.corpus import gutenberg


#select the corpus name from the list below
#gutenberg, webtext, reuters, product_reviews_2

corpus_name = 'gutenberg'

#download the corpus and import it.
corpus_raw=""
for file in gutenberg.fileids():
    corpus_raw += gutenberg.raw(file)

#get the raw text output of the corpus to the corpus_raw variable.

#call your tokenizer method
my_tokenized_list = my_tokenizer(corpus_raw)

print(my_tokenized_list[:100])  # Display the first 50 tokens


['emma', 'by', 'jane', 'austen', '1816', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.', 'she', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', 's', 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', '.', 'her', 'mother', 'had', 'died', 'too']


## Please do not touch the code below that will evaluate your tokenizer with the nltk word tokenizer. You will get zero points from evaluation if you do so.

In [5]:
def similarity_score(set_a, set_b):
    '''
    type set_a: set
    param set_a: The first set to be compared
    type set_b: set
    param set_b: The tokens extracted from the corpus_raw
    rtype: float
    return: similarity score with two sets using Jaccard similarity.
    '''

    jaccard_similarity = float(len(set_a.intersection(set_b)) / len(set_a.union(set_b)))

    return jaccard_similarity

In [6]:
from nltk import word_tokenize
nltk.download('punkt')
from nltk import punkt

def evaluation(corpus_raw, token_list):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus
    type token_list: list
    param token_list: The tokens extracted from the corpus_raw
    rtype: float
    return: comparison score with the given token list and the nltk tokenizer.
    '''

    #The comparison score only looks at the tokens but not the frequencies of the tokens.
    #we assume case folding is already applied to the token_list
    corpus_raw = corpus_raw.lower()
    nltk_tokens = word_tokenize(corpus_raw, language='english')

    score = similarity_score(set(token_list), set(nltk_tokens))

    return score

[nltk_data] Downloading package punkt to /Users/efeguclu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
#Evaluation

eval_score = evaluation(corpus_raw, my_tokenized_list)

print('The similarity score is {:.2f}'.format(eval_score))

The similarity score is 0.77


Please write your report below using clear headlines with markdown syntax.

In [8]:
#CORPUS NAME AND SELECTION REASON
#For this homework, I chose the Gutenberg Corpus. Gutenberg corpus is one of the famous collections of classical texts and presents unique challenges regarding language archaicness, sentence structure complexity, and heavy use of contractions.

#DESIGN OF TOKENIZER AND REASONING
#All text is in lowercase to ensure uniformity with similarity checker and prevent token mismatch due to capitalization.
#Contractions are expanded to their full forms; for instance, don't is changed to do not in order to avoid splits and make it possible to handle them as one word.
#This step is necessary for the tokenizer to align with NLTK when handling words. For example, possessive "John's" would be tokenized into two tokens - base word and possessive marker.
#This would align with how it is treated by NLTK. Tokenizer treats punctuation as if it were separate tokens since this allows for fine grained tokenization. 
#Hyphenated words are separated into individual tokens to avoid word combining; this could be treated as separate words by nltk
#Tokenizer accounts for extra spaces and new lines to make everything consistent.

#CHALLENGES FACED
#This mainly involved making the tokenizer NLTK-like, especially with regard to punctuation and contractions. 
#Archaic language necessitated careful handling of contractions, many of which differ from the conventions of modern usage. Also, balancing the treatment of punctuation-where some should be separated and others kept within tokens-proved complex. 
#his required extra rules for possessives and abbreviations, such that words like "it's" and "Dr." would be tokenized correctly.
#The other challenge is tokenizing the entire corpus. In dealing with such a huge volume of text, there could be several performance concerns that may include missed edge cases.

#LIMITATIONS OF APPROACH
#he limitation of the approach at this stage consists in the fact that certain edge cases are not handled, like complex abbreviation and hyphenated phrases, which NLTK might do. 
#While the present tokenizer does a good job of separating punctuation, it sometimes does not match NLTK's treatment of punctuation in context, especially quotes and dashes.

#POSSIBLE IMPROVMENTS
#Advanced regular expressions can handle more complex abbreviations, decimals, and special punctuation. The more context-sensitive rules are added, the better the edge cases will be handled by the tokenizer.




