<div align="center">

# Python & Control Flow
## Related to NLP
## Computational Statistics and NLP Lab

**Supervised By:** Abu Omayed  
**Email:** [abuomayed@gmail.com](mailto:abuomayed@gmail.com)

---

</div>

**String Manipulation, Tokenization & Cleaning**

`The "Smart" Tokenizer`

**Context:** Splitting by space `split(" ")` is easy, but it leaves punctuation attached to words `(e.g., "hello!" becomes ['hello!'])`.

**Task:** Write a function that splits a sentence into words but keeps punctuation as separate tokens.

**Input:** `"Hi, logic!"`

**Output:** `['Hi', ',', ' logic', '!']`

`Stopword Removal`

**Context:** In NLP, we often remove common words like `"the", "is", "at"`.

**Task:** Given a sentence and a list of "stopwords" `['is', 'the', 'at', 'on']`, return a filtered sentence string.

**Input:** `"The cat is on the mat"`

**Output:** `"cat mat"` (Case insensitive matching required).

`Stemming Simulator (Suffix Stripper)`

**Context:** Reducing words to their root form (e.g., "running" $\rightarrow$ "run").

**Task:** Create a loop that iterates through a list of words. If a word ends with `"ing", "ly", or "ed"`, remove that suffix. Ensure the remaining root is `at least 3 characters long`.

**Input:** `["playing", "lovely", "red", "bed"]`

**Output:** `["play", "love", "red", "bed"]`

`Sentence Segmenter`

**Context:** Splitting a paragraph into sentences.

**Task:** Split a paragraph string into a list of sentences based on delimiters `., ?, and !`. Be careful not to lose the punctuation mark`!`

**Input:** `"Wait! Is that Python? Yes."`

**Output:** `["Wait!", "Is that Python?", "Yes."]`

`"Hashtag" Extractor`

**Task:** Write a function that extracts all unique words starting with `#` from a tweet. Return them in a list.

**Difficulty:** Must handle spaces and punctuation correctly (e.g., #AI! should be #AI).

**Input:** "I love `#Python` and `#AI`! `#Python` is great."

**Output:** `['#Python', '#AI']`

`Case-Insensitive Vowel Start`

**Task:** Count how many words in a sentence start with a vowel (a, e, i, o, u), regardless of capitalization.

**Input:** "Apple is eating an Orange."

**Output:** 3 (Apple, is, an, Orange - wait, "eating" starts with 'e'. Let's count properly: Apple, is, eating, an, Orange = 5).

**Correction for logic:** Iterate words, `.lower()[0]`, check if in "aeiou".

`Whitespace Normalizer`

**Task:** A user entered text with messy spacing. Convert multiple spaces into a single space and trim the edges. Do this without using `split()` and `join()`. Use a loop.

**Input:** " Natural Language Processing "

**Output:** "Natural Language Processing"

`"Email" Masker`

**Task:** Find any email address in a string (assume any word containing @) and replace it with `[EMAIL]`.

**Input:** "Contact me at `test@email.com` for info."

**Output:** "Contact me at `[EMAIL]` for info."

`Sentence Reverser (Word by Word)`

**Task:** Reverse the order of words in a sentence, but keep the words themselves spelled correctly.

**Input:** "Deep Learning is hard"

**Output:** "hard is Learning Deep"

**List (Sequence Processing & Logic)**

`N-Gram Generator`

**Context:** An N-gram is a contiguous sequence of `n` items from a given text.

**Task:** Write a function that takes a string and a number `n` (e.g. 2 for bigrams), and return a list of n-gram tuples.

**Input:** `"I love NLP"`, `n = 20`

**Output**  `[("I", "love"), ("love", "NLP")]`

`Longest Word with Ties`

**Context:** Analyzing vocabulary complexity.

**Task:** Find the longest word in a sentence. If there is a tie (multiple words with the same max length), return `all` of them in a list.

**Input:** `"Code and Data are cool"`

**Output:** `["Code", "Data", "cool"]`

`Bag of Words (Binary Vector)`

**Context:** Converting text into numbers for Machine Learning.

**Task:** Given a fixed `vocabulary = ["apple", "banana", "cat"]` and a input sentence `"I have a cat and an apple"`, return a list of `1`s and `0`s indicating if the vocab word appears in the sentence.

**Output:** `[1, 0, 1]` (Present, Not Present, Present)

`Key Word in Context (KWIC)`

**Context:** Seeing how a word is used.

**Task:** Given a list of words and a target word, print the `target` word along with its immediate neighbors (one word before, one word after). Handle cases where the word is at the start or end.

**Input:** `["I", "am", "learning", "Python", "now"]`, target="learning"

**Output:** `"am learning Python"`

`Filter Short Words`

**Task:** Given a list of tokens, return a new list containing only words that are longer than 3 characters.

**Input:** `["AI", "is", "the", "future", "of", "tech"]`

**Output:** `["future", "tech"]`

`Find the "Middle" Word`

**Task:** Find the middle word of a sentence. If the sentence has an even number of words, return the two middle words.

**Input:** `"Python is very fun language"` (5 words)

**Output:** `"very"`

**Input 2:** `"Python is very fun"` (4 words)

**Output:** `["is", "very"]`

`Cumulative Word Lengths`

**Task:** Create a list where each element is the sum of the lengths of all words up to that point.

**Input:** `["I", "am", "happy"]`

**Logic:** `len("I")=1, len("am")=2 (1+2=3), len("happy")=5 (3+5=8).`

**Output:** `[1, 3, 8]`

`List Difference (Vocabulary Check)`

**Task:** You have two lists: `text_words` and `known_vocab`. Return a list of words that appear in `text_words` but NOT in `known_vocab`.

**Input:** `text=["hi", "human", "bot"]`, `vocab=["hi", "human"]`

**Output:** `["bot"]`

`Move "Stopwords" to End`

**Task:** Given a list of words and a set of stopwords `{'is', 'the'}`, move all stopwords to the end of the list while keeping the order of other words intact.

**Input:** `["The", "sky", "is", "blue"]` (Assume case-insensitive check for "The")

**Output:** `["sky", "blue", "The", "is"]`

**Dictionaries (Frequency & Mapping)**

`Term Frequency (TF) Calculator`

**Context:** Counting how "important" a word is.

**Task:** Create a dictionary where keys are unique words and values are their percentage frequency in the text (Count / Total Words).

**Input:** `"apple banana apple"`

**Output:** `{"apple": 0.66, "banana": 0.33}`

`Inverted Index Construction`

**Context:** How search engines store data.

**Task:** Given a dictionary of documents doc_id: text, create an inverted index `word: [list_of_doc_ids]`.

**Input:** `{1: "apple pie", 2: "apple juice"}`

**Output:** `{"apple": [1, 2], "pie": [1], "juice": [2]}`

`Word Co-occurrence Matrix`

**Context:** Counting which words appear together.

**Task:** For a specific target word (e.g., "data"), create a dictionary counting the words that appear immediately after it across a long text list.

**Input:** `["data", "science", "data", "processing", "data", "science"]`

**Target:** `"data"`

**Output:** `{"science": 2, "processing": 1}`

`Simple Sentiment Dictionary`

**Context:** Rule-based sentiment analysis.

**Task:** Given a dictionary `scores = {"good": 1, "bad": -1, "great": 2}` and a sentence, calculate the total sentiment score. If a word isn't in the dictionary, it scores `0`.

**Input:** `"The movie was good but not great"`

**Output:** `1 + 0 + 0 + 1 + 0 + 0 + 2 = 4` (Note: handling "not" is advanced, just sum the words for now).

`Character Frequency (Excluding Spaces)`

**Task:** Count the frequency of every character in a string, but ignore spaces.

**Input:** `"aa b"`

**Output:** `{'a': 2, 'b': 1}`

`First Unique Word`

**Task:** Find the first word in a list that appears exactly once. If no such word exists, return `None`.

**Input:** `["apple", "banana", "apple", "cherry", "banana"]`

**Output:** `"cherry"`

`Group Words by Length`

**Task:** Create a dictionary where the keys are word lengths (integers) and the values are lists of words of that length.

**Input:** `["go", "hi", "run", "code"]`

**Output:** `{2: ["go", "hi"], 3: ["run"], 4: ["code"]}`

`Merge Two Frequency Dictionaries`

**Task:** You have two dictionaries representing word counts from two different texts. Merge them. If a word exists in both, sum their counts.

**Input:** `d1 = {'a': 1, 'b': 2}, d2 = {'b': 3, 'c': 1}`

**Output:** `{'a': 1, 'b': 5, 'c': 1}`

`Dictionary Filtering (Minimum Frequency)`

**Task:** Given a frequency dictionary, return a new dictionary containing only words that appear `3 or more times`.

**Input:** `{'data': 10, 'science': 2, 'AI': 5}`

**Output:** `{'data': 10, 'AI': 5}`

`Mixed Logic & Algorithms`

`Longest Sequence of Same Character`

**Task:** Find the length of the longest consecutive run of identical characters in a string.

**Input:** `"reaaaally"`

**Output:** `4 (because of "aaaa")`

`Simple Spell Corrector (Dictionary Lookup)`

`Dictionary lookup in Natural Language Processing (NLP) is a fundamental technique that involves using a predefined dictionary (a linguistic dataset or data structure) to quickly retrieve information about words in a text. It's used for various tasks, from basic word representation to advanced semantic analysis.` 

**Task:** You have a dictionary of common misspellings: `corrections = {"teh": "the", "wnt": "want"}`. Write a script to replace words in a sentence if they appear in the keys of the dictionary.

**Input:** `"I wnt teh apple"`

**Output:** `"I want the apple"`

`Palindrome Sentence Check`

**Task:** Check if a sentence is a palindrome if you ignore spaces and punctuation.

**Input:** `"Madam, I'm Adam"`

**Output:** `True`

`Text to "One-Hot" Indices`

**Task:** Create a sorted list of unique words (vocabulary). Then, replace the original words in the sentence with their index from that vocabulary list.

**Input:** `"i love code and i love data"`

**Vocab:** `['and', 'code', 'data', 'i', 'love']`

**Output:** `[3, 4, 1, 0, 3, 4, 2]`

`Nested Bracket Extractor`

**Task:** Extract the text inside the first pair of square brackets `[]` found in the string.

**Input:** `"The dataset [embedded vector] is large."`

**Output:** `"embedded vector"`

**Tuples & Sets (Uniqueness & Immutable Data)**

`Unique Vocabulary Builder`

**Context:** Pre-processing for training.

**Task:** Take a messy paragraph (with capitals and duplicates), normalize it (lowercase), and return a `sorted tuple` of unique words.

**Input:** `"Apple banana APPLE."`

**Output:** `("apple", "banana")`

`Jaccard Similarity`

**Context:** Measuring how similar two sentences are.

**Task:** Given two sentences, convert them to sets of words. $$Calculate: \frac{\text{Size of Intersection}}{\text{Size of Union}}$$.

**Input:** `"AI is cool", "AI is great"`

**Math:** `Intersection={"AI", "is"} (2), Union={"AI", "is", "cool", "great"} (4).`

**Output:** `0.5`

`Tuple-Based Part-of-Speech (POS) Tagger`

**Context:** Assigning grammar labels.

**Task:** Given a list of rules as tuples `[("ends_with_ly", "Adverb"), ("ends_with_ed", "Verb")]`, iterate through a sentence. If a word matches a rule, tag it. Default tag is `"Noun"`.

**Input:** `"He walked quickly"`

**Output:** `[("He", "Noun"), ("walked", "Verb"), ("quickly", "Adverb")]`

**Mixed Logic**

`The "Unknown" Token Replacer`

**Context:** Handling words the model hasn't seen before.

**Task:** Given a list of words, find any word that appears fewer than 2 times and replace it with the special token `"<UNK>"`.

`Dictionary of Lists to List of Dictionaries`

**Context:** Data formatting.

**Task:** `Convert data = {"words": ["hi", "bye"], "pos": ["noun", "noun"]} into [{"word": "hi", "pos": "noun"}, {"word": "bye", "pos": "noun"}].`

`Text Summarizer (Frequency Based)`

**Task:**

- Count word frequencies in a paragraph.
- Score each sentence by summing the frequencies of its words.
- Return the single sentence with the highest score.

`Corpus Statistics Report`

**Task:** Write a script that takes a long string and prints a report:

- Total Token Count
- Vocabulary Size (Unique words)
- Lexical Diversity (Unique / Total)
- Average Word Length

`Simple Spell Checker (One Edit Distance)`

**Context:** Finding typos.

**Task:** Given a correct vocabulary `["apple", "pear"]` and a user word "aple", check if adding one letter to the user word matches any word in the vocabulary.

**Hint:** Loop through the vocab and check length differences and character overlap.