## Python Strings: The Basics

A string in Python is a sequence of characters. They are **immutable**, meaning once a string is created, it cannot be changed. Any operation that appears to modify a string actually creates a new string.

Strings can be created using single quotes (`'...'`), double quotes (`"..."`), or triple quotes (`'''...'''` or `"""..."""`). Triple quotes are useful for multi-line strings and docstrings.

In [1]:
# --- Introduction ---
print("--- String Introduction ---")
s1 = 'Hello, World!'
s2 = "Python is fun."
s3 = """This is a
multi-line
string."""
s4 = '''Another
multi-line
string.'''

print(f"s1: {s1}")
print(f"s2: {s2}")
print(f"s3:\n{s3}") # \n is needed here because print itself adds a newline for f-string content
print(f"s4:\n{s4}")

# Immutability example
my_string = "apple"
print(f"Original string ID: {id(my_string)}")
# my_string[0] = "A" # This would cause a TypeError: 'str' object does not support item assignment
my_string = "Apple" # This creates a NEW string "Apple" and assigns it to my_string
print(f"New string ID: {id(my_string)}")
print("-" * 30)

--- String Introduction ---
s1: Hello, World!
s2: Python is fun.
s3:
This is a
multi-line
string.
s4:
Another
multi-line
string.
Original string ID: 139988986137600
New string ID: 139988986138080
------------------------------


## String Operations

### 1. Concatenation (+)
Joining two or more strings together.

In [2]:
print("--- Concatenation ---")
first_name = "Ada"
last_name = "Lovelace"
full_name = first_name + " " + last_name
print(f"Full name: {full_name}")
print("-" * 30)

--- Concatenation ---
Full name: Ada Lovelace
------------------------------


### 2. Repetition (*)
Repeating a string multiple times.

In [3]:
print("--- Repetition ---")
separator = "-_-"
repeated_separator = separator * 5
print(f"Repeated: {repeated_separator}")
print("-" * 30)

--- Repetition ---
Repeated: -_--_--_--_--_-
------------------------------


### 3. Membership (`in`, `not in`)
Checking if a substring exists within a string.

In [4]:
print("--- Membership ---")
sentence = "The quick brown fox jumps over the lazy dog."
print(f"'fox' in sentence: {'fox' in sentence}")
print(f"'cat' in sentence: {'cat' in sentence}")
print(f"'cat' not in sentence: {'cat' not in sentence}")
print("-" * 30)

--- Membership ---
'fox' in sentence: True
'cat' in sentence: False
'cat' not in sentence: True
------------------------------


### 4. Slicing (`[start:end:step]`)
Extracting a part of the string.
*   `start`: The starting index (inclusive). Defaults to 0.
*   `end`: The ending index (exclusive). Defaults to the end of the string.
*   `step`: The increment. Defaults to 1.

In [5]:
print("--- Slicing ---")
text = "PythonProgramming"
# P y t h o n P r o g r a m m i n g
# 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 (indices)
#-7-6-5-4-3-2-1 (negative indices from end)

print(f"Original: {text}")
print(f"text[0]: {text[0]}")
print(f"text[6]: {text[6]}")
print(f"text[-1]: {text[-1]}")
print(f"text[0:6]: {text[0:6]}")
print(f"text[:6]: {text[:6]}")
print(f"text[6:]: {text[6:]}")
print(f"text[-11:]: {text[-11:]}")
print(f"text[::2]: {text[::2]}")
print(f"text[::-1]: {text[::-1]}")
print(f"text[1:10:3]: {text[1:10:3]}")
print("-" * 30)

--- Slicing ---
Original: PythonProgramming
text[0]: P
text[6]: P
text[-1]: g
text[0:6]: Python
text[:6]: Python
text[6:]: Programming
text[-11:]: Programming
text[::2]: PtoPormig
text[::-1]: gnimmargorPnohtyP
text[1:10:3]: yor
------------------------------


## Traversing a String using Loops

### 1. `for` loop (character by character)

In [7]:
print("--- Traversing with for loop (char by char) ---")
word = "Looping"
for char in word:
    print(char, end="##") # print each char followed by a space
print("\n" + "-" * 30)

--- Traversing with for loop (char by char) ---
L##o##o##p##i##n##g##
------------------------------


# Python String Manipulation and NLTK Introduction

This notebook covers fundamental Python string operations and provides an introduction to the Natural Language Toolkit (NLTK).

### 2. `for` loop with `range` (index-based)

In [8]:
print("--- Traversing with for loop (index-based) ---")
word = "Indexed"
for i in range(len(word)):
    print(f"Index {i}: {word[i]}")
print("-" * 30)

--- Traversing with for loop (index-based) ---
Index 0: I
Index 1: n
Index 2: d
Index 3: e
Index 4: x
Index 5: e
Index 6: d
------------------------------


### 3. `while` loop (index-based)

In [9]:
print("--- Traversing with while loop ---")
word = "WhileIt"
index = 0
while index < len(word):
    print(word[index], end="-")
    index += 1
print("\n" + "-" * 30)

--- Traversing with while loop ---
W-h-i-l-e-I-t-
------------------------------


## Built-in Functions/Methods

**Note:**
*   **Functions** like `len()` are called directly on the object (e.g., `len(my_string)`).
*   **Methods** are functions associated with an object and are called using dot notation (e.g., `my_string.upper()`).

### 1. `len()`
Function: Returns the length of the string.

In [10]:
print("--- len() ---")
s = "Hello"
print(f"String: '{s}', Length: {len(s)}")

--- len() ---
String: 'Hello', Length: 5


### 2. `capitalize()`
Method: Converts the first character to uppercase and the rest to lowercase.

In [11]:
print("\n--- capitalize() ---")
s = "python is fun."
print(f"Original: '{s}', Capitalized: '{s.capitalize()}'")
s2 = "PYTHON IS FUN."
print(f"Original: '{s2}', Capitalized: '{s2.capitalize()}'")


--- capitalize() ---
Original: 'python is fun.', Capitalized: 'Python is fun.'
Original: 'PYTHON IS FUN.', Capitalized: 'Python is fun.'


### 3. `title()`
Method: Converts the first character of each word to uppercase and others to lowercase.

In [12]:
print("\n--- title() ---")
s = "welcome to the jungle"
print(f"Original: '{s}', Titled: '{s.title()}'")
s2 = "they're bill's friends"
print(f"Original: '{s2}', Titled: '{s2.title()}'")


--- title() ---
Original: 'welcome to the jungle', Titled: 'Welcome To The Jungle'
Original: 'they're bill's friends', Titled: 'They'Re Bill'S Friends'


### 4. `lower()`
Method: Converts all characters to lowercase.

In [13]:
print("\n--- lower() ---")
s = "HELLO WORLD"
print(f"Original: '{s}', Lowered: '{s.lower()}'")


--- lower() ---
Original: 'HELLO WORLD', Lowered: 'hello world'


### 5. `upper()`
Method: Converts all characters to uppercase.

In [14]:
print("\n--- upper() ---")
s = "hello world"
print(f"Original: '{s}', Uppered: '{s.upper()}'")


--- upper() ---
Original: 'hello world', Uppered: 'HELLO WORLD'


### 6. `count()`
Method: Returns the number of non-overlapping occurrences of a substring.
`count(substring, start=0, end=len(string))`

In [15]:
print("\n--- count() ---")
s = "abracadabra"
print(f"String: '{s}'")
print(f"Count of 'a': {s.count('a')}")
print(f"Count of 'abra': {s.count('abra')}")
print(f"Count of 'a' from index 3: {s.count('a', 3)}")
print(f"Count of 'a' from index 3 to 7: {s.count('a', 3, 7)}")


--- count() ---
String: 'abracadabra'
Count of 'a': 5
Count of 'abra': 2
Count of 'a' from index 3: 4
Count of 'a' from index 3 to 7: 2


### 7. `find()`
Method: Returns the lowest index of a substring. Returns -1 if not found.
`find(substring, start=0, end=len(string))`

In [16]:
print("\n--- find() ---")
s = "hello world, world of wonders"
print(f"String: '{s}'")
print(f"Find 'world': {s.find('world')}")
print(f"Find 'world' after index 7: {s.find('world', 7)}")
print(f"Find 'python': {s.find('python')}")


--- find() ---
String: 'hello world, world of wonders'
Find 'world': 6
Find 'world' after index 7: 13
Find 'python': -1


### 8. `index()`
Method: Like `find()`, but raises a `ValueError` if the substring is not found.
`index(substring, start=0, end=len(string))`

In [17]:
print("\n--- index() ---")
s = "hello world"
print(f"String: '{s}'")
print(f"Index of 'world': {s.index('world')}")
try:
    print(f"Index of 'python': {s.index('python')}")
except ValueError as e:
    print(f"Error finding 'python': {e}")


--- index() ---
String: 'hello world'
Index of 'world': 6
Error finding 'python': substring not found


### 9. `endswith()`
Method: Returns `True` if the string ends with the specified suffix.
`endswith(suffix, start=0, end=len(string))`

In [18]:
print("\n--- endswith() ---")
filename = "document.txt"
print(f"String: '{filename}'")
print(f"Ends with '.txt': {filename.endswith('.txt')}")
print(f"Ends with 'doc': {filename.endswith('doc')}")
print(f"Substring 'document.t' ends with '.t': {filename.endswith('.t', 0, 10)}")


--- endswith() ---
String: 'document.txt'
Ends with '.txt': True
Ends with 'doc': False
Substring 'document.t' ends with '.t': True


### 10. `startswith()`
Method: Returns `True` if the string starts with the specified prefix.
`startswith(prefix, start=0, end=len(string))`

In [19]:
print("\n--- startswith() ---")
url = "https://www.example.com"
print(f"String: '{url}'")
print(f"Starts with 'https://': {url.startswith('https://')}")
print(f"Starts with 'www' from index 8: {url.startswith('www', 8)}")


--- startswith() ---
String: 'https://www.example.com'
Starts with 'https://': True
Starts with 'www' from index 8: True


### 11. `isalnum()`
Method: Returns `True` if all characters are alphanumeric (letters or numbers) and there is at least one character.

In [20]:
print("\n--- isalnum() ---")
s1 = "Python3"
s2 = "Python 3" # contains space
s3 = "Python@"  # contains symbol
s4 = ""         # empty
print(f"'{s1}'.isalnum(): {s1.isalnum()}")
print(f"'{s2}'.isalnum(): {s2.isalnum()}")
print(f"'{s3}'.isalnum(): {s3.isalnum()}")
print(f"'{s4}'.isalnum(): {s4.isalnum()}")


--- isalnum() ---
'Python3'.isalnum(): True
'Python 3'.isalnum(): False
'Python@'.isalnum(): False
''.isalnum(): False


### 12. `isalpha()`
Method: Returns `True` if all characters are alphabetic and there is at least one character.

In [21]:
print("\n--- isalpha() ---")
s1 = "Python"
s2 = "Python3" # contains digit
s3 = "Python " # contains space
s4 = ""        # empty
print(f"'{s1}'.isalpha(): {s1.isalpha()}")
print(f"'{s2}'.isalpha(): {s2.isalpha()}")
print(f"'{s3}'.isalpha(): {s3.isalpha()}")
print(f"'{s4}'.isalpha(): {s4.isalpha()}")


--- isalpha() ---
'Python'.isalpha(): True
'Python3'.isalpha(): False
'Python '.isalpha(): False
''.isalpha(): False


### 13. `isdigit()`
Method: Returns `True` if all characters are digits and there is at least one character.

In [22]:
print("\n--- isdigit() ---")
s1 = "12345"
s2 = "123.45" # contains period
s3 = "123a"   # contains letter
s4 = ""       # empty
print(f"'{s1}'.isdigit(): {s1.isdigit()}")
print(f"'{s2}'.isdigit(): {s2.isdigit()}")
print(f"'{s3}'.isdigit(): {s3.isdigit()}")
print(f"'{s4}'.isdigit(): {s4.isdigit()}")


--- isdigit() ---
'12345'.isdigit(): True
'123.45'.isdigit(): False
'123a'.isdigit(): False
''.isdigit(): False


### 14. `islower()`
Method: Returns `True` if all cased characters are lowercase and there is at least one cased character.

In [23]:
print("\n--- islower() ---")
s1 = "python"
s2 = "Python"
s3 = "python3"
s4 = "PYTHON"
s5 = "123"    # no cased characters
s6 = ""       # empty
print(f"'{s1}'.islower(): {s1.islower()}")
print(f"'{s2}'.islower(): {s2.islower()}")
print(f"'{s3}'.islower(): {s3.islower()}")
print(f"'{s4}'.islower(): {s4.islower()}")
print(f"'{s5}'.islower(): {s5.islower()}")
print(f"'{s6}'.islower(): {s6.islower()}")


--- islower() ---
'python'.islower(): True
'Python'.islower(): False
'python3'.islower(): True
'PYTHON'.islower(): False
'123'.islower(): False
''.islower(): False


### 15. `isupper()`
Method: Returns `True` if all cased characters are uppercase and there is at least one cased character.

In [24]:
print("\n--- isupper() ---")
s1 = "PYTHON"
s2 = "Python"
s3 = "PYTHON3"
s4 = "python"
s5 = "123"
s6 = ""
print(f"'{s1}'.isupper(): {s1.isupper()}")
print(f"'{s2}'.isupper(): {s2.isupper()}")
print(f"'{s3}'.isupper(): {s3.isupper()}")
print(f"'{s4}'.isupper(): {s4.isupper()}")
print(f"'{s5}'.isupper(): {s5.isupper()}")
print(f"'{s6}'.isupper(): {s6.isupper()}")


--- isupper() ---
'PYTHON'.isupper(): True
'Python'.isupper(): False
'PYTHON3'.isupper(): True
'python'.isupper(): False
'123'.isupper(): False
''.isupper(): False


### 16. `isspace()`
Method: Returns `True` if all characters are whitespace and there is at least one character.

In [25]:
print("\n--- isspace() ---")
s1 = " \t\n" # space, tab, newline
s2 = "  Hello  "
s3 = ""
print(f"'{s1}'.isspace(): {s1.isspace()}")
print(f"'{s2}'.isspace(): {s2.isspace()}")
print(f"'{s3}'.isspace(): {s3.isspace()}")


--- isspace() ---
' 	
'.isspace(): True
'  Hello  '.isspace(): False
''.isspace(): False


### 17. `lstrip()`
Method: Returns a copy of the string with leading whitespace removed.
`lstrip(chars)` - Optional argument `chars` specifies a string containing characters to remove from the left.

In [26]:
print("\n--- lstrip() ---")
s = "   Hello World   "
print(f"Original: 'END{s}END'")
print(f"lstrip(): 'END{s.lstrip()}END'")

s_chars = "xyxxyyHello World"
print(f"Original with chars: '{s_chars}'")
print(f"lstrip('xy'): '{s_chars.lstrip('xy')}'")


--- lstrip() ---
Original: 'END   Hello World   END'
lstrip(): 'ENDHello World   END'
Original with chars: 'xyxxyyHello World'
lstrip('xy'): 'Hello World'


### 18. `rstrip()`
Method: Returns a copy of the string with trailing whitespace removed.
`rstrip(chars)` - Optional argument `chars` specifies a string containing characters to remove from the right.

In [27]:
print("\n--- rstrip() ---")
s = "   Hello World   "
print(f"Original: 'END{s}END'")
print(f"rstrip(): 'END{s.rstrip()}END'")

s_chars = "Hello Worldxyxxyy"
print(f"Original with chars: '{s_chars}'")
print(f"rstrip('xy'): '{s_chars.rstrip('xy')}'")


--- rstrip() ---
Original: 'END   Hello World   END'
rstrip(): 'END   Hello WorldEND'
Original with chars: 'Hello Worldxyxxyy'
rstrip('xy'): 'Hello World'


### 19. `strip()`
Method: Returns a copy of the string with leading and trailing whitespace removed.
`strip(chars)` - Optional argument `chars` specifies a string containing characters to remove from both ends.

In [None]:
print("\n--- strip() ---")
s = "   Hello World   "
print(f"Original: 'END{s}END'")
print(f"strip(): 'END{s.strip()}END'")

s_chars = "xyxxyyHello Worldxyxxyy"
print(f"Original with chars: '{s_chars}'")
print(f"strip('xy'): '{s_chars.strip('xy')}'")

### 20. `replace()`
Method: Returns a copy with all occurrences of a substring replaced by another.
`replace(old, new, count=-1)` - `count` is optional, max number of replacements.

In [None]:
print("\n--- replace() ---")
s = "one two one three one four"
print(f"Original: '{s}'")
print(f"Replace 'one' with '1': '{s.replace('one', '1')}'")
print(f"Replace 'one' with '1' (max 2 times): '{s.replace('one', '1', 2)}'")

### 21. `join()`
Method: Joins elements of an iterable (like a list of strings) into a single string, with the string itself as the separator.

In [None]:
print("\n--- join() ---")
words = ["Hello", "World", "Python"]
separator = " "
print(f"Words: {words}, Separator: '{separator}'")
print(f"Joined: '{separator.join(words)}'")

separator_dash = "-"
print(f"Words: {words}, Separator: '{separator_dash}'")
print(f"Joined: '{separator_dash.join(words)}'")

### 22. `partition()`
Method: Splits the string at the first occurrence of a separator.
Returns a 3-tuple: `(part_before_separator, separator, part_after_separator)`.
If separator not found, returns `(original_string, '', '')`.

In [None]:
print("\n--- partition() ---")
s = "name=value"
print(f"String: '{s}'")
print(f"Partition by '=': {s.partition('=')}")

s2 = "no_separator_here"
print(f"String: '{s2}'")
print(f"Partition by '=': {s2.partition('=')}")

### 23. `split()`
Method: Splits the string into a list of substrings.
`split(sep=None, maxsplit=-1)`
*   If `sep` is not specified or `None`, splits by whitespace and discards empty strings.
*   If `sep` is specified, splits by that separator.
*   `maxsplit` is the maximum number of splits to do.

In [None]:
print("\n--- split() ---")
s = "apple banana cherry"
print(f"String: '{s}'")
print(f"Split by space (default): {s.split()}")

s_csv = "apple,banana,cherry,date"
print(f"String CSV: '{s_csv}'")
print(f"Split by ',': {s_csv.split(',')}")
print(f"Split by ',' (maxsplit=1): {s_csv.split(',', 1)}")
print(f"Split by ',' (maxsplit=2): {s_csv.split(',', 2)}")

s_multi_space = "word1  word2   word3"
print(f"String multi-space: '{s_multi_space}'")
print(f"Split (default): {s_multi_space.split()}")
print("-" * 30)

## Introduction to NLTK (Natural Language Toolkit)

NLTK is a powerful Python library for working with human language data (text). It provides tools for many Natural Language Processing (NLP) tasks.

### Installation

First, you need to install NLTK if you haven't already. You can do this by running the following command in your terminal:
```bash
pip install nltk
```
Or, you can run the following cell (uncomment the line):

In [1]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### NLTK Resource Downloads

After installation, NLTK requires you to download some data resources (corpora, models, etc.) for its modules to work. The cell below attempts to download necessary resources like 'punkt' (for tokenization), 'averaged_perceptron_tagger' (for POS tagging), 'stopwords', 'wordnet', and 'omw-1.4' (for lemmatization). 

**Note:** This might take a few moments the first time you run it. You generally only need to do this once per environment.

In [2]:
import nltk

print("--- NLTK Resource Check/Download ---")

resources_to_download = {
    'punkt': 'tokenizers/punkt',
    'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
    'stopwords': 'corpora/stopwords',
    'wordnet': 'corpora/wordnet',
    'omw-1.4': 'corpora/omw-1.4' # For WordNetLemmatizer, Open Multilingual Wordnet
}

for resource_name, resource_path in resources_to_download.items():
    try:
        nltk.data.find(resource_path)
        print(f"Resource '{resource_name}' already downloaded.")
    except nltk.downloader.DownloadError:
        print(f"Downloading '{resource_name}' resource for NLTK...")
        nltk.download(resource_name, quiet=True)
        print(f"'{resource_name}' downloaded.")
    except Exception as e: # Catch any other lookup errors
        print(f"Could not find or download '{resource_name}'. Error: {e}")
        print(f"Attempting download directly for '{resource_name}'...")
        try:
            nltk.download(resource_name, quiet=True)
            print(f"'{resource_name}' downloaded successfully after retry.")
        except Exception as download_e:
            print(f"Failed to download '{resource_name}' even after retry. Error: {download_e}")

print("NLTK resources check/download process completed.")
print("-" * 30)

--- NLTK Resource Check/Download ---
Resource 'punkt' already downloaded.
Resource 'averaged_perceptron_tagger' already downloaded.
Resource 'stopwords' already downloaded.


AttributeError: module 'nltk.downloader' has no attribute 'DownloadError'

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/shuvam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Basic NLTK Operations

Let's look at a few common NLP tasks using NLTK:

#### 1. Tokenization
Breaking text into smaller units (words or sentences).

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

print("\n--- NLTK: Tokenization ---")
text_sample = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue."

# Sentence Tokenization
sentences = sent_tokenize(text_sample)
print("Sentences:")
for i, sentence in enumerate(sentences):
    print(f"{i+1}. {sentence}")

# Word Tokenization
words = word_tokenize(text_sample.lower()) # Often good to lowercase first
print("\nWords (lowercase):")
print(words)
print("-" * 30)


--- NLTK: Tokenization ---
Sentences:
1. Hello Mr. Smith, how are you doing today?
2. The weather is great, and Python is awesome.
3. The sky is pinkish-blue.

Words (lowercase):
['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.', 'the', 'sky', 'is', 'pinkish-blue', '.']
------------------------------


#### 2. Stop Word Removal
Removing common words (like "the", "is", "in") that often don't carry significant meaning for analysis.

In [9]:
from nltk.corpus import stopwords

print("\n--- NLTK: Stop Word Removal ---")
stop_words_english = set(stopwords.words('english')) # Use a set for faster lookups
# You can add custom stop words:
# stop_words_english.update(['mr.', ',', '?']) # Example

# Using the text_sample from the previous cell
text_sample = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. Duytika has made a well rounded python code to compute the determinants for 2x2 and 3x3 matrices along with the implementation of the Cramer’s rule."
words_from_sample = word_tokenize(text_sample.lower())

filtered_words = [word for word in words_from_sample if word.isalnum() and word not in stop_words_english]
# isalnum() helps remove punctuation tokens as well

print("Original words:", words_from_sample)
print("Stop words (English sample - first 10):", list(stop_words_english)[:10])
print("Filtered words:", filtered_words)
print("-" * 30)


--- NLTK: Stop Word Removal ---
Original words: ['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.', 'the', 'sky', 'is', 'pinkish-blue', '.', 'duytika', 'has', 'made', 'a', 'well', 'rounded', 'python', 'code', 'to', 'compute', 'the', 'determinants', 'for', '2x2', 'and', '3x3', 'matrices', 'along', 'with', 'the', 'implementation', 'of', 'the', 'cramer', '’', 's', 'rule', '.']
Stop words (English sample - first 10): ['had', "wasn't", 'own', 'd', "i'll", 'weren', "shan't", 'be', 's', "mustn't"]
Filtered words: ['hello', 'smith', 'today', 'weather', 'great', 'python', 'awesome', 'sky', 'duytika', 'made', 'well', 'rounded', 'python', 'code', 'compute', 'determinants', '2x2', '3x3', 'matrices', 'along', 'implementation', 'cramer', 'rule']
------------------------------


#### 3. Stemming
Reducing words to their root or stem form. It's a crude heuristic process that chops off word endings.

In [10]:
from nltk.stem import PorterStemmer

print("\n--- NLTK: Stemming ---")
ps = PorterStemmer()
words_to_stem = ["program", "programs", "programmer", "programming", "programmers", "running", "flies", "democracy"]
stemmed_words = [ps.stem(w) for w in words_to_stem]

print(f"Original: {words_to_stem}")
print(f"Stemmed (Porter): {stemmed_words}")

# Example with our filtered words from the previous cell
# (Assuming 'filtered_words' is available from the stop word removal cell)
if 'filtered_words' in locals(): # Check if filtered_words exists
    stemmed_filtered = [ps.stem(w) for w in filtered_words]
    print("\nStemmed filtered words from sample:")
    print(stemmed_filtered)
else:
    print("\nRun the 'Stop Word Removal' cell first to generate 'filtered_words' for this example.")
print("-" * 30)


--- NLTK: Stemming ---
Original: ['program', 'programs', 'programmer', 'programming', 'programmers', 'running', 'flies', 'democracy']
Stemmed (Porter): ['program', 'program', 'programm', 'program', 'programm', 'run', 'fli', 'democraci']

Stemmed filtered words from sample:
['hello', 'smith', 'today', 'weather', 'great', 'python', 'awesom', 'sky', 'duytika', 'made', 'well', 'round', 'python', 'code', 'comput', 'determin', '2x2', '3x3', 'matric', 'along', 'implement', 'cramer', 'rule']
------------------------------


#### 4. Lemmatization
Similar to stemming, but it reduces words to their actual dictionary form (lemma). It's more sophisticated as it considers the part of speech (POS). By default, WordNetLemmatizer assumes words are nouns if no POS tag is provided.

In [11]:
from nltk.stem import WordNetLemmatizer
# NLTK's WordNetLemmatizer needs a POS tag that is compatible with WordNet's POS tags.
# We can create a helper function to map NLTK's POS tags (from pos_tag) to WordNet tags.
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):
    """Converts treebank POS tags to WordNet POS tags."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun if no mapping found

print("\n--- NLTK: Lemmatization ---")
lemmatizer = WordNetLemmatizer()

words_to_lemmatize = ["cats", "cacti", "geese", "rocks", "running", "better", "ate"]

print(f"Original: {words_to_lemmatize}")

# Lemmatizing without POS tag (defaults to noun)
lemmatized_default_pos = [lemmatizer.lemmatize(w) for w in words_to_lemmatize]
print(f"Lemmatized (default POS - noun): {lemmatized_default_pos}")

# Lemmatizing with specific POS tags
print(f"'running' (as verb): {lemmatizer.lemmatize('running', pos='v')}")
print(f"'ate' (as verb): {lemmatizer.lemmatize('ate', pos='v')}")
print(f"'better' (as adjective): {lemmatizer.lemmatize('better', pos='a')}")
print(f"'better' (as adverb): {lemmatizer.lemmatize('better', pos='r')}")

# Example with our filtered words (defaulting to noun)
# (Assuming 'filtered_words' is available from the stop word removal cell)
if 'filtered_words' in locals():
    lemmatized_filtered_default = [lemmatizer.lemmatize(w) for w in filtered_words]
    print("\nLemmatized filtered words from sample (default POS):")
    print(lemmatized_filtered_default)
    
    # For more accurate lemmatization, use POS tags
    # First, get POS tags for the filtered words (using nltk.pos_tag)
    # Note: pos_tag works best on original sentence structure, not just filtered words.
    # This is just a demonstration on the filtered list.
    # For a real task, you'd POS tag before stopword removal or on tokenized sentences.
    tagged_filtered_words = nltk.pos_tag(filtered_words) 
    lemmatized_filtered_with_pos = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_filtered_words]
    print("\nLemmatized filtered words from sample (with POS tags):")
    print(lemmatized_filtered_with_pos)
else:
    print("\nRun the 'Stop Word Removal' cell first to generate 'filtered_words' for this example.")

print("-" * 30)


--- NLTK: Lemmatization ---
Original: ['cats', 'cacti', 'geese', 'rocks', 'running', 'better', 'ate']
Lemmatized (default POS - noun): ['cat', 'cactus', 'goose', 'rock', 'running', 'better', 'ate']
'running' (as verb): run
'ate' (as verb): eat
'better' (as adjective): good
'better' (as adverb): well

Lemmatized filtered words from sample (default POS):
['hello', 'smith', 'today', 'weather', 'great', 'python', 'awesome', 'sky', 'duytika', 'made', 'well', 'rounded', 'python', 'code', 'compute', 'determinant', '2x2', '3x3', 'matrix', 'along', 'implementation', 'cramer', 'rule']

Lemmatized filtered words from sample (with POS tags):
['hello', 'smith', 'today', 'weather', 'great', 'python', 'awesome', 'sky', 'duytika', 'make', 'well', 'rounded', 'python', 'code', 'compute', 'determinants', '2x2', '3x3', 'matrix', 'along', 'implementation', 'cramer', 'rule']
------------------------------


In [13]:
nltk.download('tagsets_json')

[nltk_data] Downloading package tagsets_json to
[nltk_data]     /home/shuvam/nltk_data...
[nltk_data]   Unzipping help/tagsets_json.zip.


True

#### 5. Part-of-Speech (POS) Tagging
Assigning grammatical tags (noun, verb, adjective, etc.) to each word. NLTK's `pos_tag` uses the Penn Treebank tagset.

In [18]:
from nltk import pos_tag # Already imported nltk, but good for clarity
# from nltk.tokenize import word_tokenize # Already imported

print("\n--- NLTK: Part-of-Speech Tagging ---")
sentence_for_pos = "The quick brown fox jumps over the lazy dog. Duytika is a good programmer. Duytika has made a well rounded python code to compute the determinants for 2x2 and 3x3 matrices along with the implementation of the Cramer’s rule."
tokens_for_pos = word_tokenize(sentence_for_pos)
pos_tags = pos_tag(tokens_for_pos)

print("Tokens:", tokens_for_pos)
print("POS Tags:")
for word, tag in pos_tags:
    # nltk.help.upenn_tagset(tag) provides a description of the tag
    # It can be verbose, so let's just print the tag for now, or a snippet.
    tag_description_lines = nltk.help.upenn_tagset(tag)
    if tag_description_lines:
        tag_description = tag_description_lines.splitlines()[0] # Get the first line of description
    else:
        tag_description = "No description available"
    print(f"('{word}', '{tag}') - {tag_description}")

# Example: Find all nouns from the sentence
nouns = [word for word, tag in pos_tags if tag.startswith('NN')] # NN, NNS, NNP, NNPS are noun tags
print("\nNouns from the sentence:", nouns)
print("-" * 30)


--- NLTK: Part-of-Speech Tagging ---
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'Duytika', 'is', 'a', 'good', 'programmer', '.', 'Duytika', 'has', 'made', 'a', 'well', 'rounded', 'python', 'code', 'to', 'compute', 'the', 'determinants', 'for', '2x2', 'and', '3x3', 'matrices', 'along', 'with', 'the', 'implementation', 'of', 'the', 'Cramer', '’', 's', 'rule', '.']
POS Tags:
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
('The', 'DT') - No description available
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
('quick', 'JJ') - No description available
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humou