<a href="https://colab.research.google.com/github/IrfanKpm/machine-learning-diaries/blob/main/NLP/_002_Basic_Text_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K

In [12]:
import spacy
import contractions

# Load spaCy's English tokenizer
nlp = spacy.load("en_core_web_sm")

## **Tokenization**


---

### **Basic Tokenization**

In [3]:
text = "Hello, world! How are you today?" # Tokenizing text with contractions.

# Process the text
doc = nlp(text)
# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['Hello', ',', 'world', '!', 'How', 'are', 'you', 'today', '?']


In [4]:
text = "This is the first sentence. This is the second sentence." #  Tokenizing a text with multiple sentences.

# Process the text
doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print(tokens)

['This', 'is', 'the', 'first', 'sentence', '.', 'This', 'is', 'the', 'second', 'sentence', '.']


### **Accessing Tokens and Their Attributes in spaCy**


In [5]:
# Example text
text = "spaCy makes NLP easy."
# Process the text
doc = nlp(text)
# Access tokens
for token in doc:
    print(f"Token: {token.text}, Index: {token.i}")

Token: spaCy, Index: 0
Token: makes, Index: 1
Token: NLP, Index: 2
Token: easy, Index: 3
Token: ., Index: 4


In [18]:
# New text for testing
text = "New York is vibrant, but the weather is unpredictable. Tourists love the Statue."

# Process the text
doc = nlp(text)

# Iterate over tokens and print their attributes with numbering
for i, token in enumerate(doc, start=1):
    print("-" * 75)
    print(f"Token {i}: {token.text}")
    print(f"Lemma {i}: {token.lemma_}")
    print(f"POS {i}: {token.pos_}")
    print(f"Tag {i}: {token.tag_}")
    print(f"Dependency {i}: {token.dep_}")
    print(f"Shape {i}: {token.shape_}")
    print(f"Is Alpha {i}: {token.is_alpha}")
    print(f"Is Stop Word {i}: {token.is_stop}")

---------------------------------------------------------------------------
Token 1: New
Lemma 1: New
POS 1: PROPN
Tag 1: NNP
Dependency 1: compound
Shape 1: Xxx
Is Alpha 1: True
Is Stop Word 1: False
---------------------------------------------------------------------------
Token 2: York
Lemma 2: York
POS 2: PROPN
Tag 2: NNP
Dependency 2: nsubj
Shape 2: Xxxx
Is Alpha 2: True
Is Stop Word 2: False
---------------------------------------------------------------------------
Token 3: is
Lemma 3: be
POS 3: AUX
Tag 3: VBZ
Dependency 3: ROOT
Shape 3: xx
Is Alpha 3: True
Is Stop Word 3: True
---------------------------------------------------------------------------
Token 4: vibrant
Lemma 4: vibrant
POS 4: ADJ
Tag 4: JJ
Dependency 4: acomp
Shape 4: xxxx
Is Alpha 4: True
Is Stop Word 4: False
---------------------------------------------------------------------------
Token 5: ,
Lemma 5: ,
POS 5: PUNCT
Tag 5: ,
Dependency 5: punct
Shape 5: ,
Is Alpha 5: False
Is Stop Word 5: False
------------

In [7]:
   # Accessing Sentence Token

# Sample text with two sentences
text = "The quick brown fox jumps over the lazy dog. The sun is shining brightly today."
# Process the text
doc = nlp(text)
# Iterate over sentences and print tokens
for sent in doc.sents:
    print(f"Sentence: {sent.text}")

Sentence: The quick brown fox jumps over the lazy dog.
Sentence: The sun is shining brightly today.


## **Text Normalization**

### **Lowercasing**

In [8]:
text = "This is a SIMPLE Example."
# Process the text
doc = nlp(text)
# Convert tokens to lowercase
tokens = [token.text.lower() for token in doc]
print("Lowercased Tokens:", tokens)

Lowercased Tokens: ['this', 'is', 'a', 'simple', 'example', '.']


### **Removing Stop Words**

In [9]:
# Stop words are common words that don't add significant meaning to the text.

text = "This is a SIMPLE Example."
tokens = [token.text for token in doc if not token.is_stop]
print("Tokens without Stop Words:", tokens)

Tokens without Stop Words: ['SIMPLE', 'Example', '.']


### **Expanding Contractions**

In [14]:
# Sample text with contractions
text = "I can't go there because it's raining."

# Expand contractions
expanded_text = contractions.fix(text)
print(f"Input Text : {text}")
print("Expanded Text : ", expanded_text)

Input Text : I can't go there because it's raining.
Expanded Text :  I cannot go there because it is raining.


### **Removing Special Characters**

In [16]:
text = "I can't go there because it's raining."
doc = nlp(text)
cleaned_tokens = [token.text for token in doc if token.is_alpha]
print("Tokens without Special Characters:", cleaned_tokens)

Tokens without Special Characters: ['I', 'ca', 'go', 'there', 'because', 'it', 'raining']



## **example tasks**

In [25]:
data = """
John recently moved to New York and started a new job as a software engineer. He’s been exploring the city and enjoying the vibrant culture. If you want to reach out to him for professional inquiries, you can email him at john.doe@example.com. He’s always open to connecting with like-minded professionals.
Sophia is an experienced graphic designer who freelances for various international clients. She specializes in branding and web design, often sharing her work on social media. For collaborations, you can contact her at sophia.artwork@example.com. She’s excited to work on new and creative projects that challenge her skills.
Michael is a digital marketer who recently launched his own agency. He helps small businesses grow their online presence through strategic marketing. You can get in touch with him at michael.marketing@example.com for consultations. He believes in personalized strategies that drive results and loves working with passionate entrepreneurs.
"""

doc = nlp(data)

for token in doc:
    if token.like_email:
        print(f"Email found: {token.text}")


Email found: john.doe@example.com
Email found: sophia.artwork@example.com
Email found: michael.marketing@example.com


## **malayalam tokenization**

In [27]:
nlp = spacy.blank("ml")

# Sample Malayalam text
text = "ഞാൻ എങ്ങനെ സഹായിക്കാമെന്ന് നോക്കാം."

# Process the text
doc = nlp(text)

# Print tokens
for token in doc:
    print(f"Token: {token.text}")


Token: ഞാൻ
Token: എങ്ങനെ
Token: സഹായിക്കാമെന്ന്
Token: നോക്കാം.
