<a href="https://colab.research.google.com/github/HamzaBahsir/NLP/blob/main/LanguageRepresentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LAB 3: Language Representation**

**Language Representation** a.k.a. Text Representation is the process of converting unstructured text data into a structured format (machine-readable form). It involves converting words, phrases, or entire documents into numerical or symbolic representations while preserving meaning and context.

It comprise preprocessing the text data followed by selecting a suitable representation scheme, such as Bag-of-Words, TF-IDF etc. to capture the key features and characteristics of the same, in a numerical form that can be processed by machine learning algorithms.



# **Objectives:**
Here we will do

1. Text Preprocessing
    * Remove Punctuation
    * Remove URLs
    * Lowercasing
    * Tokenization
    * Remove Stop Words
    * Stemming
    * Lemmatization
2. Character Encoding
    * ASCII
    * UTF-8
3. Text Representation
    * Bag-of-Words (BoW)
    * Term Frequency - Inverse Document Frequency (TF-IDF)

#**Extra Resources**
[Natural Language Processing with Python](https://www.nltk.org/book/)

#**Libraries Required**
1.   nltk
2.   string
3.   re
4.   sklearn



In [15]:
#Import Libraries
#### For Removing Punctuation
import string
#### For Removing URLs
import re
#### For Tokenization
import nltk
nltk.download("stopwords")
nltk.download("punkt_tab")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
#### For Stemming
from nltk.stem import PorterStemmer
#### For Lemmatization
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
#### For BoW
from sklearn.feature_extraction.text import CountVectorizer
#### For TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# 1.  **Text Preprocessing**




### **Remove Punctuation**

In [1]:
text = "Hello, world! NLP is amazing. Let's learn it at https://example.com."
text_no_punct = text.translate(str.maketrans("", "", string.punctuation))
print(text_no_punct)  #output:(Hello world NLP is amazing Lets learn it at httpsexamplecom)


Hello world NLP is amazing Lets learn it at httpsexamplecom


### **Remove URLs**

In [3]:
# text is text_no_punct computed above
text_no_urls = re.sub(r'http\S+|www\S+', '', text_no_punct)
# print the output
print(text_no_urls)   #output:(Hello world NLP is amazing Lets learn it at)


Hello world NLP is amazing Lets learn it at 


### **Lowercasing**

In [4]:
# text is "text_no_urls" computed above
text_lower = text_no_urls.lower()
# print the output
print(text_lower)     # Output:(hello world nlp is amazing lets learn it at)


hello world nlp is amazing lets learn it at 


### **Tokenization**

In [7]:
# text is "text_lower" computed above
words = word_tokenize(text_lower)
sentences = sent_tokenize(text_lower)

# print both outputs
print(sentences)      # Output:['hello world nlp is amazing lets learn it at']
print(words)          # Output:['hello', 'world', 'nlp', 'is', 'amazing', 'lets',
                      # 'learn', 'it', 'at']


['hello world nlp is amazing lets learn it at']
['hello', 'world', 'nlp', 'is', 'amazing', 'lets', 'learn', 'it', 'at']


### **Remove Stop Words**

In [8]:
# Using the `words` from previous block
filtered_text = [word for word in words if word not in stopwords.words("english")]
# print the output
print(filtered_text)  # Output:['hello', 'world', 'nlp', 'amazing', 'lets', 'learn']

['hello', 'world', 'nlp', 'amazing', 'lets', 'learn']


### **Stemming**

In [9]:
# Using the `filtered_text` from previous block
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_text]
# print the output
print(stemmed_words)  # Output:['hello', 'world', 'nlp', 'amaz', 'let', 'learn']


['hello', 'world', 'nlp', 'amaz', 'let', 'learn']


### **Lemmatization**

In [12]:
# Using the `filtered_text` from previous block
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos="n") for word in filtered_text]  # 'n' for noun
# print the output
print(lemmatized_words)# Output:['hello', 'world', 'nlp', 'amazing', 'let', 'learn']

['hello', 'world', 'nlp', 'amazing', 'let', 'learn']


NOTE: Valid options for `pos` in `.lemmatize()` are “n” for nouns, “v” for verbs, “a” for adjectives, “r” for adverbs and “s” for satellite adjectives.



# 2.  **Character Encoding**

### **American Standard Code for Information Interchange (ASCII)**


In [13]:
text = "Hello, دنیا"

# Encode using ASCII (ignoring non-ASCII characters)
encoded_text = text.encode("ascii", errors="ignore")
print("Encoded Text: ", encoded_text)  # Output: b'Caf'

# Encode using ASCII (replacing non-ASCII characters)
decoded_text = encoded_text.decode("ascii")
print("Decoded Text: ", decoded_text)  # Output: Hello

Encoded Text:  b'Hello, '
Decoded Text:  Hello, 


### **Unicode Transformation Format 8 (UTF-8)**


In [14]:
text = "ہیلو، دنیا"  # Unicode string in Urdu

############ Encode to bytes using UTF-8
# Encode using UTF-8
encoded_text = text.encode("utf-8", errors="ignore")
# print the output
print("Encoded Text: ", encoded_text)  #Output:(b'\xdb\x81\xdb\x8c\xd9\x84\xd9\x88\xd8\x8c \xd8\xaf\xd9\x86\xdb\x8c\xd8\xa7')

############ Decode back to string using .decode()
decoded_text = encoded_text.decode("utf-8")
# print the output
print("Decoded Text: ", decoded_text)  #Output:(ہیلو، دنیا)

Encoded Text:  b'\xdb\x81\xdb\x8c\xd9\x84\xd9\x88\xd8\x8c \xd8\xaf\xd9\x86\xdb\x8c\xd8\xa7'
Decoded Text:  ہیلو، دنیا


# 3. **Text Representation**

a) **Bag-of-Words (BoW) Representation:**

It represents text as a vector of word frequencies, ignoring grammar and word order, based on a corpus-wide vocabulary.


b) **Term Frequency - Inverse Document Frequency (TF-IDF) Representation:**

It is a statistical measure that evaluates a word's importance in a document relative to a collection of documents by combining its frequency in the document (TF) and its rarity across the corpus (IDF).

Words that appear frequently across many documents (common words) have lower importance.

### **BoW**

In [16]:
# Input texts
text1 = "I love NLP."
text2 = "NLP is an interesting subject."

# Bag of Words (BoW)
# Initialize the CountVectorizer, which converts text into a matrix of token counts
bow_vectorizer = CountVectorizer()
# Fit and transform the input texts into a BoW matrix
bow_matrix = bow_vectorizer.fit_transform([text1, text2])

# Feature names and BoW representation
print("Bag of Words (BoW):")
print("Feature Names:", bow_vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())

Bag of Words (BoW):
Feature Names: ['an' 'interesting' 'is' 'love' 'nlp' 'subject']
BoW Matrix:
 [[0 0 0 1 1 0]
 [1 1 1 0 1 1]]


NOTE: `vectorizer.fit_transform()` build a unique vocabulary by
  * Applying Tokenization
  * Removing Duplicates
  * Lowercasing
  * Stop Word Removal

### **TF-IDF**

In [17]:
# TF-IDF
# Same text to be used as BoW
# Input texts
text1 = "I love NLP."
text2 = "NLP is an interesting subject."

# Initialize the TfidfVectorizer, which transforms text into TF-IDF
TfIdf_vectorizer = TfidfVectorizer()

# Fit and transform the input texts into a TF-IDF matrix
TfIdf_matrix = TfIdf_vectorizer.fit_transform([text1, text2])

# Print the Feature names and TF-IDF representation
print("Term Frequency -- Inverse Document Frequency (TF-IDF):")
print("Feature Names:", TfIdf_vectorizer.get_feature_names_out())
print("TfIdf Matrix:\n", TfIdf_matrix.toarray())



Term Frequency -- Inverse Document Frequency (TF-IDF):
Feature Names: ['an' 'interesting' 'is' 'love' 'nlp' 'subject']
TfIdf Matrix:
 [[0.         0.         0.         0.81480247 0.57973867 0.        ]
 [0.47107781 0.47107781 0.47107781 0.         0.33517574 0.47107781]]
