<a href="https://colab.research.google.com/github/SfurtiR/Natural-Language-Processing/blob/main/Text_Representation_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Text Representation in NLP**

**Traditional (Statistical) Methods**

1.1 Bag of Words (BoW)


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP is amazing"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())


['amazing' 'is' 'love' 'nlp']
[[0 0 1 1]
 [1 1 0 1]]


array([1, 1, 1, 1, 1])

1.2 TF-IDF (Term Frequency - Inverse Document Frequency)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())


['amazing' 'is' 'love' 'nlp']
[[0.         0.         0.81480247 0.57973867]
 [0.6316672  0.6316672  0.         0.44943642]]


1.3 N-grams

In [None]:
vectorizer = CountVectorizer(ngram_range=(2,2))  # Bi-grams
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())


['is amazing' 'love nlp' 'nlp is']


**Modern (Neural) Methods: Word Embeddings**

2.1 Word2Vec

In [None]:
from gensim.models import Word2Vec
sentences = [["I", "love", "NLP"], ["NLP", "is", "amazing"]]
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1)
print(model.wv.most_similar("NLP"))


[('is', 0.1267007291316986), ('amazing', 0.042373016476631165), ('love', -0.01447527389973402), ('I', -0.11821287125349045)]


**2.2 GloVe (Global Vectors for Word Representation)**

In [None]:
import gensim.downloader as api
glove_model = api.load("glove-wiki-gigaword-50")
print(glove_model.most_similar("and"))


[('well', 0.9412044286727905), ('with', 0.934298038482666), ('both', 0.9299852252006531), ('while', 0.9278404712677002), (',', 0.9206988215446472), ('.', 0.9186708331108093), ('as', 0.9164124727249146), ('also', 0.9155831933021545), ('other', 0.8991360068321228), ('all', 0.8804185390472412)]


**2.3 Contextual Embeddings (BERT, GPT)**

In [None]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

text = "NLP is amazing!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)  # Embeddings for each token


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

torch.Size([1, 7, 768])


In [None]:
outputs[1:1:1]

()

**Using BART for Text Summarization**

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Load Pretrained BART Model
model_name = "facebook/bart-large-cnn"
#model_name = "bert-base-uncased"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Example Input Text
text = "The COVID-19 pandemic has had a significant impact on global economies, causing disruptions in supply chains and labor markets."

# Tokenize & Generate Summary
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(inputs.input_ids, max_length=50, num_beams=4, early_stopping=True)

# Decode Output
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]



Summary: The COVID-19 pandemic has had a significant impact on global economies. It has caused disruptions in supply chains and labor markets. The global economy has been affected by the pandemic for more than a year. The U.


**Using GPT for Text Generation**

**Applications of BART & GPT**

Application         BART    GPT

Text Summarization ✅ Yes ❌ Limited

Chatbots        ❌ Limited ✅ Yes

Question Answering
✅ Yes
✅ Yes

Code Generation
❌ No
✅ Yes (Codex)

Machine Translation
✅ Yes
❌ No



In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load Pretrained GPT-2 Model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Example Prompt
prompt = "The future of artificial intelligence is"

# Tokenize Input
inputs = tokenizer(prompt, return_tensors="pt")

# Generate Text
outputs = model.generate(inputs.input_ids, max_length=50, num_return_sequences=1, temperature=0.7)

# Decode Output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Text:", generated_text)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text: The future of artificial intelligence is uncertain.

"We're not sure what the future will look like," said Dr. Michael S. Schoenfeld, a professor of computer science at the University of California, Berkeley. "But we're not
