<a href="https://colab.research.google.com/github/Rimcode-ai/AI-Driven-Educational-Enhancement/blob/main/nlp_ner_sampleproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NER (Named Entity Recognition)**: Use spaCy to extract entities.

**Sentiment Analysis**: Use *TextBlob* for polarity and subjectivity or
***VADER* from NLTK.**

**Text Summarization**: Use *Pegasus or T5* from Hugging Face.

**Seq2Seq**: Use *T5 or BART* for tasks like summarization and translation.


**Machine Translation**: Use *MarianMT* for multilingual translation.


In [None]:
import json
import pandas as pd
from nltk.tokenize import word_tokenize
import nltk
import spacy
from textblob import TextBlob #Sentiment Analysis # polarity and subjectivity
from transformers import pipeline #Text Summarization
from transformers import T5ForConditionalGeneration #summary#Seq2Seq
from transformers import T5Tokenizer #tokenizer #Seq2Seq
from transformers import MarianMTModel, MarianTokenizer#MarianMT is a model for machine translation
                                                        #that supports various languages.
from nltk.sentiment.vader import SentimentIntensityAnalyzer



In [None]:
text = "I don't know"


In [None]:
num_char = len(text)
print(num_char)

12


In [None]:
with open('/content/sample_data/LexNxdata.json', 'r') as source:
  data = json.load(source)

In [None]:
df = pd.DataFrame(data)
print(df)

                    source                                               news  \
0                  Reuters  Tech Industry Stocks Surge Amid Economic Optimism   
1                 BBC News  Global Leaders Meet to Discuss Climate Change ...   
2  The Wall Street Journal  Federal Reserve Signals Interest Rate Hike Ami...   

                                             content        date       author  \
0  The technology sector experienced a significan...  2025-01-15     John Doe   
1  Leaders from around the world gathered in Pari...  2025-01-14   Jane Smith   
2  The U.S. Federal Reserve has signaled that it ...  2025-01-13  Michael Lee   

                                                tags  \
0     [Technology, Economy, Stock Market, Investing]   
1   [Politics, Climate Change, Global Summit, Paris]   
2  [Economy, Inflation, Federal Reserve, Interest...   

                                            entities       location  
0  [{'type': 'Organization', 'name': 'Apple'}, {'...   

In [None]:
#nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
tokens = word_tokenize(df['content'][0])
print(tokens)

['The', 'technology', 'sector', 'experienced', 'a', 'significant', 'increase', 'in', 'stock', 'prices', 'today', 'as', 'investors', 'showed', 'renewed', 'optimism', 'about', 'the', 'future', 'of', 'tech', 'companies', '.', 'Key', 'players', 'like', 'Apple', ',', 'Microsoft', ',', 'and', 'Google', 'reported', 'better-than-expected', 'earnings', ',', 'which', 'fueled', 'positive', 'sentiment', 'in', 'the', 'market', '.', 'Analysts', 'predict', 'that', 'this', 'trend', 'will', 'continue', 'through', 'the', 'next', 'quarter', ',', 'particularly', 'if', 'global', 'economic', 'conditions', 'remain', 'stable', '.']


In [None]:
nlp = spacy.load("en_core_web_sm")
print(nlp)

<spacy.lang.en.English object at 0x79ce98787850>


In [None]:
doc = nlp(df['content'][0])
print(doc)

The technology sector experienced a significant increase in stock prices today as investors showed renewed optimism about the future of tech companies. Key players like Apple, Microsoft, and Google reported better-than-expected earnings, which fueled positive sentiment in the market. Analysts predict that this trend will continue through the next quarter, particularly if global economic conditions remain stable.


In [None]:
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)

[('today', 'DATE'), ('Apple', 'ORG'), ('Microsoft', 'ORG'), ('Google', 'ORG'), ('the next quarter', 'DATE')]


In [None]:
blob = TextBlob(df['content'][0])
print(blob)

#Sentiment polarity ranges from -1 to 1 (1 is very positive)
print(f"Sentiment Polarity: {blob.sentiment.polarity}")

#Subjectivity ranges from 0 to 1--1 is very subjective and 0 is very objective
print(f"Sentiment Subjectivity: {blob.sentiment.subjectivity}")

The technology sector experienced a significant increase in stock prices today as investors showed renewed optimism about the future of tech companies. Key players like Apple, Microsoft, and Google reported better-than-expected earnings, which fueled positive sentiment in the market. Analysts predict that this trend will continue through the next quarter, particularly if global economic conditions remain stable.
Sentiment Polarity: 0.2002840909090909
Sentiment Subjectivity: 0.4556818181818182


In [None]:
summarizer_abstractive = pipeline("summarization", model="google/pegasus-xsum")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


In [None]:
print(summarizer_abstractive)

<transformers.pipelines.text2text_generation.SummarizationPipeline object at 0x79cf6c7f0a10>


In [None]:
summarizer_extractive = pipeline("summarization", model="google/pegasus-xsum")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


In [None]:
print(summarizer_extractive)

<transformers.pipelines.text2text_generation.SummarizationPipeline object at 0x79ce94652490>


In [None]:
text = df['content'][0]
print(text)

The technology sector experienced a significant increase in stock prices today as investors showed renewed optimism about the future of tech companies. Key players like Apple, Microsoft, and Google reported better-than-expected earnings, which fueled positive sentiment in the market. Analysts predict that this trend will continue through the next quarter, particularly if global economic conditions remain stable.


In [None]:
#Abstractive Summary
summary1 = summarizer_extractive(text, max_length=60, min_length=59, do_sample=True)

In [None]:
print(summary1[0]['summary_text'])

The Dow Jones Industrial Average and the S&P 500 both closed at record highs on Tuesday, with the S&P 500 closing at a record high for the fourth consecutive trading day and the Dow Jones closing at a record high for the fourth consecutive trading day, the first time that has


In [None]:
# Extractive Summary
summary2 = summarizer_extractive(text,  max_length=60, min_length=59, do_sample=False)

In [None]:
print(summary2[0]['summary_text'])

The Dow Jones and the S&P 500 both closed at record highs on Wednesday, marking the first time they have closed above their all-time highs since the financial crisis began in 2008... and the first time they have closed above their all-time lows since the financial


In [None]:
#Seq2Seq Models---Load model and tokenizer for T5 (used for text-to-text tasks)

# Define the model name (T5-small is a smaller version, but you can use larger ones like "t5-base" or "t5-large" for better performance)
model_name = "t5-small"

# Load the pre-trained T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

In [None]:
input_text = "summarizer: " + df['content'][0]
print(input_text)

summarizer: The technology sector experienced a significant increase in stock prices today as investors showed renewed optimism about the future of tech companies. Key players like Apple, Microsoft, and Google reported better-than-expected earnings, which fueled positive sentiment in the market. Analysts predict that this trend will continue through the next quarter, particularly if global economic conditions remain stable.


In [None]:
#tokenize and generate Summary
inputs = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
#pt==PyTorch tensor ; tf==tenserFlow tensor ; np== numpyArray
print(inputs)

{'input_ids': tensor([[21603,    52,    10,    37,   748,  2393,  1906,     3,     9,  1516,
           993,    16,  1519,  1596,   469,    38,  4367,  3217, 18184, 24543,
            81,     8,   647,    13,  5256,   688,     5,  4420,  1508,   114,
          2184,     6,  2803,     6,    11,  1163,  2196,   394,    18,  6736,
            18, 31643,  8783,     6,    84,     3, 28536,  1465,  6493,    16,
             8,   512,     5, 25224,     7,  9689,    24,    48,  4166,    56,
           916,   190,     8,   416,  2893,     6,  1989,     3,    99,  1252,
          1456,  1124,  2367,  5711,     5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]])}


In [None]:
summary_ids = model.generate(inputs['input_ids'], max_length=150, min_length=50, do_sample=False)
print(summary_ids)

tensor([[    0,     8,  5256,  2393,  1906,     3,     9,  1516,   993,    16,
          1519,  1596,   469,     3,     5,   843,  1508,   114,  8947,     6,
          2803,     6,    11, 10283,  2196,   394,    18,  6736,    18, 31643,
          8783,     3,     5, 15639,  9689,    48,  4166,    56,   916,   190,
             8,   416,  2893,     3,     5,     8,   512,    19,  1644,    12,
            36,  5711,    16,     8,   416,  2893,     3,     5,     1]])


In [None]:
#Decode Summary

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

the tech sector experienced a significant increase in stock prices today. key players like apple, Microsoft, and google reported better-than-expected earnings. analysts predict this trend will continue through the next quarter. the market is expected to be stable in the next quarter.


In [None]:
#Multi-lingual Machine Translation

model_name_1 = "Helsinki-NLP/opus-mt-en-fr"
tokenizer_1 = MarianTokenizer.from_pretrained(model_name_1)
model_1 = MarianMTModel.from_pretrained(model_name_1)

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [None]:
text = df['content'][0]


In [None]:
tokenized_text = tokenizer(text, return_tensors = "pt", padding = True)
print(tokenized_text)

{'input_ids': tensor([[   37,   748,  2393,  1906,     3,     9,  1516,   993,    16,  1519,
          1596,   469,    38,  4367,  3217, 18184, 24543,    81,     8,   647,
            13,  5256,   688,     5,  4420,  1508,   114,  2184,     6,  2803,
             6,    11,  1163,  2196,   394,    18,  6736,    18, 31643,  8783,
             6,    84,     3, 28536,  1465,  6493,    16,     8,   512,     5,
         25224,     7,  9689,    24,    48,  4166,    56,   916,   190,     8,
           416,  2893,     6,  1989,     3,    99,  1252,  1456,  1124,  2367,
          5711,     5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]])}


In [None]:
#Translate

translated = model_1.generate(**tokenized_text)
print(translated)

tensor([[59513,  1516,     5,   730,   174,    16,   396,    91,    38,    38,
          3677,  3217, 18184,  3066,    81,     8,   647,    13,  2753,  1424,
             5,  4420,   148,  1607,  2184,     6,    14,  3022,     6,    14,
          3022,     6,    11,  1163,  2196,  1692,    23,  6736,    18,    49,
          1704,    14,     6,   150,     3,    49, 16410,  1465,  1906,    16,
             8,   512,     5,  5749,  1856,     5,     8,  2153,    24,   162,
           712,    29,    14,     6,   150,     3,    49, 16410,  1465,  1906,
            16,     8,   512,     5,  5749,  1856,     5,     8,  2153,    24,
           162,   712,    29,    14,     6,   150,     3,    49, 16410,  1465,
          1906,    16,     8,   512,     5,  5749,  1856,     5,     8,  2153,
            24,   162,   712,    29,    14,     6,   150,     3,    49, 16410,
          1465,  1906,     8,   512,     5,     8,  2153,    24,   162,   712,
            29,    14,     6,   150,     3,    49, 1

In [None]:
translated_text = tokenizer_1.decode(translated[0],skip_special_tokens=True)
print(f"Translated Text: {translated_text}")

Translated Text: Plus de savoir si les femmes ont une une AmériqueTM apparemment filles n la page des confiance prises de district fait référence multi' low' low' etée 2005, puis en acheter in  pourquoi l'A. POL Z président les la total defin financière de la description (en tant que l'A. POL Z président les la total defin financière de la description (en tant que l'A. POL Z président les la total defin financière de la description (en tant que l'A. POL Z président la total de la description (en tant que l'A. POL Z président les la total defin financière de la description (en tant que l'A. Aux Aux Droitsin la » Quand' Can.) ? fournit soumis going going submited going submit going submit going submit going submit going subject going subs subject going subs subject going subsing subsum subsum subsum subsum submum subsum subs subs subsum va subs que que de: de: de: de


In [None]:
#Sentiment analysis with VADER

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
text = df['content'][0]
sia= SentimentIntensityAnalyzer()

In [None]:
sentiment_score_1 = sia.polarity_scores(text)


In [None]:
print("Sentiment Scores (VADER):")
print(sentiment_score_1)

Sentiment Scores (VADER):
{'neg': 0.0, 'neu': 0.759, 'pos': 0.241, 'compound': 0.9313}


In [None]:
#Sample example of Sentiment with VADER and Subjectivity with TextBlob

text = "The technology sector is experiencing significant growth."

# Get sentiment scores using VADER
sentiment_score = sia.polarity_scores(text)
print("Sentiment Scores (VADER):")
print(sentiment_score)

# Get subjectivity score using TextBlob
blob = TextBlob(text)
subjectivity_score = blob.sentiment.subjectivity
print(f"Subjectivity Score (TextBlob): {subjectivity_score}")

Sentiment Scores (VADER):
{'neg': 0.0, 'neu': 0.532, 'pos': 0.468, 'compound': 0.5267}
Subjectivity Score (TextBlob): 0.875
