<a href="https://colab.research.google.com/github/Osakhra/ITAI2373-NewsBot-Final/blob/main/notebooks/07_Conversational_Interface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07_Conversational_Interface.ipynb

In this notebook, I will demonstrate NewsBot 2.0’s ability to answer natural language questions about news articles using a conversational interface.

**Specifically, I will:**
- Load my cleaned news data and models
- Use my QueryProcessor, IntentClassifier, and ResponseGenerator modules
- Ask NewsBot questions like “What category is this article?” or “Summarize this news story”
---


In [1]:
!pip install langdetect spacy nltk scikit-learn pyldavis textblob transformers torch sumy sentence-transformers numpy matplotlib seaborn googletrans==4.0.0-rc1
import nltk
nltk.download('stopwords')
%cd /content
!rm -rf ITAI2373-NewsBot-Final
!git clone https://github.com/Osakhra/ITAI2373-NewsBot-Final.git
import sys
sys.path.append('/content/ITAI2373-NewsBot-Final/src')



Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Cloning into 'ITAI2373-NewsBot-Final'...
remote: Enumerating objects: 258, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (80/80), done.[K
remote: Total 258 (delta 52), reused 6 (delta 6), pack-reused 170 (from 1)[K
Receiving objects: 100% (258/258), 277.34 KiB | 1.62 MiB/s, done.
Resolving deltas: 100% (115/115), done.


In [2]:
from google.colab import files
uploaded = files.upload()



Saving news_cleaned.csv to news_cleaned.csv


In [3]:
import pandas as pd
df = pd.read_csv('news_cleaned.csv')
df.head()

Unnamed: 0,ArticleId,content,category,clean_content
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex boss launch defence lawyer defend ...
1,154,german business confidence slides german busin...,business,german business confidence slide german busine...
2,1101,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicate economic gloom citizen major...
3,1976,lifestyle governs mobile choice faster bett...,tech,lifestyle govern mobile choice fast well funky...
4,917,enron bosses in $168m payout eighteen former e...,business,enron boss payout eighteen former enron direct...


In [4]:
from analysis.classifier import NewsClassifier
from analysis.sentiment_analyzer import SentimentAnalyzer
from analysis.ner_extractor import NERExtractor
from analysis.topic_modeler import TopicModeler
from language_models.summarizer import Summarizer
from conversation.query_processor import QueryProcessor
from data_processing.feature_extractor import FeatureExtractor # Corrected import

In [5]:
from data_processing.feature_extractor import FeatureExtractor

extractor = FeatureExtractor(max_features=2000, ngram_range=(1,2))
X = extractor.fit_transform(df['clean_content'])
y = df['category']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = NewsClassifier(model_type='nb')
clf.train(X_train, y_train)
print("Classifier retrained.")
clf.save('news_classifier_nb.pkl')


Classifier retrained.


In [6]:
# Load best classifier
clf = NewsClassifier(model_type='nb')
clf.load('news_classifier_nb.pkl')

sentiment_analyzer = SentimentAnalyzer()
ner_extractor = NERExtractor()
topic_modeler = TopicModeler(n_topics=5, method='lda', max_features=1500)
topic_modeler.fit_transform(df['clean_content'])
summarizer = Summarizer()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [7]:
import importlib
import conversation.query_processor
importlib.reload(conversation.query_processor)
from conversation.query_processor import QueryProcessor

qp = QueryProcessor(
    classifier=clf,
    sentiment_analyzer=sentiment_analyzer,
    ner_extractor=ner_extractor,
    topic_modeler=topic_modeler,
    summarizer=summarizer,
    feature_extractor=extractor  # <--- This is your fitted FeatureExtractor
)

In [8]:
import sys, importlib
sys.path.append('/content/ITAI2373-NewsBot-Final/src')

import conversation.intent_classifier as ic
import conversation.query_processor as qpmod
importlib.reload(ic)
importlib.reload(qpmod)
from conversation.query_processor import QueryProcessor


In [9]:
# (re)build the feature extractor exactly as used for training
from data_processing.feature_extractor import FeatureExtractor
extractor = FeatureExtractor(max_features=2000, ngram_range=(1,2))
X = extractor.fit_transform(df['clean_content'])  # fit on the same clean text used for clf

# load/recreate your trained classifier, analyzers, etc.
from analysis.classifier import NewsClassifier
from analysis.sentiment_analyzer import SentimentAnalyzer
from analysis.ner_extractor import NERExtractor
from analysis.topic_modeler import TopicModeler
from language_models.summarizer import Summarizer

clf = NewsClassifier(model_type='nb')
clf.load('news_classifier_nb.pkl')  # if not saved/available, retrain then save+load

sentiment_analyzer = SentimentAnalyzer()
ner_extractor = NERExtractor()
topic_modeler = TopicModeler(n_topics=5, method='lda', max_features=1500)
topic_modeler.fit_transform(df['clean_content'])
summarizer = Summarizer()

qp = QueryProcessor(
    classifier=clf,
    sentiment_analyzer=sentiment_analyzer,
    ner_extractor=ner_extractor,
    topic_modeler=topic_modeler,
    summarizer=summarizer,
    feature_extractor=extractor
)


Device set to use cpu


In [10]:
# Pick an article
sample_article = df['content'].iloc[0]

# Ask for category
print("User: What category is this article about?")
print("NewsBot:", qp.process("What category is this article?", sample_article), "\n")

# Ask for sentiment
print("User: What is the sentiment of this news story?")
print("NewsBot:", qp.process("What is the sentiment?", sample_article), "\n")

# Ask for entities
print("User: Who or what is mentioned in this article?")
print("NewsBot:", qp.process("List the entities in this article.", sample_article), "\n")

# Ask for main topic
print("User: What is the main topic here?")
print("NewsBot:", qp.process("What topic is this about?", sample_article), "\n")

# Ask for summary
print("User: Summarize this article.")
print("NewsBot:", qp.process("Summarize this article.", sample_article), "\n")


User: What category is this article about?
Detected Intent: category
NewsBot: Predicted Category: business 

User: What is the sentiment of this news story?
Detected Intent: sentiment
NewsBot: Sentiment: neutral (polarity: 0.02) 

User: Who or what is mentioned in this article?
Detected Intent: entities
NewsBot: Entities found: first [ORDINAL], cynthia cooper [PERSON], us [GPE], 2002 [DATE], 5.7bn [MONEY], new york [GPE], wednesday [DATE], arthur andersen [PERSON], early 2001 and [DATE], 2002 [DATE], scott sullivan [PERSON], sullivan [PERSON], worldcom s accounting [ORG], 2001 [DATE], 85 years [DATE], 2004 [DATE], mci [ORG], last week [DATE], mci [ORG], 6.75bn [MONEY] 

User: What is the main topic here?
Detected Intent: topic
NewsBot: Main topic #2: say, year, mr, company, market, firm, rise, sale 

User: Summarize this article.
Detected Intent: summary

