# AutoTagger – Smart Tags & Topic Extractor

This project demonstrates how to extract meaningful **tags** and discover **dominant topics** from a long-form article using Natural Language Processing techniques.

### Objectives:
- Extract **domain-relevant tags** using TF-IDF and linguistic filtering
- Detect **latent topics** using unsupervised topic modeling (LDA)


## Install and Import Required Libraries

- `nltk` for stopword handling
- `spaCy` for part-of-speech tagging
- `scikit-learn` for TF-IDF vectorization
- `gensim` for LDA topic modeling

In [10]:
# Install once (if needed)
!pip install nltk scikit-learn gensim spacy pyLDAvis  --quiet
import nltk
nltk.download('stopwords')
!python -m spacy download en_core_web_sm

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JeevanaSree\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------------ --------------------------- 3.9/12.8 MB 23.5 MB/s eta 0:00:01
     -------------------- ------------------- 6.6/12.8 MB 17.5 MB/s eta 0:00:01
     --------------------- ------------------ 6.8/12.8 MB 16.1 MB/s eta 0:00:01
     ---------------------- ----------------- 7.1/12.8 MB 9.9 MB/s eta 0:00:01
     ---------------------- ----------------- 7.1/12.8 MB 9.9 MB/s eta 0:00:01
     ---------------------- ----------------- 7.1/12.8 MB 9.9 MB/s eta 0:00:01
     ---------------------- ----------------- 7.1/12.8 MB 9.9 MB/s eta 0:00:01
     ---------------------- ----------------- 7.3/12.8 MB 4.4 MB/s eta 0:00:02
     ---------------------- ----------------- 7.3/12.8 MB 4.4 MB/s eta 0:00:02
     ----------------------- ---------

In [13]:
import re
import numpy as np
import spacy
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim import corpora
from gensim.models import LdaModel

## Input Sample Article

This text simulates a real-world blog/article about **Artificial Intelligence in Healthcare**. It will serve as our base document for tag and topic extraction.


In [3]:
sample_text = """
Artificial Intelligence (AI) is playing an increasingly important role in healthcare, particularly in the domains of diagnostics and treatment personalization.
With the explosion of digital medical records and health data, AI is being used to detect patterns and make predictions that assist doctors and researchers alike.
Machine learning algorithms can now analyze X-rays, MRIs, and other medical imaging data faster and in some cases more accurately than human radiologists.
AI tools are also being integrated into electronic health records (EHRs) to automate administrative tasks, allowing clinicians to focus more on patient care.
Natural Language Processing (NLP) helps in structuring unstructured data from clinical notes, medical literature, and patient interaction transcripts.

In surgery, robotic systems powered by AI are increasing precision, reducing complications, and improving recovery times.
Remote surgeries using robotic arms, guided by AI-driven feedback mechanisms, are becoming a reality in rural and underserved areas.
Predictive analytics using AI models can forecast outbreaks, anticipate patient deterioration, and optimize resource allocation in hospitals.

However, the integration of AI in healthcare is not without challenges.
Issues such as data privacy, algorithmic bias, interpretability of models, and the need for regulatory compliance remain key concerns.
There is also a need for continuous training of AI models with diverse datasets to ensure fairness and accuracy in predictions.

Despite these challenges, the future of AI in healthcare looks promising.
With advancements in deep learning, reinforcement learning, and data integration, AI is set to revolutionize not just diagnostics and treatment but the entire healthcare ecosystem.
Researchers and practitioners must work collaboratively to harness AI’s full potential while maintaining ethical and human-centered practices.
"""


## Preprocess and Extract Tags using TF-IDF + POS Filtering

Why this approach?
- TF-IDF gives high importance to unique and rare words in context
- However, raw TF-IDF often includes irrelevant or noisy terms (verbs, adverbs)
- So we filter the top TF-IDF words to **nouns and proper nouns** using spaCy
- This ensures more **relevant and human-readable tags**


In [17]:
# Load stopwords and spacy model
stop_words = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm")

# Basic cleaning
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\W+', ' ', text)
    words = text.split()
    words = [word for word in words if word not in stop_words and len(word) > 2]
    return " ".join(words)

# Enhancing TF-IDF with POS filtering
def extract_clean_tags(text, top_n=7):
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform([text])
    words = tfidf.get_feature_names_out()
    scores = tfidf.idf_
    word_score_pairs = dict(zip(words, scores))

    # Using spaCy to keep only nouns & proper nouns
    doc = nlp(text)
    noun_tokens = {token.lemma_.lower() for token in doc if token.pos_ in ["NOUN", "PROPN"] and not token.is_stop}

    filtered = {word: score for word, score in word_score_pairs.items() if word in noun_tokens}
    sorted_filtered = sorted(filtered.items(), key=lambda x: x[1])
    return [word for word, score in sorted_filtered[:top_n]]

# Apply
cleaned_text = clean_text(sample_text)
tags = extract_clean_tags(sample_text)
print("Top Tags (TF-IDF + POS):", tags)


Top Tags (TF-IDF + POS): ['accuracy', 'ai', 'allocation', 'artificial', 'bias', 'care', 'compliance']


## Discover Topics using LDA

I have used Latent Dirichlet Allocation (LDA) from Gensim to detect **latent topics** from the article. 

Each **paragraph is treated as a separate document**, and words that frequently co-occur are grouped into topics.

**Why LDA?**
- Unsupervised learning algorithm
- Helps summarize large corpora into 2–5 key themes


In [18]:
# Split into paragraphs for LDA
paragraphs = [p.strip() for p in sample_text.split("\n") if len(p.strip()) > 50]
tokenized_paragraphs = [clean_text(p).split() for p in paragraphs]

# Gensim LDA setup
dictionary = corpora.Dictionary(tokenized_paragraphs)
bow_corpus = [dictionary.doc2bow(p) for p in tokenized_paragraphs]

# Train LDA
lda_model = LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=3, passes=20, random_state=42)
topics = lda_model.print_topics(num_words=6)

print("\nDominant Topics:")
for i, topic in topics:
    print(f"Topic {i+1}: {topic}")



Dominant Topics:
Topic 1: 0.034*"healthcare" + 0.024*"data" + 0.024*"learning" + 0.024*"integration" + 0.024*"challenges" + 0.014*"diagnostics"
Topic 2: 0.020*"models" + 0.020*"patient" + 0.020*"using" + 0.012*"also" + 0.012*"integrated" + 0.012*"care"
Topic 3: 0.023*"data" + 0.023*"medical" + 0.023*"human" + 0.013*"need" + 0.013*"researchers" + 0.013*"processing"


## Final Output Summary

Here, printing:
- The top tags detected via TF-IDF + POS
- The 3 main topics uncovered using LDA

These outputs provide semantic insight into the key terms and themes from the input article.


In [19]:
print("\nFinal Summary:")
print("Top Tags (TF-IDF + POS):", tags)

print("\nLDA Topics:")
for i, topic in topics:
    print(f"Topic {i+1}: {topic}")



Final Summary:
Top Tags (TF-IDF + POS): ['accuracy', 'ai', 'allocation', 'artificial', 'bias', 'care', 'compliance']

LDA Topics:
Topic 1: 0.034*"healthcare" + 0.024*"data" + 0.024*"learning" + 0.024*"integration" + 0.024*"challenges" + 0.014*"diagnostics"
Topic 2: 0.020*"models" + 0.020*"patient" + 0.020*"using" + 0.012*"also" + 0.012*"integrated" + 0.012*"care"
Topic 3: 0.023*"data" + 0.023*"medical" + 0.023*"human" + 0.013*"need" + 0.013*"researchers" + 0.013*"processing"
