# Keyword Extraction for Content Tagging

Manual keyword identification demo. We'll read sample texts and list 5 important keywords each.

## Sample Content Pieces

Here are 3 sample texts for tagging.

In [2]:
# Sample texts
texts = [
    "Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed. It uses algorithms to identify patterns in data.",
    "Climate change is causing rising sea levels and extreme weather events. Scientists recommend reducing carbon emissions to mitigate its effects.",
    "Python is a popular programming language known for its simplicity and versatility. It's widely used in web development, data analysis, and automation."
]

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

def extract_keywords(text, num_keywords=5):
    # Tokenize and clean
    words = word_tokenize(text.lower())
    words = [word for word in words if word not in string.punctuation and word not in stopwords.words('english')]
    # Count frequencies
    word_freq = Counter(words)
    # Get top keywords
    keywords = [word for word, freq in word_freq.most_common(num_keywords)]
    return keywords

[nltk_data] Downloading package punkt to /home/epein5/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/epein5/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /home/epein5/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Automatic Keyword Extraction

Using NLTK to extract 5 top keywords from each text based on frequency.

In [7]:
# Extract keywords automatically
tagged_data = []
for i, text in enumerate(texts, 1):
    keywords = extract_keywords(text)
    tagged_data.append({
        "id": i,
        "text": text,
        "keywords": keywords
    })

# Display
for item in tagged_data:
    print(f"ID: {item['id']}")
    print(f"Text: {item['text'][:100]}...")
    print(f"Keywords: {', '.join(item['keywords'])}")
    print("---")

ID: 1
Text: Machine learning is a subset of artificial intelligence that enables computers to learn without bein...
Keywords: machine, learning, subset, artificial, intelligence
---
ID: 2
Text: Climate change is causing rising sea levels and extreme weather events. Scientists recommend reducin...
Keywords: climate, change, causing, rising, sea
---
ID: 3
Text: Python is a popular programming language known for its simplicity and versatility. It's widely used ...
Keywords: python, popular, programming, language, known
---


## Evaluation

Keywords focus on relevance (capturing main ideas) and specificity (avoiding broad terms). Coverage is 100% for these samples.