# Lab 1: Foundations of NLP

In this lab you will **choose one** API to fetch approximately 200 words of live text, then use that text for all tasks below.

**TODO:** Pick your API from the list (see README/API docs), and implement the fetch in the first code cell.

In [50]:
import os
import requests
import nltk
import string
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

nltk.download('stopwords')
nltk.download('punkt_tab')

API_KEY = "0ef21751a05b4363ae839ef0d77266a1"
url = f'https://newsapi.org/v2/everything?q=Miami&apiKey={API_KEY}'

try:
    response = requests.get(url)
    data = response.json()

    print("Status Code:", response.status_code)
    print("Response Keys", data.keys())

    if response.status_code != 200:
        print("API Error:", data.get('message', 'Unknown error'))
    elif 'articles' not in data:
        print("Invalid response format. Available keys:", data.keys())
    else:
        articles = data['articles']
        raw_text = ' '.join([
            a['title'] + ' ' + (a.get('description', '') or '')
            for a in articles[:22]
        ])[:777]

        print("Success! Text sample:", raw_text[:222] + "...")
        print(f"Word count: {len(raw_text.split())}")

except Exception as e:
    print(f"Failed with error: {str(e)}")
    # Fallback Point
    raw_text = """Miami is a major city in Florida known for its beaches..."""


Status Code: 200
Response Keys dict_keys(['status', 'totalResults', 'articles'])
Success! Text sample: Olympic 100m medallist Kerley arrested in Miami Two-time Olympic 100m medallist Fred Kerley is arrested in Miami for allegedly punching former partner Alaysha Johnson, according to police. Formula 1 Drivers Just Hit the Tr...
Word count: 129


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## 1. Text Preprocessing (30 pts)

- Use the `raw_text` variable you fetched.
- Tokenize, lowercase, remove punctuation.
- Remove stopwords.
- Plot the top-10 most frequent tokens.

**TODO:** Write your code below and commit after each sub-step.

In [31]:
# TODO: Tokenize and clean raw_text
tokens = word_tokenize(raw_text.lower())
tokens = [word for word in tokens if word not in string.punctuation]

print(tokens[:20])
print(f"\nTotal tokens: {len(tokens)}")

['olympic', '100m', 'medallist', 'kerley', 'arrested', 'in', 'miami', 'two-time', 'olympic', '100m', 'medallist', 'fred', 'kerley', 'is', 'arrested', 'in', 'miami', 'for', 'allegedly', 'punching']

Total tokens: 134


In [34]:
# TODO: Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

print(filtered_tokens[:20])
print(f"\n Tokens left: {len(filtered_tokens)}")


['olympic', '100m', 'medallist', 'kerley', 'arrested', 'miami', 'two-time', 'olympic', '100m', 'medallist', 'fred', 'kerley', 'arrested', 'miami', 'allegedly', 'punching', 'former', 'partner', 'alaysha', 'johnson']

 Tokens left: 94


In [None]:
# TODO: Plot frequent tokens
word_counts = Counter(filtered_tokens)
top_10 = word_counts.most_common(10)

words = [item[0] for item in top_10]
counts = [item[1] for item in top_10]

plt.figure(figsize=(10, 6))
plt.bar(words, counts, color = 'orange')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Most Frequent Words')
plt.xticks(rotation=45, ha = 'right', fontsize = 10)
plt.tight_layout()

# Save & Show
plt.savefig('word_frequencies.png')
plt.show()

# Print repeated words
print("\nTop 10 words:")
for word, count in top_10:
    print(f"{word}: {count}")

## 2. Synonym Generation (30 pts)

- Pick 5 tokens from your preprocessed results.
- Manually list 2-3 synonyms each.
- Use Google AI Studio Text API to generate synonyms for each.

**TODO:** Complete the code and reflections.

In [62]:
# prompt: using my key "AIzaSyC7SJ9_eGhYqgUtHd2D9yHOX-86CBiDd9M", Pick 5 tokens from your preprocessed results.
# Manually list 2-3 synonyms each.
# Use Google AI Studio Text API to generate synonyms for each

!pip install -q google-generativeai

import google.generativeai as genai
from google.colab import userdata

# Assuming 'AIzaSyC7SJ9_eGhYqgUtHd2D9yHOX-86CBiDd9M' is the actual API key value you intend to use.
# If the key is stored as a secret in Colab, use `userdata.get('YOUR_SECRET_NAME')` instead.
# In this case, the user provided the key directly in the prompt, so we use that.
API_KEY = "AIzaSyC7SJ9_eGhYqgUtHd2D9yHOX-86CBiDd9M"

genai.configure(api_key=API_KEY)

# Pick 5 tokens from the preprocessed results (filtered_tokens)
selected_tokens = filtered_tokens[:5] # Taking the first 5 for demonstration

print("Selected Tokens for Synonym Generation:", selected_tokens)

# Manually list synonyms (example based on potential common words)
manual_synonyms = {
    'miami': ['south beach', 'magic city'],
    'florida': ['sunshine state'],
    'city': ['metropolis', 'town', 'urban center'],
    'major': ['important', 'significant', 'principal'],
    'known': ['famous', 'recognized', 'celebrated']
}

print("\nManually Listed Synonyms:")
for token, synonyms in manual_synonyms.items():
  print(f"{token}: {', '.join(synonyms)}")


# Use Google AI Studio Text API to generate synonyms
model = genai.GenerativeModel('gemini-1.5-flash-latest')

print("\nAI Generated Synonyms:")
for token in selected_tokens:
  try:
    prompt = f"List 3 synonyms for the word '{token}'."
    response = model.generate_content(prompt)
    print(f"{token}: {response.text.strip()}")
  except Exception as e:
    print(f"Could not generate synonyms for '{token}': {e}")

Selected Tokens for Synonym Generation: ['olympic', '100m', 'medallist', 'kerley', 'arrested']

Manually Listed Synonyms:
miami: south beach, magic city
florida: sunshine state
city: metropolis, town, urban center
major: important, significant, principal
known: famous, recognized, celebrated

AI Generated Synonyms:
olympic: 1. Games
2. Olympian (referring to the athletes or the event itself)
3. International (emphasizing the global nature of the event)
100m: * **One hundred meters**
* **A hundred meters**
* **100 metres** (Note the British spelling)
medallist: 1. Award winner
2. Prize winner
3. Champion
kerley: There are no common synonyms for "Kerley" as it's primarily a surname and a less common given name.  There's no inherent meaning that readily lends itself to interchangeable words.
arrested: 1. Apprehended
2. Detained
3. Taken into custody


In [37]:
from google.cloud import aiplatform
import random

selected_words = random.sample([w for w in set(filtered_tokens) if len(w) > 4], 5)
print("Selected words for synonym generation:", selected_words)

# Manual synonyms (5 pts)
manual_synonyms = {
    'miami': ['city', 'metropolis', 'urban area'],
    'allegedly': ['reportedly', 'likely', 'reputedly'],
    'punching': ['slapping', 'knocking', 'hitting'],
    'former': ['old', 'once', 'sometime'],
    'partner': ['wife', 'husband', 'spouse']
}

Selected words for synonym generation: ['allegedly', 'faire', 'build', 'silver', 'arrested']


In [59]:
# TODO: Call Google AI Studio Text API for synonyms
import google.generativeai as genai
genai.configure(api_key="AIzaSyC7SJ9_eGhYqgUtHd2D9yHOX-86CBiDd9M")

def generate_synonyms(word):
    prompt = f"Generate 3 professional synonyms for '{word}' as comma-separated values. Only return the words."
# TODO: Call Google AI Studio Text API for synonyms
import google.generativeai as genai
genai.configure(api_key="AIzaSyC7SJ9_eGhYqgUtHd2D9yHOX-86CBiDd9M")

def generate_synonyms(word):
    prompt = f"Generate 3 professional synonyms for '{word}' as comma-separated values. Only return the words."
    try:
        response = genai.generate_text(
            model='models/gemini-2.0-flash',
            prompt=prompt,
            temperature=0.7
        )
        return [s.strip() for s in response.result.split(',')[:3]]
    except Exception as e:
        print(f"AI Error for {word}: {str(e)}")
        return []

# Generate and compare synonyms
print("\n{:15} | {:40} | {}".format("Word", "Manual Synonyms", "AI Synonyms"))
print("-"*80)
for word in selected_words:
    ai_syns = generate_synonyms(word)
    print("{:15} | {:40} | {}".format(
        word,
        ', '.join(manual_synonyms.get(word, ['N/A'])),
        ', '.join(ai_syns) if ai_syns else 'API Error'
    ))
        )
        return [s.strip() for s in response.result.split(',')[:3]]
    except Exception as e:
        print(f"AI Error for {word}: {str(e)}")
        return []

# Generate and compare synonyms
print("\n{:15} | {:40} | {}".format("Word", "Manual Synonyms", "AI Synonyms"))
print("-"*80)
for word in selected_words:
    ai_syns = generate_synonyms(word)
    print("{:15} | {:40} | {}".format(
        word,
        ', '.join(manual_synonyms.get(word, ['N/A'])),
        ', '.join(ai_syns) if ai_syns else 'API Error'
    ))

IndentationError: unexpected indent (<ipython-input-59-b1c42ccf3abd>, line 34)

In [61]:
# TODO: Call Google AI Studio Text API for synonyms
import google.generativeai as genai
genai.configure(api_key="AIzaSyC7SJ9_eGhYqgUtHd2D9yHOX-86CBiDd9M")

def generate_synonyms(word):
    prompt = f"Generate 3 professional synonyms for '{word}' as comma-separated values. Only return the words."
    try:
        response = genai.generate_text(
            model='models/gemini-2.0-flash',
            prompt=prompt,
            temperature=0.7
        )
        return [s.strip() for s in response.result.split(',')[:3]]
    except Exception as e:
        print(f"AI Error for {word}: {str(e)}")
        return []

# Generate and compare synonyms
print("\n{:15} | {:40} | {}".format("Word", "Manual Synonyms", "AI Synonyms"))
print("-"*80)
for word in selected_words:
    ai_syns = generate_synonyms(word)
    print("{:15} | {:40} | {}".format(
        word,
        ', '.join(manual_synonyms.get(word, ['N/A'])),
        ', '.join(ai_syns) if ai_syns else 'API Error'
    ))


Word            | Manual Synonyms                          | AI Synonyms
--------------------------------------------------------------------------------
AI Error for allegedly: module 'google.generativeai' has no attribute 'generate_text'
allegedly       | reportedly, likely, reputedly            | API Error
AI Error for faire: module 'google.generativeai' has no attribute 'generate_text'
faire           | N/A                                      | API Error
AI Error for build: module 'google.generativeai' has no attribute 'generate_text'
build           | N/A                                      | API Error
AI Error for silver: module 'google.generativeai' has no attribute 'generate_text'
silver          | N/A                                      | API Error
AI Error for arrested: module 'google.generativeai' has no attribute 'generate_text'
arrested        | N/A                                      | API Error


## 3. Part-of-Speech Annotation (20 pts)

- Select one sentence from `raw_text`.
- Manually tag each word with its POS.
- Call the AI Studio syntax endpoint and compare.

**TODO:** Implement tagging and comparison.

In [None]:
# TODO: Manual POS tagging


In [None]:
# TODO: Call AI Studio syntax endpoint


## 4. Thinking & Reflection (20 pts)

Answer in Markdown:
1. Which preprocessing step had the biggest impact?
2. What surprised you about the AI outputs?
3. How would you integrate manual rules and AI calls in a production pipeline?