<a href="https://colab.research.google.com/github/Sakinat-Folorunso/OOU_CSC309_Artificial_Intelligence/blob/main/notebooks/CSC309_Week07_NLP1_QA_Spam_CA2_Student_Centred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC309 ‚Äì Artificial Intelligence  
**Week 7 Lab:** NLP I ‚Äî Retrieval QA & Spam Detection (CA2)

**Instructor:** Dr Sakinat Folorunso  
**Mode:** Student‚Äëcentred, hands‚Äëon in Google Colab

> Every code cell is commented line‚Äëby‚Äëline so you can follow the logic precisely.

## How to use this notebook
1. Start with the **Group Log** and **Do Now**.  
2. Run the **Setup** cell once.  
3. Work through **Tasks**. Edit only cells marked **`# TODO(Student)`**.  
4. Use **Quick Checks** to test your understanding.  
5. Finish with the **Reflection**. If you finish early, try the **Extensions**.

In [None]:
#@title üßëüèΩ‚Äçü§ù‚Äçüßëüèæ Group Log (fill before you start)
# The '#@param' annotations create form fields in Colab for easy input.

group_members = "Sakinat"  #@param {type:"string"}  # Names of teammates
roles_notes = "how are you?"  #@param {type:"string"}  # Short working notes

print("üë• Group:", group_members)        # Echo the group list for confirmation
print("üìù Notes:", roles_notes)          # Echo the notes so they're preserved in output

üë• Group: Sakinat
üìù Notes: how are you?


### Learning Objectives
- Build a TF‚ÄëIDF **retrieval QA** over notes.  
- Train/evaluate a **spam classifier** with a proper split.

In [None]:
#@title üîß Setup
# Install common NLP/ML packages used in this lab.

import sys, subprocess                                              # For optional installs
def pip_install(pkgs):
    for p in pkgs:
        try: __import__(p.split("==")[0])                           # Try importing
        except Exception:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", p])  # Install quietly
pip_install(["pandas", "scikit-learn"])                              # Dataframes and ML

import pandas as pd                                                 # Tabular data handling
from sklearn.feature_extraction.text import TfidfVectorizer         # Text to TF‚ÄëIDF features
from sklearn.metrics.pairwise import cosine_similarity              # Cosine similarity for QA
from sklearn.model_selection import train_test_split                # Proper train/test split
from sklearn.linear_model import LogisticRegression                 # Linear classifier for spam
from sklearn.metrics import classification_report, accuracy_score   # Evaluation metrics

print("‚úÖ Setup complete for Week 7.")

‚úÖ Setup complete for Week 7.


In [None]:
#@title üìö Retrieval QA over short notes (line‚Äëby‚Äëline)

notes = """
Artificial Intelligence (AI) studies intelligent agents that perceive their environment and act.
Heuristic search trades optimality for speed. A* uses f(n)=g(n)+h(n).
Knowledge representation captures facts and rules for reasoning.
""".strip().split("\n")                                    # Split the multi‚Äëline string into separate 'documents'

vectorizer = TfidfVectorizer()                                # Create a TF‚ÄëIDF vectorizer (default tokenization)
X = vectorizer.fit_transform(notes)                           # Learn vocabulary + transform notes to TF‚ÄëIDF matrix

def answer(question, topk=1):
    """Return top‚Äëk note sentences most similar to the question."""
    q_vec = vectorizer.transform([question])                  # Convert the question to TF‚ÄëIDF using the same vectorizer
    sims = cosine_similarity(q_vec, X).ravel()                # Compute cosine similarity to each sentence
    idx = sims.argsort()[::-1][:topk]                         # Indices of the top‚Äëk most similar sentences
    return [notes[i] for i in idx]                            # Return the corresponding sentences

print("Q:", "What does A* use?")                              # Example question
print("A:", answer("What does A* use?")[0])                   # Show the best matching sentence

Q: What does A* use?
A: Knowledge representation captures facts and rules for reasoning.


In [None]:
#@title ‚úâÔ∏è SMS Spam detection (UCI dataset) ‚Äî fully commented
# Note: This cell downloads a small dataset from the UCI repository when run in Colab.

import zipfile, os, urllib.request                             # Tools for downloading and reading zip files

URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"  # Dataset URL

if not os.path.exists("smsspamcollection.zip"):                # Download only if not already present
    urllib.request.urlretrieve(URL, "smsspamcollection.zip")   # Save the zip file locally

with zipfile.ZipFile("smsspamcollection.zip") as z:            # Open the zip archive
    with z.open("SMSSpamCollection") as f:                      # Access the data file inside the zip
        data = [line.decode("utf-8").strip().split("\t", 1)   # Each line: label<TAB>text
                for line in f.readlines()]

df = pd.DataFrame(data, columns=["label", "text"])             # Build a DataFrame with two columns

X_train, X_test, y_train, y_test = train_test_split(           # Split into train/test (80/20)
    df["text"], df["label"], test_size=0.2, random_state=42)

vec = TfidfVectorizer(min_df=2, ngram_range=(1,2))             # Use word unigrams+bigrams; ignore very rare terms
Xtr = vec.fit_transform(X_train)                                # Fit on training text and transform it
Xte = vec.transform(X_test)                                     # Transform test text using the same vectorizer

clf = LogisticRegression(max_iter=200)                          # Linear model with a reasonable iteration cap
clf.fit(Xtr, y_train)                                           # Train the classifier
pred = clf.predict(Xte)                                         # Predict on the test set

print("Accuracy:", round(accuracy_score(y_test, pred), 3))      # Quick scalar metric
print(classification_report(y_test, pred, zero_division=0))     # Full precision/recall/F1 breakdown

Accuracy: 0.971
              precision    recall  f1-score   support

         ham       0.97      1.00      0.98       954
        spam       1.00      0.80      0.89       161

    accuracy                           0.97      1115
   macro avg       0.98      0.90      0.94      1115
weighted avg       0.97      0.97      0.97      1115



### **CA2 Deliverables**
- QA demo with three well‚Äëanswered questions + a sentence on your vectorization choices.  
- Spam classifier results with a short **confusion analysis**, and one documented improvement attempt.