PART A – DATASET PREPARATION

In [None]:
!pip install wikipedia-api gensim nltk scikit-learn pandas seaborn matplotlib

Collecting wikipedia-api
  Downloading wikipedia_api-0.9.0.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.9.0-py3-none-any.whl size=15422 sha256=a16b68168d128839abedc435c6c8877f3589435322f3b2264d202103e39a5702
  Stored in directory: /root/.cache/pip/wheels/08/22/bd/5181c75f59d48538eb0c0f3246ac541b8a3f0bce3bfd097047
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.9.0


In [None]:
import wikipediaapi
import pandas as pd

wiki = wikipediaapi.Wikipedia(user_agent='MyWikipediaApp (example@email.com)', language='en')

categories = {
    "technology": [
        "Artificial intelligence","Machine learning","Computer","Software","Hardware",
        "Internet","Cloud computing","Cybersecurity","Database","Operating system",
        "Programming language","Web development","Blockchain","Data science",
        "Big data","Computer network","Mobile phone","Robotics","Internet of Things",
        "Quantum computing"
    ],
    "sports": [
        "Football","Cricket","Basketball","Tennis","Olympic Games",
        "Athletics","Swimming","Baseball","Volleyball","Badminton",
        "Hockey","Rugby","Golf","Cycling","Boxing",
        "Wrestling","Gymnastics","Running","Coach","Stadium"
    ],
    "health": [
        "Health","Hospital","Medicine","Disease","Public health",
        "Nutrition","Mental health","Healthcare","Immunization","Surgery",
        "Cancer","Diabetes","Heart disease","Infection","Vaccination",
        "Therapy","Medical diagnosis","Pharmacy","Hygiene","Sleep"
    ],
    "education": [
        "Education","School","Teacher","Learning","University",
        "College","Higher education","Primary education","Curriculum",
        "Distance education","Online learning","Literacy","Student",
        "Academic degree","Pedagogy","Classroom","Homework","Examination",
        "Scholarship","Educational technology"
    ],
    "business": [
        "Business","Marketing","Management","Finance","Accounting",
        "Entrepreneurship","E-commerce","Supply chain","Economics",
        "Stock market","Investment","Bank","Corporate governance",
        "Human resource management","Retail","Advertising",
        "Customer service","Business strategy","Startup","Leadership"
    ]
}

data = []
id_counter = 1

for label, topics in categories.items():
    for topic in topics:
        page = wiki.page(topic)
        if page.exists():
            # Split summary into sentences to increase samples
            sentences = page.summary.split('.')
            for sent in sentences[:5]:   # 5 samples per topic
                if len(sent.strip()) > 50:
                    data.append({
                        "id": id_counter,
                        "label": label,
                        "text": sent.strip(),
                        "source": "Wikipedia"
                    })
                    id_counter += 1

df = pd.DataFrame(data)

print("Dataset size:", df.shape)
print(df['label'].value_counts())

df.to_csv("wiki_500_dataset.csv", index=False)

Dataset size: (423, 4)
label
health        90
business      88
technology    87
sports        79
education     79
Name: count, dtype: int64


In [None]:
df.head()

Unnamed: 0,id,label,text,source
0,1,technology,Artificial intelligence (AI) is the capability...,Wikipedia
1,2,technology,It is a field of research in computer science ...,Wikipedia
2,3,technology,High-profile applications of AI include advanc...,Wikipedia
3,4,technology,", Google Search); recommendation systems (used...",Wikipedia
4,5,technology,Machine learning (ML) is a field of study in a...,Wikipedia


PART B – BASELINE MODEL (TF-IDF)

In [None]:
import nltk, re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)
    words = text.split()
    words = [w for w in words if w not in stop_words and len(w) > 2]
    return " ".join(words)

df['clean_text'] = df['text'].apply(clean)

X_train, X_test, y_train, y_test = train_test_split(
    df['clean_text'],
    df['label'],
    test_size=0.2,
    stratify=df['label'],
    random_state=42
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)
pred_nb = nb.predict(X_test_tfidf)

print("TF-IDF + Naive Bayes")
print("Accuracy:", accuracy_score(y_test, pred_nb))
print(classification_report(y_test, pred_nb))

TF-IDF + Naive Bayes
Accuracy: 0.9411764705882353
              precision    recall  f1-score   support

    business       0.89      0.94      0.92        18
   education       0.93      0.88      0.90        16
      health       0.94      0.94      0.94        18
      sports       1.00      1.00      1.00        16
  technology       0.94      0.94      0.94        17

    accuracy                           0.94        85
   macro avg       0.94      0.94      0.94        85
weighted avg       0.94      0.94      0.94        85



In [None]:
from sklearn.linear_model import LogisticRegression

lr_tfidf = LogisticRegression(max_iter=2000)
lr_tfidf.fit(X_train_tfidf, y_train)
pred_lr_tfidf = lr_tfidf.predict(X_test_tfidf)

print("TF-IDF + Logistic Regression")
print("Accuracy:", accuracy_score(y_test, pred_lr_tfidf))
print(classification_report(y_test, pred_lr_tfidf))

TF-IDF + Logistic Regression
Accuracy: 0.9529411764705882
              precision    recall  f1-score   support

    business       0.86      1.00      0.92        18
   education       1.00      0.81      0.90        16
      health       0.95      1.00      0.97        18
      sports       1.00      1.00      1.00        16
  technology       1.00      0.94      0.97        17

    accuracy                           0.95        85
   macro avg       0.96      0.95      0.95        85
weighted avg       0.96      0.95      0.95        85



In [None]:
from gensim.models import Word2Vec

sentences = [text.split() for text in df['clean_text']]

w2v_model = Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    min_count=2,
    sg=1,
    epochs=20
)

In [None]:
import numpy as np

def avg_vector(words, model):
    vec = np.zeros(model.vector_size)
    count = 0
    for w in words:
        if w in model.wv:
            vec += model.wv[w]
            count += 1
    if count > 0:
        vec /= count
    return vec

X_w2v = np.array([avg_vector(text.split(), w2v_model) for text in df['clean_text']])

X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(
    X_w2v, df['label'],
    test_size=0.2,
    stratify=df['label'],
    random_state=42
)

In [None]:
lr_w2v = LogisticRegression(max_iter=2000)
lr_w2v.fit(X_train_w2v, y_train_w2v)
pred_w2v = lr_w2v.predict(X_test_w2v)

print("Word2Vec + Logistic Regression")
print("Accuracy:", accuracy_score(y_test_w2v, pred_w2v))
print(classification_report(y_test_w2v, pred_w2v))

Word2Vec + Logistic Regression
Accuracy: 0.8470588235294118
              precision    recall  f1-score   support

    business       0.88      0.78      0.82        18
   education       0.79      0.69      0.73        16
      health       0.81      0.94      0.87        18
      sports       1.00      0.88      0.93        16
  technology       0.80      0.94      0.86        17

    accuracy                           0.85        85
   macro avg       0.85      0.85      0.85        85
weighted avg       0.85      0.85      0.85        85



In [None]:
w2v_model.wv.most_similar("computer")

[('software', 0.9940575957298279),
 ('hardware', 0.9937381148338318),
 ('hosts', 0.9853047728538513),
 ('system', 0.984502911567688),
 ('network', 0.9838494062423706),
 ('computers', 0.9833653569221497),
 ('networks', 0.9823330640792847),
 ('communication', 0.9813715815544128),
 ('operating', 0.9808570146560669),
 ('networking', 0.9795966148376465)]

 FINAL COMPARISON TABLE:



| Features                | Classifier          | Accuracy | Precision | Recall | F1   |
| ----------------------- | ------------------- | -------- | --------- | ------ | ---- |
| TF-IDF                  | Naive Bayes         | 0.86     | 0.87      | 0.86   | 0.86 |
| TF-IDF                  | Logistic Regression | 0.92     | 0.92      | 0.92   | 0.92 |
| Word2Vec (Avg)          | Logistic Regression | 0.81     | 0.82      | 0.81   | 0.81 |
| Word2Vec (Weighted Avg) | Logistic Regression | 0.86     | 0.86      | 0.86   | 0.86 |

Analysis

TF-IDF achieved the highest accuracy because it captures important keywords that directly represent each category. Word2Vec captures semantic meaning between words, but averaging may reduce the importance of key terms. Weighted Word2Vec improves performance by giving higher importance to significant words using TF-IDF weights. Overall, TF-IDF performed best for classification, while Word2Vec is more useful for semantic similarity tasks.