<a href="https://colab.research.google.com/github/Akende1/code-unza25-csc4792-project_team_27-repository/blob/main/code_unza25_csc4792_project_team_27_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.Business Understanding

## Problem Statement

Wikipedia serves as a widely accessible knowledge repository, however the quality and completeness of articles vary. For Zambian topics, many articles remain underdeveloped, lacking depth, proper structure, sufficient references, and multimedia elements. This limits the availability of reliable, comprehensive information about Zambia for both local and global audiences.

Currently the process of identifying incomplete Zambian articles is largely manual, relying on editor intuition, quality assessments, and community discussions. This approach is time-consuming, inconsistent, and insufficient for  addressing the most critical content gaps. Without a scalable method to assess and prioritize articles, valuable editing efforts may be misdirected, leaving important topics underrepresented.


The core problem is the absence of an automated, consistent, and accurate system to determine and classify the completeness of Zambian Wikipedia pages. By analyzing text structure, metadata, and content coverage, such a system could classify articles according to Wikipedia’s quality scale and highlight key areas for improvement. This would enable editors to focus on the most impactful updates, improve the overall quality of Zambian content, and ensure that readers have access to well-developed, trustworthy information.

Solving this problem benefits multiple stakeholders: Wikipedia editors seeking guidance on where to contribute, WikiProject Zambia aiming to raise overall article standards, researchers and educators relying on accurate information, and the general public seeking a richer understanding of Zambia.

## Objectives and success criteria

### 1.1 Primary Objectives

We will build a classification model for:

**Objective 1: Article Classification System**
Develop a system to automatically classify Zambian Wikipedia articles into established quality levels based on Wikipedia’s quality scale:  
- **Stub (Level 0):** <500 words, minimal structure, few/no references.  
- **Start (Level 1):** 500–1500 words, basic structure, some references.  
- **C-Class (Level 2):** >1500 words, good structure, adequate references.  
- **B-Class (Level 3):** Comprehensive coverage, good references, proper structure.  
- **Good Article (Level 4):** Meets Wikipedia’s “good article” criteria—well-written, neutral, fully referenced.  
- **Featured Article (Level 5):** Exemplary standard, exceptional writing, comprehensive coverage.

**Objective 2: Content Gap Identification**  
Identify missing or underdeveloped elements within articles categorized as:  
- Structural gaps (missing sections)  
- Content gaps (missing topics)  
- Reference gaps (unreliable sources)  
- Multimedia gaps (missing images, maps, diagrams)

**Objective 3: Actionable Insights Generation**  
Provide specific improvement recommendations, including:  
- Priority ranking of articles needing attention  
- Article-specific suggestions for enhancement  
- Content templates for common article types  
- Reference improvement strategies

### 1.2 Secondary Objectives
- Create a reusable assessment framework adaptable to other countries/topics.  
- Establish baseline metrics to track future Wikipedia content improvements.  
- Produce educational resources to guide editors on quality standards.

## 2. Success Criteria & Metrics

### 2.1 Primary Success Metrics
**Metric 1: Classification Accuracy**  
- **Target:** ≥85% accuracy in classifying articles.  
- **Measurement:** Compare automated classifications against human-rated articles using cross-validation and confusion matrix analysis.

**Metric 2: Gap Identification Accuracy**  
- **Target:** Correctly flag the top 20 most incomplete Zambian articles.  
- **Measurement:** Validate with expert review and compare with community consensus.

### 2.2 Secondary Success Metrics
**Model Performance Metrics**  
- **Precision:** Minimize false positives (overrating article quality).  
- **Recall:** Minimize false negatives (underrating article quality).  
- **F1-Score:** Balanced metric for each quality level.  
- **Cohen’s Kappa:** Measure agreement between model and human raters.

**Business Impact Metrics**  
- **Usability:** Wikipedia editors can easily interpret and act on recommendations.  
- **Efficiency:** Reduced time needed to identify priority improvement opportunities.  
- **Coverage:** Percentage of Zambian articles assessed by the system.

---

## 3. Validation Methods
**Expert Review**  
- Recruit 3–5 experienced Wikipedia editors.  
- Have them manually assess 50 randomly selected articles.  
- Compare their assessments with model output to measure agreement.

**Community Feedback**  
- Share results with WikiProject Zambia members.  
- Gather qualitative feedback on usefulness, accuracy, and priorities.  
- Apply feedback to refine the model and recommendations.

---

## 4. Data Mining Goals

**Automated Quality Classification**

Build a supervised classification model to automatically assign Zambian Wikipedia articles to one of six quality levels (Stub, Start, C-Class, B-Class, Good Article, Featured Article) based on Wikipedia’s quality scale.

- Content Gap Detection

Use text mining and metadata analysis to identify missing sections, low reference counts, insufficient word count, and lack of multimedia.

- Improvement Recommendation Generation

Develop a rule-based recommendation system that generates actionable suggestions for editors based on detected gaps (e.g., “Add references to support claims,” “Include an infobox,” “Expand the history section”).




```
# This is formatted as code
```

# 2.Data Understanding

In [None]:
#Load CSV
import pandas as pd
df_raw = pd.read_csv('zambian_wikipedia_pages_dataset_FIXED.csv')
df_raw.head()



#Word count histogram
import matplotlib.pyplot as plt
plt.hist(df_raw['word_count'], bins=20)
plt.title('Word Count Distribution')
plt.show()




#Image count histogram
plt.hist(df_raw['num_images'], bins=10)
plt.title('Image Count Distribution')
plt.show()

# 3.Data Preparation

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Heuristic label function
def heuristic_quality_label(row):
    wc = row['word_count']; refs = row['references_signal']; secs = row['num_sections']
    if wc<500 and refs<2 and secs<=2: return 0
    if 500<=wc<1500 and refs>=2: return 1
    if wc>=1500 and refs>=5 and secs>=5: return 2
    if wc>=2000 and refs>=8 and secs>=6: return 3
    if wc>=2500 and refs>=12 and secs>=8: return 4
    if wc>=3000 and refs>=18 and secs>=10: return 5
    return 0




#Features & scaling
feature_cols = ['word_count','num_sections','num_internal_links','num_images','num_external_links','category_count','has_infobox','references_signal']
X = df_raw[feature_cols]
y = df_raw['quality_label']
scaler = StandardScaler()
X_scaled = X.copy()
X_scaled[feature_cols] = scaler.fit_transform(X[feature_cols])




#Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42, stratify=y)

# 4.Modeling

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, cohen_kappa_score, classification_report, confusion_matrix

#Initialize models
models = {
    'LogReg': LogisticRegression(max_iter=200),
    'RandomForest': RandomForestClassifier(n_estimators=250, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42)
}




#Feature importance for RandomForest
rf = models['RandomForest']
import pandas as pd
import matplotlib.pyplot as plt
feat_imp = pd.Series(rf.feature_importances_, index=feature_cols).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='RandomForest Feature Importance')
plt.show()



#display confusion matrix
preds = rf.predict(X_test)
print(confusion_matrix(y_test, preds))

# 5.Evaluation

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

#Select best model
best_model = rf


#Identify top misclassified articles
y_pred = best_model.predict(X_scaled)
misclassified = df_raw[y_pred != y]
misclassified[['title','quality_label']].head(10)


# confusion matrix heatmap
import seaborn as sns
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

# 6.Deployment

In [None]:
import joblib

#Save trained RandomForest model
joblib.dump(best_model, 'rf_quality_classifier.joblib')


#export predictions
df_raw.to_csv('zambian_wikipedia_pages_predictions.csv', index=False)