<a href="https://colab.research.google.com/github/Chalwemwansa/data_mining_assign/blob/master/Group_22_data_mining_assign.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Business Understanding

## Problem Statement

The reliability and usefulness of Zambian Wikipedia pages as a source of current information can be affected when the content becomes outdated  
In many cases information that once was accurate is left unchanged for long periods of time and this makes it hard for readers to trust that what they are reading reflects the present situation  

Without a clear and organized way to identify which pages are stale or contain old data it is difficult for contributors especially Zambian Wikipedians to know where their attention is most needed  
This lack of direction means that updates can be random or uneven and some topics remain untouched while others are repeatedly edited  

Over time this results in a reduced quality of content on and about Zambia making the platform less valuable as a resource for people who rely on it for research learning or keeping up with developments in the country.


## Business Objectives

From a practical real world view success for this project means being able to identify what is old and what is still current  
It is about detecting and flagging sections or whole Zambian Wikipedia pages like the Zambia page that hold stale facts  
Clear signals help both readers and editors see where drift has happened and how serious it is  

The next step is to make updates easier not harder  
Contributors especially Zambian Wikipedians need a simple way to spot which pages or sections most need attention for factual recency  
This supports the DataLab Research group as it explores strategies for improving contributions on and about Zambia  
Having a clear queue of priority pages is more useful than relying on guesswork or scattered edits  

The long term aim is to raise the overall currency of information on Wikipedia pages about Zambia  
Fresher content increases trust and value whether the pages are used in schools in media or in everyday research  
When updates become routine and guided the quality of the whole collection improves steadily  

These objectives align with the CRISP DM phase of Business Understanding  
Here the aim is to define broad and specific goals while also staying close to the needs of the community that relies on and contributes to this knowledge  


This foundation ensures that later technical steps—like data preparation, modeling, and evaluation—stay tied to real impact and community relevance.

## Data Mining

To meet the business obejectives, our task is to develop a classification model that can label the recency of content on Zambian Wikipedia pages. Each section would be tagged as 'Current','Moderately Outdated' or 'Severely Outdated', giving contributors and readers a quick sense of accuracy. This turns the challenge of stale information into a clear, structured task.

Beyond classification, we also aim to analyse edit histories to reveal how quickly different topics become outdated. Some pages, like those on politics, may need frequent updates, while others such as geography remain stable for longer. Spotting these trends helps contributors anticipate where attention will be needed next.

Finally, the project should provide ranking metrics that highlight high-priority pages. Instead of random edits, contributors can follow a clear queue that shows where updates will have the most impact. In this way, data mining goals directly support fresher, more reliable information on Wikipedia pages about Zambia

## Initial Project Success Criteria

Our initial criteria for project success will be measured by:

- **Classification Accuracy**  
  The classification model should achieve an accuracy of at least 85% in correctly identifying the recency level of data on Zambian Wikipedia pages. This will serve as a key benchmark for determining whether the project is meeting its early objectives  

- **Interpretability and Actionability**  
  The predictions of the model and the factors influencing them must be understandable to human editors. Contributors should be able to see why a page is flagged as outdated and also get clear guidance on what needs updating. The results should be actionable so Wikipedians can focus their efforts where they matter most  

- **Scalability**  
  The model should be capable of being applied to a wide range of Zambian Wikipedia pages and not just a single Zambia page. Broader application will make the tool useful for long term and large scale improvements  

- **Feedback Integration**  
  Another measure of success will be how well the model's insights can be integrated into a tool or dashboard that Wikipedians can use. This would allow contributors to receive notifications or recommendations for content updates in a more structured and user friendly way  

These criteria provide a practical way of measuring the relative success of the project and align with the "Key Success Criteria" aspect of the CRISP-DM Business Understanding phase  


This section completes the first phase of the project under the CRISP-DM methodology which focuses on understanding the overall business context and clearly identifying the challenge that needs to be solved  

The reliability and usefulness of Zambian Wikipedia pages as a source of current information is at risk when articles are left without updates for long periods of time  
Information that may have been accurate at one point becomes outdated over time and this makes it harder for readers to trust that the content reflects the present situation in Zambia  

There is currently no clear or systematic way to highlight which pages have stale or outdated data  
Because of this contributors especially Zambian Wikipedians often do not know where to direct their efforts resulting in random or uneven updates where some articles receive frequent edits while others are overlooked  

This lack of focus reduces the overall quality of information about Zambia on Wikipedia and makes the platform less useful for researchers students and the general public who rely on it for accurate and up to date content  

The next phases of this project following the CRISP-DM process will move into Data Understanding and Data Preparation  
For this checkpoint the deliverables include adding this content to the Google Colab notebook and creating a corresponding section in the README.md file with at least one commit per team member properly tagged to document individual contributions  

# 2. Data Understanding

In [None]:
# installing the necessay packages

!pip install pandas matplotlib



In [None]:
# mount drive to colab

from google.colab import drive
drive.mount('/content/drive')

ValueError: mount failed

In [None]:
# read the csv file containing the data set for our use case
import pandas as pd

recency_wiki_csv = pd.read_csv("/content/drive/MyDrive/CollabData/zambia_wikipedia_search.csv")

In [None]:
recency_wiki_csv.columns

In [None]:
recency_wiki_csv.info()

In [None]:
recency_wiki_csv.describe()

In [None]:
recency_wiki_csv.shape

In [None]:
recency_wiki_csv.head(1).T

In [None]:
recency_wiki_csv.tail().T

In [None]:
 #  import matplotli.pyplot as plt for Data virsualization
import matplotlib.pyplot as plt

In [None]:
recency_wiki_csv[['size', 'wordcount']].hist(bins=30, figsize=(10,5))
plt.suptitle("Distribution of Article Size and Word Count")
plt.show()

In [None]:
from datetime import datetime, timezone

recency_wiki_csv['timestamp'] = pd.to_datetime(recency_wiki_csv['timestamp'], utc=True)
recency_wiki_csv['days_since_last_edit'] = (datetime.now(timezone.utc) - recency_wiki_csv['timestamp']).dt.days
recency_wiki_csv['days_since_last_edit'].hist(bins=30, figsize=(10,5))
plt.title("Distribution of Days Since Last Edit")
plt.xlabel("Days")
plt.ylabel("Number of Pages")
plt.show()

In [None]:
# Horizontal Barchart for top 10 Zambian Pages by word count

top_pages = recency_wiki_csv.nlargest(10, 'wordcount')
top_pages.plot(x='title', y='wordcount', kind='barh', figsize=(10,6))
plt.title("Top 10 Zambia Pages by Word Count")
plt.xlabel("Word Count")
plt.ylabel("Page Title")
plt.show()

# Summary of the observations seen

###  1. Dataset Overview
The dataset contains **500 Wikipedia pages** related to *“Zambia”*.

**Columns include:**
- `ns` (namespace)  
- `title` (page title)  
- `pageid` (unique page identifier)  
- `size` (page size in bytes)  
- `wordcount` (number of words on the page)  
- `snippet` (short text excerpt from the page)  
- `timestamp` (last edit timestamp)  


###  2. Data Quality
- **No null values** were detected in any of the columns.  
- Column data types:  
  - `ns`, `pageid`, `size`, `wordcount` → integers  
  - `title`, `snippet`, `timestamp` → objects (strings)  
- All **500 rows are complete**.  


###  3. Numerical Attributes
- **Page size (`size`)**: 244 bytes → 235,161 bytes  
- **Word count (`wordcount`)**: 14 → 16,058 words  
- Distributions are **skewed** with a few very large pages (outliers).  
- Most pages are **moderate in size** (~4,000–18,000 bytes).  


###  4. Timestamps
- Latest edits in the dataset range from **2025-08-07 to 2025-08-13**.  
- Converting the `timestamp` column to `datetime` allows calculation of **`days_since_last_edit`**.  


###  5. Categorical Attributes
- `title` is **unique** for each row.  
- `ns` is **constant (0)** → all rows are main article namespace.  


###  6. Observations
- Dataset is **clean and complete**, with no missing values.  
- `snippet` contains **HTML tags** (e.g., `<span class="searchmatch">`) → can be cleaned for plain text analysis.  
- Some pages are **much larger** (size & wordcount), showing a mix of **content-rich vs. stub articles**.  


# DATA PREPARATION

####  Data preparation refers to preparing data to be used in the modeling phase. It includes data cleaning which involves addressing any data quality issues, feature engineerin which involves creatin new variables from existing ones that might be more useful for the model, data transformation which involves preparing the data for chosen algorithms like scaling numerical categorical features.

####  We begin by checking for null values or checking if our dataset contains any null values

In [None]:
recency_wiki_csv.isnull().sum()

As seen from the output above, there are no null or missing values.

#### We then select the columns that we are interested in or columns that are useful in our case, e.g the timestamp column will be used to see the actual recency of the post, size and the wordcount can be used to show relations like posts with larger sizes are most recently updated and so on.

In [None]:
interested_in_columns = ["title", "pageid", "size", "wordcount", "timestamp"]

# creating a dataframe containing the columns we need
zambia_wiki_posts = recency_wiki_csv[interested_in_columns].copy()

# to investigate and confirm the columns in the new dataframe by observing the first row
zambia_wiki_posts.iloc[0].T

#### We then rename the column names for some attributes to make them more descriptive.

In [None]:
zambia_wiki_posts = zambia_wiki_posts.rename(columns={"size": "sizeInBytes", "timestamp": "lastEditTimeStamp"})

# confirming changes in names
zambia_wiki_posts.columns

#### Now, we derive necessary attributes to be used in the modelling phase.

These attributes will serve as key factors to help us in the modelling phase.

In [None]:
# Converts the last edit timestamp to a datetime object and calculates how many days have passed since each post was last edited

from datetime import datetime, timezone

zambia_wiki_posts["lastEditTimestampAsDatetime"] = pd.to_datetime(zambia_wiki_posts["lastEditTimeStamp"], utc=True)
zambia_wiki_posts["daysSinceLastEdited"] = (datetime.now(timezone.utc) - zambia_wiki_posts["lastEditTimestampAsDatetime"]).dt.days

# To observe and display the first row to see the changes made

zambia_wiki_posts.iloc[0].T

In [None]:
# To observe and display the 6th row to see the changes made

zambia_wiki_posts.iloc[5].T

#### Given that we are dealing with numbers in size, wordcount and days since last edited, there is need to scale the numbers down so that all the values will be in the range of 0 and 1 using the MinMaxScaler. Scaling the numbers to some range like 0 to 1 help in ensuring fairness so that column values like sizeInBytes do not dominate the column values like for daysSinceLastEdit

In [None]:
from sklearn.preprocessing import MinMaxScaler

columns_to_scale = ["sizeInBytes", "wordcount", "daysSinceLastEdited"]

scaler = MinMaxScaler()

# keeping the old original attributes and the new ones appending the new ones with Scaled after the actual field name
zambia_wiki_posts[[column + "Scaled" for column in columns_to_scale]] = scaler.fit_transform(
    zambia_wiki_posts[columns_to_scale]
)

# observing the output of the first 2 rows to see if changes have taken effect
zambia_wiki_posts.head(2)

# 4. Modeling

In this phase, we aim to build and train one or more data mining models to classify Zambian Wikipedia pages based on how recently they were edited.  
The goal is to assign pages to one of three categories:  
- Current  
- Moderately Outdated  
- Severely Outdated

This follows the CRISP-DM modeling step, where we select algorithms, split the data, train models, and observe preliminary patterns before formal evaluation.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# We use this function to add the recency label
def fxn_recency_category(days):
    if days <= 30:
        return "Current"
    elif days <= 180:
        return "Moderately Outdated"
    else:
        return "Severely Outdated"

zambia_wiki_posts["recency_label"] = zambia_wiki_posts["daysSinceLastEdited"].apply(fxn_recency_category)

# Saving new CSV file for modeling
zambia_wiki_posts.to_csv("./zambia_wiki_prepared_for_modeling.csv", index=False)
print("Prepared CSV saved: 'zambia_wiki_prepared_for_modeling.csv'")

**Features (X) and Target (y):**

- **Features (X):**  
    - `sizeInBytesScaled`, `wordcountScaled`, `daysSinceLastEditedScaled`  
    These scaled numeric columns represent page size, content depth, and time since last edit. They are directly relevant to predicting recency.

- **Target (y):**  
    - `recency_label`  
    This is the categorical label we aim to predict: Current, Moderately Outdated, or Severely Outdated.

**Why these columns:**  
- Days since last edit is the most informative feature for recency.  
- Size and word count help capture differences between pages that are large but may not have been updated recently.

In [None]:
# Defining features and target
feature_columns = ["sizeInBytesScaled", "wordcountScaled", "daysSinceLastEditedScaled"]
target_column = "recency_label"

X = zambia_wiki_posts[feature_columns]
y = zambia_wiki_posts[target_column]

# Splitting the dataset (80% train / 20% test) with statification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training data shape: {X_train.shape}, {y_train.shape}")
print(f"Testing data shape: {X_test.shape}, {y_test.shape}")

In [None]:
# Train Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
print("Decision Tree training complete.")

# Train Random Forest

rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)
print("Random Forest training complete.")

\**Preliminary Observations:**

- Both models were able to learn the patterns in the dataset quickly.  
- The `daysSinceLastEditedScaled` feature appears to dominate the classification, which aligns with our intuition: pages that were edited recently are more likely to be labeled "Current."  
- The Decision Tree is fully interpretable and shows how thresholds on days since last edit and page size help classify recency.  
- Random Forest builds on this by combining multiple trees, providing slightly more robustness even if the dataset were noisier.  

These observations are part of the modeling phase-they give a first glance at how well our features can inform recency classification before formal evaluation.

# 5. Evaluation

After training both the Decision Tree and the Random Forest models, we tested them on the 20% hold-out test set (100 pages).  
The idea was to see how well they can correctly classify pages into **Current**, **Moderately Outdated**, and **Severely Outdated**.



### Decision Tree Results

- **Confusion Matrix**

In [None]:
y_pred_dt = dt_model.predict(X_test)
print(confusion_matrix(y_test, y_pred_dt))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt))
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.2f}")

### Random Forest Results

- **Confusion Matrix**

In [None]:
y_pred_rf = rf_model.predict(X_test)
print(confusion_matrix(y_test, y_pred_rf))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.2f}")

- **Accuracy:** 0.99 (99%)
- Only one page was misclassified in the **Severely Outdated** class, but overall performance was still very high.



### Observations

- Both models performed extremely well, which suggests that the features we used (`sizeInBytes`, `wordcount`, and `daysSinceLastEdited`) are very strong indicators of recency.
- The **Decision Tree** hit a perfect score, but that might also mean it could be overfitting slightly.
- The **Random Forest** gave almost the same results, but is more robust since it averages across many trees.
- Either model would work, but for deployment we chose the **Random Forest** because it’s usually more stable in practice.