<a href="https://colab.research.google.com/github/Coop149/code-unza25-csc4792-project_team_24-repository/blob/main/code_unza25_csc4792_project_team_24_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Business Understanding

## Problem Statement
The University of Zambia publishes numerous scholarly articles across various disciplines, but there is no automated system to categorize these publications according to Zambia’s Vision 2030 sector classifications (e.g., Education, Health, Agriculture). This lack of classification makes it difficult for policymakers, researchers, and students to quickly identify research relevant to national development priorities. Manually sorting articles is time-consuming and prone to human error, limiting the accessibility and impact of academic research.


## **Business Objectives**

1. **Develop an automated classification system**  
   Accurately categorize scholarly articles from the University of Zambia into the official *Zambia’s Vision 2030* sector classifications  
   *(e.g., Education, Health, Agriculture)*.

2. **Improve accessibility of academic research**  
   Enable policymakers, researchers, and students to quickly locate publications relevant to national development priorities.

3. **Reduce manual workload & eliminate classification errors**  
   Provide a consistent, data-driven categorization process to replace human classification.

4. **Enhance the impact of academic research**  
   Align publications with *Zambia’s Vision 2030* goals, making it easier for decision-makers to leverage research for evidence-based policy and planning.
######**Practical Success**
- A policymaker searches for **"Education sector"** in the system and instantly retrieves all University of Zambia publications related to that sector — without having to manually read and sort each article.

- Any user can search for and retrieve articles by *Vision 2030* sector in seconds.  
- The classification is consistent, accurate, and requires minimal manual intervention.


## **Data Mining Goals**

1. **Build a text classification model**  
   Automatically categorize scholarly articles from the University of Zambia into *Zambia’s Vision 2030* sector categories  
   *(e.g., Education, Health, Agriculture)* based on article titles, abstracts, and keywords.

2. **Preprocess and clean the text data**  
   Remove stop words, apply stemming or lemmatization, and handle special characters to prepare it for machine learning algorithms.

3. **Extract relevant features from the article text**  
   Use Natural Language Processing (NLP) techniques such as **TF-IDF** (Term Frequency–Inverse Document Frequency) or word embeddings.

4. **Train and evaluate classification algorithms**  
   Experiment with models such as **Naïve Bayes**, **Support Vector Machines (SVM)**, or **Logistic Regression** to determine which provides the highest accuracy for this dataset.

5. **Deploy the trained model**  
   Integrate it into a searchable interface that allows users to retrieve articles by *Vision 2030* sector in a short period of time.



## **Initial Project Success Criteria**

1. **Accuracy Threshold**

   The classification model should achieve **at least 80% accuracy** on the test dataset when categorizing articles into *Zambia’s Vision 2030* sector categories.

2. **Balanced Performance**

   The model must maintain a **minimum precision and recall of 75%** across all sectors to ensure no single category is disproportionately misclassified.

3. **Coverage of All Sectors**

   The system should be able to classify articles into **all major Vision 2030 sectors** (e.g., Education, Health, Agriculture, etc.) without excluding any represented category.

4. **Speed of Classification**

   The model should return results in **a short period of time** for a single article query to support real-time search functionality.

5. **Reduction of Manual Effort**

   Compared to manual sorting, the automated system should **reduce classification time by at least 70%**, making research retrieval faster and more consistent.

6. **Usable Output**

   The final system should produce a **searchable, categorized dataset or dashboard** for policymakers, researchers, and students, enabling instant filtering by Vision 2030 sector.


# 2. Data Undestanding


In [1]:
from google.colab import files
uploaded = files.upload()

Saving vision_2030_dataset - unza_journal_articles.csv to vision_2030_dataset - unza_journal_articles.csv


In [2]:
import pandas as pd

df = pd.read_csv("vision_2030_dataset - unza_journal_articles.csv")

# Preview the first 5 rows
df.head()

Unnamed: 0,id,title,authors,year,abstract,journal_full_name,sector_label_primary,sector_label_secondary,vision_2030_sector,keywords,url,source
0,1,Performance of cowpea progenitor and hybrids i...,Emmanuel Chikalipa,2025,Among abiotic factors limiting cowpea producti...,Journal of Agriculture and Biomedical Sciences...,Agriculture,Science and Technology,Agriculture/Science and Technology,"Vigna unguiculata, Hydroponic, Phosphorus, Pro...",View of Performance of Cowpea Progenitor and H...,The University of Zambia journals
1,2,Contagious Bovine Pleuropneumonia-Associated F...,"Isaac Dayo Olorunshola, Shukrah Omotayo Ghali,...",2025,Contagious bovine pleuropneumonia (CBPP) is st...,Journal of Agriculture and Biomedical Sciences...,Agriculture,Health,Agriculture/Health,"Abattoir, Cattle, CBPP, Fetal wastes, Meat qua...",https://journals.unza.zm/index.php/JABS/articl...,The University of Zambia journals
2,3,Potential of a powdered Mopane worms-breakfas...,"Yongo Salasini, Nixon Miyoba, Bubala Hamaimbo,...",2025,"In Zambia, the intake of animal-sourced protei...",Journal of Agriculture and Biomedical Sciences...,Food and Nutrition,Agriculture,Food and Nutrition/Agriculture,"Breakfast meal, Composite blends, Mopani Worms...",https://journals.unza.zm/index.php/JABS/articl...,The University of Zambia journals
3,4,Programmed Death Ligand-1 Expression in Gastri...,"Husna Munshi, Chibamba Mumba, Mupeta Songwe, V...",2025,abstract: Gastric cancer is a highly fatal dis...,Journal of Agriculture and Biomedical Sciences...,Health,Science and Technology,Health/Science and Technology,"Gastric Cancer, Adenocarcinoma, Immunohistoche...",View of Programmed Death Ligand-1 Expression i...,The University of Zambia journals
4,5,Exploring Dietary Patterns and Nutrition Statu...,"Nixon Miyoba, Abidan Chansa, Namakando Liusha,...",2025,This study explored food consumption patterns ...,Journal of Agriculture and Biomedical Sciences...,Health,Food and Nutrition,Health/Food and Nutrition,"Body mass index, Blood sugar levels, Diabetic ...",https://journals.unza.zm/index.php/JABS/articl...,The University of Zambia journals


In [5]:
df.head()


Unnamed: 0,id,title,authors,year,abstract,journal_full_name,sector_label_primary,sector_label_secondary,vision_2030_sector,keywords,url,source
0,1,Performance of cowpea progenitor and hybrids i...,Emmanuel Chikalipa,2025,Among abiotic factors limiting cowpea producti...,Journal of Agriculture and Biomedical Sciences...,Agriculture,Science and Technology,Agriculture/Science and Technology,"Vigna unguiculata, Hydroponic, Phosphorus, Pro...",View of Performance of Cowpea Progenitor and H...,The University of Zambia journals
1,2,Contagious Bovine Pleuropneumonia-Associated F...,"Isaac Dayo Olorunshola, Shukrah Omotayo Ghali,...",2025,Contagious bovine pleuropneumonia (CBPP) is st...,Journal of Agriculture and Biomedical Sciences...,Agriculture,Health,Agriculture/Health,"Abattoir, Cattle, CBPP, Fetal wastes, Meat qua...",https://journals.unza.zm/index.php/JABS/articl...,The University of Zambia journals
2,3,Potential of a powdered Mopane worms-breakfas...,"Yongo Salasini, Nixon Miyoba, Bubala Hamaimbo,...",2025,"In Zambia, the intake of animal-sourced protei...",Journal of Agriculture and Biomedical Sciences...,Food and Nutrition,Agriculture,Food and Nutrition/Agriculture,"Breakfast meal, Composite blends, Mopani Worms...",https://journals.unza.zm/index.php/JABS/articl...,The University of Zambia journals
3,4,Programmed Death Ligand-1 Expression in Gastri...,"Husna Munshi, Chibamba Mumba, Mupeta Songwe, V...",2025,abstract: Gastric cancer is a highly fatal dis...,Journal of Agriculture and Biomedical Sciences...,Health,Science and Technology,Health/Science and Technology,"Gastric Cancer, Adenocarcinoma, Immunohistoche...",View of Programmed Death Ligand-1 Expression i...,The University of Zambia journals
4,5,Exploring Dietary Patterns and Nutrition Statu...,"Nixon Miyoba, Abidan Chansa, Namakando Liusha,...",2025,This study explored food consumption patterns ...,Journal of Agriculture and Biomedical Sciences...,Health,Food and Nutrition,Health/Food and Nutrition,"Body mass index, Blood sugar levels, Diabetic ...",https://journals.unza.zm/index.php/JABS/articl...,The University of Zambia journals


In [6]:
df.shape

(124, 12)

We successfully loaded the dataset vision_2030_dataset - unza_journal_articles.csv into a pandas DataFrame.  
- The first 5 rows (df.head()) confirm that column names and sample data appear as expected.  
- The dataset shape (df.shape) shows <number_of_rows> rows and <number_of_columns> columns, which matches our expectations.  

Thus, the dataset is correctly loaded and ready for further exploration.