<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Classification Project: Nthabiseng Moyeni & Alex Masina
© ExploreAI Academ

___
## Table of Contents

<a href=#BC> [Background Context](#Background-Context)</a>

1. <a href=#one>[Importing Packages](#Importing-Packages)</a>
2. <a href=#two>[Loading Data](#Loading-Data)</a>
3. <a href=#three>[Data Preprocessing](#Data-Preprocessing) </a>
4. <a href=#four>[Model Training](#Model-Training) </a>
5. <a href=#five>[Streamlit App Deployment](#Streamlit-App-Deployment) </a>
6. <a href=#six>[Conclussion](#Conclussion) </a>

# About Project

We have been tasked with creating a classification model using Python and deploying it as a web application with Streamlit by a news outlet. The aim is to apply machine learning techniques to natural language processing tasks. This project aims to classify news articles into categories such as Business, Technology, Sports, Education, and Entertainment.

* We will go through the full workflow: loading data, preprocessing, training models, evaluating them, and preparing the final model for deployment.

# About the Data

The dataset is comprised of news articles that need to be classified into categories based on their content, including Business, Technology, Sports, Education, and Entertainment. 

Dataset Features:


* Headlines:	The headline or title of the news article.
* Description:	A brief summary or description of the news article.
* Content:	The full text content of the news article.
* URL:	The URL link to the original source of the news article.
* Category:	The category or topic of the news article (e.g., business, education, entertainment, sports, technology).

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>[Back to Table of Contents](#Table-of-Contents)</a>


NB: See all the libraries listed below:
---

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import streamlit as st
import joblib, os

---
<a href=#two></a>
## **Loading Data**
<a href=#cont>[Back to Table of Contents](#Table-of-Contents)</a>



---

In [3]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [6]:
# Inspect the datasets
print(train_data.head())
print(test_data.head())

                                           headlines  \
0  RBI revises definition of politically-exposed ...   
1  NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...   
2  Akasa Air ‘well capitalised’, can grow much fa...   
3  India’s current account deficit declines sharp...   
4  States borrowing cost soars to 7.68%, highest ...   

                                         description  \
0  The central bank has also asked chairpersons a...   
1  NDTV's consolidated revenue from operations wa...   
2  The initial share sale will be open for public...   
3  The current account deficit (CAD) was 3.8 per ...   
4  The prices shot up reflecting the overall high...   

                                             content  \
0  The Reserve Bank of India (RBI) has changed th...   
1  Broadcaster New Delhi Television Ltd on Monday...   
2  Homegrown server maker Netweb Technologies Ind...   
3  India’s current account deficit declined sharp...   
4  States have been forced to pay through thei

---
<a href=#three></a>
## **Data Preprocessing**
<a href=#cont>[Back to Table of Contents](#Table-of-Contents)</a>


---

In [8]:
# Remove missing values
train_data.dropna(subset=['content', 'category'], inplace=True)
test_data.dropna(subset=['content'], inplace=True)

# Encode categories
y_train = train_data['category'].astype('category').cat.codes

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['content'])
X_test = vectorizer.transform(test_data['content'])

---
<a href=#four></a>
## **Model Training**
<a href=#cont>[Back to Table of Contents](#Table-of-Contents)</a>

---

In [10]:
# Define models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": SVC()
}

# Train and evaluate models
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_train)
    accuracy = accuracy_score(y_train, predictions)
    results[model_name] = {
        "Accuracy": accuracy,
        "Classification Report": classification_report(y_train, predictions, target_names=train_data['category'].unique())
    }
    print(f"Model: {model_name}\nAccuracy: {accuracy}\n")
    print(results[model_name]['Classification Report'])



Model: Logistic Regression
Accuracy: 0.9902173913043478

               precision    recall  f1-score   support

     business       0.99      0.98      0.99      1120
       sports       1.00      0.99      1.00      1520
entertainment       1.00      1.00      1.00       960
    education       0.99      0.99      0.99       640
   technology       0.98      0.99      0.98      1280

     accuracy                           0.99      5520
    macro avg       0.99      0.99      0.99      5520
 weighted avg       0.99      0.99      0.99      5520

Model: Random Forest
Accuracy: 1.0

               precision    recall  f1-score   support

     business       1.00      1.00      1.00      1120
       sports       1.00      1.00      1.00      1520
entertainment       1.00      1.00      1.00       960
    education       1.00      1.00      1.00       640
   technology       1.00      1.00      1.00      1280

     accuracy                           1.00      5520
    macro avg       1.

---
<a href=#five></a>
## **Streamlit App Deployment**
<a href=#cont>[Back to Table of Contents](#Table-of-Contents)</a>

---

In [11]:
# Create a Streamlit app for deploying the best-performing model

def main():
    """News Classifier App with Streamlit """

    # Creates a main title and subheader on your page -
    # these are static across all pages
    st.title("News Classifier")
    st.subheader("Classifying news articles into categories")

    # Creating sidebar with selection box -
    # you can create multiple pages this way
    options = ["Prediction", "Information"]
    selection = st.sidebar.selectbox("Choose Option", options)

    # Building out the "Information" page
    if selection == "Information":
        st.info("General Information")
        st.markdown("This app classifies news articles into predefined categories like Business, Technology, Sports, Education, and Entertainment.")

    # Building out the prediction page
    if selection == "Prediction":
        st.info("Prediction with ML Models")
        # Creating a text box for user input
        news_text = st.text_area("Enter News Content", "Type here...")

        if st.button("Classify"):
            # Transforming user input with vectorizer
            vect_text = vectorizer.transform([news_text]).toarray()
            # Load the best-performing model
            best_model_name = max(results, key=lambda x: results[x]['Accuracy'])
            predictor = models[best_model_name]
            prediction = predictor.predict(vect_text)[0]
            predicted_category = train_data['category'].unique()[prediction]
            st.success(f"Text Categorized as: {predicted_category}")

if __name__ == "__main__":
    main()


2025-01-04 17:59:30.752 
  command:

    streamlit run c:\Users\nthab\anaconda4\envs\myenv\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]
2025-01-04 17:59:30.923 Session state does not function when running a script without `streamlit run`


---
<a href=#six></a>
## **Conclussion**
<a href=#cont>[Back to Table of Contents](#Table-of-Contents)</a>

---

This notebook provides a complete pipeline for the news classification task. 
Further steps could include hyperparameter tuning, exploring additional models, and enhancing the Streamlit app.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>