# BBC News Classification Kaggle Mini-Project

# Brief description of the problem and data

This dataset contains about 1.7k news paragraphs. They fall into the following 5 categories, business, tech, sports, entertainment, or politics. Our goal is to use Matrix Factorization to predict the category of a given news. 

# Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data

![image.png](attachment:image.png)\
The dataset is unbalanced. Entertainment news are fewer than the other categories. 
There are some tokens in the text that might interfere with model training, for example, punctuation. They should be removed before training. Some of the longer articles may unnecessarily increase complexity. As a result, I dropped articles with more than 750 tokens.


# Model Architecture

Preprocessing is described in the previous section. The following is the remaining of the data processing pipeline.
  1.	Feature Extraction\
    •	TF-IDF Vectorization: The cleaned text is converted into a numerical matrix where each entry represents the importance of a word in a document relative to the entire dataset. 
  2.	Topic Modeling with NMF\
    •	NMF Overview: Non-negative Matrix Factorization decomposes the TF-IDF matrix into two components:\
            o	A topic distribution matrix (W).\
            o	A word-to-topic association matrix (H).\
    •	Clustering Articles: Articles are grouped based on their strongest topic association from the W matrix.
  3.	Label Matching\
    •	NMF is unsupervised, the model assigns numeric labels (0-4) to topics. The label_permute function permutes these labels to align them with the ground truth labels for accuracy assessment.\ 
    
  Why This Method Is Suitable?\
      1.	Dimensionality Reduction: NMF simplifies the complex relationships in the high-dimensional TF-IDF matrix, making topic detection computationally efficient.\
      2.	Effective for Text Data: TF-IDF captures word importance, and NMF effectively identifies co-occurring patterns in words, which is crucial for topic discovery.



# Results and Analysis

We have achieved 96% acc on the private dataset. While the matrix factorization steps are pretty much fixed (Vectorization followed by NMF), we experimented with different NMF parameters. While most of the parameters slightly worsen the result, changing l2 loss to l1 loss causes acc to drop to 17%

# Conclusion

I learned that unsupervised methods can be applied on labeled data, too. Moreover, the performance is actually pretty good! While playing with NMF parameters did not work out well, I learned they can utterly ruin the prediction, if not choose properly.