# BBC News Classification
In this assignment, we will use data from https://www.kaggle.com/c/learn-ai-bbc/overview, which is a Kaggle competition is about categorizing news articles. You will use matrix factorization to predict the category, submit your results to Kaggle for test evaluation.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.decomposition import NMF
import os

In [None]:
# Loading train data
articles = pd.read_csv("data/bbc/train.csv")

In [None]:
# Displaying what the data looks like
# It has article id, article texts, and category. Here, category is the laabel.
articles

In [None]:
# Let's print out a sample article text. You'll also see special characters.
articles['Text'].iloc[0]

In [None]:
# There are 5 unique categories (labels)
articles['Category'].unique()

In [None]:
# Let's see how many articles are in each category. It shows that the categories are reasonably balanced.
plt.hist(articles['Category'])

## Q1. Text preprocessing
As we have done simple EDA, now, let's extract some feaures from the text.
Read sklearn document for [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) and [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer). Search online about text preprocessing methods based on word count and TF-IDF. 
### [10 pts] Summarize/explain what those methods are. Why is TF-IDF better than simple word count?
(optional: Search and explain other text preprocessing methods such as GloVe or Word2Vec. you can include python snippets on how to use them). 

YOUR ANSWER HERE

In the below example, we will show how to use feature extraction vectorizer. We will show CountVectorizer, but the usage of TfidfVectorizer is also similar. The vectorizer has many options, but `max_features` is most often used. A collection of all words in the all articles is called vocabulary. Since the vocabulary can include so many words, often we want to limit the number of vocabularies in our feature vector. `max_features` is a parameter that sets the limit.

In [None]:
#There are 1490 articles in the train data
data_samples = articles['Text']
print(len(data_samples))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [None]:
# Let's take an example of CountVectorizer
n_features = 200 # For example, we decided to include only 200 words in our vocabulary (but often it needs bigger number) 
count_vectorizer = CountVectorizer(max_features=n_features, stop_words="english")

In [None]:
# .fit_transform() transforms the text data to feature vectors. 
# Here, the feature matrix from CountVectorizer is a word count vector
wordcount = count_vectorizer.fit_transform(data_samples)

In [None]:
# This feature matrix has a shape of (# of articles, # max features)
# The matrix is a sparse matrix format for computation efficiency.
wordcount

In [None]:
# Although you can convert a sparse matrix to dense matrix to see the content, 
# usually we don't want to load a dense matrix of very big matrix (such as a matrix with multiple thousands or millions of rows and cols).
# Here, just to show what's inside:
wordcount.todense()
# It shows a word count fewature matrix. 

In [None]:
#For example, the feature vector for the last third article is this:
wordcount.todense()[-3]
# It has 0 count the first vocabulary, 1 count for the second vocabulary, 1 count for the third vocabulary, etc.

In [None]:
# .vocabulary_ shows word count for each word in the selected vocabulary in the data
count_vectorizer.vocabulary_

## Q2. Topic Modeling using NMF
Below are the starter codes. We will build TopicModeling class which predicts article labels using NMF algorithm.
You'll need to complete `factorize` and `predict` methods.

### Q2.a Complete factorize function [15 pts]
5 pts each for each STEP. You can do the similar fro mthe example above.    
**Note:** `self.X` is the train and test data combined. In the STEP2 in `factorize` function, we fit all the data so that the feature matrix contains vocabularies from both trin and test data. You could fit with only texts from train data, but then there might be some vocabularies from test data that don't exist in train data. Since we're not showing the labels to the model, using the vectorizer feature extraction with combined data is ok.

### Q2.b Complete STEP4 in predict function [10 pts]
self.features is a feature matrix from the tf-idf vectorizer.
You can retrieve the fitted feature matrix from train data portion by `self.features[:self.n_tr]`
You can calculate `yp_tr` using this train part of the feature matrix. Use the matrix factorization formula (theory) for prediction. It involves some dot products and transpose of matrices. numpy operations require matrices to be numpy array format, which is why we reformatted `tfidf` to `self.features` using `numpy.array` at the end of the `factorize` function.

### Q2.c Complete STEP 5 in predict function [15 pts]
Map the numeric label values 0,1,2,3,4 from the prediction to category strings. Save the test prediction labels to `self.test['Category']`. For example, `self.test['Category']` may look like:    

|     |              |
|:----|:------------|     
|0    |         sport|    
|1    |           tech|     
|2    |          sport|    
|3    |       business|    
|4    |          sport|
|...|          |
|730  |       business|
|731  |  entertainment|
|732  |           tech|
|733  |       business|
|734  |       politics|
Name: Category, Length: 735, dtype: object

How can you map the integers to the string labels? You can use train data: You can compare the train label string `yt[i]` and the predicted integers from train data `yp_tr[i]` for each sample. Since the model isn't going to predict on train data perfectly, sometimes the match may be wrong. But if you keep track of the matching predicted integer labels for each string label then you can find the majority of the integer index corresponding to the string label.

In [None]:
class TopicModeling():
    def __init__(self):
        self.getdata()
        
    def getdata(self):
        self.train = pd.read_csv("data/bbc/train.csv")
        self.test = pd.read_csv("data/bbc/test.csv")
        self.X = list(self.train.Text)+list(self.test.Text) # To make the common vocabulary, we will have all texts from train and test data. Don't worry, as long as we don't show test labels, it's not showing the answer.
        self.n_tr = len(self.train)
        self.n_te = len(self.test)
        
    def factorize(self,n_features):
        self.n_features = n_features
        # STEP1. Construct tf-idf vectorizer that accepts max features of n_features. Use parameters max_df=0.95, min_df=2
        # tfidf_vectorizer = 
        # YOUR CODE HERE
        
        
        # STEP2. Fit the tfidf_vectorizer using the all data X above. Assign the transformed result matrix as tfidf. [5 pts]
        # tfidf = 
        # YOUR CODE HERE
        
        
        # STEP3. Fit the model using sklearn NMF and assign to self.model
        # self.model = 
        # YOUR CODE HERE
        
        self.tfidf = tfidf
        self.features = np.array(tfidf.toarray()) #saves the feature matrix in numpy array format. You'll need when predict.
        
    def predict(self):
        # STEP4. Predict labels for train and test data using matrix algebra.
        # Refer to the usage in https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
        # The predicted labels are numeric; 0-4
        # yp_tr = # predicion from train data
        # yp_te = # prediction from test data
        # YOUR CODE HERE
        
        
        # STEP5. Map the numeric values 0-4 from the prediction to string values of label category.
        # You can compare the true labels from train data with yp_tr prediction labels from train data.
        # Then you know which number label in yp_tr corresponds which string value. 
        # update self.test['Category'] to the string labels accordingly.
        yt = list(self.train['Category'])
        # self.test['Category'] =
        
        # YOUR CODE HERE
        
        
    def save(self): # This function helps to create submission file
        if not os.path.isdir('submission'):
            os.mkdir('submission')
        self.test[["ArticleId","Category"]].to_csv("submission/submission_f"+str(self.n_features)+".csv",index=False)

## Q3. Tune hyper parameters [10 pts]
Change your n_features (for example, between 1000 to 10000). 
Run prediction which will predict and save
Print the train accuracy for each n feature. (You can add print statement for train accuracy in the predict function above)
Save the test prediction using save function above.

In [None]:
# YOUR CODE HERE


## Q4. Best results [10 pts]
Submit a few test prediction (judge based on train accuracy, although the best train accuracy doesn't mean the best test accuracy) and pick which n_features led to the best test acc. Record the result. Discuss your observations.

In [None]:
# YOUR CODE HERE
