# **BBC News Classification**

### **1\. Introduction**

#### **1.1 Objective statement**
Classify the dataset of BBC news articles into five categories (business, entertainment, politics, sport, tech).  
#### **1.2 Plan**:
  * Implement Unsupervised learning on it
  * Then compare it with supervised learning.


### **1.3 Data description**

**File descriptions**

|File name| Descriptions|
|--|--|
|Train.csv|The BBC News Train dataset which has 1490 records|
|Test.csv|The BBC News test dataset which has 736 records|
|Solution.csv |The BBC News Sample， a sample submission file in the correct format|

**Data fields**

|Field name| Descriptions|
|--|--|
|ArticleId|Article id unique # given to the record|
|Article|text of the header and article|
|Category |cateogry of the article (tech, business, sport, entertainment, politics）|




### **2\. Exploratory Data Analysis (EDA)**

* **2.1. Load the Data:**  
  * Load the training dataset using pandas.  
  * Display the first few rows (.head()) of the dataframe.  
  * Use .info() and .describe() to get a summary of the data.  


In [11]:
import pandas as pd

raw_url_train = 'https://raw.githubusercontent.com/RockDeng110/BBC-News-Classification/main/datasets/BBC%20News%20Train.csv'
raw_url_test = 'https://raw.githubusercontent.com/RockDeng110/BBC-News-Classification/main/datasets/BBC%20News%20Test.csv'
raw_url_sample = 'https://raw.githubusercontent.com/RockDeng110/BBC-News-Classification/main/datasets/BBC%20News%20Sample%20Solution.csv'


df_train = pd.read_csv(raw_url_train)
df_test = pd.read_csv(raw_url_test)
df_sample = pd.read_csv(raw_url_sample)


def get_df_summary(df, df_name):
  # print out name of df
  print(f'===== Summary of  {df_name}:')
  print(f'[DataFrame head]:')
  print(df.head())
  print(f'[DataFrame info]:')
  df.info()
  print(f'[DataFrame describe]:')
  print(df.describe())

get_df_summary(df_train, "df_train")
get_df_summary(df_test, "df_test")
get_df_summary(df_sample, "df_sample")

===== Summary of  df_train:
[DataFrame head]:
   ArticleId                                               Text  Category
0       1833  worldcom ex-boss launches defence lawyers defe...  business
1        154  german business confidence slides german busin...  business
2       1101  bbc poll indicates economic gloom citizens in ...  business
3       1976  lifestyle  governs mobile choice  faster  bett...      tech
4        917  enron bosses in $168m payout eighteen former e...  business
[DataFrame info]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.1+ KB
[DataFrame describe]:
         ArticleId
count  1490.000000
mean   1119.696644
std     641.826283
min       2.000000
25%     565.250000
50% 

* **2.2. Data Cleaning:**  
  * Check for missing values (.isnull().sum()) and decide on a strategy to handle them if any exist.  
  * Check for duplicate articles and remove them.  
* **2.3. Data Visualization:**  
  * **Category Distribution:** Create a bar chart to visualize the number of articles in each category. Check for class imbalance.  
  * **Text Length Analysis:**  
    * Calculate the length of each article (word count and character count).  
    * Plot histograms or boxplots of article lengths for each category to see if there are any noticeable differences.  
  * **Word Frequency Analysis:**  
    * Identify the most common words in the entire corpus.  
    * Create word clouds for each category to visualize the most frequent and important words.  
    * Use bar charts to show the frequency of top N words per category after removing stopwords.



### **3\. Data Preprocessing**

* **3.1. Text Cleaning:**  
  * Convert all text to lowercase.  
  * Remove punctuation and special characters.  
  * Remove numbers (if they are not considered useful features).  
  * Remove common English stopwords.  
* **3.2. Text Normalization:**  
  * **Lemmatization or Stemming:** Apply one of these techniques to reduce words to their root form. Explain the choice (lemmatization is generally preferred for better accuracy).  
* **3.3. Feature Engineering (Text Representation):**  
  * **TF-IDF (Term Frequency-Inverse Document Frequency):**  
    * Explain the concept of TF-IDF.  
    * Use TfidfVectorizer from scikit-learn to convert the preprocessed text into numerical vectors.  
    * Discuss key parameters like max\_features, ngram\_range, and min\_df/max\_df.  
* **3.4. Data Splitting:**  
  * Split the data into training and validation sets using train\_test\_split. Ensure a stratified split if there is a class imbalance.

### **4\. Model Building**

* **4.1. Baseline Model:**  
  * Start with a simple, interpretable model like **Naive Bayes** (specifically MultinomialNB) or **Logistic Regression**.  
  * Train the model on the TF-IDF vectors.  
* **4.2. Advanced Models:**  
  * Train a few more powerful models. Good candidates include:  
    * **Support Vector Machines (SVM)**  
    * **Random Forest**  
    * **Gradient Boosting Machines (e.g., XGBoost, LightGBM)**  
* **4.3. (Optional) Deep Learning Models:**  
  * For a more advanced approach, consider a simple neural network:  
    * **Word Embeddings (e.g., GloVe, Word2Vec) or an Embedding Layer.**  
    * **Recurrent Neural Network (RNN) like LSTM or a Convolutional Neural Network (CNN) for text classification.**  
    * This section would require libraries like TensorFlow/Keras or PyTorch.

### **5\. Model Evaluation**

* **5.1. Performance Metrics:**  
  * Define the evaluation metrics to be used. For a classification task, these include:  
    * **Accuracy:** Overall correct predictions.  
    * **Precision, Recall, F1-Score:** Per-class performance.  
    * **Confusion Matrix:** To visualize where the models are making mistakes.  
    * **Classification Report:** A summary of precision, recall, and F1-score for each class.  
* **5.2. Model Comparison:**  
  * Make predictions on the validation set for each trained model.  
  * Generate a classification report and a confusion matrix for each model.  
  * Create a summary table or bar chart to compare the key metrics (e.g., accuracy, F1-score) across all models.  
  * Select the best-performing model based on the evaluation results.

### **6\. Hyperparameter Tuning (for the Best Model)**

* **6.1. Tuning Strategy:**  
  * Choose a hyperparameter tuning technique like **GridSearchCV** or **RandomizedSearchCV** for the best-performing model from the previous step.  
  * Define the parameter grid to search over.  
* **6.2. Final Model Training:**  
  * Train the selected model with the best hyperparameters found during tuning on the **entire training dataset**.


### **8\. Comparison with supervised learning**



## Reference
* https://www.kaggle.com/competitions/learn-ai-bbc/overview

---
---

### **7\. Conclusion & Submission**

* **7.1. Summary of Results:**  
  * Summarize the project findings. State which model performed best and its final score on the validation set.  
  * Discuss any interesting insights from the EDA or model performance.  
* **7.2. (If applicable) Submission:**  
  * Describe the process for generating the submission file if the competition requires predictions on a separate test set.  
  * Load the test data, apply the same preprocessing steps, and use the final trained model to make predictions.  
  * Format the predictions into the required submission file format.  
* **7.3. Future Work:**  
  * Suggest potential improvements, such as:  
    * Trying more advanced deep learning architectures (e.g., Transformers like BERT).  
    * Experimenting with different feature engineering techniques.  
    * Using different word embeddings.  
    * Ensemble methods.