### Problem 2 : Data Engineering Challenge

##### Description: 
You are provided with a zip file named data.zip, which contains a folder named BBC_articles. Inside this folder are text files named as "articleID_category", where "articleID" corresponds to the unique identifier of the article and "category" denotes the category of the article. Your task is to structure the data by creating a CSV file with appropriate columns. Then, you need to write code to read the CSV data, tokenize the text, and prepare the dataset to have numerical features using vectorization. Goal is to clean and prepare the dataset so that in can be trained using NLP-classification techniques.

#### Task1 (Data Structuring):

- Unzip the data.zip file to access the BBC_articles folder.
- Inside the BBC_articles folder, each text file is named as "articleID_category", where "articleID" is a unique identifier and "category" is the category of the article.
- Create a CSV file named bbc_articles.csv with the following columns:
    - article_id: Unique identifier for each article.
    - text: Text content of the article.
    - category: Category of the article.

#### Task 2 (Data Preprocessing for Model Training):

- Read the bbc_articles.csv file into a DataFrame using Python.
- Tokenize the text data using a suitable tokenizer (e.g., NLTK, SpaCy)(Even better if you perform custom tokenization.)
- Perform any necessary preprocessing steps such as lowercasing, removing stopwords, punctuation, etc.
- Make a new csv file with numerical features and given labels.
- Note: Features are typically derived through text vectorization techniques such as:
    - Bag-of-Words (BoW): Each feature represents the count or frequency of each word in the document.
    - TF-IDF (Term Frequency-Inverse Document Frequency): Each feature represents the TF-IDF score of each word in the document.
    - Word Embeddings: Each feature represents the vector representation of each word in the document (e.g., Word2Vec, GloVe).
- You are free to choose how you vectorize to get the feature

#### Submission Requirements:

- Submit the structured CSV file bbc_articles.csv.
- Submit final datset as a CSV file containing numerical features.
> (name it as vectorized_dataset.csv).

- Provide the Python code used to preprocess the data and vectorize ot for the text classification.
- Include a brief report or documentation summarizing the preprocessing steps, and what methods you adopted for featurization.
- Additional Notes:
    - Ensure that your code is well-commented and follows best practices for readability and maintainability.
    - Provide any necessary instructions or requirements for running your code, such as package dependencies or environment setup.

# IMPORT

In [2]:
import os, csv, json
import pandas as pd
import numpy as np
from tqdm import tqdm

# USEFUL FUNCTIONS

In [3]:
def loadJSON(filepathh):
    _dataa = {} 
    if os.path.exists(filepathh):
        with open(filepathh, "r", encoding="utf-8") as _f:
            _dataa = json.load(_f)
    else:
        print(f"{filepathh} does not exists...\n") 
    return _dataa 

def loadTXT(filepathh):
    _dataa = ""
    if os.path.exists(filepathh):
        with open(filepathh, "r", encoding="utf-8") as _f:
            _dataa = _f.read()
    else:
        print(f"{filepathh} does not exists...\n") 
    return _dataa 

def loadFILE(filepathh = ""):
    if os.path.exists(filepathh):
        if filepathh.endswith(".txt"):
            return loadTXT(filepathh)
        elif filepathh.endswith(".json"):
            return loadJSON(filepathh)
        else:
            print("\n- Invalid File format 😐 !!!\n")
            return None
    else:
        print(f"{filepathh} does not exists...\n") 

In [4]:
def convert_dict_to_csv(my_ds = {}, csv_file_path = "./bbc_articles.csv", verbose = False):
    _choice = 1
    if os.path.exists(csv_file_path):
        print(f"CSV file with filename - '{os.path.basename(csv_file_path)}' exists !!!")
        _choice = 0
        _choice_str = input("Overwrite it ? (1/0)").strip()
        if _choice_str in "10" and len(_choice_str) == 1:
            _choice = int(_choice_str)
        
    if _choice:
        req_fieldnames = ['article_id', 'text', 'category']
        
        with open(csv_file_path, 'w', newline='', encoding='utf-8') as csvfile:
            if verbose:
                print(f"CSV file with filename - '{os.path.basename(csv_file_path)}' created.")
                _count = 0
                _max_count = len(my_ds)

            writer = csv.DictWriter(csvfile, fieldnames=req_fieldnames)
            
            writer.writeheader()
            
            for _ , values in tqdm(my_ds.items()):
                _article_id, _category, _text = values
                writer.writerow({'article_id': _article_id, 'text': _text, 'category': _category})

                # if verbose:
                #     print(f"{_count}/{_max_count} written ...", end="\r")
            
        if verbose:
            print(f"CSV file with filename - '{os.path.basename(csv_file_path)}' written.")
            print("\n")
        return 1
    
    if verbose:
        print(f"CSV file with filename - '{os.path.basename(csv_file_path)}' was not written.")
    
    return 0


# --- TASK 1 ---

In [5]:
# DATA PATH
BBC_data_folder = "./selection-problems/data/data/BBC_articles/"

In [6]:
# LISTING FILES TO PROCESS
id_category_files_list = [i for i in os.listdir(BBC_data_folder) if i.endswith('.txt')]
len(id_category_files_list), id_category_files_list[:5]

(1490,
 ['1003_entertainment.txt',
  '1004_tech.txt',
  '1005_entertainment.txt',
  '1007_business.txt',
  '1008_politics.txt'])

In [7]:
# MAKING DATA STRUCTURE
#     {
#         filename1_id_category.txt : [article_id, categoty, text],
#         filename2_id_category.txt : [article_id, categoty, text],
#         .
#         .
#         .
#     }

data_struc_dict = {_ : _[:-4].split('_') for _ in id_category_files_list} 
list(data_struc_dict.items())[:5]

[data_struc_dict[_id_category_file].append(loadFILE(BBC_data_folder+_id_category_file)) for _id_category_file in tqdm(id_category_files_list, desc="id_category_files")]
len(data_struc_dict.items()), list(data_struc_dict.items())[:5]


id_category_files: 100%|██████████| 1490/1490 [00:00<00:00, 6817.72it/s]


(1490,
 [('1003_entertainment.txt',
   ['1003',
    'entertainment',
    'jamelia s return to the top r&b star jamelia had three brit nominations to go with her triple triumph at last year s mobo awards.  the birmingham-born singer  full name jamelia davis  was signed to a record label at the age of 15 and released her first single so high at 18. she released four number ones from her 2000 album drama  including the top five hit money featuring the vocals of reggae artist beenie man. she racked up five mobo nominations in 2000  winning one for best video. but in the same year she also fell pregnant and decided to take a break from music to bring up her daughter teja  who was born in march 2001. while she originally planned to get back to work pretty swiftly after giving birth it was actually two years before she released another single. during her absence r&b music exploded and a whole host of female artists were on the scene  meaning jamelia had to once again prove herself. her comeba

In [8]:
# PATH OF CSV FILE TO MAKE
csv_file_to_make_path = "./bbc_articles.csv"


In [9]:
# CALLING FUNCTION TO CONVERT OUR DATA STRUCTURE TO CSV FILE
convert_dict_to_csv(data_struc_dict, 
                    csv_file_to_make_path, 
                    verbose=1)

CSV file with filename - 'bbc_articles.csv' exists !!!
