## Extracting data for pre-training

**Author:** Benjamin Aw  
**Date:** 9 Dec 2021  
**Context:** Pre-training the model to contextualise the distilBERT model requires a particular format  
**Objective:** Extracting the CSV files, and coverting them into text files while also removing all HTML tags  

#### A) Setting up

Importing the libraries and obtaing the file path for the datasets required

In [1]:
import pandas as pd
import os
import re

# changing directory first in order to import the package
os.chdir('..')
from ssoc_autocoder.processing import final_cleaning

#### B) Writing out the necessary functions

Because a cleaning code `final_cleaning` has been written out, we can utilise that function instead of writing out our own.

In [2]:
def cleaning_text_and_check(text):
    
    cleaned_text = final_cleaning(text)
    
    # add in additional check for proper sentences
    
    return cleaned_text

We need to loop through each individial file by year-month, and append each cleaned entry to the appropriate text file

In [None]:
def output_individual_files(path):

    for filename in os.listdir(path + "mcf_api_responses_csv"):

        df = pd.read_csv(path + "mcf_api_responses_csv/" + filename)
        df_desc = df['description'].apply(cleaning_text_and_check)

        # Writing it out into a readable text file

        filename_without_ext = filename.strip(".csv")

        for jd in df_desc: 
            fi = open("../Data/Train/pre-training data/" + filename_without_ext + ".txt", "a")
            fi.write(f'{jd}\n')
            fi.close()
        
        print(f"{filename_without_ext}.txt saved!")

Writing a separate function that lumps all the text files together

In [17]:
def output_combined_file(path):

    df_list = []
    for filename in os.listdir(path):

        print(f'Processing {filename}...\r', end = '')
        df = pd.read_csv(path + "/" + filename)
        df_desc = df['description'].apply(cleaning_text_and_check)
        df_list.append(df_desc)
    
    print('')
    combined_df = pd.concat(df_list, ignore_index = True)
    print(f'Shape of combined dataframe: {combined_df.shape}')

    return combined_df

#### C) Putting it all together

We run the functions as described above, to create the appropriate text files that is requried for pre-training of the distilledBERT model. 

In [None]:
output_individual_files('Data/Raw/mcf_api_responses_csv')

In [18]:
combined_df = output_combined_file('Data/Raw/mcf_api_responses_csv')

Processing raw_2021-05.csv...
Shape of combined dataframe: (882214,)


In [26]:
# Writing it out into a readable text file

filepath = 'Data/Train/pre-training-full.txt'

with open(filepath, 'w+') as outfile:
    for i, jd in enumerate(combined_df):
#         if i >= 1000:
#             break
        outfile.write(f'{jd}\n')

print(f"{filepath} saved!")

Data/Train/pre-training-full.txt saved!
