# Create Random Sample of Data

In [None]:
import os
import pandas as pd

## 1) Overview

This step is necessary to prepare training data for BERT classification of the newspaper clippings. It's a simple step. All I have to do is randomly select 1,000 rows that contain clippings from our data. This random selection is then hand-labelled (see 06_label_training_data.ipynb) and used to fine-tune BERT for our specific classification task.

## 2) Compile Clippings from All CSV Files

To make a truly random selection, I'm compiling all the .csv files together and selecting 1,000 rows from the aggregated data.

In [None]:
directory = 'name_clusters'
clippings_urls = []

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        df = pd.read_csv(file_path, usecols=['clippings', 'url', 'victim'])
        df = df.dropna(subset=['clippings'])
        clippings_urls.append(df)
        
    except Exception as e:
        print(f'{file_path} caused error {e}')

all_clippings = pd.concat(clippings_urls, ignore_index=True)

## 3) Create Sample of 2,000 Clippings

Then I make the random selection:

In [None]:
random_clippings = all_clippings.sample(n=2000)

## 4) Break into 20 Chunks

To make the hand-labelling step easier, I've split the random selection into 20 chunks, 100 clippings apiece.

In [None]:
for i in range(20):
    start_index = i * 100
    end_index = (i + 1) * 100
    chunk = random_clippings[start_index:end_index]
    df_chunk = pd.DataFrame(chunk, columns=['clippings', 'url', 'victim'])
    output_name = f'training_data_2/clippings_part_{i+1}.csv'
    df_chunk.to_csv(output_name, index=False)