# Summary
This notebook generates random samples of 100 comments from each dataset.

The comments are pulled at random (with a seed for reproducability) and there are no common samples between any one dataset, i.e. comments are chosen without replacement. 

The samples are placed into a subfolder in '/samples/{dataset_name}'. For example, a comment in the `no_attachment` dataset might be found in '/samples/no_attachment/sample1'.

In [8]:
import numpy as np
import pandas
import random
import shutil

### Generating dataset samples
The codeblock below imports all 5 datasets created in the previous notebooks.

Next, in the codeblock below that, a function is set up to generate 3 samples of 100 comments for each dataset passed in. The codeblock then saves these 3 samples to a spreadsheet where each sheet is different samples.

In [9]:
# Import the datasets from the Meta Analysis notebook.
data_cleaned = pandas.read_json('./data/data_cleaned.json', orient='records', dtype='false')
has_attachment = pandas.read_json('./data/has_attachment.json', orient='records', dtype='false')
no_attachment = pandas.read_json('./data/no_attachment.json', orient='records', dtype='false')

In [10]:
def save_to_spreadsheet(samples_array, dataset_name):
    # Save the samples to a spreadsheet named dataset_name_samples in a directory of the same name. 
    # E.g., has_attachment samples are saved to `./sample/has_attachment/has_attachment_samples.xlsx`.
    writer = pandas.ExcelWriter("./samples/" + dataset_name + "/" + dataset_name + "_samples.xlsx")

    # Write each sample of 100 comments to a separate sheet in the spreadsheet
    samples_array[0].to_excel(writer, sheet_name="sample 1")
    samples_array[1].to_excel(writer, sheet_name="sample 2")
    samples_array[2].to_excel(writer, sheet_name="sample 3")

    writer.save()
    writer.close()

In [11]:
def generate_samples(dataset, dataset_name):
    samples = dataset.sample(n=300, random_state=24996236)
    samples_array = np.split(samples, 3)
    
    # Save the samples to a spreadsheet after generating them 
    save_to_spreadsheet(samples_array, dataset_name)