# Summary
This notebook generates random samples of 100 comments from each dataset.

The comments are pulled at random (with a seed for reproducability) and there are no common samples between any one dataset, i.e. comments are chosen without replacement. 

The samples are placed into a subfolder in '/samples/{dataset_name}'. For example, a comment in the `no_attachment` dataset might be found in '/samples/no_attachment/sample1'.

In [5]:
import numpy as np
import pandas
import random
import shutil

### Generating dataset samples
The codeblock below imports all 5 datasets created in the previous notebooks.

Next, in the codeblock below that, a function is set up to generate 3 samples of 100 comments for each dataset passed in. The codeblock then saves these 3 samples to a spreadsheet where each sheet is different samples.

In [6]:
# Import the datasets from the Meta Analysis notebook.
data_cleaned = pandas.read_json('./data/data_cleaned.json', orient='records', dtype='false')
has_attachment = pandas.read_json('./data/has_attachment.json', orient='records', dtype='false')
no_attachment = pandas.read_json('./data/no_attachment.json', orient='records', dtype='false')

In [58]:
def save_to_spreadsheet(samples_array, dataset_name):
    # Save the samples to a spreadsheet named dataset_name_samples. 
    # E.g., has_attachment samples are saved to has_attachment_samples.xlsx.
    writer = pandas.ExcelWriter("./samples/" + dataset_name + "_samples.xlsx")

    # Write each sample of 100 comments to a separate sheet in the spreadsheet
    samples_array[0].to_excel(writer, sheet_name="sample 1")
    samples_array[1].to_excel(writer, sheet_name="sample 2")
    samples_array[2].to_excel(writer, sheet_name="sample 3")

    writer.save()
    writer.close()

In [52]:
def generate_samples(dataset, dataset_name):
    samples = dataset.sample(n=300, random_state=24996236)
    samples_array = np.split(samples, 3)
    
    # Save the samples to a spreadsheet after generating them 
    save_to_spreadsheet(samples_array, dataset_name)

The codeblock below takes the five datasets and applies the two functions above to create a spreadsheet of samples.

In [59]:
generate_samples(data_cleaned, "data_cleaned")
generate_samples(has_attachment, "has_attachment")
generate_samples(no_attachment, "no_attachment")

The%20above%20letter%20is%20essential%20for%20teachers%20and%20administrations%20to%20understand%20how%20to%20help%20students%20with%20ADHD.%20My%20son%20was%20able%20to%20qualify%20early%20for%20an%20I.E.P%20and%20now%20as%20a%20middle%20schooler,%20a%20504%20plan%20due%20to%20the%20hyperactive%20aspect%20and%20learning%20issues%20he%20presented%20with.%20However,%20we%20are%20now%20in%20a%20battle%20with%20our%20school%20system%20to%20help%20our%20daughter%20with%20largely%20inattentive%20type%20ADD%20to%20get%20the%20services%20she%20needs.%20This%20letter%20will%20help%20guide%20us%20and%20the%20educators%20as%20we%20move%20through%20this%20process.' with link or location/anchor > 255 characters since it exceeds Excel's limit for URLS
  force_unicode(url))


### Creating samples from the downloaded attachments (not complete)

In [74]:
def move_to_folder(files, folder_name):
    for file, sp-score in files:
        try:
            shutil.copy2('./data/attachments/' + str(file) + '.pdf', './data/samples/' + folder_name)
        except:
            shutil.copy2('./data/attachments/' + str(file) + '.docx', './data/samples/' + folder_name)

In [75]:
# From attachments dataset, make 3 samples, each containing 15% of the dataset. 
# The comments are pulled randomly from the entire datasetand each time and may contain the same comment as another set.
# Random state set for reproducability of the sets.

sample = [x for x in range(len(data_attachments_only))]
sample_size = int(len(sample) * .15)

random.seed(a=24996236, version=2)
data_attachments_only_sample_1 = random.sample(sample, sample_size)
moveSamplesToFolder(data_attachments_only_sample_1, 'attachment_sample_1')

random.seed(a=98952473, version=2)
data_attachments_only_sample_2 = random.sample(sample, sample_size)
moveSamplesToFolder(data_attachments_only_sample_2, 'attachment_sample_2')

random.seed(a=37241857, version=2)
data_attachments_only_sample_3 = random.sample(sample, sample_size)
moveSamplesToFolder(data_attachments_only_sample_3, 'attachment_sample_3')

In [94]:
f = open('./data/samples/attachment_sentiment_scores.txt', 'w+')
f.write(data_attachments_only.to_string(columns=['sentiment']).replace('\n', '\r\n'))
f.close()

In [83]:
# Save each dataset for checking if the sentiment score and comment_body follow a pattern
# data_duplicates_removed_sample_1.to_json('./data/samples/data_duplicates_removed_sample_1.json', orient='records')
# data_duplicates_removed_sample_2.to_json('./data/samples/data_duplicates_removed_sample_2.json', orient='records')
# data_duplicates_removed_sample_3.to_json('./data/samples/data_duplicates_removed_sample_3.json', orient='records')


writter = pandas.ExcelWriter("./data/samples/data_duplicates_removed_samples.xlsx")

data_duplicates_removed_sample_1.to_excel(writter, sheet_name="sample1")
data_duplicates_removed_sample_2.to_excel(writter, sheet_name="sample2")
data_duplicates_removed_sample_3.to_excel(writter, sheet_name="sample3")

writter.save()