#### Batch Organizer Tool for Sample Collections

##### Welcome to the Batch Organizer Tool, an utility designed to streamline the organization of your sample collection into randomized, stratified batches. This tool is ideal for researchers and scientists who need to manage large datasets with complex hierarchical structures, such as herbarium collections.

* Key Features:

    CSV File Input: Simply upload your .csv file containing the details of your sample collection. Ensure that your file includes columns for family, species, and genus.
    
    Stratified Sampling: The tool focuses on the taxonomical family to ensure that each batch is representative of the diversity within your collection. This stratification minimizes bias and maintains the integrity of your samples.
    
    Randomization: Within each family, samples are randomized to further reduce bias and ensure that each batch is varied.
    
    QC Sample Integration: Each batch includes 3 QC samples (Blank, CQ-1, CQ-2) distributed as follows: 3 QCs + 23 samples + 3 QCs + 23 samples + 3 QCs + 22 samples + 3 QCs.
    
    Batch Creation: Organize your samples into batches of 68, or a size of your choosing. The tool automatically calculates the number of batches required and distributes samples accordingly.
    
    Output: Each batch is saved as a separate CSV file, ready for further analysis or processing.

* How It Works:

    Upload Your Data: Start by uploading your CSV file. The file should contain columns named "family", "species", and "genus".
    
    Proportion Calculation: The tool calculates the proportion of each family within your dataset to ensure balanced representation in each batch.
    
    Randomized Stratification: Samples are stratified by family and then randomized within each family to create balanced and varied batches.
    
    QC Sample Integration: Each batch is structured to include QC samples in the following order:
    
        3 QC samples (Blank, CQ-1, CQ-2)

        23 randomized samples

        3 QC samples (Blank, CQ-1, CQ-2)
        
        23 randomized samples
        
        3 QC samples (Blank, CQ-1, CQ-2)
        
        22 randomized samples
        
        3 QC samples (Blank, CQ-1, CQ-2)
    
    Batch Creation: The tool divides your samples into the specified structure, ensuring each batch reflects the proportional diversity of your collection.
    
    Output Files: Each batch is saved as a CSV file, ready for download and further use.

Example Use Case:

Imagine you have a dataset of 1000 herbarium samples, representing various species across different genera and families. You need to organize these samples into batches for analysis, ensuring that each batch is representative of the overall diversity and includes QC samples in a specific order. This tool automates the process, saving you time and ensuring statistical robustness.

##### Explanation of the Script:

    1. Load the Data:
        Reads the CSV file into a pandas DataFrame.

    2. Check for Family Column:
        Ensures the family column exists in the file.

    3. Define QC Samples:
        Lists the QC samples (Blank, QC_Inter_Batch, QQ_Inter_Batch).

    4. Calculate Proportions:
        Calculates the proportion of each family in the dataset.

    5. Define Batch Structure:
        Specifies the number of samples and QC samples in each batch.

    6. Calculate Number of Batches:
        Determines the number of batches needed based on the total samples and batch structure.

    7. Create Batches:
        Randomly selects samples for each family proportionally.
        Ensures each batch contains the required number of samples and QC samples in the specified structure.

    8. Save Batches:
        Saves each batch as a separate CSV file.

In [49]:
import pandas as pd
import numpy as np
import os

# Define the directory and file name separately
directory = r"C:\Users\borge\Documents\Batch_organizer\Example_data"
file_name = "example.csv"

# Combine the directory and file name to form the full path
path = os.path.join(directory, file_name)

# Load the data
data = pd.read_csv(path)

# Add an original index column to keep track of the original order
data['original_index'] = data.index

# Ensure the family column is present
if 'family' not in data.columns:
    raise ValueError("The CSV file must contain a 'family' column.")

# Generate stratified batch dataframes
# Define QC samples
qc_samples = ['Blank', 'QC_Inter_Batch', 'QC_Intra_Batch']

# Calculate the proportion of each family
family_counts = data['family'].value_counts(normalize=True)

# Define the number of samples per batch and QC structure
samples_per_batch = 68
samples_structure = [23, 23, 22]
total_qc_per_batch = 12  # 3 QC sets of 3 QCs each and 1 extra set of 3 QCs

# Calculate the number of batches needed
total_samples = len(data)
num_batches = total_samples // (samples_per_batch - total_qc_per_batch) + (total_samples % (samples_per_batch - total_qc_per_batch) > 0)

# Initialize a list to hold batches
batches = []

# Function to get random samples for each family proportionally
def get_samples_for_family(family, count):
    available_samples = data[data['family'] == family]
    count = min(count, len(available_samples))  # Ensure we don't sample more than available
    family_samples = available_samples.sample(count)
    return family_samples

# Create a new directory for the batches if it doesn't exist
output_dir = "new_batches"
os.makedirs(output_dir, exist_ok=True)

# Create batches
for batch_num in range(num_batches):
    batch = pd.DataFrame()
    remaining_samples = data.copy()
    for family, proportion in family_counts.items():
        count = int(proportion * sum(samples_structure))
        samples = get_samples_for_family(family, count)
        batch = pd.concat([batch, samples])
        remaining_samples = remaining_samples.drop(samples.index)  # Remove selected samples from the main data
    
    # If the batch is smaller than the required size due to rounding, fill with random samples
    if len(batch) < sum(samples_structure):
        additional_sample_count = sum(samples_structure) - len(batch)
        additional_samples = remaining_samples.sample(min(additional_sample_count, len(remaining_samples)))
        batch = pd.concat([batch, additional_samples])
        remaining_samples = remaining_samples.drop(additional_samples.index)
    
    # Shuffle the batch and add randomized index
    batch = batch.sample(frac=1).reset_index(drop=True)
    batch['randomized_index'] = batch.index
    
    # Create stratified batch with QC samples
    stratified_batch = pd.DataFrame()
    start = 0
    for size in samples_structure:
        qc_set = pd.DataFrame({'original_index': ['QC'] * len(qc_samples), 'randomized_index': ['QC'] * len(qc_samples)})
        qc_set[qc_samples] = qc_samples
        sample_set = batch[start:start+size]
        stratified_batch = pd.concat([stratified_batch, qc_set, sample_set])
        start += size
    final_qc_set = pd.DataFrame({'original_index': ['QC'] * len(qc_samples), 'randomized_index': ['QC'] * len(qc_samples)})
    final_qc_set[qc_samples] = qc_samples
    stratified_batch = pd.concat([stratified_batch, final_qc_set])
    
    # Save batch to a separate CSV file in the new_batches directory
    stratified_batch.to_csv(os.path.join(output_dir, f'batch_{batch_num+1}.csv'), index=False)
    
    # Create a separate DataFrame for each batch
    globals()[f'batch_{batch_num+1}'] = stratified_batch
    batches.append(stratified_batch)

print("Batches have been created and saved in the 'new_batches' directory.")


Batches have been created and saved in the 'new_batches' directory.


In [50]:
import pandas as pd
import plotly.express as px
import os

# Create a new column for counts
data['count'] = 1

# Group by family, genus, and species to get counts
grouped_data = data.groupby(['family', 'genus', 'species']).size().reset_index(name='count')

# Directory to save plots
plot_dir = os.path.join(directory, "plots")
os.makedirs(plot_dir, exist_ok=True)

# Distribution of family, genus, and species - Sunburst Plot with counts
sunburst_fig = px.sunburst(grouped_data, 
                           path=['family', 'genus', 'species'], 
                           values='count',
                           hover_data={'count': True},
                           title='Distribution of Family, Genus, and Species in Original Data')

# Display the plot
#sunburst_fig.show()

# Save the plot as an HTML file
sunburst_fig.write_html(os.path.join(plot_dir, "original_data_sunburst.html"))


In [51]:
plot_dir

'C:\\Users\\borge\\Documents\\Batch_organizer\\Example_data\\plots'

In [52]:
import pandas as pd
import plotly.express as px
import os

# Directory to save plots
plot_dir = os.path.join(directory, "plots")
os.makedirs(plot_dir, exist_ok=True)

# Function to create sunburst plot for a batch
def create_sunburst_plot(batch_data, batch_num):
    grouped_data = batch_data.groupby(['family', 'genus', 'species']).size().reset_index(name='count')
    sunburst_fig = px.sunburst(grouped_data, 
                               path=['family', 'genus', 'species'], 
                               values='count',
                               hover_data={'count': True},
                               title=f'Stratified Distribution in Batch {batch_num}')
    # Save the plot as an HTML file
    sunburst_fig.write_html(os.path.join(plot_dir, f'batch_{batch_num}_sunburst.html'))
    # Display the plot
    #sunburst_fig.show()

# Generate sunburst plots for each batch
for i in range(1, num_batches + 1):
    batch_data = globals()[f'batch_{i}']
    create_sunburst_plot(batch_data, i)

print("Stratified batch plots have been created and saved.")

Stratified batch plots have been created and saved.


In [54]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

# Create a new column for counts
data['count'] = 1

# Group by family, genus, and species to get counts
grouped_data = data.groupby(['family', 'genus', 'species']).size().reset_index(name='count')

# Number of batches (manually set for this example, should be adjusted according to actual data)
num_batches = 4  # Adjust this to the number of batches you have

# Function to create sunburst plot for a batch
def create_sunburst_plot(batch_data, batch_num):
    grouped_data = batch_data.groupby(['family', 'genus', 'species']).size().reset_index(name='count')
    sunburst_fig = px.sunburst(grouped_data, 
                               path=['family', 'genus', 'species'], 
                               values='count',
                               hover_data={'count': True},
                               title=f'Stratified Distribution in Batch {batch_num}')
    return sunburst_fig

# Generate stratified batch dataframes (mock example with grouped data)
batches = [data.sample(frac=1/num_batches) for _ in range(num_batches)]

# Create subplots layout
rows = num_batches // 2 + num_batches % 2
fig = make_subplots(rows=rows, cols=2, subplot_titles=[f'Batch {i+1}' for i in range(num_batches)], specs=[[{'type': 'sunburst'}, {'type': 'sunburst'}] for _ in range(rows)])

# Function to add a sunburst plot to a subplot
def add_sunburst_to_subplot(fig, batch_data, row, col):
    sunburst_fig = create_sunburst_plot(batch_data, row*2 + col + 1)
    for trace in sunburst_fig.data:
        fig.add_trace(trace, row=row, col=col)

# Generate sunburst plots for each batch and add to the subplot figure
for i, batch_data in enumerate(batches):
    row = i // 2 + 1
    col = i % 2 + 1
    add_sunburst_to_subplot(fig, batch_data, row, col)

# Update layout and show the figure
fig.update_layout(height=1000, width=1200, title_text="Stratified Distribution in Batches")
fig.write_html(os.path.join(plot_dir, "batches_sunburst.html"))
#fig.show()


In [26]:
plot_dir

'C:\\Users\\borge\\Documents\\Batch_organizer\\Example_data\\plots'

In [56]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

# Create a new column for counts
data['count'] = 1

# Group by family and genus to get counts for the original data
grouped_data = data.groupby(['family', 'genus']).size().reset_index(name='count')

# Number of batches (determined from previous cells or actual data)
batch_size = 68  # You should use the actual batch size you have
num_batches = (len(data) + batch_size - 1) // batch_size
batches = [data.iloc[i*batch_size:(i+1)*batch_size] for i in range(num_batches)]

# Function to create bar plots for a batch
def create_bar_plot(batch_data, batch_num, category):
    counts = batch_data[category].value_counts().reset_index()
    counts.columns = [category, 'count']
    bar_fig = px.bar(counts, x=category, y='count', title=f'{category.capitalize()} Distribution in Batch {batch_num}')
    return bar_fig

# Create a subplot layout for bar plots
fig = make_subplots(rows=num_batches, cols=2, subplot_titles=[f'Batch {i+1}' for i in range(num_batches) for _ in range(2)], shared_yaxes=True, horizontal_spacing=0.1)

# Generate bar plots for each batch and add to the subplot figure
for i, batch_data in enumerate(batches):
    for j, category in enumerate(['family', 'genus']):
        bar_fig = create_bar_plot(batch_data, i+1, category)
        for trace in bar_fig.data:
            fig.add_trace(trace, row=i+1, col=j+1)

# Update layout and show the figure
fig.update_layout(height=400 * num_batches, width=1200, title_text="Stratified Distribution in Batches")
fig.write_html(os.path.join(plot_dir, "batches_barplot.html"))
#fig.show()


In [57]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

# Create a new column for counts
data['count'] = 1

# Number of batches (determined from previous cells or actual data)
batch_size = 68  # You should use the actual batch size you have
num_batches = (len(data) + batch_size - 1) // batch_size
batches = [data.iloc[i*batch_size:(i+1)*batch_size] for i in range(num_batches)]

# Function to create bar plots for a dataset
def create_bar_plot(dataset, title, category):
    counts = dataset[category].value_counts().reset_index()
    counts.columns = [category, 'count']
    bar_fig = px.bar(counts, x=category, y='count', title=title)
    return bar_fig

# Create bar plots for the overall distribution
overall_family_plot = create_bar_plot(data, 'Overall Family Distribution', 'family')
overall_genus_plot = create_bar_plot(data, 'Overall Genus Distribution', 'genus')

# Save overall distribution plots
overall_family_plot.write_html(os.path.join(plot_dir, "overall_family_distribution.html"))
overall_genus_plot.write_html(os.path.join(plot_dir, "overall_genus_distribution.html"))

# Create a subplot layout for bar plots
fig = make_subplots(rows=num_batches + 1, cols=2, subplot_titles=(['Overall'] + [f'Batch {i+1}' for i in range(num_batches)])*2, shared_yaxes=True, horizontal_spacing=0.1)

# Add overall distribution plots to the subplot figure
for trace in overall_family_plot.data:
    fig.add_trace(trace, row=1, col=1)

for trace in overall_genus_plot.data:
    fig.add_trace(trace, row=1, col=2)

# Generate bar plots for each batch and add to the subplot figure
for i, batch_data in enumerate(batches):
    family_plot = create_bar_plot(batch_data, f'Family Distribution in Batch {i+1}', 'family')
    genus_plot = create_bar_plot(batch_data, f'Genus Distribution in Batch {i+1}', 'genus')
    
    for trace in family_plot.data:
        fig.add_trace(trace, row=i+2, col=1)
    
    for trace in genus_plot.data:
        fig.add_trace(trace, row=i+2, col=2)

# Update layout and show the figure
fig.update_layout(height=400 * (num_batches + 1), width=1200, title_text="Stratified Distribution in Batches")
fig.write_html(os.path.join(plot_dir, "batches_barplot.html"))
#fig.show()


##### How to Judge?

To judge whether stratification was achieved, you should compare the distribution of families and genera in each batch to the overall distribution in the original dataset. The key points to look for include:

* Similarity in Distribution:
     The bar plots for each batch should closely resemble the bar plots for the overall distribution.
     The relative proportions of each family and genus in the batches should be similar to their proportions in the overall dataset.

* Consistency Across Batches:
     Each batch should have a similar composition, reflecting the diversity of the original dataset.
     There should not be any significant discrepancies or outliers in the distribution of families and genera across different batches.

##### Key Points to Look For:

* Proportional Representation: The heights of the bars in the batch plots should be proportionally similar to the heights of the bars in the overall plot. For example, if the family "Asteraceae" makes up 20% of the original dataset, it should also make up approximately 20% of each batch.

* Presence of All Groups: All families and genera present in the original dataset should also be present in the batches, assuming the batch size is large enough to capture the diversity.

* No Over- or Under-Representation: No single family or genus should be significantly over- or under-represented in any batch compared to the overall dataset.

In [58]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os
from scipy.stats import chi2_contingency

# Create a new column for counts
data['count'] = 1

# Print the size of the original dataset
print(f"Original data size: {len(data)}")

# Number of batches (determined from previous cells or actual data)
batch_size = 68  # You should use the actual batch size you have
num_batches = (len(data) + batch_size - 1) // batch_size
print(f"Calculated number of batches: {num_batches}")

# Generate stratified batch dataframes
batches = [data.iloc[i*batch_size:(i+1)*batch_size] for i in range(num_batches)]
print(f"Number of batches generated: {len(batches)}")

# Calculate overall frequencies
overall_family_freq = data['family'].value_counts(normalize=True)
overall_genus_freq = data['genus'].value_counts(normalize=True)

# Function to create bar plots for a dataset
def create_bar_plot(dataset, title, category):
    counts = dataset[category].value_counts().reset_index()
    counts.columns = [category, 'count']
    bar_fig = px.bar(counts, x=category, y='count', title=title)
    return bar_fig

# Create bar plots for the overall distribution
overall_family_plot = create_bar_plot(data, 'Overall Family Distribution', 'family')
overall_genus_plot = create_bar_plot(data, 'Overall Genus Distribution', 'genus')

# Save overall distribution plots
overall_family_plot.write_html(os.path.join(plot_dir, "overall_family_distribution.html"))
overall_genus_plot.write_html(os.path.join(plot_dir, "overall_genus_distribution.html"))

# Create a subplot layout for bar plots
fig = make_subplots(rows=num_batches + 1, cols=2, subplot_titles=(['Overall'] + [f'Batch {i+1}' for i in range(num_batches)])*2, shared_yaxes=True, horizontal_spacing=0.1)

# Add overall distribution plots to the subplot figure
for trace in overall_family_plot.data:
    fig.add_trace(trace, row=1, col=1)

for trace in overall_genus_plot.data:
    fig.add_trace(trace, row=1, col=2)

# Generate bar plots for each batch, add to subplot figure, and perform Chi-square tests
chi2_results = []

for i, batch_data in enumerate(batches):
    family_plot = create_bar_plot(batch_data, f'Family Distribution in Batch {i+1}', 'family')
    genus_plot = create_bar_plot(batch_data, f'Genus Distribution in Batch {i+1}', 'genus')
    
    for trace in family_plot.data:
        fig.add_trace(trace, row=i+2, col=1)
    
    for trace in genus_plot.data:
        fig.add_trace(trace, row=i+2, col=2)
    
    # Calculate observed frequencies
    observed_family_freq = batch_data['family'].value_counts()
    observed_genus_freq = batch_data['genus'].value_counts()
    
    # Align with overall frequencies
    observed_family_freq = observed_family_freq.reindex(overall_family_freq.index, fill_value=0)
    observed_genus_freq = observed_genus_freq.reindex(overall_genus_freq.index, fill_value=0)
    
    # Create contingency tables
    family_contingency = pd.DataFrame({'observed': observed_family_freq, 'expected': overall_family_freq * len(batch_data)})
    genus_contingency = pd.DataFrame({'observed': observed_genus_freq, 'expected': overall_genus_freq * len(batch_data)})
    
    # Perform Chi-square tests
    chi2_family, p_family, _, _ = chi2_contingency(family_contingency)
    chi2_genus, p_genus, _, _ = chi2_contingency(genus_contingency)
    
    chi2_results.append({'batch': i+1, 'chi2_family': chi2_family, 'p_family': p_family, 'chi2_genus': chi2_genus, 'p_genus': p_genus})

# Update layout and show the figure
fig.update_layout(height=400 * (num_batches + 1), width=1200, title_text="Stratified Distribution in Batches")
fig.write_html(os.path.join(plot_dir, "batches_barplot.html"))
#fig.show()

# Display Chi-square test results
chi2_results_df = pd.DataFrame(chi2_results)
chi2_results_df.to_csv(os.path.join(plot_dir, "chi2_results.csv"), index=False)
print(chi2_results_df)


Original data size: 200
Calculated number of batches: 3
Number of batches generated: 3
   batch  chi2_family  p_family  chi2_genus  p_genus
0      1     0.187611  0.999980    0.283293      1.0
1      2     0.315476  0.999881    0.582475      1.0
2      3     0.069689  0.999999    0.172635      1.0


In [72]:
# Function to judge the results and format the output with bold characters
def judge_stratification_results(results_df):
    p_family_values = results_df['p_family'].apply(lambda x: f"{x:.2f}").tolist()
    p_genus_values = results_df['p_genus'].apply(lambda x: f"{x:.2f}").tolist()
    if (results_df['p_family'] > 0.95).all() and (results_df['p_genus'] > 0.95).all():
        message = (
            f"High p-values (close to 1, specifically: {p_family_values} and  {p_genus_values})"
            " suggest that there is <b>NO SIGNIFICANT</b> difference between the observed and expected distributions for families and genera in each batch. "
            "This means that the observed frequencies in your batches are very close to the expected frequencies derived from the overall dataset."
        )
    else:
        message = (
            f"Low p-values (specifically: {p_family_values} and  {p_genus_values})"
            " suggest that there is a <b>SIGNIFICANT</b> difference between the observed and expected distributions for families and/or genera in some batches. "
            "Further investigation may be needed to ensure proper stratification."
        )
    from IPython.display import display, HTML
    display(HTML(message))

# Judge the results
judge_stratification_results(chi2_results_df)

##### Chi-square Test Interpretation:

* Chi-square Statistic (chi2_family, chi2_genus):
        The Chi-square statistic measures the discrepancy between the observed and expected frequencies.
        A higher Chi-square value indicates a greater difference between observed and expected distributions.
        In your results, the Chi-square values for both family and genus are very low, indicating that the observed frequencies in the batches are very close to the expected frequencies.

* p-value (p_family, p_genus):
        The p-value represents the probability that the observed distribution occurred by chance under the null hypothesis (no difference between observed and expected distributions).
        A p-value less than 0.05 typically indicates a statistically significant difference, suggesting that the observed distribution is different from the expected distribution.
        In your results, the p-values are extremely high (close to 1), indicating no significant difference between the observed and expected distributions.

##### Conclusion:

* High p-values (close to 1) suggest that there is no significant difference between the observed and expected distributions for families and genera in each batch. This means that the observed frequencies in your batches are very close to the expected frequencies derived from the overall dataset.

* Low Chi-square values further support the conclusion that the distributions in the batches closely match the overall distribution.

##### Proving Stratification:

* Consistency with Overall Distribution: The high p-values and low Chi-square values indicate that the stratification was successful. Each batch's distribution of families and genera matches the overall distribution, achieving the stratification goal.

* No Significant Differences: Since the p-values are not significant, we can conclude that the batches were stratified correctly, and there is no statistical evidence to suggest otherwise.