DSC Ticket: https://precisionmedicineinitiative.atlassian.net/browse/DSC-2?focusedCommentId=64650&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-64650

The requirements in the [DRC COVID Serology Plating Strategy](https://docs.google.com/document/d/1SWJDZtbrJH5_48NyTx3dE2T8oLr4OUZA5UgoQDNI664/edit?pli=1#) document will be referenced throughput the notebook.

# SET UP

In [None]:
# %run C:\Users\kouamea\CONNECTIONS\setup.ipynb
%run ../CONNECTIONS/setup.ipynb

In [None]:
# To save files with today's date
today = str(datetime.today().strftime('%Y-%m-%d'))
time = str(datetime.now().time())[:8]
now = (today+'_'+time).replace(':','.')
now

## Load needed datasets

In [None]:
import glob

location = '../DATA/DSC2'
fileset = [file for file in glob.glob(location + "**/***.*", recursive=True)]
n = 0
for file in fileset:
    #print()
    print(n, file)
    n+=1

In [None]:
serum_samples1 = pd.read_csv('../DATA/DSC2/serum_samples_list_2020-05-06 with locations.csv')

## Removing AIAN and withdrawn pids from original list
## keeping only columns we need
serum_samples_check = serum_samples1[(serum_samples1.is_AIAN_participant != 1) & (serum_samples1.withdrawal_status!=2)]
serum_samples_check = serum_samples_check[['biobank_id','Description','Sequence', 'Rack', 'Position in Rack']].drop_duplicates()

serum_samples_check.head(5)

### Load Biobank's File with Sequencing information

### Load Serology Master File, Negative Control Master File, Positive Control List and Previous Batch (if applicable)

In [None]:
master_list_filename = '../DATA\\DSC2\\Antobodies_tests_Master_List_2020-05-21.csv'
Negcontrols_filename = '../DATA\DSC2\COVID_controls_2020-06-15_14.14.52.csv'
#PosControls_filename = '../DATA\DSC2\COVID_controls_2020-06-15_14.14.52.csv'
# prevBatch_negControls = fileset[1]
# previous_batch_filename = fileset[13]
available_serum_filename = '../DATA\DSC2\11.1.2019-3.18.2020 Available Serum with bids.csv'
avalaible_negative_serum_filename = '../DATA\DSC2\11.1.2018-3.18.2019 Negative Serum with bids.csv'

master_list_filename, Negcontrols_filename#, PosControls_filename, prevBatch_negControls, previous_batch_filename,

In [None]:
# load master sample list
master_list = pd.read_csv(master_list_filename).iloc[:,1:]
master_list.head(3)

In [None]:
# load negative controls list
COVID_controls = pd.read_csv(Negcontrols_filename)
COVID_controls.head(3)

In [None]:
# load positive controls list
positive_controls = pd.read_csv(PosControls_filename)
positive_controls.head(5)

In [None]:
# load available serums and available neg serums - both received from biobank
available_serum = pd.read_csv(available_serum_filename)
available_negative_serum = pd.read_csv(avalaible_negative_serum_filename)

----------------------------------------

In [None]:
# load pervious batch's negatove control list
prevBatch_negControlsbids = pd.read_csv(prevBatch_negControls)[['biobank_id']]

In [None]:
# Load Previous batch, if exists
previous_batch_bids = pd.read_csv(previous_batch_filename)[['biobank_id']]
previous_batch_bids.head()

## Functions to perform each step of the plating strategy

*'Serum specimens will be sorted in descending order of collection dates. Following this order, the first batch of specimens will consist of approximately 5,000 specimens optimized to 68 well plating.*

*The number of specimens sent for testing in subsequent batches will depend on when the last positive specimen is found. The batches will consist of all specimens collected in the week prior to the last positive specimen until there are no positives.*

*The samples will be randomized within the state in which they were collected to reduce geographic bias in each batch.  Plate location will also be optimized for biobank pull and not randomized by freezer location.'*

### `get_plating_numbers`: Function to calculate plating numbers we need

In [None]:
# import math
# #Fixed variables (pr requirements/biobank)

# raw_batch_size = 5060
# NegControls_perBatch = 150 #unrepeated
# raw_nwells = 68 #including all positive and negative controls
# PosControls_perWell = 1

# def get_plating_numbers():

#     n_plates_needed = NegControls_perBatch/2
#     negControls_perWell = (NegControls_perBatch/n_plates_needed)*2
    
#     #get the next whole number of plate needed based on inputs above
#     #n_plates_needed = math.ceil((raw_batch_size + totalNegControls_perBatch)/nwells)

#     #therefore, actual batch size needed 
#     batch_size = (raw_nwells - negControls_perWell - PosControls_perWell)*n_plates_needed
     
#     print(colored('We need: batch_size = '+str(batch_size) + ', number of plates = ' +str(n_plates_needed) + ' and number of wells (excluding the neg and pos controls) = '+str((raw_nwells - negControls_perWell - PosControls_perWell)), 'magenta'))
#     print(colored('n negative controls per well = '+str(negControls_perWell) + ', n positive controls per well = ' +str(PosControls_perWell), 'magenta'))
    

Actual number per batch is 5,060 samples from 2020 (5,060 samples + 300 neg = 5,360 / (68-1 pos each= 67 per plate) = 80 plates per batch of 5,000 in the first two batches for a total 10,000 first analysis will be done.

### 6/11/2020_ New Plating numbers 
-Andrea wants 
- 5060 samples
- 150x2 neg controls per batch (she says seven distribution does not matter,more than getting at least 5k samples)
- 40 positive controls (1 per well)
- 68 wells total
- (5060+300)/67 wells = 80plates, then add 1 positive control

### `pull_batch()`: function to pull batch of desired size from master sample list

*'Serum specimens will be sorted in descending order of collection dates. Following this order, the first batch of specimens will consist of approximately 5,000 specimens ...'*

In [None]:
def pull_batch(batch_size, previous_batch = None):  
    '''Sorts sample master file from most recent to oldest and pulls batch of first n (n = desired batch size).
       When applicable, excludes list of biobank_ids in the previous batch from new batch
       '''
    
    DF = master_list[['participant_id','biobank_id','DateBloodSampleCollected', 'DateBloodSampleReceived', 'state']].drop_duplicates()
    
    # If there is a previous batch,
    # keep only pids that are in the master list and not in the previous batch
    # Else, proceed
    if previous_batch.empty:
        DF = DF
        
    else:
        keep = pd.DataFrame(set(master_list.biobank_id) - set(previous_batch.biobank_id)).rename(columns = {0:'biobank_id'}) 
        DF = pd.merge(DF, keep, how = 'inner')
    
    # Sort Date of collection from most recent to oldest in the master file
    # then select the first n(batch size) rows
    new_batch = DF.sort_values('DateBloodSampleCollected', ascending= False).iloc[:batch_size,:]
    
    #new_batch = DF1.sample(n = batch_size, random_state= rand_state)
    #new_batch = pd.merge(new_batch, master_list, on = ['participant_id','biobank_id','DateBloodSampleCollected', 'DateBloodSampleReceived'],
                         #how = 'left')
    print(colored('Batch #' +str(b_number)+' of ' +str(new_batch.biobank_id.nunique()) + ' participants pulled from Master List, sorted in descending order of collection dates, is ready.', 'blue'))
    return new_batch

In [None]:
# def pull_batch(batch_size, previous_batch = None):  
#     '''Sorts sample master file from most recent to oldest and pulls batch of first n (n = desired batch size).
#        When applicable, excludes list of biobank_ids in the previous batch from new batch
#        '''
    
#     DF = master_list[['participant_id','biobank_id','DateBloodSampleCollected', 'DateBloodSampleReceived', 'state']].drop_duplicates()
    
# #     # If there is a previous batch,
# #     # keep only pids that are in the master list and not in the previous batch
# #     # Else, proceed
# #     if previous_batch:
# #         if not previous_batch.empty:
# #             keep = pd.DataFrame(set(master_list.biobank_id) - set(previous_batch.biobank_id)).rename(columns = {0:'biobank_id'}) 
# #             DF = pd.merge(DF, keep, how = 'inner')
    
# #     else:
# #         DF = DF
    
#     # Sort Date of collection from most recent to oldest in the master file
#     # then select the first n(batch size) rows
#     new_batch = DF.sort_values('DateBloodSampleCollected', ascending= False).iloc[:batch_size,:]
    
#     #new_batch = DF1.sample(n = batch_size, random_state= rand_state)
#     #new_batch = pd.merge(new_batch, master_list, on = ['participant_id','biobank_id','DateBloodSampleCollected', 'DateBloodSampleReceived'],
#                          #how = 'left')
#     print(colored('Batch #' +str(b_number)+' of ' +str(new_batch.biobank_id.nunique()) + ' participants pulled from Master List, sorted in descending order of collection dates, is ready.', 'blue'))
#     return new_batch

In [None]:
def pull_batch2(batch_size, previous_batch):  
    '''Sorts sample master file from most recent to oldest and pulls batch of first n (n = desired batch size).
       When applicable, excludes list of biobank_ids in the previous batch from new batch
       '''
    
    DF = master_list[['participant_id','biobank_id','DateBloodSampleCollected', 'DateBloodSampleReceived', 'state']].drop_duplicates()
    
    # If there is a previous batch,
    # keep only pids that are in the master list and not in the previous batch
    keep = pd.DataFrame(set(master_list.biobank_id) - set(previous_batch.biobank_id)).rename(columns = {0:'biobank_id'}) 
    DF = pd.merge(DF, keep, how = 'inner')
    
    # Sort Date of collection from most recent to oldest in the master file
    # then select the first n(batch size) rows
    new_batch = DF.sort_values('DateBloodSampleCollected', ascending= False).iloc[:batch_size,:]
    
    #new_batch = DF1.sample(n = batch_size, random_state= rand_state)
    #new_batch = pd.merge(new_batch, master_list, on = ['participant_id','biobank_id','DateBloodSampleCollected', 'DateBloodSampleReceived'],
                         #how = 'left')
    print(colored('Batch #' +str(b_number)+' of ' +str(new_batch.biobank_id.nunique()) + ' participants pulled from Master List, sorted in descending order of collection dates, is ready.', 'blue'))
    return new_batch

### `randomize_and_locate()`: Function to randomize entire batch by State and add location columns to batch

*'The samples will be randomized based within the state in which they were collected to reduce geographic bias in each batch.'*

In [None]:
def randomize_and_locate(batch_df, batch_size, rand_state):
    '''Function to add bay,freezer and rack location columns to pulled batch and randomize by state entire batch 
        to get it ready for optimization'''
    
    # randomize by state = rnadomly shuffle states
    batch_n_state = batch_df[['state']].sample(n = batch_size, random_state = rand_state)
    batch_n_randomState = batch_df.sample(n = batch_size, random_state= rand_state) 
    
#     # Add back participant_ids to the randomized states
#     batch_n_randomState = pd.merge(batch_n, batch_n_state)

    # Add Location/Sequencing Columns to the batch
    batch_n_location = pd.merge(serum_samples_check, 
                           batch_n_randomState[['biobank_id','participant_id']].drop_duplicates(), 'right')
    
    print('Shape check:' + str(batch_n_location.shape))
 
    return batch_n_location

### `optimize_sequence()`: Function to optimize the Sequence and Group participants by plates of n size

*'Plate location will also be optimized for biobank pull and not randomized by freezer location.'*

In [None]:
def optimize_sequence(n_wells, df):  #df = = samples_groups
    '''Function to Create Optimized Sequence 
       Groups participants in plates of n_wells participants
       n_wells is the number of wells/the number of people per plate '''

    # create empty 'plate' column
    df['plate'] = int()
    
    # SEQUENCE OPTIMIZATION:
     ## Order the batch by Description (Bay, Freezer), Sequence and Rack (optimum sequence order provided by biobank)
     ## Then assign plate numbers to each participant, starting with plate #1
     ## with n_wells participants per plate 
    n = 0
    plate_number = 0

    for pid in df.sort_values(['Description', 'Sequence','Rack']).biobank_id: 
        pids_per_well = df.sort_values(['Description', 'Sequence', 'Rack'])[['biobank_id']][n:n_wells+n]
        ind = pids_per_well.index.values
        plate_number += 1

        for i in ind:
            df.loc[i, 'plate'] = plate_number

        n += n_wells
        
    display(df[['participant_id','plate']].rename(columns = {'participant_id':'pids_per_plate'}).groupby('plate').count())
    print(colored('Done! We have ' +str(n_wells)+' wells per plate, with ' +str(df.plate.nunique()) + ' plates in total. The last few plates may have different numbers', 'blue'))
    
    return  df

### `batch_negative_controls()`: Function to add negative controls, including repeated pids 
*'2019 matched to date range of the batch with geographic location matching as well'*

In [None]:
## 6/11/2020

##FOR NOW I WILL PUT hard numbers for the number of neg controls per plate, because the new numbers are not even
## I will come up with an automated way to generate the numbers below later

def batch_negative_controls(ramdomized_sample_df, batch_df, negControls_master, unrepeated_negControls, 
                          stop1, number_plates, rand_state):
    '''Select batch negative controls, then appends a repeated list of pids to get a total = number of plates
        Add columns of location, plates and negative controls to the batch_controls'''
        
     #match negative controls on state
    neg_controls_df= pd.merge(negControls_master[['biobank_id', 'participant_id','state']].drop_duplicates(), 
                                batch_df[['state']].drop_duplicates()).drop('state', axis = 1).drop_duplicates()#.reset_index()

    ## randomly select the batch negative controls
    batch_neg_controls_df = neg_controls_df.sample(n = int(unrepeated_negControls), 
                                                random_state = rand_state).reset_index().drop('index', axis = 1)
 
    ### Assign negatove controls to plates
    batch_neg_controls_df['plate']= int()
    
    batch_controls = batch_neg_controls_df.sort_values('biobank_id', ascending = True)
    batch_controls['plate'][:stop1] = range(1,stop1+1)
    batch_controls['plate'][stop1:stop1*2] = range(1,stop1+1)
    batch_controls['plate'][stop1*2:]= range(stop1+1,number_plates+1)

    repeat = batch_neg_controls_df.iloc[::-1].dropna()#.reset_index().drop('index', axis = 1)
    repeat['plate'][:stop1]= range(1,stop1+1)
    repeat['plate'][stop1:stop1*2]= range(1,stop1+1)
    repeat['plate'][stop1*2:]= range(stop1+1,number_plates+1)
    
    batch_controls_repeated = batch_controls.append(repeat)
      
    batch_controls_repeated['negative control'] = 'Yes'
    batch_controls_repeated['positive control'] = 'No'
    batch_controls_repeated['location'] = 'No'
    
    return batch_controls_repeated

In [None]:
# def add_negative_controls(ramdomized_sample_df, batch_df, negControls_master, unrepeated_negControls, 
#                           number_plates, rand_state):
#     ''' This funtions select batch negative controls, then appends a repeated list of pids 
#         Then merges with the samples from 2020. This will be th einput for th eoptimization function'''
    
# #     df['plate']= int()
# #     batch_controls = df.sort_values('biobank_id', ascending = True)
# #     batch_controls['plate'][:number_plates] = range(1,number_plates+1)
# #     batch_controls['plate'][number_plates:]= range(1,number_plates+1)

# #     repeat = batch_controls.iloc[::-1].dropna()#.reset_index().drop('index', axis = 1)
# #     repeat['plate'][:number_plates]= range(1,number_plates+1)
# #     repeat['plate'][number_plates:]= range(1, number_plates+1)
 
# #     batch_controls_repeated = batch_controls.append(repeat)
      
# #     batch_controls_repeated['negative control'] = 'Yes'
# #     batch_controls_repeated['positive control'] = 'No'
# #     batch_controls_repeated['location'] = 'No'

#     #checks
#     print('unique pids:' + str(sample_and_negControls.biobank_id.nunique()))
#     print('shape:' + str(sample_and_negControls.shape))
    
#     return sample_and_negControls

In [None]:
300/80
#combination of negative controls per plate that give 300 total (with repeats)
3*60 + 30*4,

3*79 + 63*1, 64*79 + 4*1

3*78 + 33*2, 64*78 + 34*2
#(63+4)*70 + 

### `batch_positive_controls()`:  Function to add positive controls 
and return final deliverable


'*Inclusion of 1 positive control per plate taken from Vanderbilt contributed samples that have prior titer quantification in duplicate
45 samples available from Vanderbilt, need 40 per batch, will choose randomly across low/ med/ high and use same 40 in both batches so each will be run 4 times using total volume 1600 uL*'



In [None]:
38+37

In [None]:
def batch_positive_controls(pos_controls_df, numb_plates, rand_state):
    '''Function to get positive controls '''

    #df = optimized_seq['positive control'] = 'No'
    
    # Determine how many samples we need
    unique_samples = math.ceil(numb_plates/2)
    #repeat = numb_plates - unique_samples   #total = number of plates - because we need one per place
    
    # choose n samples from positive controls, randomly from high/med/low
    pos_controls = pos_controls_df[['Sample ID', 'Positive Classificaiton', 'serology_id']].drop_duplicates()
     
    pos_controls = pos_controls.sample(n = int(unique_samples), random_state= rand_state)
    pos_controls = pd.DataFrame(pos_controls.append(pos_controls))
    display(pos_controls.shape)
    
    ## QC - chheck volume
    check = pd.merge(pos_controls, pos_controls_df[['Sample ID', 'Total vol (ml)']].drop_duplicates())
    print('checking batch positive control volume:')
    display(check['Total vol (ml)'].sum())

    pos_controls['positive control'] = 'Yes'
    pos_controls['negative control'] = 'No'
    pos_controls['location'] = 'No'
    pos_controls['plate'] = range(1,number_of_plates+1)
  

    return pos_controls

### `final_deliverable()`:Function to get final deliverable
Put the sample, positive controls and negative controls together

In [None]:
def final_deliverable(optimized_seq, neg_controls, pos_controls):
    '''Function to add all together and return final deliverable'''

    serology_samples_by_plate = optimized_seq
    serology_samples_by_plate['negative control'] = 'No'
    serology_samples_by_plate['positive control'] = 'No'
    serology_samples_by_plate['location'] = ''
    nolocation = serology_samples_by_plate[serology_samples_by_plate.Description.isna()]
    location = serology_samples_by_plate[~serology_samples_by_plate.Description.isna()]
    location['location'] = 'Yes'
    nolocation['location'] = 'No'

    serology_samples_by_plate = location.append(nolocation)
    
    Serology_final_dataset = pd.concat([serology_samples_by_plate, neg_controls], sort=True)
    Serology_final_dataset = pd.concat([Serology_final_dataset, pos_controls], sort=True)
    
    Serology_final_dataset = Serology_final_dataset[['biobank_id', 'serology_id', 'Sample ID', 'Description', 'Position in Rack', 'Rack', 'Sequence',
                        'location', 'negative control', 'positive control','plate']]
    
    return Serology_final_dataset

### `get_dist()`: Function to Check the distributions of datasets by a specified variable

In [None]:
def get_dist(DF, dist_var, group = None, groupnumber = None):
    '''group determines whether the distribution is to be done by plate or on the entire dataset.
       dist_var is the variable for which we check the distribution -- ie race, state 
       if group is 'plate', then, group number is the plate number'''
            
    if group == None:
        df = pd.DataFrame(DF[['participant_id', dist_var]].drop_duplicates()[dist_var].value_counts())
        df['%ofTheGroup'] = (df[dist_var]/DF.participant_id.nunique())*100
        
    else:        
        df1 = DF[DF[group] == groupnumber] 
        df = pd.DataFrame(df1[['participant_id', dist_var]].drop_duplicates()[dist_var].value_counts())
        df['%ofTheGroup'] = (df[dist_var]/df1.participant_id.nunique())*100

            
    return df

# EXECUTE THE PLATING STRATEGY

## Plating Numbers we need
Uses `get_plating_numbers()`

In [None]:
#get_plating_numbers()

In [None]:
# # To be used for rest of code
# b_size = 4725
# number_of_plates = 75
# n_wells = 63

In [None]:
# new 06/11/2020
# To be used for rest of code
import math
b_size = 5060
number_of_plates = 80
nwells = math.floor(5060/80) #excluding controls
NegControls_perBatch = 150 #unrepeated
nwells

In [None]:
# FOR NOW WILL USE HARD CODED NUMBERS since the new numbers are not even
# i will come up with a function to calculate these later

#combination sample+ negative controls per plate that works (with repeats)

#plates 1 to 70 = 63 samples and 4 neg controls
#plates 71 to 80 = 65 samples and 2 neg controls

63*70 + 65*10 + 4*70 + 2*10


## BATCH 1

### Pull batch 1
Uses function `pull_batch()`

*'Serum specimens will be sorted in descending order of collection dates. Following this order, the first batch of specimens will consist of approximately 5,000 specimens optimized to 68 well plating'*

In [None]:
### Which batch number is this (that we deliver to the biobank?)
b_number = 1
#b_size = 4725

batch1 = pull_batch(batch_size = b_size, previous_batch = None)
batch1.head()

In [None]:
batch1.state.drop_duplicates().sort_values()
#batch_n2.state.nunique()

##### Quick checks + Save batch to drive for documentation

In [None]:
#Check dates of specimen collection and number of unique biobank_ids
batch1.participant_id.nunique(), batch1['DateBloodSampleCollected'].max(), batch1['DateBloodSampleCollected'].min()


In [None]:
batch_filename = '../DATA/DSC2/Serology_batch' +str(b_number)+'_'+now+'.csv'
batch_filename

In [None]:
batch1.to_csv(batch_filename, index = False)

### Randomize by State and add Location to batch
Uses `randomize_and_locate()`

In [None]:
## double check n participants in batch
batch1.participant_id.nunique()

In [None]:
samples_groups = randomize_and_locate(batch_df = batch1, batch_size = b_size, rand_state = 609)
samples_groups.head()

### Sequence Optimization
Uses `optimize_sequence()`

*'...the first batch of specimens will consist of approximately 5,000 specimens optimized to 68 well plating... '*

#### **NEW 07/01/2020** New plating numbers have been decided. But we want to use the same samples and pos/neg controls. 
So, I will simply load them from the previous batch

In [None]:
batch1 = pd.read_csv('../DATA\\DSC2\\Quest_Serology_batch1_2020-06-16_13.07.28.csv')
batch1.head()

In [None]:
batch1.columns

In [None]:
batch1 = batch1[['biobank_id', 'Description', 'Position in Rack', 'Rack',
       'Sequence', 'location']].drop_duplicates()

batch1

In [None]:
pd.merge(available_serum, batch1, on ='biobank_id', how= 'right')

In [None]:
COVID_controls.biobank_id.nunique()

In [None]:
#nwells defined in 2.1 #excludes both positive and negative control wells for now

samples_groups_optimized = optimize_sequence(df = samples_groups, n_wells = nwells)
samples_groups_optimized.sort_values('plate')

In [None]:
## View plates with no location
samples_groups_optimized[samples_groups_optimized.Description.isna()].plate.drop_duplicates()

In [None]:
# TO ACHIVIVE the combination stated in 2.1, I will manually modif the last 10 plates
# we need 65 samples in each

In [None]:
samples_groups_optimized2 = samples_groups_optimized[samples_groups_optimized.plate <=70]
samples_groups_optimized2.sort_values('plate').tail()

In [None]:
samples_groups_optimizedmod = samples_groups_optimized[samples_groups_optimized.plate>=71]
samples_groups_optimizedmod.loc[samples_groups_optimizedmod.plate == 81, 'plate'] = 80
samples_groups_optimizedmod.sort_values('plate').tail()

In [None]:
samples_groups_optimizedmod = samples_groups_optimizedmod.sample(n = len(samples_groups_optimizedmod), random_state=512)

n = 0
m = 65
for p in range(71,80):  
    samples_groups_optimizedmod['plate'][n:n+m] = p
    m +=65
    n +=65

samples_groups_optimizedmod['plate'][len(samples_groups_optimizedmod)-65:] = 80
  
    # samples_groups_optimizedmod['plate'][n:n*2] = 78
# samples_groups_optimizedmod['plate'][n*2:n*3] = 79
# samples_groups_optimizedmod['plate'][n*3:] = 80

In [None]:
samples_groups_optimizedmod.sort_values('plate')

In [None]:
samples_groups_optimized2 = samples_groups_optimized2.append(samples_groups_optimizedmod)
samples_groups_optimized2[['biobank_id','plate']].groupby('plate').nunique()

In [None]:
samples_groups_optimized2.biobank_id.nunique()

### Add Negative Controls
Uses `add_negative_controls()`

In [None]:
# combination stated in 2.1, 
#plates 1 to 76 = 64 samples and 3 neg controls
#plates 77 to 80 = 49 samples and 18 neg controls


In [None]:
#  #match negative controls on state
# batch_controls_df= pd.merge(COVID_controls[['biobank_id', 'state']].drop_duplicates(), batch_n[['state']].drop_duplicates()).drop('state', axis = 1).drop_duplicates()#.reset_index()

# number_of_plates, NegControls_perBatch#, repeat_negControls

#  ## randomly select the batch negative controls
# batch_controls_df2 = batch_controls_df.sample(n = int(NegControls_perBatch), random_state = 610).reset_index().drop('index', axis = 1)
# batch_controls_df2

In [None]:
batch_negControls = batch_negative_controls(ramdomized_sample_df = samples_groups_optimized2,#samples_groups, 
                                            batch_df = batch1, 
                                               negControls_master = COVID_controls, 
                                               unrepeated_negControls = NegControls_perBatch,
                                               stop1 = 70, number_plates = number_of_plates, rand_state = 6115)

batch_negControls

In [None]:
#neg_control_dates = pd.merge(batch_negControls, COVID_controls[['biobank_id','DateBloodSampleCollected']])
neg_control_dates.DateBloodSampleCollected.max(), neg_control_dates.DateBloodSampleCollected.min()

In [None]:
# negative_controls_df = batch_negative_controls(df = batch_controls_df2, unrepeated_negControls = NegControls_perBatch, 
#                                                number_plates = number_of_plates)
# negative_controls_df

##### QC AND FIX

In [None]:
# Check that we have 4 DISTINCT PIDS per plate, even with the repeats
check = batch_negControls[['biobank_id','plate']].groupby('plate').nunique()#.drop('plate', axis = 1)
check.sort_values('biobank_id', ascending = False).tail(15)

In [None]:
#batch_negControls[batch_negControls.plate == 38]

In [None]:
# Controlsbids_notIN_batchcontrols = pd.DataFrame(set(batch_controls_df.biobank_id) - set(negative_controls_df.biobank_id))
# Controlsbids_notIN_batchcontrols.sample(2, random_state = 1)

In [None]:
negative_controls_df.to_csv('../DATA/DSC2/batch1_negative_controls_'+now+'.csv')

In [None]:
# #negative_controls_df.loc[negative_controls_df['plate'] == 38, 'biobank_id'] = 687141743
# #check again

# # Check that we have 4 DISTINCT PIDS per plate, even with the repeats
# check = negative_controls_df[['biobank_id','plate']].groupby('plate').nunique()#.drop('plate', axis = 1)
# check.sort_values('biobank_id', ascending = False)#.tail(15)


# #yay!!

### Add Positive Controls

In [None]:
now

In [None]:
positive_controls.head()

In [None]:
positive_controls_df = batch_positive_controls(positive_controls, rand_state = 511419, numb_plates = number_of_plates)
positive_controls_df = positive_controls_df[['Sample ID','serology_id','positive control', 'negative control', 'plate']]
positive_controls_df

In [None]:
positive_controls_df.to_csv('../DATA/DSC2/positive_controls_batch'+str(1)+'_'+now+'.csv')

In [None]:
positive_controls_df[['serology_id','plate']].groupby('plate').count().sort_values('serology_id')

### QC Only: Check Distributios of Batches and Plates 

In [None]:
batch_controls_demog = pd.merge(COVID_controls, batch_controls_repeated[['biobank_id']], how = 'inner')#.drop('Unnamed: 0', axis =1)
batch_controls_demog.head()

for v in ['Race','state']:
    display(get_dist(DF = batch_controls_demog, dist_var = v, group = None, groupnumber = None))

#### Whole Batch

In [None]:
## Checking the distribution
batch_n_demog = pd.merge(samples_groups, master_list)
batch_n_demog.participant_id.nunique()

In [None]:
for v in ['race','state']:
    display(get_dist(DF = batch_n_demog, dist_var = v, group = None, groupnumber = None))

#### Distribution Per Plate

In [None]:
gr = 'plate'
DF = batch_n_demog

writer = pd.ExcelWriter('../DATA/DSC2/sample_distributions_plates'+now+'.xlsx')
for g in DF[gr].drop_duplicates():
    #print(g)    
    for v in ['state']: #['race','state']:
        distribution = get_dist(DF, dist_var = v, group = gr, groupnumber = g)
        display(distribution)
        distribution.to_excel(writer, str(v)+'_'+str(gr)+str(g))
                        
writer.save()

### FINAL DELIVERABLE
  

In [None]:
Serology_batch_final = final_deliverable(samples_groups_optimized2, batch_negControls, positive_controls_df)

In [None]:
Serology_batch_final .sort_values('plate')

In [None]:
Serology_batch_final[['biobank_id','serology_id','plate']].drop_duplicates().sort_values('plate').groupby('plate').count()

In [None]:
Serology_batch_final_DRC = Serology_batch_final
Serology_batch_final_DRC

In [None]:
Serology_batch_final_DRC.columns

In [None]:
Serology_batch_final_quest = Serology_batch_final[['biobank_id', 'serology_id', 'Description', 'Position in Rack', 'Rack',
       'Sequence', 'location', 'plate']]

##### last checks 

In [None]:
# Serology_batch1_0518 = pd.read_csv('Serology_batch1_0518.csv')
# batch1_SerumAvailability = pd.read_csv('Serology_batch1_SerumAvailability.csv')
# batch1_SerumAvailability['biobank_id'] = [batch1_SerumAvailability.loc[x,'Subject Description'][1:] for x in batch1_SerumAvailability.index]
# batch1_SerumAvailability['biobank_id'] = [int(x) for x in batch1_SerumAvailability['biobank_id']]

#len(set(Serology_final_dataset.biobank_id) - set(batch1_SerumAvailability.biobank_id)), len(set(SERUM_SAMPLES_BYPLATE_DEMOG.biobank_id) - set(batch1_SerumAvailability.biobank_id))

#### Save Final Deliverable

In [None]:
batch_number = 1

In [None]:
Serology_batch_final_DRC.to_csv('../DATA/DSC2/DRC_Serology_batch'+str(batch_number)+'_'+now+'.csv', index = False)

#for quest, must mask positive  aand nnegative controls
Serology_batch_final_quest.to_csv('../DATA/DSC2/Quest_Serology_batch'+str(batch_number)+'_'+now+'.csv', index = False)


## BATCH 2

### Pull batch 2
Uses function `pull_batch()`

*'Serum specimens will be sorted in descending order of collection dates. Following this order, the first batch of specimens will consist of approximately 5,000 specimens optimized to 68 well plating'*

In [None]:
### Which batch number is this (that we deliver to the biobank?)
b_number = 2
#b_size = 4725

batch_n2 = pull_batch2(batch_size = b_size, previous_batch = batch_n)
batch_n2.head()

In [None]:
set(batch_n.participant_id) - set(batch_n2.participant_id) 

##### Quick checks + Save batch to drive for documentation

In [None]:
#Check dates of specimen collection and number of unique biobank_ids
batch_n2.participant_id.nunique(), batch_n2['DateBloodSampleCollected'].max(), batch_n2['DateBloodSampleCollected'].min()


In [None]:
## check that the biobank ids are not the same as in the first batch
len(set(batch_n2.biobank_id) - set(batch_n.biobank_id))

In [None]:
batch_filename = '../DATA/DSC2/Serology_batch' +str(b_number)+'_'+now+'.csv'
batch_filename

In [None]:
batch_n2.to_csv(batch_filename, index = False)

### Randomize by State and add Location to batch
Uses `randomize_and_locate()`

In [None]:
## double check n participants in batch
batch_n2.participant_id.nunique()

In [None]:
samples_groups2 = randomize_and_locate(batch_df = batch_n2, batch_size = b_size, rand_state = 611444)
samples_groups2.head()

### Sequence Optimization
Uses `optimize_sequence()`

*'...the first batch of specimens will consist of approximately 5,000 specimens optimized to 68 well plating... '*

#### **NEW 07/01/2020** New plating numbers have been decided. But we want to use the same samples and pos/neg controls. 
So, I will simply load them from the previous batch

In [None]:
batch2 = pd.read_csv('../DATA\\DSC2\\Quest_Serology_batch2_2020-06-16_13.07.28.csv')

In [None]:
samples_groups_optimizedb2 = optimize_sequence(df = samples_groups2, n_wells = nwells)
samples_groups_optimizedb2.sort_values('plate')

In [None]:
## View plates with no location
samples_groups_optimizedb2[samples_groups_optimizedb2.Description.isna()].plate.drop_duplicates()

In [None]:
# TO ACHIVIVE the combination stated in 2.1, I will manually modif the last 10 plates
# we need 65 samples in each

In [None]:
samples_groups_optimizedb22 = samples_groups_optimizedb2[samples_groups_optimizedb2.plate <=70]
samples_groups_optimizedb22.sort_values('plate').tail()

In [None]:
samples_groups_optimizedmodb2 = samples_groups_optimizedb2[samples_groups_optimizedb2.plate>=71]
samples_groups_optimizedmodb2.loc[samples_groups_optimizedmodb2.plate == 81, 'plate'] = 80
samples_groups_optimizedmodb2.sort_values('plate').tail()

In [None]:
samples_groups_optimizedmodb2 = samples_groups_optimizedmodb2.sample(n = len(samples_groups_optimizedmodb2), random_state=512)

n = 0
m = 65
for p in range(71,80):  
    samples_groups_optimizedmodb2['plate'][n:n+m] = p
    m +=65
    n +=65

samples_groups_optimizedmodb2['plate'][len(samples_groups_optimizedmodb2)-65:] = 80
  
    # samples_groups_optimizedmod['plate'][n:n*2] = 78
# samples_groups_optimizedmod['plate'][n*2:n*3] = 79
# samples_groups_optimizedmod['plate'][n*3:] = 80

In [None]:
samples_groups_optimizedmodb2.sort_values('plate')

In [None]:
samples_groups_optimizedb22 = samples_groups_optimizedb22.append(samples_groups_optimizedmodb2)
samples_groups_optimizedb22[['biobank_id','plate']].groupby('plate').nunique()

### Add Negative Controls
Uses `batch_negative_controls()`

In [None]:
# #Make sure batch2 negative controls are not in batch 1 negative controls
COVID_controls_keep = pd.DataFrame(set(COVID_controls.biobank_id) - set(batch_negControls.biobank_id)).rename(columns = {0:'biobank_id'}) 
COVID_controls_keep = pd.merge(COVID_controls[['biobank_id', 'participant_id','state', 'DateBloodSampleCollected']].drop_duplicates(),
                               COVID_controls_keep, how = 'inner')

In [None]:
batch_negControls2 = batch_negative_controls(ramdomized_sample_df = samples_groups_optimizedb22, batch_df = batch_n2, 
                                               negControls_master = COVID_controls_keep, 
                                               unrepeated_negControls = NegControls_perBatch,
                                               stop1 = 70, number_plates = number_of_plates, rand_state = 6111140)

batch_negControls2

##### QC AND FIX

In [None]:
#neg_control_dates2 = pd.merge(batch_negControls2, COVID_controls[['biobank_id','DateBloodSampleCollected']])
neg_control_dates2.DateBloodSampleCollected.max(), neg_control_dates2.DateBloodSampleCollected.min()

In [None]:
#Make sure batch2 negative controls are not in batch 1 negative controls
len(set(batch_negControls.biobank_id) - set(batch_negControls2.biobank_id))

#yay!

In [None]:
# Check that we have 4 DISTINCT PIDS per plate, even with the repeats
check2 = batch_negControls2[['biobank_id','plate']].groupby('plate').nunique()#.drop('plate', axis = 1)
check2.sort_values('biobank_id', ascending = False)#.tail(15)

In [None]:
# Controlsbids_notIN_batchcontrols = pd.DataFrame(set(batch_controls_df.biobank_id) - set(negative_controls_df.biobank_id))
# Controlsbids_notIN_batchcontrols.sample(2, random_state = 1)

In [None]:
batch_negControls2.to_csv('../DATA/DSC2/batch2_negative_controls_'+now+'.csv')

### Add Positive Controls

'*Positive controls are one per plate run in duplicate per 5000 and the run again in next 5000, randomly selected across their titer levels (see below)*'

In [None]:
## USE SAME ONES AS BATCH 1

namewas = '../DATA/DSC2/positive_controls_batch'+str(1)+'_'+now+'.csv'
namewas

In [None]:
positive_controls_df =pd.read_csv(namewas).iloc[:, 1:]
positive_controls_df

In [None]:
positive_controls_df[['serology_id','plate']].groupby('plate').count().sort_values('serology_id')

### QC Only: Check Distributios of Batches and Plates 

In [None]:
batch_controls_demog = pd.merge(COVID_controls, batch_controls_repeated[['biobank_id']], how = 'inner')#.drop('Unnamed: 0', axis =1)
batch_controls_demog.head()

for v in ['Race','state']:
    display(get_dist(DF = batch_controls_demog, dist_var = v, group = None, groupnumber = None))

#### Whole Batch

In [None]:
## Checking the distribution
batch_n_demog = pd.merge(samples_groups, master_list)
batch_n_demog.participant_id.nunique()

In [None]:
for v in ['race','state']:
    display(get_dist(DF = batch_n_demog, dist_var = v, group = None, groupnumber = None))

#### Distribution Per Plate

In [None]:
gr = 'plate'
DF = batch_n_demog

writer = pd.ExcelWriter('../DATA/DSC2/sample_distributions_plates'+now+'.xlsx')
for g in DF[gr].drop_duplicates():
    #print(g)    
    for v in ['state']: #['race','state']:
        distribution = get_dist(DF, dist_var = v, group = gr, groupnumber = g)
        display(distribution)
        distribution.to_excel(writer, str(v)+'_'+str(gr)+str(g))
                        
writer.save()

### FINAL DELIVERABLE
  

In [None]:
Serology_batch2_final2 = final_deliverable(samples_groups_optimizedb22, batch_negControls2, positive_controls_df)

In [None]:
Serology_batch2_final2 .sort_values('plate')

In [None]:
Serology_batch2_final2[['biobank_id','serology_id','plate']].drop_duplicates().sort_values('plate').groupby('plate').count()

In [None]:
Serology_batch2_final_DRC2 = Serology_batch2_final2
Serology_batch2_final_DRC2

In [None]:
Serology_batch2_final_DRC2.biobank_id.nunique(), Serology_batch2_final_DRC2.serology_id.count()

In [None]:
Serology_batch2_final_quest2 = Serology_batch2_final2[['biobank_id', 'serology_id', 'Description', 'Position in Rack', 'Rack',
       'Sequence', 'location', 'plate']]

##### last checks 

In [None]:
len(set(Serology_batch_final_quest.biobank_id) - set(Serology_batch2_final_quest2.biobank_id))

#### Save Final Deliverable

In [None]:
batch_number = 2

In [None]:
Serology_batch2_final_DRC2.to_csv('../DATA/DSC2/DRC_Serology_batch'+str(batch_number)+'_'+now+'.csv', index = False)

#for quest, must mask positive  aand nnegative controls
Serology_batch2_final_quest2.to_csv('../DATA/DSC2/Quest_Serology_batch'+str(batch_number)+'_'+now+'.csv', index = False)


In [None]:
5290+150

# Random

In [None]:
positive_controls = pd.read_csv('../DATA\\DSC2\\VandiCOVIDPositiveControls_2020-06-10_11.38.52.csv')

In [None]:
batch1_2_posControls = pd.read_csv('../DATA\DSC2\positive_controls_batch1_2020-06-10_23.51.57.csv')

In [None]:
remaining_posControls = pd.DataFrame(set(positive_controls['Sample ID']) - set(batch1_2_posControls['Sample ID']))
remaining_posControls.columns = ['Sample ID']
remaining_posControls = pd.merge(remaining_posControls, positive_controls)

In [None]:
remaining_posControls['Total vol (ml)'].sum()

In [None]:
16*22, 200/22, 400/22

In [None]:
10*5, 10*22

In [None]:
19*5, 

# **NEW** June 15, 2020 - Removing 538648580 from the negative conrol list and replacing with a new one.
Also add back Sample ID column to batches/ save latest batch files
Need to ensure it is not in previous batch 1 or batch 2 negative controls

In [None]:
#fileset[1], fileset[3]

In [None]:
batch1_neg_controls = pd.read_csv(fileset[1])
batch2_neg_controls = pd.read_csv(fileset[3])

In [None]:
COVID_controls.head()

In [None]:
batch1_neg_controls_BIDS = batch1_neg_controls[['biobank_id']]
batch2_neg_controls_BIDS = batch2_neg_controls[['biobank_id']]

In [None]:
rep_BID = pd.DataFrame(set(COVID_controls.biobank_id) - set(batch1_neg_controls_BIDS) - set(batch2_neg_controls_BIDS))
rep_BID.columns = ['biobank_id']
rep_BID[rep_BID.state]

In [None]:
COVID_controls[COVID_controls.biobank_id == 347865106]

##### Replacing id in batch 1 and adding the Sample Ids back

In [None]:
batch1 = pd.read_csv('../DATA\DSC2\Quest_Serology_batch1_2020-06-10_23.51.57.csv')
batch2 = pd.read_csv('../DATA\DSC2\Quest_Serology_batch2_2020-06-10_23.51.57.csv')

In [None]:
batch1.shape, batch2.shape, 

In [None]:
batch1 = batch1.merge(positive_controls[['serology_id','Sample ID']].drop_duplicates(), how = 'outer')
batch2 = batch2.merge(positive_controls[['serology_id','Sample ID']].drop_duplicates(), how = 'outer')

In [None]:
## REPLACING 538648580 in batch 1 with 347865106
batch1.loc[batch1.biobank_id == 538648580, 'biobank_id'] = 347865106

In [None]:
batch1[batch1.biobank_id == 347865106 ]

In [None]:
batch2[batch2.biobank_id == 347865106]

In [None]:
#for quest, must mask positive  aand nnegative controls
batch2.to_csv('../DATA/DSC2/Quest_Serology_batch'+str(2)+'_'+now+'.csv', index = False)
#for quest, must mask positive  aand nnegative controls
batch1.to_csv('../DATA/DSC2/Quest_Serology_batch'+str(1)+'_'+now+'.csv', index = False)


### 06_16_2020 iDENTIFYING NEG CONTROS IN BATCH

In [None]:
fileset[12], fileset[14]

In [None]:
batch1 = pd.read_csv(fileset[12])
batch2 = pd.read_csv(fileset[14])

In [None]:
batch1.shape, batch2.shape

In [None]:
batch1_neg_controls = pd.merge(batch1[['biobank_id', 'plate']], COVID_controls[['biobank_id']])
batch1_neg_controls['negative control'] = 'yes'
batch1 = batch1.merge(batch1_neg_controls, 'outer')
batch1['negative control'] = batch1['negative control'].fillna('No')


batch1.shape, batch1_neg_controls.shape

In [None]:
batch2_neg_controls = pd.merge(batch2[['biobank_id', 'plate']], COVID_controls[['biobank_id']])
batch2_neg_controls['negative control'] = 'yes'
batch2 = batch2.merge(batch2_neg_controls.drop_duplicates(), 'outer')
batch2['negative control'] = batch2['negative control'].fillna('No')

batch2.shape, batch2_neg_controls.shape

In [None]:
len(set(batch1.biobank_id.dropna()) - set(batch2.biobank_id.dropna()))

In [None]:
batch2

In [None]:
batch2.to_csv('../DATA/DSC2/Quest_Serology_batch'+str(2)+'_'+now+'.csv', index = False)
batch1.to_csv('../DATA/DSC2/Quest_Serology_batch'+str(1)+'_'+now+'.csv', index = False)

In [None]:
batch1[['biobank_id','negative control']].groupby('negative control').nunique()

## June 23, 2020 - Checking excel file that biobank send us 

Checking that the negs and pos controls are roughly 'random' and not al lin same wells

In [None]:
coll_map = pd.read_csv('../DATA/DSC2/Collective Children Plate Maps.csv')
bids_coll_map = coll_map[coll_map['Biobank ID'].str.startswith('A')]
                
                #= [int(i.split('A')[1]) for i in coll_map['Biobank ID']]

bids_coll_map['biobank_id'] = [int(i.split('A')[1]) for i in bids_coll_map['Biobank ID']]
bids_coll_map

In [None]:
# bids_coll_map = pd.merge(bids_coll_map, batch1_david)
# bids_coll_map_neg = bids_coll_map[bids_coll_map['negative control'] == 'yes']
# #bids_coll_map_neg[['plate','biobank_id']].groupby(['plate']).nunique()

# bids_coll_map_neg

In [None]:
pos_coll_map = coll_map[~coll_map['Biobank ID'].str.startswith('A')]
pos_coll_map = pos_coll_map.rename(columns = {'Biobank ID':'Sample ID'})


In [None]:
coll_map_DRC = pd.concat([bids_coll_map, pos_coll_map], sort=True)
coll_map_DRC

In [None]:
#pos_coll_map = pd.merge(pos_coll_map, batch1)
#pos_coll_map[['Plate ID','Biobank ID']].groupby(['Plate ID']).nunique()

In [None]:
#bids_coll_map

In [None]:
bids_coll_map['Biobank ID'].sort_values()

#### Check that the Sample Id they sent actualy are the corresponding sample Ids

In [None]:
fileset[10]

In [None]:
batch1 = pd.read_csv(fileset[10])
batch1

In [None]:
set(batch1.biobank_id.dropna()) - set(bids_coll_map.biobank_id.dropna())

In [None]:
pd.merge(master_list[['biobank_id','participant_id']], batch1, on = 'biobank_id')[['biobank_id']].nunique()

## June 30,2020 Files for David's analysis - 6/30/2020

In [None]:
### Batch1
fileset[12], fileset[8], fileset[2]

In [None]:
batch1 = pd.read_csv(fileset[12])
batch1

In [None]:
pids_4negcontrols_batch1 = pd.read_sql('''SELECT distinct participant_id, biobank_id
FROM participant_summary''', cnx)#.merge(pd.read_csv(fileset[8])[['biobank_id']])

In [None]:
batch1_david = pd.merge(batch1, pids_4negcontrols_batch1, 'left')#'merge(pids_4negcontrols_batch1, 'left')

In [None]:
batch1_david

#### merging with file from biobank

In [None]:
#batch1_david  = batch1_david.drop(['serology_id'], axis = 1)

In [None]:
#batch1_david.to_csv('../DATA\\DSC2\\Quest_Serology_batch1_2020-06-16_13.07.28_David.csv')

In [None]:
batch1_david = pd.read_csv(fileset[13])
batch1_david

In [None]:
coll_map_DRC

In [None]:
batch1_david2 = pd.merge(batch1_david, coll_map_DRC)
batch1_david2

In [None]:
Quest_AoU_Flat_map = pd.read_csv('../DATA/DSC2/Quest_AoU_Flat_20200630_map.csv').drop_duplicates()
Quest_AoU_Flat_map

In [None]:
import pandas as pd

coll_map = pd.read_csv('../DATA/DSC2/Collective Children Plate Maps.csv')

bids_coll_map = coll_map[coll_map['Biobank ID'].str.startswith('A')]
bids_coll_map['biobank_id'] = [int(i.split('A')[1]) for i in bids_coll_map['Biobank ID']]
pos_coll_map = coll_map[~coll_map['Biobank ID'].str.startswith('A')]
pos_coll_map = pos_coll_map.rename(columns = {'Biobank ID':'Sample ID'})
coll_map_DRC = pd.concat([bids_coll_map, pos_coll_map], sort=True)

batch1_david = pd.read_csv('../DATA\\DSC2\\Quest_Serology_batch1_2020-06-16_13.07.28_David.csv')
batch1_david2 = pd.merge(batch1_david, coll_map_DRC)

Quest_AoU_Flat_map = pd.read_csv('../DATA/DSC2/Quest_AoU_Flat_20200630_map.csv')[['Specimen ID','Matrix Tube ID',
                                                                                  'DOS']].drop_duplicates()
## nb: `Quest_AoU_Flat_20200630_map.csv`this file is just the first 3 columns of th eflat file they sent

Quest_AoU_Flat_map= Quest_AoU_Flat_map.rename(columns = {'Specimen ID':"Sample Id"})  #'Specimen ID' = "Sample Id"

batch1_david3 = pd.merge(batch1_david2[['biobank_id','participant_id',
                                       #'Sample ID',
                                        'negative control', 'positive control', 
                                        'Biobank ID', 'Collection Date','Matrix (Tube) ID', 
                                        'Plate ID', 'Quantity (uL)', 'Sample Id',
       'Sample Type', 'Storage Location']].drop_duplicates(), Quest_AoU_Flat_map, on = 'Sample Id', how = 'right').drop_duplicates()


batch1_david3.to_csv('../DATA/DSC2/Quest_batch1_returned_results_merged.csv')

#NB: I dropped and rearranged some columns manually in the saved CSV. You can do that in Python direftly if you wanr.
# Also i renamed the "Sample Id" bac to 'Specimen ID'

In [None]:
Quest_AoU_Flat_map= Quest_AoU_Flat_map.rename(columns = {'Specimen ID':"Sample Id"})

In [None]:
batch1_david2.columns

In [None]:
batch1_david3 = pd.merge(batch1_david2[['biobank_id','participant_id',
                                       'Sample ID',
       'negative control', 'positive control', 'Biobank ID', 'Collection Date',
       'Matrix (Tube) ID', 'Plate ID', 'Quantity (uL)', 'Sample Id',
       'Sample Type', 'Storage Location']].drop_duplicates(), Quest_AoU_Flat_map, on = 'Sample Id', how = 'right').drop_duplicates()


In [None]:
batch1_david2.to_csv('../DATA/DSC2/Quest_Serology_batch1_2020-06-16_13.07.28_David..csv')
batch1_david3.to_csv('../DATA/DSC2/Quest_batch1_returned_results_merged.csv')

### Get demographics for batch1

In [None]:
batch1_david2 = pd.read_csv('../DATA\DSC2\Quest_Serology_batch1_2020-06-16_13.07.28_David.csv')

In [None]:
batch1_david2_demog = pd.merge(batch1_david2, master_list, 'left')
batch1_david2_demog

In [None]:
batch1_david2_demog = pd.read_sql('''Select distinct participant_id, biobank_id, 
                display_name as hpo, hpo_id
                from hpo
                inner join participant_summary USING(hpo_id)''', cnx).merge(batch1_david2_demog).drop('hpo_id', axis = 1)

In [None]:
batch1_david2_demog

In [None]:
batch1_david2[batch1_david2['negative control'] == 'yes']

In [None]:
bach1_negative_controls_demographics = pd.merge(batch1_david2[batch1_david2['negative control'] == 'yes'][['biobank_id','participant_id']].drop_duplicates(), COVID_controls)

In [None]:
bach1_negative_controls_demographics

In [None]:
bach1_negative_controls_demographics = pd.read_sql('''Select distinct participant_id, biobank_id, 
                display_name as hpo, hpo_id
                from hpo
                inner join participant_summary USING(hpo_id)''', cnx).merge(bach1_negative_controls_demographics.drop('hpo', axis = 1)).drop('hpo_id', axis = 1)

In [None]:
bach1_negative_controls_demographics = pd.read_sql('''
Select display as SexAtBirth, code_id as sex_id
FROM code

''', cnx).merge(bach1_negative_controls_demographics).drop('sex_id', axis = 1)

bach1_negative_controls_demographics.head()

In [None]:
bach1_negative_controls_demographics = pd.read_sql('''
Select display as GenderIdentity, code_id as gender_identity_id
FROM code

''', cnx).merge(bach1_negative_controls_demographics).drop('gender_identity_id', axis = 1)

bach1_negative_controls_demographics.head()

In [None]:
r = Race()
r.columns = ['race','Race']

In [None]:
bach1_negative_controls_demographics = pd.read_sql('''
Select display as SexualOrientation, code_id as sexual_orientation_id
FROM code

''', cnx).merge(bach1_negative_controls_demographics).drop(['sexual_orientation_id'], axis = 1).merge(r).drop('Race', axis = 1)

bach1_negative_controls_demographics.head()

In [None]:
batch1_demog = pd.merge(batch1_david2_demog, bach1_negative_controls_demographics, 'outer')
batch1_demog.columns

In [None]:
batch1_david2_demog.biobank_id.nunique()

In [None]:
set(batch1_david2.biobank_id.dropna()) - set(batch1_david2_demog.biobank_id.dropna()) 

In [None]:
batch1_demog.to_csv('../DATA/DSC2/bach1__demographics.csv')
#batch1_demog.to_csv('../DATA/DSC2/bach1__demographics.csv')

### **07/022020** checking replated file sent by Mine

In [None]:
import pandas as pd

In [None]:
batch2_replated = pd.read_csv('../DATA/DSC2/80-well plate maps with parent IDs_revised controls_07.03.2020.csv')
batch2_replated = batch2_replated.rename(columns ={'Biobank ID/Positive Control ID':'biobank_id'}).iloc[:, :4]
batch2_replated

In [None]:
samp_batch2_replated = batch2_replated[~batch2_replated['biobank_id'].str.startswith('PIO')]
neg_batch2_replated = samp_batch2_replated[samp_batch2_replated['Negative Control'] == 'yes']
samp_batch2_replated = samp_batch2_replated[samp_batch2_replated['Negative Control'] == 'No']
samp_batch2_replated['biobank_id'].nunique(), neg_batch2_replated['biobank_id'].nunique()

In [None]:
pos_batch2_replated = batch2_replated[batch2_replated['biobank_id'].str.startswith('PIO')]
pos_batch2_replated['biobank_id'].nunique()

In [None]:
batch2_replated[['biobank_id','Plate']].groupby('Plate').nunique()

In [None]:
pos_batch2_replated['biobank_id'].count()

In [None]:
neg_batch2_replated = batch2_replated[batch2_replated['Negative Control'] == 'yes']
neg_batch2_replated
neg_batch2_replated[['biobank_id', 'Plate']].groupby('Plate').nunique()

In [None]:
neg_batch2_replated


In [None]:
neg_batch2_replated.biobank_id.value_counts().sort_values(),

In [None]:
neg_batch2_replated['Duplicated Negative Control'] = 'Yes'
neg_batch2_replated.loc[neg_batch2_replated['biobank_id'] == '954572832', 'Duplicated Negative Control'] = 'No'
neg_batch2_replated.loc[neg_batch2_replated['biobank_id'] == '995678363', 'Duplicated Negative Control'] = 'No'


In [None]:
noneg_neg_batch2_replated = batch2_replated[batch2_replated['Negative Control'] == 'No']
noneg_neg_batch2_replated['Duplicated Negative Control'] = 'No'
noneg_neg_batch2_replated

In [None]:
batch2_replated_tosend = pd.concat([neg_batch2_replated, noneg_neg_batch2_replated], sort = 'False')
batch2_replated_tosend = batch2_replated_tosend[['Plate','Sample ID','biobank_id','Negative Control','Duplicated Negative Control']]

batch2_replated_tosend.to_csv('../DATA/DSC2/80-well plate maps with parent IDs_revised controls_07.03.2020.csv')

In [None]:
batch2_replated_tosend

In [None]:
neg_batch2_replated[['biobank_id', 'Duplicated Negative Control']].groupby('Duplicated Negative Control').nunique()#.sort_values()

In [None]:
batch2_replated['biobank_id'].count()

In [None]:
neg_batch2_replated.biobank_id.nunique()

In [None]:
neg_batch2_replated.biobank_id.nunique(), pos_batch2_replated['Biobank ID/Positive Control ID'].nunique()

In [None]:
neg_batch2_replated['biobank_id'] = [int(i) for i in neg_batch2_replated['biobank_id']]
pd.merge(neg_batch2_replated.biobank_id,batch2_neg.biobank_id)

In [None]:
127*2 + 2

In [None]:
batch2_replated[['biobank_id','Plate']].groupby('Plate').nunique()

## checking distributions

In [None]:
batch2= pd.read_csv('../DATA\DSC2\Quest_Serology_batch2_2020-06-16_13.07.28.csv')
batch2


In [None]:
#batch1_neg = batch1[batch1['negative control'] == 'yes']
batch2_neg = batch2[batch2['negative control'] == 'yes']

In [None]:
master_list.head()

In [None]:
nopos_batch2_replated = batch2_replated[~batch2_replated['Biobank ID/Positive Control ID'].str.startswith('PIO')]
nopos_batch2_replated = nopos_batch2_replated.rename(columns ={'Biobank ID/Positive Control ID':'biobank_id'})
nopos_batch2_replated['biobank_id'] = [int(i) for i in nopos_batch1_replated['biobank_id']]

batch2_replated_demog= pd.merge(nopos_batch2_replated, 
                                master_list[['participant_id','biobank_id','state', 'DateBloodSampleCollected']].drop_duplicates(), on = 'biobank_id', how = 'left')
batch2_replated_demog                               

In [None]:
COVID_controls.head()

In [None]:
COVID_controls.shape

In [None]:
batch2_replated_demog_andneg = pd.merge(batch2_replated_demog, 
                                COVID_controls[['participant_id','biobank_id','state', 'DateBloodSampleCollected']].drop_duplicates(), how = 'left')
batch2_replated_demog_andneg                               

In [None]:
batch2_replated_demog_andneg.DateBloodSampleCollected.dropna().min(), batch2_replated_demog_andneg.DateBloodSampleCollected.dropna().max()

In [None]:
gr = 'Plate'
DF = batch2_replated_demog_andneg

#writer = pd.ExcelWriter('../DATA/DSC2/sample_distributions_plates'+now+'.xlsx')
for g in DF[gr].drop_duplicates():
    #print(g)    
    for v in ['state']: #['race','state']:
        distribution = get_dist(DF, dist_var = v, group = gr, groupnumber = g)
        display(distribution)
        #distribution.to_excel(writer, str(v)+'_'+str(gr)+str(g))
