# Derivatives of Supernatants from our freezers, before corrections

## General informations

In [2]:
import pandas as pd
import re
from IPython.display import display, Markdown, HTML
from file_toolkit import *

pd.set_option('display.max_rows', 96)
pd.set_option('display.max_columns', 96)

filepath = "/Volumes/LabExMI/Users/Nolwenn/FreezerPro/DataToImport/"
filename = "Supernatants_Derivatives_F1F2_20170125.csv"
df = pd.read_csv(filepath+filename)
display(Markdown("**%d** tubes in our derivatives file." % len(df)))
display(Markdown("* **%d** tubes for Fraction 1." % len(df.loc[df["Sample Type"] == "Fraction1"])))
display(Markdown("* **%d** tubes for Fraction 2." % len(df.loc[df["Sample Type"] == "Fraction2"])))
display(Markdown("* **%d** tubes are not assigned to a fraction." % len(df.loc[df["Sample Type"].isnull()])))
display(Markdown("List of the *%d* columns:" % len(df.columns)))
display(Markdown(";\n".join(["1. "+col for col in df.columns])+"."))

**20000** tubes in our derivatives file.

* **10000** tubes for Fraction 1.

* **10000** tubes for Fraction 2.

* **0** tubes are not assigned to a fraction.

List of the *33* columns:

1. ParentID;
1. Name;
1. BARCODE;
1. Position;
1. Volume;
1. Freezer;
1. Freezer_Descr;
1. Level1;
1. Level1_Descr;
1. Level2;
1. Level2_Descr;
1. Level3;
1. Level3_Descr;
1. BoxType;
1. Box;
1. Box_Descr;
1. ThermoBoxBarcode;
1. BOX_BARCODE;
1. CreationDate;
1. UpdateDate;
1. AliquotID;
1. DonorID;
1. StimulusID;
1. StimulusName;
1. VisitID;
1. ThawCycle;
1. Sample Source;
1. Description;
1. BatchID;
1. Sample Type;
1. ShelfBarcode;
1. RackBarcode;
1. DrawerBarcode.

We have 20.000 tubes as expected, and 10.000 tubes per Fraction. We want to know how many tube per donor we have.

## Checking tubes

In [3]:
display(Markdown("**{}** unique BARCODE. Expected 20.000.".format(df["Name"].nunique())))
duplicatednames = df.loc[df["Name"].duplicated(), "Name"].tolist()
if len(duplicatednames) > 0:
    display(Markdown("Table for {} duplicated *Name*:".format(len(duplicatednames))))
    display(df.loc[df["Name"].isin(duplicatednames), ["Name", "ParentID", "DonorID", "VisitID", "BatchID", "AliquotID", "StimulusID"]])

**20000** unique BARCODE. Expected 20.000.

In [4]:
display(Markdown("Final output file has **%d unique StimulusID**" % len(df["StimulusID"].unique())))
countstimperdonor = pd.DataFrame(df.groupby("DonorID")["StimulusID"].count())
display(Markdown("List of **%d** count of tubes per donor:" % len(countstimperdonor["StimulusID"].unique())))
display(Markdown(";\n".join(["* "+str(int(stim))+" tubes per donor" for stim in sorted(countstimperdonor["StimulusID"].unique())])+"."))

Final output file has **11 unique StimulusID**

List of **1** count of tubes per donor:

* 20 tubes per donor.

We were expecting **10 unique** StimulusID, Céline and Bruno only treat 10 Stimulus. We have an extra stimulus.

We want to know if at least one donor has less than 20 tubes associated, and, if possible, which tube is missing.

In [5]:
donorindexes = countstimperdonor[countstimperdonor["StimulusID"] < 20].index.values
if len(donorindexes) > 0:
    display(Markdown("List of **%d** unique DonorID:" % len(donorindexes)))
    display(Markdown(";\n".join([" - "+str(int(donor)) for donor in sorted(donorindexes)])+"."))
else:
    display(Markdown("All donor are assigned to at least 20 StimulusID."))
    display(Markdown("%d unique DonorID has more than 20 StimulusID." % \
                     len(countstimperdonor[countstimperdonor["StimulusID"] > 20].index.values)))

All donor are assigned to at least 20 StimulusID.

0 unique DonorID has more than 20 StimulusID.

We also want to known how many DonorID each StimulusID count:

In [7]:
# countdonorperstim = pd.DataFrame(df.groupby("StimulusID")["DonorID"].count())
# countdonorperstim.loc[:, "StimulusID"] = countdonorperstim.index.get_values().astype(int)
# countdonorperstim.reset_index(drop=True, inplace=True)
# display(countdonorperstim[["StimulusID", "DonorID"]])
countdonorperstim = compare_two_columns(df, "StimulusID", "DonorID")

Unnamed: 0,StimulusID,Nb_DonorID
0,11.0,2000
1,17.0,2000
2,18.0,2000
3,19.0,2
4,23.0,1998
5,24.0,2000
6,27.0,2000
7,32.0,2000
8,35.0,2000
9,37.0,2000


For StimulusID 19, at least one donor is assigned. First, we want to know the donors assigned to stimulus 19:

In [8]:
getdonor = df.loc[df["StimulusID"] == 19, "DonorID"].unique()
display(Markdown("StimulusID 19 has **%s** DonorID assigned:" % len(getdonor)))
display(Markdown(";\n".join(["* "+str(int(donor)) for donor in getdonor])+"."))

StimulusID 19 has **1** DonorID assigned:

* 819.

The donor numbered 75 had no tube for Stimuli 19. Céline sent an email to pinpoint where to find a file with the corrected data for this tube. The script to construct the final output was updated consequently.

For donor *819*, the problem is known. In fact, this donor is supposed to be assigned to stimulus 23 but it seems that stimulus were mixed for this donor:
* in box 23, tube donor 819 found should have been assigned for box 24
* in box 24, tube donor 819 found should have been assigned for box 17
* in box 17, tube donor 819 found should have been assigned for box 18
* in box 18, tube donor 819 found should have been assigned for box 19

(Remember to ask Céline if it is correct)

For stimulus 23, we want to know which are the missing donors:

In [9]:
donorlist = df["DonorID"].unique()
donorliststim23 = df.loc[df["StimulusID"] == 23.0, "DonorID"].unique()
donornotfound = list(set(donorlist) - set(donorliststim23))
display(Markdown("List of **%d** donors not found for StimulusID 23:" % len(donornotfound)))
display(Markdown(";\n".join(["* "+str(int(donor)) for donor in sorted(donornotfound)])))

List of **1** donors not found for StimulusID 23:

* 819

This confirms that the tube of DonorID 819 in StimulusID 19 is probably assigned to the wrong stimulus and should have been assign to the stimulus 23.

Know we want to be sure that we have one stimulus per box:

In [16]:
display(Markdown("There is **%d unique boxes** in the whole file." % len(df["Box"].unique())))

display(Markdown("**Look on Fraction 1:**"))
display(Markdown("There is **%d unique boxes** in the file for Fraction 1." %\
                 len(df.loc[df["Sample Type"] == "Fraction1", "Box"].unique())))
countboxstim = compare_two_columns(df, "Box", "StimulusID", method="nunique", show=False)
display(countboxstim.loc[(countboxstim["StimulusID"] > 1) & (countboxstim["Box"].str.contains("F1")),\
                         ["Box", "StimulusID"]])

display(Markdown("**Look on Fraction 2:**"))
display(Markdown("There is **%d unique boxes** in the file for Fraction 2." %\
                 len(df.loc[df["Sample Type"] == "Fraction2", "Box"].unique())))
display(countboxstim.loc[(countboxstim["StimulusID"] > 1) & (countboxstim["Box"].str.contains("F2")),\
                         ["Box", "StimulusID"]])

There is **220 unique boxes** in the whole file.

**Look on Fraction 1:**

There is **110 unique boxes** in the file for Fraction 1.

Unnamed: 0,Box,StimulusID
30,MIC_Plasma_S17_V1_A1_F1_D801-896,2
52,MIC_Plasma_S18_V1_A1_F1_D801-896,2
74,MIC_Plasma_S23_V1_A1_F1_D801-896,2
96,MIC_Plasma_S24_V1_A1_F1_D801-896,2


**Look on Fraction 2:**

There is **110 unique boxes** in the file for Fraction 2.

Unnamed: 0,Box,StimulusID
41,MIC_Plasma_S17_V1_A1_F2_D801-896,2
63,MIC_Plasma_S18_V1_A1_F2_D801-896,2
85,MIC_Plasma_S23_V1_A1_F2_D801-896,2
107,MIC_Plasma_S24_V1_A1_F2_D801-896,2


We have 4 boxes for which 2 different StimulusID are assigned, apparently those boxes are in the range of the DonorID 819. What are the StimulusID for each box?

In [11]:
boxes = countboxstim.loc[countboxstim["StimulusID"] > 1, "Box"]
display(Markdown("There is a total of **%d** boxes to check." % len(boxes)))
display(Markdown("**Look on Fraction 1:**"))
display(Markdown("List of StimulusID for the **%d** boxes to check:" % len(boxes.loc[boxes.str.contains("F1")])))
for box in boxes.loc[boxes.str.contains("F1")]:
    display(Markdown("* "+box+" contains %d StimulusID" % len(df.loc[df["Box"] == box, "StimulusID"].unique())+":"))
    for stim in df.loc[df["Box"] == box, "StimulusID"].unique():
        display(Markdown("       * "+str(int(stim))))
display(Markdown("**Look on Fraction 2:**"))
display(Markdown("List of StimulusID for the **%d** boxes to check:" % len(boxes.loc[boxes.str.contains("F2")])))
for box in boxes.loc[boxes.str.contains("F2")]:
    display(Markdown("* "+box+" contains %d StimulusID" % len(df.loc[df["Box"] == box, "StimulusID"].unique())+":"))
    for stim in df.loc[df["Box"] == box, "StimulusID"].unique():
        display(Markdown("       * "+str(int(stim))))

There is a total of **8** boxes to check.

**Look on Fraction 1:**

List of StimulusID for the **4** boxes to check:

* MIC_Plasma_S17_V1_A1_F1_D801-896 contains 2 StimulusID:

       * 17

       * 18

* MIC_Plasma_S18_V1_A1_F1_D801-896 contains 2 StimulusID:

       * 18

       * 19

* MIC_Plasma_S23_V1_A1_F1_D801-896 contains 2 StimulusID:

       * 23

       * 24

* MIC_Plasma_S24_V1_A1_F1_D801-896 contains 2 StimulusID:

       * 24

       * 17

**Look on Fraction 2:**

List of StimulusID for the **4** boxes to check:

* MIC_Plasma_S17_V1_A1_F2_D801-896 contains 2 StimulusID:

       * 17

       * 18

* MIC_Plasma_S18_V1_A1_F2_D801-896 contains 2 StimulusID:

       * 18

       * 19

* MIC_Plasma_S23_V1_A1_F2_D801-896 contains 2 StimulusID:

       * 23

       * 24

* MIC_Plasma_S24_V1_A1_F2_D801-896 contains 2 StimulusID:

       * 24

       * 17

The boxes impacted seems to be the ones that Céline already found. Could we retrieve the DonorID 819 in those StimulusID?

In [12]:
getstims = df.loc[df["Box"].isin(boxes), "StimulusID"].astype(int).unique()
display(Markdown("**Look on Fraction 1:**"))
for box in boxes.loc[boxes.str.contains("F1")]:
    for stim in getstims:
        if len(df.loc[(df["Box"] == box) & (df["StimulusID"] == stim) & (df["DonorID"] == 819), "DonorID"]) > 0:
            display(Markdown('* DonorID 819 found in box *%s*, StimulusID **%d**.' % (box,stim)))
display(Markdown("**Look on Fraction 2:**"))
for box in boxes.loc[boxes.str.contains("F2")]:
    for stim in getstims:
        if len(df.loc[(df["Box"] == box) & (df["StimulusID"] == stim) & (df["DonorID"] == 819), "DonorID"]) > 0:
            display(Markdown('* DonorID 819 found in box *%s*, StimulusID **%d**.' % (box,stim)))

**Look on Fraction 1:**

* DonorID 819 found in box *MIC_Plasma_S17_V1_A1_F1_D801-896*, StimulusID **18**.

* DonorID 819 found in box *MIC_Plasma_S18_V1_A1_F1_D801-896*, StimulusID **19**.

* DonorID 819 found in box *MIC_Plasma_S23_V1_A1_F1_D801-896*, StimulusID **24**.

* DonorID 819 found in box *MIC_Plasma_S24_V1_A1_F1_D801-896*, StimulusID **17**.

**Look on Fraction 2:**

* DonorID 819 found in box *MIC_Plasma_S17_V1_A1_F2_D801-896*, StimulusID **18**.

* DonorID 819 found in box *MIC_Plasma_S18_V1_A1_F2_D801-896*, StimulusID **19**.

* DonorID 819 found in box *MIC_Plasma_S23_V1_A1_F2_D801-896*, StimulusID **24**.

* DonorID 819 found in box *MIC_Plasma_S24_V1_A1_F2_D801-896*, StimulusID **17**.

Apparently, the problem described by Céline exists for the 8 boxes, do we have to change them? Normally they are supposed to already been changed. Is there a problem in the script that generated the data?

We also want to be sure that the box name reproduce the same error if we look with the Thermo Fisher box barcode:

In [14]:
display(Markdown("There is a total of **%d unique boxes** in the file." % len(df["ThermoBoxBarcode"].unique())))
display(Markdown("There is **%d unique boxes** in Fraction 1." % len(df.loc[df["Sample Type"] == "Fraction1", \
                                                                          "ThermoBoxBarcode"].unique())))
display(Markdown("There is **%d unique boxes** in Fraction 2." % len(df.loc[df["Sample Type"] == "Fraction2", \
                                                                          "ThermoBoxBarcode"].unique())))

countthermoboxstim = compare_two_columns(df, "ThermoBoxBarcode", "StimulusID", method="nunique", show=False)
display(countthermoboxstim.loc[(countthermoboxstim["StimulusID"] > 1), ["ThermoBoxBarcode", "StimulusID"]])

There is a total of **220 unique boxes** in the file.

There is **110 unique boxes** in Fraction 1.

There is **110 unique boxes** in Fraction 2.

Unnamed: 0,ThermoBoxBarcode,StimulusID
71,TF00080640,2
73,TF00080642,2
144,TS00010678,2
153,TS00010783,2
182,TS00047751,2
196,TS00048028,2
201,TS00048039,2
210,TS00048095,2


We have the same number of ThermoBoxBarcode as of BoxBarcode.

We wonder if the same trouble as for Box column, with StimulusID, occurs with the column ThermoBoxBarcode:

In [None]:
thermoboxes = countthermoboxstim.loc[countthermoboxstim["StimulusID"] > 1, "ThermoBoxBarcode"]
display(Markdown("List of StimulusID for the **%d** boxes to check:" % len(thermoboxes)))

for thermobox in thermoboxes:
    display(Markdown("* "+thermobox+" contains %d StimulusID" % \
                     len(df.loc[df["ThermoBoxBarcode"] == thermobox, "StimulusID"].unique())+":"))
    for stim in df.loc[df["ThermoBoxBarcode"] == thermobox, "StimulusID"].unique():
        display(Markdown("       *"+str(int(stim))))

The same lists of StimulusID appears for ThermoBoxBarcode column. Do we have the same result on StimulusID column when we look specifically for DonorID 819?

In [None]:
getstims = df.loc[df["ThermoBoxBarcode"].isin(thermoboxes), "StimulusID"].astype(int).unique()
for thermobox in thermoboxes:
    for stim in getstims:
        if len(df.loc[(df["ThermoBoxBarcode"] == thermobox) & (df["StimulusID"] == stim) & \
                      (df["DonorID"] == 819), "DonorID"]) > 0:
            display(Markdown('* DonorID 819 found in box *%s*, StimulusID **%d**.' % (thermobox,stim)))

The results are in accordance with analysis using Box column.

We want to know, from the 4 boxes from Thermo that are assigned to more than one StimulusID, the list of the boxes Box that are related:

In [None]:
for thermobox in thermoboxes:
    display(Markdown(thermobox+" -> "+\
                     ", ".join([box for box in df.loc[df["ThermoBoxBarcode"] == thermobox, "Box"].unique()])))

For the boxes assigned to more than one StimulusID, it looks ok.

For each ThermoBoxBarcode column, which are those with more than one Box column associated?

In [None]:
countboxperthermobox = pd.DataFrame(df.groupby("Box")["ThermoBoxBarcode"].nunique())
countboxperthermobox.loc[:, "Box"] = countboxperthermobox.index.get_values()
countboxperthermobox.reset_index(drop=True, inplace=True)
display(Markdown("**%d** boxes are not assigned to a ThermoBoxBarcode" % \
                 len(countboxperthermobox.loc[countboxperthermobox["ThermoBoxBarcode"] < 1,\
                                              ["Box", "ThermoBoxBarcode"]])))
display(Markdown("**%d** boxes are assigned to more than one ThermoBoxBarcode" % \
                 len(countboxperthermobox.loc[countboxperthermobox["ThermoBoxBarcode"] > 1,\
                                              ["Box", "ThermoBoxBarcode"]])))
display(Markdown("**%d** boxes are assigned to a ThermoBoxBarcode" % \
                 len(countboxperthermobox.loc[countboxperthermobox["ThermoBoxBarcode"] == 1,\
                                              ["Box", "ThermoBoxBarcode"]])))
if len(countboxperthermobox.loc[countboxperthermobox["ThermoBoxBarcode"] != 1, \
                                ["Box", "ThermoBoxBarcode"]]) > 0:
    display(countboxperthermobox.loc[countboxperthermobox["ThermoBoxBarcode"] != 1,["Box", "ThermoBoxBarcode"]])

Do these boxes contains info on DonorID?

In [None]:
boxes = countboxperthermobox.loc[countboxperthermobox["ThermoBoxBarcode"] != 1]["Box"].values
display(Markdown("**%d** tubes in boxes not assigned to ThermoBoxbarcode." % \
      len(df.loc[df["Box"].isin(boxes), ["Box", "ThermoBoxBarcode", "DonorID"]])))
if len(df.loc[df["Box"].isin(boxes), ["Box", "ThermoBoxBarcode", "DonorID"]]) > 0:
    display(df.loc[df["Box"].isin(boxes), ["Box", "ThermoBoxBarcode", "DonorID"]])

The list of DonorID found correspond to tubes that generated errors and that were corrected by Céline. The column ThermoBoxBarcode was not taken into account on the previous version of the script, know it is fixed.

We also want to check if none of the excluded donors are in our data:

In [None]:
excludeddonors = [96, 104, 122, 167, 178, 219, 268, 279, 303, 308, 534, 701]
df["DonorID"] = df["DonorID"].astype(int)
display(Markdown("**%d** donor found." % len(df.loc[df["DonorID"].isin(excludeddonors), "DonorID"].unique())))
if len(df.loc[df["DonorID"].isin(excludeddonors), "DonorID"].unique()) > 0:
    display(Markdown("The excluded donor found are:"))
    display(Markdown(";\n".join(["* "+str(donor) for donor in df.loc[df["DonorID"].isin(excludeddonors), "DonorID"].unique()])))

We have none of the excluded donors in the final output of our data.

From Céline, the missing donor for StimulusID should be in the run file, and it should be assign to Thermo Fisher box barcode TF00080651, at position G3. From the run file, there is a tube. We have to check the donors from the ThermoBoxBarcode, at well row G. From Atlas, in the run excel file, the tube is set as ‘No read‘, but from the computer next to TECAN, the barcode exists. After adding the good info, we have those data:

In [None]:
display(Markdown("**%d** tubes found for ThermoBoxBarcode TF00080651:" % len(df.loc[(df["ThermoBoxBarcode"] == "TF00080651") & (df["Position"].str.contains("G")),\
               ["Box", "ThermoBoxBarcode", "Position", "DonorID"]].sort_values(["DonorID"]))))
display(df.loc[(df["ThermoBoxBarcode"] == "TF00080651") & (df["Position"].str.contains("G")),\
               ["Box", "ThermoBoxBarcode", "Position", "DonorID"]].sort_values(["DonorID"]))

We are supposed to have 11 boxes per stimulus and fraction, how many boxes do we effectively have per StimulusID?

In [None]:
display(Markdown("**Look on Fraction 1:**"))
df_frac1 = df.loc[df["Sample Type"] == "Fraction1"]
countboxperstim1 = pd.DataFrame(df_frac1.groupby("StimulusID")["Box"].nunique())
countboxperstim1.loc[:, "StimulusID"] = countboxperstim1.index.get_values().astype(int)
countboxperstim1.reset_index(drop = True, inplace = True)
display(table_wo_index(countboxperstim1[["Box", "StimulusID"]]))

display(Markdown("**Look on Fraction 2:**"))
df_frac2 = df.loc[df["Sample Type"] == "Fraction2"]
countboxperstim2 = pd.DataFrame(df_frac2.groupby("StimulusID")["Box"].nunique())
countboxperstim2.loc[:, "StimulusID"] = countboxperstim2.index.get_values().astype(int)
countboxperstim2.reset_index(drop = True, inplace = True)
display(countboxperstim2[["Box", "StimulusID"]])

The exceed box for StimulusID 19 is expected as we still don't have change DonorID 819 for this stimulus. The boxes in StimulusID 17, 18, and 24 are also expected. What are these boxes?

In [None]:
display(Markdown("**Look on Fraction 1:**"))
stims1 = countboxperstim1.loc[countboxperstim1["Box"] > 11, "StimulusID"].values.tolist()

if len(df.loc[(df["StimulusID"].isin(stims1)) & (df["Box"].str.contains("F1")), "Box"].unique()) > 11:
    for stim1 in stims1:
        display(Markdown("List of boxes for **StimulusID %d**:" % stim1))
        display(Markdown(";\n".join(["* "+str(box)+", StimulusID "+str(int(stim1)) \
                                     for box in df.loc[(df["StimulusID"] == stim1) & \
                                                       (df["Box"].str.contains("F1")), "Box"].unique()])+"."))
        
display(Markdown("**Look on Fraction 2:**"))
stims2 = countboxperstim2.loc[countboxperstim2["Box"] > 11, "StimulusID"].values.tolist()

if len(df.loc[(df["StimulusID"].isin(stims2)) & (df["Box"].str.contains("F2")), "Box"].unique()) > 11:
    for stim2 in stims2:
        display(Markdown("List of boxes for **StimulusID %d**:" % stim2))
        display(Markdown(";\n".join(["* "+str(box)+", StimulusID "+str(int(stim2)) \
                                     for box in df.loc[(df["StimulusID"] == stim2) & \
                                                       (df["Box"].str.contains("F2")), "Box"].unique()])+"."))

List of donors in unexpected boxes:

In [None]:
display(Markdown("**Look on Fraction 1:**"))
display(df.loc[(df["Box"] == 'MIC_Plasma_S17_V1_A1_F1_D801-896') & (df["StimulusID"] != 17.0),\
               ["Box", "StimulusID", "DonorID"]])
display(df.loc[(df["Box"] == 'MIC_Plasma_S18_V1_A1_F1_D801-896') & (df["StimulusID"] != 18.0),\
               ["Box", "StimulusID", "DonorID"]])
display(df.loc[(df["Box"] == 'MIC_Plasma_S24_V1_A1_F1_D801-896') & (df["StimulusID"] != 24.0),\
               ["Box", "StimulusID", "DonorID"]])
display(df.loc[(df["Box"] == 'MIC_Plasma_S23_V1_A1_F1_D801-896') & (df["StimulusID"] != 23.0),\
               ["Box", "StimulusID", "DonorID"]])

display(Markdown("**Look on Fraction 2:**"))
display(df.loc[(df["Box"] == 'MIC_Plasma_S17_V1_A1_F2_D801-896') & (df["StimulusID"] != 17.0),\
               ["Box", "StimulusID", "DonorID"]])
display(df.loc[(df["Box"] == 'MIC_Plasma_S18_V1_A1_F2_D801-896') & (df["StimulusID"] != 18.0),\
               ["Box", "StimulusID", "DonorID"]])
display(df.loc[(df["Box"] == 'MIC_Plasma_S24_V1_A1_F2_D801-896') & (df["StimulusID"] != 24.0),\
               ["Box", "StimulusID", "DonorID"]])
display(df.loc[(df["Box"] == 'MIC_Plasma_S23_V1_A1_F2_D801-896') & (df["StimulusID"] != 23.0),\
               ["Box", "StimulusID", "DonorID"]])

These results are agreed with observations done by Céline. These tubes will have to be changed manually, directly from the output file.

In [None]:
display(df.loc[(df["StimulusID"].isin([17.0, 18.0, 19.0, 24.0])) & (df["DonorID"] == 819),\
               ["Box", "StimulusID", "DonorID"]])

## Checking common fields

We have 2 fractions, but each fraction have to contains common informations:
* ParentID;
* Position;
* CreationDate;
* UpdateDate;
* AliquotID;
* BoxType;
* VisitID;
* ThawCycle;
* Volume;
* Sample Source;
* BatchID;
* Sample Type.

We expect to have same number of lines for each field, for each fraction:

In [None]:
display(Markdown("### Look on Fraction 1"))

display(Markdown("#### ParentID field"))
display(Markdown("* **%d** empty ParentID\n* **%d** unique ParentID" % \
                 (len(df_frac1[df_frac1["ParentID"].isnull()]), len(df_frac1["ParentID"].unique()))))

display(Markdown("#### Position field"))
tubesperposition1 = pd.DataFrame(df_frac1.groupby("Position")["Position"].count())
tubesperposition1.loc[:, "Tubes"] = tubesperposition1["Position"]
tubesperposition1.loc[:, "Position"] = tubesperposition1.index
tubesperposition1.reset_index(drop=True, inplace=True)
display(tubesperposition1)
display(Markdown("**%d** tubes for Position." % tubesperposition1["Tubes"].sum()))

display(Markdown("#### CreationDate field"))
tubespercreation1 = pd.DataFrame(df_frac1.groupby("CreationDate")["CreationDate"].count())
tubespercreation1.loc[:, "Tubes"] = tubespercreation1["CreationDate"]
tubespercreation1.loc[:, "CreationDate"] = tubespercreation1.index
tubespercreation1.reset_index(drop=True, inplace=True)
display(tubespercreation1)
display(Markdown("**%d** tubes for CreationDate." % tubespercreation1["Tubes"].sum()))

display(Markdown("#### UpdateDate field"))
tubesperupdate1 = pd.DataFrame(df_frac1.groupby("UpdateDate")["UpdateDate"].count())
tubesperupdate1.loc[:, "Tubes"] = tubesperupdate1["UpdateDate"]
tubesperupdate1.loc[:, "UpdateDate"] = tubesperupdate1.index
tubesperupdate1.reset_index(drop=True, inplace=True)
display(tubesperupdate1)
display(Markdown("**%d** tubes for UpdateDate." % tubesperupdate1["Tubes"].sum()))

display(Markdown("#### AliquotID field"))
tubesperaliquotid1 = pd.DataFrame(df_frac1.groupby("AliquotID")["AliquotID"].count())
tubesperaliquotid1.loc[:, "Tubes"] = tubesperaliquotid1["AliquotID"]
tubesperaliquotid1.loc[:, "AliquotID"] = tubesperaliquotid1.index
tubesperaliquotid1.reset_index(drop=True, inplace=True)
display(tubesperaliquotid1)
display(Markdown("**%d** tubes for AliquotID." % tubesperaliquotid1["Tubes"].sum()))

display(Markdown("#### BoxType field"))
tubesperboxtype1 = pd.DataFrame(df_frac1.groupby("BoxType")["BoxType"].count())
tubesperboxtype1.loc[:, "Tubes"] = tubesperboxtype1["BoxType"]
tubesperboxtype1.loc[:, "BoxType"] = tubesperboxtype1.index
tubesperboxtype1.reset_index(drop=True, inplace=True)
display(tubesperboxtype1)
display(Markdown("**%d** tubes for BoxType." % tubesperboxtype1["Tubes"].sum()))

display(Markdown("#### VisitID field"))
tubespervisitid1 = pd.DataFrame(df_frac1.groupby("VisitID")["VisitID"].count())
tubespervisitid1.loc[:, "Tubes"] = tubespervisitid1["VisitID"]
tubespervisitid1.loc[:, "VisitID"] = tubespervisitid1.index
tubespervisitid1.reset_index(drop=True, inplace=True)
display(tubespervisitid1)
display(Markdown("**%d** tubes for VisitID." % tubespervisitid1["Tubes"].sum()))

display(Markdown("#### ThawCycle field"))
tubesperthawcycle1 = pd.DataFrame(df_frac1.groupby("ThawCycle")["ThawCycle"].count())
tubesperthawcycle1.loc[:, "Tubes"] = tubesperthawcycle1["ThawCycle"]
tubesperthawcycle1.loc[:, "ThawCycle"] = tubesperthawcycle1.index
tubesperthawcycle1.reset_index(drop=True, inplace=True)
display(tubesperthawcycle1)
display(Markdown("**%d** tubes for ThawCycle." % tubesperthawcycle1["Tubes"].sum()))

display(Markdown("#### Sample Source field"))
display(Markdown("* **%d** empty Sample Source\n* **%d** unique Sample Source" % \
                 (len(df_frac1[df_frac1["Sample Source"].isnull()]), len(df_frac1["Sample Source"].unique()))))

display(Markdown("#### BatchID field"))
tubesperbatchid1 = pd.DataFrame(df_frac1.groupby("BatchID")["BatchID"].count())
tubesperbatchid1.loc[:, "Tubes"] = tubesperbatchid1["BatchID"]
tubesperbatchid1.loc[:, "BatchID"] = tubesperbatchid1.index
tubesperbatchid1.reset_index(drop=True, inplace=True)
display(tubesperbatchid1)
display(Markdown("**%d** tubes for ThawCycle." % tubesperbatchid1["Tubes"].sum()))

display(Markdown("#### Sample Type field"))
tubespersampletype1 = pd.DataFrame(df_frac1.groupby("Sample Type")["Sample Type"].count())
tubespersampletype1.loc[:, "Tubes"] = tubespersampletype1["Sample Type"]
tubespersampletype1.loc[:, "Sample Type"] = tubespersampletype1.index
tubespersampletype1.reset_index(drop=True, inplace=True)
display(tubespersampletype1)
display(Markdown("**%d** tubes for Sample Type." % tubespersampletype1["Tubes"].sum()))

display(Markdown("### Look on Fraction 2"))

display(Markdown("#### ParentID field"))
display(Markdown("* **%d** empty ParentID\n* %d unique ParentID" % \
                 (len(df_frac2[df_frac2["ParentID"].isnull()]), len(df_frac2["ParentID"].unique()))))

display(Markdown("#### Position field"))
tubesperposition2 = pd.DataFrame(df_frac2.groupby("Position")["Position"].count())
tubesperposition2.loc[:, "Tubes"] = tubesperposition2["Position"]
tubesperposition2.loc[:, "Position"] = tubesperposition2.index
tubesperposition2.reset_index(drop=True, inplace=True)
display(tubesperposition2)
display(Markdown("**%d** tubes for Position." % tubesperposition2["Tubes"].sum()))

display(Markdown("#### CreationDate field"))
tubespercreation2 = pd.DataFrame(df_frac2.groupby("CreationDate")["CreationDate"].count())
tubespercreation2.loc[:, "Tubes"] = tubespercreation2["CreationDate"]
tubespercreation2.loc[:, "CreationDate"] = tubespercreation2.index
tubespercreation2.reset_index(drop=True, inplace=True)
display(tubespercreation2)
display(Markdown("**%d** tubes for CreationDate." % tubespercreation2["Tubes"].sum()))

display(Markdown("#### UpdateDate field"))
tubesperupdate2 = pd.DataFrame(df_frac2.groupby("UpdateDate")["UpdateDate"].count())
tubesperupdate2.loc[:, "Tubes"] = tubesperupdate2["UpdateDate"]
tubesperupdate2.loc[:, "UpdateDate"] = tubesperupdate2.index
tubesperupdate2.reset_index(drop=True, inplace=True)
display(tubesperupdate2)
display(Markdown("**%d** tubes for UpdateDate." % tubesperupdate2["Tubes"].sum()))

display(Markdown("#### AliquotID field"))
tubesperaliquotid2 = pd.DataFrame(df_frac2.groupby("AliquotID")["AliquotID"].count())
tubesperaliquotid2.loc[:, "Tubes"] = tubesperaliquotid2["AliquotID"]
tubesperaliquotid2.loc[:, "AliquotID"] = tubesperaliquotid2.index
tubesperaliquotid2.reset_index(drop=True, inplace=True)
display(tubesperaliquotid2)
display(Markdown("**%d** tubes for AliquotID." % tubesperaliquotid2["Tubes"].sum()))

display(Markdown("#### BoxType field"))
tubesperboxtype2 = pd.DataFrame(df_frac2.groupby("BoxType")["BoxType"].count())
tubesperboxtype2.loc[:, "Tubes"] = tubesperboxtype2["BoxType"]
tubesperboxtype2.loc[:, "BoxType"] = tubesperboxtype2.index
tubesperboxtype2.reset_index(drop=True, inplace=True)
display(tubesperboxtype2)
display(Markdown("**%d** tubes for BoxType." % tubesperboxtype2["Tubes"].sum()))

display(Markdown("#### VisitID field"))
tubespervisitid2 = pd.DataFrame(df_frac2.groupby("VisitID")["VisitID"].count())
tubespervisitid2.loc[:, "Tubes"] = tubespervisitid2["VisitID"]
tubespervisitid2.loc[:, "VisitID"] = tubespervisitid2.index
tubespervisitid2.reset_index(drop=True, inplace=True)
display(tubespervisitid2)
display(Markdown("**%d** tubes for VisitID." % tubespervisitid2["Tubes"].sum()))

display(Markdown("#### ThawCycle field"))
tubesperthawcycle2 = pd.DataFrame(df_frac2.groupby("ThawCycle")["ThawCycle"].count())
tubesperthawcycle2.loc[:, "Tubes"] = tubesperthawcycle2["ThawCycle"]
tubesperthawcycle2.loc[:, "ThawCycle"] = tubesperthawcycle2.index
tubesperthawcycle2.reset_index(drop=True, inplace=True)
display(tubesperthawcycle2)
display(Markdown("**%d** tubes for ThawCycle." % tubesperthawcycle2["Tubes"].sum()))

display(Markdown("#### Sample Source field"))
display(Markdown("* **%d** empty Sample Source\n* **%d** unique Sample Source" % \
                 (len(df_frac2[df_frac2["Sample Source"].isnull()]), len(df_frac2["Sample Source"].unique()))))

display(Markdown("#### BatchID field"))
tubesperbatchid2 = pd.DataFrame(df_frac2.groupby("BatchID")["BatchID"].count())
tubesperbatchid2.loc[:, "Tubes"] = tubesperbatchid2["BatchID"]
tubesperbatchid2.loc[:, "BatchID"] = tubesperbatchid2.index
tubesperbatchid2.reset_index(drop=True, inplace=True)
display(tubesperbatchid2)
display(Markdown("**%d** tubes for ThawCycle." % tubesperbatchid2["Tubes"].sum()))

display(Markdown("#### Sample Type field"))
tubespersampletype2 = pd.DataFrame(df_frac2.groupby("Sample Type")["Sample Type"].count())
tubespersampletype2.loc[:, "Tubes"] = tubespersampletype2["Sample Type"]
tubespersampletype2.loc[:, "Sample Type"] = tubespersampletype2.index
tubespersampletype2.reset_index(drop=True, inplace=True)
display(tubespersampletype2)
display(Markdown("**%d** tubes for Sample Type." % tubespersampletype2["Tubes"].sum()))

In [None]:
boxescheck = ["MIC_Plasma_S17_V1_A1_F1_D801-896", "MIC_Plasma_S18_V1_A1_F1_D801-896", \
              "MIC_Plasma_S19_V1_A1_F1_D801-896", "MIC_Plasma_S24_V1_A1_F1_D801-896", \
              "MIC_Plasma_S23_V1_A1_F1_D801-896"]
display(df[df["Box"].str.contains(boxescheck[4])]["StimulusID"].count())