## 1) Begin Milestone 1 with a 250-500-word narrative describing an original idea for an analysis/model building business problem. 

I work in Research & Development for a company that makes animal vaccines. My job is, in part, to scale up the process from benchtop / lab scale to full production / manufacturing scale. For the past few years, I have been working on a particular microorganism that has proved to be finicky and not always predictable and, while they can be overlapped, each batch takes about 2 months, start to finish.   

Through a series of setbacks, I have now produced 12 batches between 2019 and I would like to see if I can find any insights in my data that might be helpful in maximizing the final yield, as quantified by ELISA testing for the target antigen.   

The scale up initially involves roller bottles and flasks in the clean lab, then goes through several vessels, ultimately finishing in a 1000 liter SUB (single-use bioreactor). The majority of my data comes from sampling the vessels at different timepoints, usually more spaced out early in each vessel, then daily as we hit log-phase growth and get ready to transfer to the next vessel. Because sampling isn’t always daily, there are a lot of days that we will have missing data.   

In addition to optimizing yield, there is another issue we’re trying to solve in that we currently use a piece of analytical equipment (MPbio_TotCt in the data) which correlates nicely with our ELISA results, which is helpful in determining when to harvest our final vessel. This equipment is no longer supported by the vendor and so we need to find a replacement. For several of these recent runs, we have been pulling additional samples to try out different analytical methods and equipment that correlates, hopefully with the old equipment, and preferably with the ELISA results.   

I should also note that we do pull in-process samples from our final vessel for ELISA testing, but the ELISA is a longer (overnight) test and we need a much faster method for in-process testing. For reference, the analytical equipment we are replacing could provide a result in just a couple of hours.

## 2) Then, do a graphical analysis creating a minimum of four graphs.
Label your graphs appropriately and explain/analyze the information provided by each graph. Keep in mind that your analysis may look very different from the Titanic tutorial graphical analysis.

### Load and clean up the data

In [1]:
# Load libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# Load data to a DataFrame
vessel_data = pd.read_excel('Compiled_Lawsonia.xlsx')

In [3]:
# Check the dimensions of the table
print(f"The table is {vessel_data.shape[0]} rows x {vessel_data.shape[1]} columns.")

The table is 465 rows x 39 columns.


In [4]:
# Get an initial look at the data
vessel_data.head()

Unnamed: 0,Batch,Vessel,DPI,Date,Time,Temp_C,Agitation,DO_pct,pH_online,pH_offline,...,Micro_Comments,Intra_Code,Extra_Code,ELISA_KZO,ELISA_LNK,Live_Titer,VCD_OD,RNA,gDNA/1mL,Antifoam
0,369697PV,Flasks,3,2019-07-12,,,,,,,...,,,,,,,,,,
1,369697PV,5L BLU,M,2019-07-12,09:25:00,36.0,40.0,7.37,7.035,6.783,...,,,,,,,,,,
2,369697PV,5L BLU,0,2019-07-12,10:53:00,35.8,40.0,15.39,6.983,7.118,...,,,,,,,,,,
3,369697PV,5L BLU,1,2019-07-13,,,,,,,...,,,,,,,,,,
4,369697PV,5L BLU,2,2019-07-14,,,,,,,...,,,,,,,,,,


There are a lot of NaNs that will need to be dealt with.  
DPI is days post inoculation and that column contains several 'M's, which are just the media, prior to inoculation, and can also be removed.

In [5]:
# Remove rows from 'DPI' column which are 'M'
vessel_data = vessel_data[vessel_data.DPI!='M']

In [6]:
# Check each column for NaNs
pd.isnull(vessel_data).any()

Batch             False
Vessel            False
DPI                True
Date               True
Time               True
Temp_C             True
Agitation          True
DO_pct             True
pH_online          True
pH_offline         True
O₂_%sat            True
CO₂_%sat           True
L_Glut             True
Gluc               True
Lact               True
Ammon              True
Nuc_NonVi          True
Nuc_TotCt          True
Nuc_LiveCt         True
Nuc_Viab           True
MPbio_Blank        True
MPbio_Area         True
MPbio_TotCt        True
MPtvo_Blank        True
MPtvo_Area         True
MPtvo_TotCt        True
Intra_Ct           True
Micro_Intra        True
Micro_Extra        True
Micro_Comments     True
Intra_Code         True
Extra_Code         True
ELISA_KZO          True
ELISA_LNK          True
Live_Titer         True
VCD_OD             True
RNA                True
gDNA/1mL           True
Antifoam           True
dtype: bool

DPI shouldn't have any null values.

In [7]:
# Remove rows from 'DPI' which are NaN
vessel_data = vessel_data[vessel_data['DPI'].notnull()]

In [8]:
# Check the types of data
vessel_data.dtypes

Batch                     object
Vessel                    object
DPI                       object
Date              datetime64[ns]
Time                      object
Temp_C                   float64
Agitation                float64
DO_pct                   float64
pH_online                float64
pH_offline               float64
O₂_%sat                  float64
CO₂_%sat                 float64
L_Glut                   float64
Gluc                     float64
Lact                     float64
Ammon                    float64
Nuc_NonVi                 object
Nuc_TotCt                float64
Nuc_LiveCt               float64
Nuc_Viab                 float64
MPbio_Blank              float64
MPbio_Area               float64
MPbio_TotCt              float64
MPtvo_Blank              float64
MPtvo_Area               float64
MPtvo_TotCt              float64
Intra_Ct                 float64
Micro_Intra               object
Micro_Extra               object
Micro_Comments            object
Intra_Code

Now that we romoved 'M' from the 'DPI' column the variable should be numeric, so we'll convert to integers.

In [9]:
# Convert 'DPI' column to numeric
vessel_data['DPI'] = vessel_data['DPI'].astype(int)

The 'Nuc_NonVi' should be numeric and I know what the issue is: there is a lower limit of detection of 5000 on that piece of equipment and if the value falls below that, we record as '<5k'. I will replace all of these with 2500 (half the lower limit).

In [10]:
# Replace '<5k' in 'Nuc_NonVi' column with 1/2 lower limit
vessel_data.Nuc_NonVi = vessel_data.Nuc_NonVi.replace('<5k', 5000/2)

The 'Micro_Intra' and 'Micro_Extra' are ordered categorical descriptions used when looking at the organism under the microscope. These are standardized, so I will create new columns 'Extra_Coded' and 'Intra_Coded' to code them into numeric values.

In [11]:
# Check the standard descriptions for microscopy
print(vessel_data.Micro_Intra.unique())
print(vessel_data.Micro_Extra.unique())

[nan 'Light, 1+' 'Some, 2+' 'Heavy, 3+' 'Very Light, <1+' 'Very Heavy, 4+']
[nan 'Some, 2+' 'Heavy, 3+' 'Very Heavy, 4+' 'Light, 1+' 'Very Light, <1+'
 'Heavy, 3+ ']


'Micro_Extra' appears to have an extra space in at least one of the fields, which will need to be stripped first.

In [12]:
# Strip whitespace from 'Micro_Extra' column
vessel_data['Micro_Extra'] = vessel_data.Micro_Extra.str.strip()

In [13]:
# Map values to new columns based on descriptions
dict = {'Very Light, <1+': 1, 'Light, 1+': 2, 'Some, 2+': 3, 'Heavy, 3+': 4, 'Very Heavy, 4+': 5}
vessel_data['Intra_Code'] = vessel_data['Micro_Intra'].map(dict)
vessel_data['Extra_Code'] = vessel_data['Micro_Extra'].map(dict)

The 'VCD_OD' column has data for a probe that I used on two runs, but was not working correctly for one of the runs and provided inaccurate data, which will need to be removed.

In [14]:
# Remove inaccurate data
vessel_data.loc[(vessel_data.Batch == '471341PV'), 'VCD_OD'] = np.nan

Many of the NaN values in the data set are days with no sampling and those can be removed completely. 

In [15]:
print(f"{vessel_data.shape[0]} rows before removing non-sampling days.")

422 rows before removing non-sampling days.


In [16]:
# Remove rows that contain no data beyond the batch/vessel identifiers and day/date/time info
colList = vessel_data.columns[5:]
vessel_data.dropna(axis=0, subset=colList, how="all", inplace=True)

In [17]:
print(f"{vessel_data.shape[0]} rows after removing non-sampling days.")

352 rows after removing non-sampling days.


### Take another look at the data

In [18]:
vessel_data.head()

Unnamed: 0,Batch,Vessel,DPI,Date,Time,Temp_C,Agitation,DO_pct,pH_online,pH_offline,...,Micro_Comments,Intra_Code,Extra_Code,ELISA_KZO,ELISA_LNK,Live_Titer,VCD_OD,RNA,gDNA/1mL,Antifoam
0,369697PV,Flasks,3,2019-07-12,,,,,,,...,,,,,,,,,,
2,369697PV,5L BLU,0,2019-07-12,10:53:00,35.8,40.0,15.39,6.983,7.118,...,,,,,,,,,,
6,369697PV,5L BLU,4,2019-07-16,08:47:00,36.0,40.0,6.14,7.038,6.978,...,,,,,,,,,,
9,369697PV,5L BLU,7,2019-07-19,08:35:00,36.0,40.0,5.77,7.028,6.947,...,,,,,,,,,,
11,369697PV,50L BLU,0,2019-07-19,09:00:00,36.0,42.0,6.26,7.034,7.003,...,,2.0,3.0,,,,,,,


In [19]:
print("Describe Data:")

Describe Data:


In [20]:
print("\tColumns 1-11:")
vessel_data.describe().iloc[:,:11]

	Columns 1-11


Unnamed: 0,DPI,Temp_C,Agitation,DO_pct,pH_online,pH_offline,O₂_%sat,CO₂_%sat,L_Glut,Gluc,Lact
count,352.0,339.0,339.0,339.0,339.0,262.0,241.0,246.0,195.0,249.0,216.0
mean,5.315341,35.986667,49.731711,6.416254,7.046743,7.052683,1.32744,0.082272,0.877846,3.892048,0.256074
std,3.858611,0.110077,10.182909,3.231918,0.04215,0.094749,9.046905,0.127859,0.84672,0.696477,0.278071
min,0.0,34.78,40.0,0.16,6.61,6.658,0.034,0.008,0.0,1.08,0.0
25%,2.0,36.0,42.0,5.59,7.05,7.006,0.377,0.055,0.05,3.6,0.09
50%,5.0,36.0,48.9,5.98,7.05,7.0505,0.475,0.073,0.66,4.12,0.185979
75%,8.0,36.0,49.2,6.3,7.06,7.091,0.614,0.089,1.515,4.39,0.31
max,15.0,36.4,67.3,41.74,7.14,7.525,100.0,1.9,3.27,4.9,1.76


In [21]:
print("\tColumns 12-22:")
vessel_data.describe().iloc[:,11:22]

	Columns 12-22


Unnamed: 0,Ammon,Nuc_NonVi,Nuc_TotCt,Nuc_LiveCt,Nuc_Viab,MPbio_Blank,MPbio_Area,MPbio_TotCt,MPtvo_Blank,MPtvo_Area,MPtvo_TotCt
count,226.0,269.0,269.0,269.0,269.0,240.0,240.0,241.0,240.0,240.0,241.0
mean,1.924248,29711.895911,564230.5,533546.5,0.946986,1105.3875,474082.0,120770500.0,289.845833,153250.683333,77458170.0
std,1.160151,42916.459196,436142.7,407287.8,0.05034,1404.196599,375951.1,98806190.0,1363.298865,154588.907064,603218000.0
min,0.28,2500.0,15000.0,10000.0,0.561905,21.0,2790.0,669000.0,0.0,917.0,219000.0
25%,0.9625,2500.0,189000.0,184000.0,0.928244,348.5,137676.5,34300000.0,11.0,39264.0,9741250.0
50%,1.8,8000.0,423000.0,413000.0,0.962121,568.0,422881.0,105467000.0,32.0,113248.0,29400000.0
75%,2.85,42000.0,855000.0,823000.0,0.980769,1111.0,755371.5,201895200.0,53.0,200863.75,53700000.0
max,5.93,212000.0,1880000.0,1835000.0,0.994111,7275.0,1400818.0,394000000.0,16757.0,716160.0,9384000000.0


In [22]:
print("\tColumns 23-32:")
vessel_data.describe().iloc[:,22:33]

	Columns 23-32


Unnamed: 0,Intra_Ct,Intra_Code,Extra_Code,ELISA_KZO,ELISA_LNK,Live_Titer,VCD_OD,RNA,gDNA/1mL,Antifoam
count,76.0,247.0,247.0,82.0,21.0,5.0,11.0,47.0,56.0,7.0
mean,15.894737,2.65587,3.890688,9263.341463,6926.571429,4.53,1.521818,15827.859284,196057900.0,1.0
std,23.289242,1.054894,1.17917,3295.625088,2853.117638,0.710282,0.925406,22378.752738,150741200.0,0.0
min,0.0,1.0,1.0,1630.0,3156.0,3.5,0.05,170.0,2545546.0,1.0
25%,2.75,2.0,3.0,6617.25,4889.0,4.5,0.765,1478.957764,76613840.0,1.0
50%,7.5,3.0,4.0,9670.25,5549.0,4.5,1.56,7378.62793,135999400.0,1.0
75%,20.0,3.0,5.0,11124.5,9758.0,4.65,2.355,23748.478516,287905700.0,1.0
max,127.0,5.0,5.0,17020.0,11965.0,5.5,2.58,103762.65625,637431800.0,1.0


In [23]:
print("Summarized Data:")
vessel_data.describe(include=['O'])

Summarized Data:


Unnamed: 0,Batch,Vessel,Time,Micro_Intra,Micro_Extra,Micro_Comments
count,352,352,341,247,247,248.0
unique,12,5,162,5,5,23.0
top,381550PV,1000L SUB,09:00:00,"Some, 2+","Very Heavy, 4+",
freq,38,109,7,97,105,225.0


### Think about the data

### Visualize Data

## 3) Write a short overview/conclusion of the insights gained from your graphical analysis.