### Pymaceuticals Skill Drill - Day 1

Congratulations, you are hired by Pymaceuticals Inc., one of the leading imaginary pharmaceutical companies that specializes in anti-cancer pharmaceuticals, to assist their senior scientist team in the effort to begin screening for potential treatments for squamous cell carcinoma (SCC), a commonly occurring form of skin cancer.

In this study, 249 mice identified with SCC tumor growth were treated through a variety of drug regimens. Over the course of 45 days, tumor development was observed and measured. The purpose of this study was to compare the performance of Pymaceuticals' drug of interest, Capomulin, versus the other treatment regimens. You have been tasked by the senior scientist team to generate an initial drug regimens comparison and generate a summary of your findings. 

For this skill drill, you will walk through the steps of a basic analysis and visualize our dataset using a new type of visualization - a box and whisker plot. Although we have provided all of the steps required to produce each output, there may be some new concepts and/or terminology in this skill drill you may not have seen before. If you are ever stuck or confused, try googling some of the terms or check out the resource links we provide throughout the activity. You got this!

### Data Cleaning 

In [1]:
%matplotlib notebook

In [2]:
# Import dependencies
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
# Read the mouse data and the study results
mouse_df = pd.read_csv('../Resources/Mouse_metadata.csv')
study_df = pd.read_csv('../Resources/Study_results.csv')

In [4]:
# Display the mouse data
mouse_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g)
0,k403,Ramicane,Male,21,16
1,s185,Capomulin,Female,3,17
2,x401,Capomulin,Female,16,15
3,m601,Capomulin,Male,22,17
4,g791,Ramicane,Male,11,16


In [5]:
# Display the study data
study_df.head()

Unnamed: 0,Mouse ID,Timepoint,Tumor Volume (mm3),Metastatic Sites
0,b128,0,45.0,0
1,f932,0,45.0,0
2,g107,0,45.0,0
3,a457,0,45.0,0
4,c819,0,45.0,0


In [6]:
# Combine the data into a single dataset and display it
combined_df = mouse_df.merge(study_df, how='inner', on='Mouse ID')
combined_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.0,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1


In [7]:
# Getting the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
duplicate_mice = combined_df.loc[combined_df.duplicated(subset=['Mouse ID','Timepoint']), 'Mouse ID']
duplicate_mice.head()

909    g989
911    g989
913    g989
915    g989
917    g989
Name: Mouse ID, dtype: object

In [8]:
# Optional: Get all the data for the duplicate mouse ID. 
duplicate_mice = combined_df.loc[combined_df.duplicated(subset=['Mouse ID','Timepoint']), :]
duplicate_mice.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
909,g989,Propriva,Female,21,26,0,45.0,0
911,g989,Propriva,Female,21,26,5,47.570392,0
913,g989,Propriva,Female,21,26,10,49.880528,0
915,g989,Propriva,Female,21,26,15,53.44202,0
917,g989,Propriva,Female,21,26,20,54.65765,1


In [9]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID and display it
clean_combined_df = combined_df.drop_duplicates(subset=['Mouse ID', 'Timepoint'])
clean_combined_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.0,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1


### Generating the Boxplot

In [10]:
# Determine the final timepoint for each mouse.
max_timepoint_df = study_df.groupby(['Mouse ID'])['Timepoint'].max().reset_index(name='max_timepoint')

max_timepoint_df.head()

# Start by getting the greatest timepoint for each mouse


Unnamed: 0,Mouse ID,max_timepoint
0,a203,45
1,a251,45
2,a262,45
3,a275,45
4,a366,30


In [11]:
# Join the newly created `.max()` dataframe to the dataframe from Part 1
tumor_volume_df = pd.merge(max_timepoint_df, clean_combined_df ,left_on=["Mouse ID","max_timepoint"], right_on=["Mouse ID","Timepoint"])
tumor_volume_df.head()

Unnamed: 0,Mouse ID,max_timepoint,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,a203,45,Infubinol,Female,20,23,45,67.973419,2
1,a251,45,Infubinol,Female,21,25,45,65.525743,1
2,a262,45,Placebo,Female,17,29,45,70.717621,4
3,a275,45,Ceftamin,Female,20,28,45,62.999356,3
4,a366,30,Stelasyn,Female,16,29,30,63.440686,1


In [12]:
# Create a list with all 10 drug regimens.
drug_list = mouse_df['Drug Regimen'].unique()

# Create a empty list to fill with the tumor vol data
tumor_vol_data = []

# Isolate (filter) each mice on each drug to collect their tumor volume.
for i in drug_list:
    
    # Locate the rows which match the drug and get the final tumor volumes of all mice
    drug_volumes = tumor_volume_df.loc[tumor_volume_df['Drug Regimen'] == i]['Tumor Volume (mm3)']
    
    
    # Append the outcome to the empty list previously created.
    tumor_vol_data.append(drug_volumes)
    
    # Determine outliers
    quartiles = drug_volumes.quantile([.25, .5, .75])
    q_one = quartiles[0.25]
    q_three = quartiles[0.75]
    iqr = q_three - q_one
    lower_bound = q_one - (1.5*iqr)
    upper_bound = q_three + (1.5*iqr)
    outliers = drug_volumes.loc[(drug_volumes < lower_bound) | (drug_volumes > upper_bound)]
    print(f"{i}'s potential outliers: {outliers}")
    

Ramicane's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Capomulin's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Infubinol's potential outliers: 31    36.321346
Name: Tumor Volume (mm3), dtype: float64
Placebo's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Ceftamin's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Stelasyn's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Zoniferol's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Ketapril's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Propriva's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)
Naftisol's potential outliers: Series([], Name: Tumor Volume (mm3), dtype: float64)


In [13]:
# ALTERNATIVE METHOD USING EMPTY DATAFRAME
# Create a list with all 10 drug regimens.
#drug_list = mouse_df['Drug Regimen'].unique()

#box_df = pd.DataFrame()
# Create a empty list to fill with the tumor vol data
#tumor_vol_data = []

# Isolate (filter) each mice on each drug to collect their tumor volume.
#for i in drug_list:
    #print(i)
    # Locate the rows which match the drug and get the final tumor volumes of all mice
    #drug_volume_df = tumor_volume_df.loc[tumor_volume_df['Drug Regimen'] == i].reset_index()
    #print(drug_volume_df['Tumor Volume (mm3)'].reset_index().head())
    #drug_volume_list = drug_volume_df['Tumor Volume (mm3)'].to_list()
    
    # Append the outcome to the empty list previously created.
    #tumor_vol_data.append(drug_volume_list)
    #box_df["{}".format(i)] = drug_volume_df['Tumor Volume (mm3)']
#print(len(tumor_vol_data), '\n', tumor_vol_data)
#box_df.head()

In [14]:
# Create a boxplot that visualizes the final tumor volume of all mice in the study across all drug regimens.
# Define a custom shape for all outliers in the visualization
flierprops = dict(marker='o', markerfacecolor='purple', markersize=12,
                  markeredgecolor='none')

# Create horizontal box and whisker plot
plt.boxplot(tumor_vol_data, labels=drug_list, vert=False, flierprops=flierprops)

#box_df.boxplot()


<IPython.core.display.Javascript object>

{'whiskers': [<matplotlib.lines.Line2D at 0x1c243573670>,
  <matplotlib.lines.Line2D at 0x1c2435739d0>,
  <matplotlib.lines.Line2D at 0x1c243584e80>,
  <matplotlib.lines.Line2D at 0x1c243591220>,
  <matplotlib.lines.Line2D at 0x1c24359c6a0>,
  <matplotlib.lines.Line2D at 0x1c24359ca00>,
  <matplotlib.lines.Line2D at 0x1c2435a8e80>,
  <matplotlib.lines.Line2D at 0x1c2435b5220>,
  <matplotlib.lines.Line2D at 0x1c2435c06a0>,
  <matplotlib.lines.Line2D at 0x1c2435c0a00>,
  <matplotlib.lines.Line2D at 0x1c2435cbe80>,
  <matplotlib.lines.Line2D at 0x1c2435d7220>,
  <matplotlib.lines.Line2D at 0x1c2435e16a0>,
  <matplotlib.lines.Line2D at 0x1c2435e1a00>,
  <matplotlib.lines.Line2D at 0x1c2435efe80>,
  <matplotlib.lines.Line2D at 0x1c2435fa220>,
  <matplotlib.lines.Line2D at 0x1c2436056a0>,
  <matplotlib.lines.Line2D at 0x1c243605a00>,
  <matplotlib.lines.Line2D at 0x1c243612e80>,
  <matplotlib.lines.Line2D at 0x1c24361d220>],
 'caps': [<matplotlib.lines.Line2D at 0x1c243573d30>,
  <matplotlib