## Observations and Insights 

We have analyzed data from testing the effectiveness of 10 Drug Regimens (including a placebo) in reducing SCC tumor size in mice. 249 mice were in the study, the results of one mouse were discarded due to lack of reliability from duplication. Leaving data from 248 mice to be analzyed equally split between male and female. Some of our observations are:

Capomulin and Ramicane show lower values in their mean, median, and variance of tumor volume over the study period suggesting some effectiveness in reducing the tumor size compared to the other regimens and the placebo.

The superior performance of Capomulin and Ramicane are also evident in the boxplot as compared to two other regimens, Ceftamin and Infubinol.

A scatter plot and linear regresssion analysis indicate a negative correlation between mouse weight and tumor volume reduction under Capomulin. Meaning the reduction in tumor volume was greater the lesser the weight of the mouse, with there being little or no reduction in tumor volume for mice with weight near 24 grams.

These findings suggest that Capomulin or Ramicane may have some efficacy in reducing SCC tumor size, with mouse weight (below 24 grams) being more receptive to these drug regimens.

In [1]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as st

# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single dataset
mouse_study_df = pd.merge(study_results, mouse_metadata, on="Mouse ID")

In [2]:
#### DONE - Checking the number of mice in the DataFrame.

num_mice_ID_df = pd.DataFrame({'Number of mice in study':[mouse_study_df['Mouse ID'].nunique()]})

#Table of number of mice in the study
num_mice_ID_df.style.hide_index()

Number of mice in study
249


In [3]:
#### DONE - Getting the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 

#Add column with flag TRUE where multiple occurence of a Mouse ID/Timepoint combination. This should not be possible so indicates these mice data is corrupted somehow.
mouse_study_df['Dup ID TP'] = mouse_study_df.duplicated(subset=['Mouse ID', 'Timepoint'], keep=False)

#Make a list of the Mouse ID's that have duplicate Mouse ID + Timepoint combinations.
dup_IDs = mouse_study_df.loc[mouse_study_df['Dup ID TP']]
dup_IDs = dup_IDs['Mouse ID'].unique()

#Add a flag in the df for all mice with a Mouse ID that is in the duplicates list. These will be removed where TRUE.
mouse_study_df['Dup Mouse'] = mouse_study_df['Mouse ID'].isin(dup_IDs)

#Table showing the duplicate Mouse IDs
pd.DataFrame({"Duplicate Mouse IDs" : dup_IDs}).style.hide_index()

Duplicate Mouse IDs
g989


In [4]:
#### DONE - Optional: Get all the data for the duplicate mouse ID. 

#Where Dup Mouse flag is TRUE
dup_mice_df = mouse_study_df.loc[mouse_study_df['Dup Mouse']==True]

#The study records associated with the duplicate mouse ID which should be removed from the analysis
dup_mice_df

Unnamed: 0,Mouse ID,Timepoint,Tumor Volume (mm3),Metastatic Sites,Drug Regimen,Sex,Age_months,Weight (g),Dup ID TP,Dup Mouse
860,g989,0,45.0,0,Propriva,Female,21,26,True,True
861,g989,0,45.0,0,Propriva,Female,21,26,True,True
862,g989,5,48.786801,0,Propriva,Female,21,26,True,True
863,g989,5,47.570392,0,Propriva,Female,21,26,True,True
864,g989,10,51.745156,0,Propriva,Female,21,26,True,True
865,g989,10,49.880528,0,Propriva,Female,21,26,True,True
866,g989,15,51.325852,1,Propriva,Female,21,26,True,True
867,g989,15,53.44202,0,Propriva,Female,21,26,True,True
868,g989,20,55.326122,1,Propriva,Female,21,26,True,True
869,g989,20,54.65765,1,Propriva,Female,21,26,True,True


In [5]:
#### DONE - Create a clean DataFrame by dropping the duplicate mouse by its ID.

#Create final df with only non-duplicate mice, flag = FALSE
no_dups_mouse_study_df = mouse_study_df.loc[mouse_study_df['Dup Mouse']==False]

#Clean up removing the duplicate mouse flag working columns
del no_dups_mouse_study_df['Dup ID TP']
del no_dups_mouse_study_df['Dup Mouse']

#The final working table from the study with all the mouse and study data and with duplicates eliminated.
no_dups_mouse_study_df

Unnamed: 0,Mouse ID,Timepoint,Tumor Volume (mm3),Metastatic Sites,Drug Regimen,Sex,Age_months,Weight (g)
0,b128,0,45.000000,0,Capomulin,Female,9,22
1,b128,5,45.651331,0,Capomulin,Female,9,22
2,b128,10,43.270852,0,Capomulin,Female,9,22
3,b128,15,43.784893,0,Capomulin,Female,9,22
4,b128,20,42.731552,0,Capomulin,Female,9,22
...,...,...,...,...,...,...,...,...
1888,m601,25,33.118756,1,Capomulin,Male,22,17
1889,m601,30,31.758275,1,Capomulin,Male,22,17
1890,m601,35,30.834357,1,Capomulin,Male,22,17
1891,m601,40,31.378045,1,Capomulin,Male,22,17


In [6]:
#### DONE - Checking the number of mice in the clean DataFrame.

no_dup_num_mice_ID = pd.DataFrame({'Number mice after removing duplicates':[no_dups_mouse_study_df['Mouse ID'].nunique()]})

#A table showing the number of mice in the study after eliminating the duplicate Mouse IDs
no_dup_num_mice_ID.style.hide_index()

Number mice after removing duplicates
248


## Summary Statistics

In [7]:
#### DONE - Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

#Create lists of requested statistics
drug_reg_means = no_dups_mouse_study_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].mean()
drug_reg_medians = no_dups_mouse_study_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].median()
drug_reg_vars = no_dups_mouse_study_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].var()
drug_reg_stds = no_dups_mouse_study_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].std()
drug_reg_sems = no_dups_mouse_study_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].sem()

#Create final summary stats table by zipping all the lists and naming the columns. Set Drug Regimen as index
drug_reg_summary_stats = pd.DataFrame(list(zip(drug_regimens, drug_reg_means, drug_reg_medians, drug_reg_vars, drug_reg_stds, drug_reg_sems)), columns= ['Drug Regimen', 'Tumor Volume mean', 'Tumor Volume median', 'Tumor Volume variance', 'Tumor Volume std dev', 'Tumor Volume SEM'])
drug_reg_summary_stats.set_index('Drug Regimen', inplace=True)

#Table of summary tumor volume statistics for each Drug Regimen
drug_reg_summary_stats

NameError: name 'drug_regimens' is not defined

## Bar Plots

In [None]:
#### DONE - Generate a bar plot showing the number of mice per time point for each treatment throughout the course of the study using pandas. 

#Plot a bar chart using Pandas plot.bar method. Group by Timepoints and then count the Drug Regimen occurences which indicates a live mouse under that Drug Regimen. Unstack creates a pivot table type structure.
no_dups_mouse_study_df.groupby('Timepoint')['Drug Regimen'].value_counts().unstack().plot(kind='bar')

#Set labels, titles, legend
plt.ylabel('Surviving mice')
plt.xlabel('Day of Study')
plt.title('Comparison of Drug Regimens - Mice Survival')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

#Bar plot of surviving mouse at each timepoint for each drug regimen (Using Pandas .plot)
plt.show()

In [None]:
#### DONE - Generate a bar plot showing the number of mice per time point for each treatment throughout the course of the study using pyplot.

#Group by Timepoints and then count the Drug Regimen occurences which indicates a live mouse under that Drug Regimen. Unstack creates a pivot table type structure.
mouse_plot = no_dups_mouse_study_df.groupby('Timepoint')['Drug Regimen'].value_counts().unstack()

#Capture the number of Drug Regimens for the plot looping
num_DR = np.arange(len(mouse_plot.columns))

#set x axis labels, location, and legend values
xtick_labels = mouse_plot.index
xtick_loc = num_DR
legend_values = mouse_plot.columns

# set width of a bar in the bar chart
barWidth = 0.05

#Plot bar chart using pyplot.bar method. Shift the bar by barwidth to make Drug Regimen groupings visually
for i in num_DR:
     plt.bar((num_DR-len(num_DR)*barWidth/2)+barWidth*i, mouse_plot.iloc[:,i], width=barWidth)

#Set labels, title, legend
plt.xticks(xtick_loc, xtick_labels, rotation=90)
plt.ylabel('Surviving mice')
plt.xlabel('Day of Study')
plt.title('Comparison of Drug Regimens - Mice Survival')
plt.legend(legend_values,loc='center left', bbox_to_anchor=(1, 0.5))

#Bar plot of surviving mouse at each timepoint for each drug regimen (Using Matplotlib pyplot)
plt.show()

## Pie Plots

In [None]:
#### DONE - Generate a pie plot showing the distribution of female versus male mice using pandas

#Get mice from original mouse database, discard the duplicate mice
mouse_gender_df = mouse_metadata.loc[~mouse_metadata['Mouse ID'].isin(dup_IDs)]

#Group by gender and get count
mouse_gender_df_grp = mouse_gender_df.groupby('Sex')
gender_count = mouse_gender_df_grp['Mouse ID'].count()

# #Plot pie chart using Pandas .plot functionality
gender_count.plot(kind='pie', y='Sex', legend='', autopct='%1.1f%%', title='Gender distribution for mice in study')
plt.ylabel('')

#Pie chart of mice gender distibution (Using Pandas .plot)
plt.show()

In [None]:
#### DONE - Generate a pie plot showing the distribution of female versus male mice using pyplot

#Pie plot of gender using pyplot
plt.pie(gender_count, labels=gender_count.index, autopct='%1.1f%%')
plt.title('Gender distribution for mice in study')

#Pie chart of mice gender distibution (Using Matplotlib pyplot)
plt.show()

## Quartiles, Outliers and Boxplots

In [None]:
#### DONE - Calculate the final tumor volume of each mouse across four of the most promising treatment regimens. Calculate the IQR and quantitatively determine if there are any potential outliers.  

#List of the top 4 treatments
Top4_DR =['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']

#Group mice ID and sort descending by timepoint within Mouse ID. Grab first row for each ID. This gives a df with one row for each mouse that contains the final timepoint and tumor volume

#Dataframe of final Tumor Volume stat for each mouse at the last timepoint for that mouse
Last_tp_by_Mouse_ID = pd.DataFrame(no_dups_mouse_study_df.sort_values(['Mouse ID','Timepoint'],ascending=False).groupby('Mouse ID').first())

#Make a table of the Tumor Volume at the last timepoint for each mouse that's in the top 4 Drug Regimens
Final_TV_for_top4_DR =Last_tp_by_Mouse_ID.loc[Last_tp_by_Mouse_ID['Drug Regimen'].isin(Top4_DR)]

#Group by Drug Regimen. Use describe to get stats and save as df
Final_tp_by_DR = Final_TV_for_top4_DR.groupby('Drug Regimen')
Final_TV_size_stats = pd.DataFrame(Final_tp_by_DR['Tumor Volume (mm3)'].describe())

#Calculate IQR, upper and lower bound for outlier identifcation, identify outliers
Final_TV_size_stats['IQR']= Final_TV_size_stats['75%']-Final_TV_size_stats['25%']
Final_TV_size_stats['Outlier_Upper']= Final_TV_size_stats['75%']+Final_TV_size_stats['IQR']*1.5
Final_TV_size_stats['Outlier_Lower']= Final_TV_size_stats['25%']-Final_TV_size_stats['IQR']*1.5

#Flag Outlier as TRUE if min or max are outside the the IQR lower or upper limits
Final_TV_size_stats['Outlier'] = (Final_TV_size_stats['min']<Final_TV_size_stats['Outlier_Lower']) | (Final_TV_size_stats['max']>Final_TV_size_stats['Outlier_Upper'])

#Cleanup stats not needed for this exercise
del Final_TV_size_stats['count']
del Final_TV_size_stats['mean']
del Final_TV_size_stats['std']

#Table of statistics for the final Tumor Volume for each regiment. To identify the top 4 regimens and outliers
Final_TV_size_stats

In [None]:
#### DONE - Generate a box plot of the final tumor volume of each mouse across four regimens of interest
#Capomulin, Ramicane, Infubinol, and Ceftamin

#Change the marker and color of any outliers
red_diamond = dict(markerfacecolor='r', marker='D')

#Create a boxplot for the final Tumor Volumes under the top 4 DR's. Outlier is a red diamond
Final_TV_for_top4_DR.boxplot(by='Drug Regimen', column='Tumor Volume (mm3)', flierprops=red_diamond)
plt.suptitle('')

#A boxplot of final tumor volume distribution for each of the top 4 drug regimens. Outliers as red diamonds
plt.show()

## Line and Scatter Plots

In [None]:
#### DONE - Generate a line plot of time point versus tumor volume for a mouse treated with Capomulin
#Start with the clean no duplictes mouse df.
mouse_metadata_no_dups = mouse_metadata.loc[mouse_metadata['Mouse ID'].isin(no_dups_mouse_study_df['Mouse ID'])]

#Input for which Drug Regimen to plot
which_regimen_plot = 'Capomulin'

#Create a list of all the mice IDs that were in the target Drug Regimen
regimen_mouse_IDs = pd.Series(mouse_metadata_no_dups['Mouse ID'].loc[mouse_metadata_no_dups['Drug Regimen']==which_regimen_plot])

#Generate a random list of ID's to be plotted
how_many_mice_plot = 1
IDs_to_plot = np.random.choice(regimen_mouse_IDs, how_many_mice_plot)

#Select just the mice in the desired Drug Regimen
plot_mice = no_dups_mouse_study_df.loc[no_dups_mouse_study_df['Drug Regimen']==which_regimen_plot]
#Select just the ID, Timepoint, and Tumor Volume columns that will be needed to do the plot
plot_mice = plot_mice[['Mouse ID', 'Timepoint', 'Tumor Volume (mm3)']]
#Select just the mice with the random ID generated
plot_mice = plot_mice.loc[plot_mice['Mouse ID'].isin(IDs_to_plot)]

#For the selected mouse at each timepoint plot the tumor volume. 
plt.plot(plot_mice['Timepoint'], plot_mice['Tumor Volume (mm3)'])
plt.xticks(np.arange(0,50,5))
plt.xlabel('Days into the study')
plt.ylabel('Tumor Volume (mm3)')
plt.title('Results for Mouse: '+IDs_to_plot+' with Drug Regimen: '+which_regimen_plot)

#A line plot of the progression in tumor size for a single mouse over the course of the study (Using pyplot)
plt.show()

In [None]:
#### DONE - Generate a scatter plot of mouse weight versus average tumor volume for the Capomulin regimen
scatter_plot_mice = no_dups_mouse_study_df.loc[no_dups_mouse_study_df['Drug Regimen']==which_regimen_plot]
scatter_plot_mice = scatter_plot_mice.groupby('Mouse ID')

plt.scatter(scatter_plot_mice['Weight (g)'].mean(), scatter_plot_mice['Tumor Volume (mm3)'].mean())
plt.xlabel('Mouse weight (g)')
plt.ylabel('Tumor volume (mm3)')
plt.title('Tumor volume vs weight of mouse')

#A scatter plot of the average tumor volume over the course of the study vs weight
plt.show()

## Correlation and Regression

In [None]:
#### DONE - Calculate the correlation coefficient and linear regression model 
#### for mouse weight and average tumor volume for the Capomulin regimen

slope, intercept, r_value, p_value, std_err = st.linregress(scatter_plot_mice['Weight (g)'].mean(), scatter_plot_mice['Tumor Volume (mm3)'].mean())
r_squared = r_value**2

plt.scatter(scatter_plot_mice['Weight (g)'].mean(), scatter_plot_mice['Tumor Volume (mm3)'].mean())
plt.plot(scatter_plot_mice['Weight (g)'].mean(), scatter_plot_mice['Weight (g)'].mean()*slope+intercept, color='red', label= 'TV= '+str(round(slope,2))+'*Wgt + '+str(round(intercept,2))+' R2='+str(round(r_squared,2))) 
plt.xlabel('Mouse weight (g)')
plt.ylabel('Tumor volume (mm3)')
plt.title('Tumor volume vs weight of mouse')
plt.legend()

#A scatter plot of the average tumor volume over the course of the study vs weight
#Overlaid with a linear regression fit. Showing the linear regression formula and R-squared in the legend
plt.show()