# Pymaceuticals Inc.
---
### Analysis

In looking at the data we can see that there is a strong, positive linear correlation between mouse weight and average tumor volume (the correlation coefficient is .84). We can also see that the drug regimens with the lowest average tumor volumes are Ramicane and Capomulin. Based on this information, I would want to look more closely at Ramicane and Capomulin, since the goal of the regimens is to treat cancer. Smaller tumor volumes would suggest that those drug regimens may be effective (we are trying to slow or stop the growth of tumors when treating cancer). In looking at the change in tumor volume over time of one mouse on the Capomulin regimen we can see that the tumor shrank. This is interesting and what we would hope to see, but I would want to see tumor volume over time for all of the mice treated with Capomulin. This might show us that similar decreases volume happened across the entire cohort of mice treated with Capomulin, but it might also show us a different picture. Looking at one mouse is not enough to make a prediction about what the drug does overall. We can see that tumor volume does go up as mouse weight goes up -- I would be interested in looking at tumor volume as a percentage of a mouse's total weight and if there is any relationship there. Does tumor volume tend to be a stable percentage of a mouse's overall weight? Finally, I think this type of anlaysis would need to be done for a longer period of time. We would want to look at whether or not tumor volume continues to shrink as time goes on. If the drug regimen is stopped would the tumors continue to shrink or stay the same size? Or would they begin to grow again? 

In [None]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st

# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single DataFrame
# normally inner would be better
mouse_study_data = pd.merge(study_results,mouse_metadata, on='Mouse ID', how='outer')

# Display the data table for preview
mouse_study_data.head()

In [None]:
# Checking the number of mice.
mouse_count = mouse_study_data['Mouse ID'].nunique()
mouse_count

In [None]:
# Our data should be uniquely identified by Mouse ID and Timepoint
# Get the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
duplicate_id=mouse_study_data.loc[mouse_study_data.duplicated(subset=['Mouse ID', 'Timepoint']),'Mouse ID'].unique()
duplicate_id

In [None]:
duplicate_data = mouse_study_data[mouse_study_data['Mouse ID'].isin(duplicate_id)==True]
duplicate_data

In [None]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
mouse_study_data_clean = mouse_study_data[mouse_study_data['Mouse ID'].isin(duplicate_id)==False]
mouse_study_data_clean.head()

In [None]:
# Checking the number of mice in the clean DataFrame.
mouse_count_clean = mouse_study_data_clean['Mouse ID'].nunique()
mouse_count_clean

## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

In [None]:
# Use groupby and summary statistical methods to calculate the following properties of each drug regimen: 
# mean, median, variance, standard deviation, and SEM of the tumor volume. 
drug_mean = mouse_study_data_clean.groupby('Drug Regimen')['Tumor Volume (mm3)'].mean()
drug_median = mouse_study_data_clean.groupby('Drug Regimen')['Tumor Volume (mm3)'].median()
drug_variance = mouse_study_data_clean.groupby('Drug Regimen')['Tumor Volume (mm3)'].var()
drug_std_dev = mouse_study_data_clean.groupby('Drug Regimen')['Tumor Volume (mm3)'].std()
drug_sem = mouse_study_data_clean.groupby('Drug Regimen')['Tumor Volume (mm3)'].sem()

In [None]:
# Assemble the resulting series into a single summary DataFrame.
summary_statistics_table = {"Mean Tumor Volume (mm3)": drug_mean,
                            "Median Tumor Volume(mm3)": drug_median,
                            "Tumor Volume Variance (mm3)":drug_variance,
                            "Tumor Volume Standard Deviation (mm3)": drug_std_dev,
                            "Tumor Volume SEM (mm3)": drug_sem}

summary_statistics_table = pd.DataFrame(summary_statistics_table)
summary_statistics_table

In [None]:
# Using the aggregation method, produce the same summary statistics in a single line
summary = mouse_study_data_clean.groupby('Drug Regimen').agg({"Tumor Volume (mm3)":['mean','median','var','std','sem']})
#summary = mouse_study_data_clean.groupby('Drug Regimen').agg({"Tumor Volume (mm3)":['mean','median','var','std','sem']}).rename(columns{'mean':"Mean Tumor Volume","median":"Median Tumor Volume",'var':"Variance",'std':"Standard Deviation",'sem':"Standard Error"})
summary


## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using Pandas.
timepoints_per_drug = mouse_study_data_clean.groupby("Drug Regimen")['Mouse ID'].count()
drug_bar_pandas = timepoints_per_drug.plot.bar(xlabel="Drug Regimen",ylabel='# of Observed Mouse Timepoints',rot=90,width=0.8, color='blue')


In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using pyplot.
drug = timepoints_per_drug.index
timepoints_count = timepoints_per_drug.values
plt.bar(drug, timepoints_count, color='blue',width=0.8,bottom=None, align='center',tick_label=drug)
plt.xlabel('Drug Regimen')
plt.ylabel('# of Observed Mouse Timepoints')
plt.xticks(rotation='vertical')

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using Pandas
mouse_sex_df = mouse_study_data_clean.groupby('Sex')['Mouse ID'].count()
plot_pandas = mouse_sex_df.plot.pie(y='Count of Mice', label='Sex', title=' Distribution of Female vs. Male Mice',figsize=(5,5),autopct='%1.0f%%')

In [None]:
plt.pie(mouse_sex_df.values, labels=mouse_sex_df.index,autopct='%1.0f%%')
plt.ylabel('Sex')
plt.title('Distribution of Female vs. Male Mice')
plt.show()

## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin

In [None]:
# Start by getting the last (greatest) timepoint for each mouse
last_timepoint=mouse_study_data_clean.groupby('Mouse ID')['Timepoint'].max()
last_timepoint

In [None]:
# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
mouse_study_data_clean_merged = pd.merge(mouse_study_data_clean,last_timepoint, on=['Mouse ID','Timepoint'], how='inner')
mouse_study_data_clean_merged = mouse_study_data_clean_merged.reset_index(drop=True)
mouse_study_data_clean_merged

In [None]:
# Create empty list to fill with tumor vol data (for plotting)
tumor_volume_data = []

# Put key treatments into a list for for loop (and later for plot labels)
key_treatments = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']

# For each drug in the key treatment list
for treatment in key_treatments: 
    # Locate the rows which contain mice on this drug and get the tumor volumes
    drug_match_s = mouse_study_data_clean_merged['Drug Regimen'] == treatment
    drug_volume_s = mouse_study_data_clean_merged.loc[drug_match_s, 'Tumor Volume (mm3)']

    # add subset (series of tumor volumes) to the tumor volume list
    tumor_volume_data.append(drug_volume_s)
    # Calculate the IQR for the drug
    percentile_25 = drug_volume_s.quantile(0.25)
    percentile_75 = drug_volume_s.quantile(0.75)
    iqr = (percentile_75-percentile_25)
    
    # Determine outliers using upper and lower bounds, for the drug
    upper_bound = percentile_75 + (1.5 * iqr)
    lower_bound = percentile_25 - (1.5 * iqr)
    upper_outlier_s = drug_volume_s > upper_bound
    lower_outlier_s = drug_volume_s < lower_bound
    outliers_s = drug_volume_s[upper_outlier_s | lower_outlier_s]
    print(f"{treatment}'s potential outliers, {outliers_s}")

In [None]:
# Generate a box plot that shows the distrubution of the tumor volume for each treatment group.
flierprops = dict(marker='o', markerfacecolor='r', markersize=12,
                  linestyle='none', markeredgecolor='black')
plt.boxplot(tumor_volume_data, flierprops=flierprops, labels=key_treatments)
plt.ylabel("Final Tumor Volume (mm3)")
plt.show()

## Line and Scatter Plots

In [None]:
# Generate a line plot of tumor volume vs. time point for a single mouse treated with Capomulin
mouse_l509 = mouse_study_data_clean['Mouse ID'] == 'l509'
tumor_volume_l509 = mouse_study_data_clean.loc[mouse_l509]['Tumor Volume (mm3)'].reset_index(drop=True)
timepoint_l509 = mouse_study_data_clean.loc[mouse_l509]['Timepoint'].reset_index(drop=True)
plt.plot(timepoint_l509, tumor_volume_l509)
plt.title('Capomulin Treatment of Mouse l509')
plt.xlabel('Timepoint(days)')
plt.ylabel('Tumor Volume (mm3)')

In [None]:
# Generate a scatter plot of mouse weight vs. the average observed tumor volume for the entire Capomulin regimen

In [None]:
drug_match_capomulin = mouse_study_data_clean['Drug Regimen'] == 'Capomulin'
capomulin_df = mouse_study_data_clean.loc[drug_match_capomulin][['Weight (g)', 'Tumor Volume (mm3)', 'Mouse ID']].reset_index(drop=True)
capomulin_df = capomulin_df.groupby('Mouse ID').mean()

In [None]:
scatter_plot = capomulin_df.plot.scatter('Weight (g)', 'Tumor Volume (mm3)')
scatter_plot.set_ylabel('Average Tumor Volume (mm3)')

## Correlation and Regression

In [None]:
# Calculate the correlation coefficient and a linear regression model 
# for mouse weight and average observed tumor volume for the entire Capomulin regimen

In [None]:
weight = capomulin_df['Weight (g)']
tumor_volume = capomulin_df['Tumor Volume (mm3)']
capomulin_r = st.pearsonr(weight, tumor_volume).statistic

In [None]:
wtv_slope, wtv_int, wtv_r, wtv_p, wtv_std_err, = st.linregress(weight, tumor_volume)
wtv_fit = wtv_slope*weight + wtv_int

In [None]:
scatter_plot = capomulin_df.plot.scatter('Weight (g)', 'Tumor Volume (mm3)')
scatter_plot.set_ylabel('Average Tumor Volume (mm3)')
plt.plot(weight,wtv_fit, '--', color='r')
print(f"The correlation between mouse weight and the average tumor volume is {capomulin_r:.2f}")
plt.show()