# Pymaceuticals Inc.
---

### Analysis

##### Here are three observations that can be made from the figures and data:
####  1.	The Capomulin treatment was effective in reducing tumor volume over time. The line plot of tumor volume vs. time point for a single mouse treated with Capomulin showed a clear downward trend in tumor volume over time, and the box plot of final tumor volume for mice across the four treatment regimens showed that Capomulin had the lowest median tumor volume of the four treatments.

#### 2.	There is a positive correlation between mouse weight and tumor volume for mice treated with Capomulin. The scatter plot of mouse weight vs. average observed tumor volume for the entire Capomulin regimen showed a positive linear relationship between the two variables, and the calculated correlation coefficient of 0.84 indicates a strong positive correlation.

#### 3.	The distribution of male and female mice in the study was roughly equal. The pie chart of male vs. female mice in the study showed that 49.4% of the mice were male and 50.6% were female, indicating that the study was well-balanced in terms of gender distribution.
 

In [1]:
%matplotlib notebook

In [2]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
from scipy.stats import linregress

# Study data files
mouse_metadata_path = r"C:\\Users\\cryst\\Documents\\UMN_Bootcamp\\Coursework\\Section_2_Data_Analytics_Python\\Module_5\\Homework\\Pymaceuticals\\data\\Mouse_metadata.csv"
study_results_path = r"C:\\Users\\cryst\\Documents\\UMN_Bootcamp\\Coursework\\Section_2_Data_Analytics_Python\\Module_5\\Homework\\Pymaceuticals\\data\\Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single DataFrame
merge_df = pd.merge(mouse_metadata, study_results, on="Mouse ID", how="outer")

# Display the data table for preview
merge_df

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.000000,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1
...,...,...,...,...,...,...,...,...
1888,z969,Naftisol,Male,9,30,25,63.145652,2
1889,z969,Naftisol,Male,9,30,30,65.841013,3
1890,z969,Naftisol,Male,9,30,35,69.176246,4
1891,z969,Naftisol,Male,9,30,40,70.314904,4


In [3]:
# Checking the number of mice.
print(len(merge_df['Mouse ID'].unique()))
merge_df['Mouse ID'].value_counts()

249


g989    13
k403    10
j365    10
j984    10
k210    10
        ..
v199     1
t573     1
f932     1
b447     1
u153     1
Name: Mouse ID, Length: 249, dtype: int64

In [4]:
# Our data should be uniquely identified by Mouse ID and Timepoint
# Get the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
merge_df.loc[merge_df.duplicated(subset=['Mouse ID', 'Timepoint'], keep=False), 'Mouse ID'].unique()

array(['g989'], dtype=object)

In [5]:
# Optional: Get all the data for the duplicate mouse ID. 
duplicate_mouse_df = merge_df[merge_df['Mouse ID'] == 'g989']
duplicate_mouse_df

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
908,g989,Propriva,Female,21,26,0,45.0,0
909,g989,Propriva,Female,21,26,0,45.0,0
910,g989,Propriva,Female,21,26,5,48.786801,0
911,g989,Propriva,Female,21,26,5,47.570392,0
912,g989,Propriva,Female,21,26,10,51.745156,0
913,g989,Propriva,Female,21,26,10,49.880528,0
914,g989,Propriva,Female,21,26,15,51.325852,1
915,g989,Propriva,Female,21,26,15,53.44202,0
916,g989,Propriva,Female,21,26,20,55.326122,1
917,g989,Propriva,Female,21,26,20,54.65765,1


In [6]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
merge_df.drop_duplicates(subset=['Mouse ID', 'Timepoint'], keep=False, inplace=True)
merge_df = merge_df[merge_df['Mouse ID'] != 'g989']

In [7]:
# Checking the number of mice in the clean DataFrame.
print(len(merge_df['Mouse ID'].unique()))
merge_df['Mouse ID'].value_counts()

248


k403    10
o287    10
j984    10
k210    10
k382    10
        ..
h428     1
o848     1
t573     1
d133     1
x226     1
Name: Mouse ID, Length: 248, dtype: int64

## Summary Statistics

In [8]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

# Use groupby and summary statistical methods to calculate the following properties of each drug regimen: 
# mean, median, variance, standard deviation, and SEM of the tumor volume. 
mean_series = merge_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].mean()
median_series = merge_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].median()
var_series = merge_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].var()
std_series = merge_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].std()
sem_series = merge_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].sem()

# Assemble the resulting series into a single summary DataFrame.
summary_df = pd.DataFrame({'mean': mean_series, 'median': median_series, 'var': var_series, 'std': std_series, 'sem': sem_series})
print(summary_df)

                   mean     median        var       std       sem
Drug Regimen                                                     
Capomulin     40.675741  41.557809  24.947764  4.994774  0.329346
Ceftamin      52.591172  51.776157  39.290177  6.268188  0.469821
Infubinol     52.884795  51.820584  43.128684  6.567243  0.492236
Ketapril      55.235638  53.698743  68.553577  8.279709  0.603860
Naftisol      54.331565  52.509285  66.173479  8.134708  0.596466
Placebo       54.033581  52.288934  61.168083  7.821003  0.581331
Propriva      52.320930  50.446266  43.852013  6.622085  0.544332
Ramicane      40.216745  40.673236  23.486704  4.846308  0.320955
Stelasyn      54.233149  52.431737  59.450562  7.710419  0.573111
Zoniferol     53.236507  51.818479  48.533355  6.966589  0.516398


In [9]:
# A more advanced method to generate a summary statistics table of mean, median, variance, standard deviation,
# and SEM of the tumor volume for each regimen (only one method is required in the solution)

# Using the aggregation method, produce the same summary statistics in a single line
summary_df = merge_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].agg(['mean', 'median', 'var', 'std', 'sem'])
print(summary_df)

                   mean     median        var       std       sem
Drug Regimen                                                     
Capomulin     40.675741  41.557809  24.947764  4.994774  0.329346
Ceftamin      52.591172  51.776157  39.290177  6.268188  0.469821
Infubinol     52.884795  51.820584  43.128684  6.567243  0.492236
Ketapril      55.235638  53.698743  68.553577  8.279709  0.603860
Naftisol      54.331565  52.509285  66.173479  8.134708  0.596466
Placebo       54.033581  52.288934  61.168083  7.821003  0.581331
Propriva      52.320930  50.446266  43.852013  6.622085  0.544332
Ramicane      40.216745  40.673236  23.486704  4.846308  0.320955
Stelasyn      54.233149  52.431737  59.450562  7.710419  0.573111
Zoniferol     53.236507  51.818479  48.533355  6.966589  0.516398


## Bar and Pie Charts

In [10]:
drug_count = merge_df['Drug Regimen'].value_counts()

In [11]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using Pandas.
drug_count.plot(kind='bar')
plt.title('Total Number of Rows per Drug Regimen')
plt.xlabel('Drug Regimen')
plt.ylabel('Number of Rows')
plt.show()

<IPython.core.display.Javascript object>

In [12]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using pyplot.
fig1, ax1 = plt.subplots()
ax1.bar(drug_count.index, drug_count.values)
ax1.set_title('Total Number of Rows per Drug Regimen')
ax1.set_xlabel('Drug Regimen')
ax1.set_ylabel('Number of Rows')
ax1.set_xticks(range(len(drug_count)))
ax1.set_xticklabels(drug_count.index, rotation=90)
plt.show()

<IPython.core.display.Javascript object>

In [13]:
gender_count = merge_df['Sex'].value_counts()

In [14]:
# Generate a pie plot showing the distribution of female versus male mice using Pandas
fig2, ax2 = plt.subplots()
ax2.pie(gender_count.values, labels=['Male', 'Female'], autopct='%1.1f%%', startangle=90)
ax2.set_title('Distribution of Female vs Male Mice')
ax2.axis('equal')
plt.show()

<IPython.core.display.Javascript object>

In [15]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot
fig, ax = plt.subplots()
ax.pie(gender_count.values, labels=['Male', 'Female'], autopct='%1.1f%%', startangle=90)
ax.set_title('Distribution of Female vs Male Mice')
ax.axis('equal')
plt.show()

<IPython.core.display.Javascript object>

## Quartiles, Outliers and Boxplots

In [16]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin
merge_df = merge_df.reset_index(drop=True)

# Start by getting the last (greatest) timepoint for each mouse
last_timepoint = merge_df.groupby('Mouse ID')['Timepoint'].max()

# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
final_volume_df = pd.merge(last_timepoint, merge_df, on=['Mouse ID', 'Timepoint'])

In [17]:
# Put treatments into a list for for loop (and later for plot labels)
treatments = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']

# Create empty list to fill with tumor vol data (for plotting)
tumor_vol_data = []

# Calculate the IQR and quantitatively determine if there are any potential outliers. 
for treatment in treatments:
    
    # Locate the rows which contain mice on each drug and get the tumor volumes
    treatment_df = final_volume_df.loc[final_volume_df['Drug Regimen'] == treatment]
    tumor_vol = treatment_df['Tumor Volume (mm3)']
    
    # add subset 
    tumor_vol_data.append(tumor_vol)
    
    # Determine outliers using upper and lower bounds
    quartiles = tumor_vol.quantile([.25,.5,.75])
    lower_q = quartiles[0.25]
    upper_q = quartiles[0.75]
    iqr = upper_q - lower_q
    lower_bound = lower_q - (1.5*iqr)
    upper_bound = upper_q + (1.5*iqr)
    
    # Check for outliers
    outliers = treatment_df.loc[(treatment_df['Tumor Volume (mm3)'] < lower_bound) | 
                                (treatment_df['Tumor Volume (mm3)'] > upper_bound)]
    if len(outliers) > 0:
        print(f"{treatment} has {len(outliers)} potential outliers:")
        print(outliers)
    else:
        print(f"{treatment} has no potential outliers.")

Capomulin has no potential outliers.
Ramicane has no potential outliers.
Infubinol has 1 potential outliers:
   Mouse ID  Timepoint Drug Regimen     Sex  Age_months  Weight (g)  \
31     c326          5    Infubinol  Female          18          25   

    Tumor Volume (mm3)  Metastatic Sites  
31           36.321346                 0  
Ceftamin has no potential outliers.


In [18]:
# Generate a box plot that shows the distrubution of the tumor volume for each treatment group.
fig, ax = plt.subplots()
ax.boxplot(tumor_vol_data)
ax.set_xticklabels(treatments)
ax.set_xlabel('Treatment')
ax.set_ylabel('Final Tumor Volume (mm3)')
ax.set_title('Distribution of Final Tumor Volume for Each Treatment Group')
plt.show()

<IPython.core.display.Javascript object>

## Line and Scatter Plots

In [19]:
# Generate a line plot of tumor volume vs. time point for a single mouse treated with Capomulin
# Filter the data for mice treated with Capomulin
capomulin_df = merge_df.loc[merge_df["Drug Regimen"] == "Capomulin"]
mouse_id = "s185"
single_mouse_df = capomulin_df.loc[capomulin_df["Mouse ID"] == mouse_id]

# Generate the line plot
plt.figure()
plt.plot(single_mouse_df["Timepoint"], single_mouse_df["Tumor Volume (mm3)"])
plt.xlabel("Time (days)")
plt.ylabel("Tumor Volume (mm3)")
plt.title(f"Tumor Volume over Time for Mouse {mouse_id} Treated with Capomulin")

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Tumor Volume over Time for Mouse s185 Treated with Capomulin')

In [23]:
# Generate a scatter plot of mouse weight vs. the average observed tumor volume for the entire Capomulin regimen
# Calculate the average tumor volume for each mouse
avg_tumor_vol = capomulin_df.groupby("Mouse ID")["Tumor Volume (mm3)"].mean()

# Get the weight for each mouse
mouse_weight = capomulin_df.groupby("Mouse ID")["Weight (g)"].unique().str[0]

# Generate the scatter plot
plt.figure()
plt.scatter(mouse_weight, avg_tumor_vol)

# Set the title and axis labels
plt.title("Mouse Weight vs. Average Tumor Volume for Capomulin Regimen")
plt.xlabel("Weight (g)")
plt.ylabel("Average Tumor Volume (mm3)")

# Show the plot
plt.show()

<IPython.core.display.Javascript object>

## Correlation and Regression

In [24]:
# Calculate the correlation coefficient and a linear regression model 
# for mouse weight and average observed tumor volume for the entire Capomulin regimen
capomulin_weight_volume = merge_df.loc[merge_df["Drug Regimen"] == "Capomulin", ["Mouse ID", "Weight (g)", "Tumor Volume (mm3)"]]
capomulin_weight_volume = capomulin_weight_volume.groupby("Mouse ID").mean()

x_values = capomulin_weight_volume["Weight (g)"]
y_values = capomulin_weight_volume["Tumor Volume (mm3)"]

# Calculate the correlation coefficient
correlation = st.pearsonr(x_values,y_values)
print(f"The correlation between mouse weight and the average tumor volume is {round(correlation[0],2)}")

# Perform a linear regression on weight vs tumor volume
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)

# Get regression values
regress_values = x_values * slope + intercept
line_eq = f"y = {round(slope,2)}x + {round(intercept,2)}"

# Create scatter plot of mouse weight vs. average tumor volume and add regression line
plt.figure()
plt.scatter(x_values,y_values)
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(20,36),fontsize=15,color="red")
plt.xlabel('Weight (g)')
plt.ylabel('Average Tumor Volume (mm3)')
plt.title("Mouse Weight vs. Average Tumor Volume for Capomulin Regimen")
plt.show()


The correlation between mouse weight and the average tumor volume is 0.84


<IPython.core.display.Javascript object>