# Pymaceuticals Inc.
---

### Analysis

In this data analysis, we investigated a mouse study dataset to gain insights into the effectiveness of various treatment regimens on tumor volume. We performed data cleaning, summary statistics calculation, data visualization, and statistical analysis to understand the relationships between variables and draw meaningful conclusions.

#### Data Cleaning and Exploration
This initial phase of the analysis involved data cleaning and exploration. The mouse data and study were merged into a single DataFrame to consolidate relevant information.
Duplicate data was identified based on unique identifiers and inconsistencies were resolved.
The cleaned dataset comprised information on mouse characteristics, treatment regimens, tumor volume and other relevant data points.

#### Summary Statistics
Summary statistics were calculated to gain insights into the tumor volume distribution across different drug regimens.
Mean, median, variance, standard deviation and standard error were computed for each regimen. The resulting statistics table provided a comprehensive overview of the tumor volume characteristics for each drug regimen.

#### Data Visualization
Visualization played a key role in understanding the dataset and communicating the findings.
Bar charts were used to visualize the total number of data points for each drug regimen, offering a clear comparison among the treatements.
Pie charts were used to illustrate the gender distributuion among the mice, helping to provide an overview of the study's composition. Box plots helped in understanding the distribution and potential outliers in tumor volume across different treatment groups.

#### Statistical Analysis
Correlation analysis was conducted to evaluate the relationshipo between mouse weight and avegrage observed tumor volume for the Capomulin regiment.
Linear regreassion analysis was performed to model the relationship between mouse weight and timor volume, and to assess the signficance of the regression parameters.

Based on my analysis, the following key findings were observed

- The dataset contained 248 unique mice
- The most promising regimens were Capomulin, Ramicane, Infubino and Ceftamin, which showed varying levels of effectiveness in reducing tumor volume.
- The Capomulin regimen showed a strong negative correlation between mouse weight and average tumor volume, indicating that heavier mice tended to have larger tumor volumes.


 

In [2]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
from scipy.stats import linregress
import numpy as np

# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single DataFrame

merged_data = study_results.join(mouse_metadata.set_index('Mouse ID'), on='Mouse ID')

# Display the data table for preview

merged_data.head()

ModuleNotFoundError: No module named 'spicy'

In [None]:
# Checking the number of mice.
unique_mice_count = merged_data['Mouse ID'].nunique()
unique_mice_count

In [None]:
# Our data should be uniquely identified by Mouse ID and Timepoint
# Get the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
duplicate_mice = merged_data[merged_data.duplicated(['Mouse ID', 'Timepoint'])]['Mouse ID'].unique()

duplicate_mice

In [None]:
# Optional: Get all the data for the duplicate mouse ID. 
duplicate_data = merged_data.loc[merged_data['Mouse ID'].isin(duplicate_mice)]

duplicate_data


In [None]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
cleaned_data = merged_data[merged_data['Mouse ID'] != 'g989']

cleaned_data.to_csv('cleaned_data.csv', index=False)
cleaned_data


In [None]:
# Checking the number of mice in the clean DataFrame.
num_mice_cleaned = cleaned_data['Mouse ID'].nunique()
num_mice_cleaned

## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

mean = cleaned_data['Tumor Volume (mm3)'].groupby(cleaned_data['Drug Regimen']).mean()
median = cleaned_data['Tumor Volume (mm3)'].groupby(cleaned_data['Drug Regimen']).median()
var = cleaned_data['Tumor Volume (mm3)'].groupby(cleaned_data['Drug Regimen']).var()
std = cleaned_data['Tumor Volume (mm3)'].groupby(cleaned_data['Drug Regimen']).std()
sem = cleaned_data['Tumor Volume (mm3)'].groupby(cleaned_data['Drug Regimen']).sem()

summary_stat = pd.DataFrame({"Mean Tumor Volume":mean, 
                            "Median Tumor Volume":median, 
                           "Tumor Volume Variance":var, 
                           "Tumor Volume Std. Dev.":std, 
                           "Tumor Volume Std. Err.":sem})

# Assemble the resulting series into a single summary DataFrame.
summary_stat



In [None]:
# A more advanced method to generate a summary statistics table of mean, median, variance, standard deviation,
# and SEM of the tumor volume for each regimen (only one method is required in the solution)

# Using the aggregation method, produce the same summary statistics in a single line

summary_stats = cleaned_data.groupby(['Drug Regimen'])[['Tumor Volume (mm3)']].agg(['mean', 'median', 'var', 'std', 'sem'])


summary_stats

## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using Pandas.

regimen_count = cleaned_data['Drug Regimen'].value_counts()

regimen_count.plot(kind='bar', xlabel='Drug Regimen', ylabel='# of Observed Mouse Timepoints', title='Total Rows per Drug Regimen',figsize=(5, 5))
plt.show()

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using pyplot.

plt.figure(figsize=(5, 5))
plt.bar(regimen_count.index, regimen_count.values)
plt.title('Total Number of Rows by Drug Regimen')
plt.xlabel('Drug Regimen')
plt.ylabel('# of Observed Mouse Timepoints')
plt.xticks(rotation='vertical')
plt.show()

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using Pandas
mice_count = cleaned_data["Sex"].value_counts()
plt.title('Distribution of Female vs Male Mice')
mice_count.plot.pie(autopct='%1.1f%%')
plt.show()

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot

plt.pie(mice_count, labels=mice_count.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Female vs Male Mice')
plt.axis('equal')
plt.show()

## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin

promising_regimens = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']
promising_data = cleaned_data[cleaned_data['Drug Regimen'].isin(promising_regimens)]

# Start by getting the last (greatest) timepoint for each mouse

final_timepoints = cleaned_data.groupby(["Mouse ID"])["Timepoint"].max()

# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint

final_tumor_volume = pd.merge(cleaned_data, final_timepoints, on=("Mouse ID", "Timepoint"), how="right")

final_tumor_volume.head(250)

In [None]:
# Put treatments into a list for for loop (and later for plot labels)

# Create empty list to fill with tumor vol data (for plotting)

# Calculate the IQR and quantitatively determine if there are any potential outliers. 

  # Locate the rows which contain mice on each drug and get the tumor volumes
    # add subset    
    # Determine outliers using upper and lower bounds

treatment_names = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']

tumor_volume_data = {}

for drug in treatment_names:
    drug_rows = final_tumor_volume.loc[final_tumor_volume['Drug Regimen']== drug]
    volumes = drug_rows['Tumor Volume (mm3)']
    tumor_volume_data[drug] = volumes
    quartile_1 = volumes.quantile(0.25)
    quartile_3 = volumes.quantile(0.75)
    iqr = quartile_3 - quartile_1
    lower_bound = quartile_1 - 1.5 * iqr
    upper_bound = quartile_3 + 1.5 * iqr
    outliers = volumes[(volumes < lower_bound) | (volumes > upper_bound)]
    print(f"Drug: {drug}")
    if len(outliers) == 0:
        print("No outliers found.")
    else:
       print("Potential outliers:")
    print(outliers)
    print()
    

In [None]:
# Generate a box plot that shows the distrubution of the tumor volume for each treatment group.

treatment_regimens = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']
filtered_data = final_tumor_volume[final_tumor_volume['Drug Regimen'].isin(treatment_regimens)]

grouped_data = filtered_data.groupby('Drug Regimen')['Tumor Volume (mm3)'].apply(list)

treatment_names = grouped_data.index.tolist()

data = grouped_data.tolist()

plt.boxplot(data, labels=treatment_names)

plt.xlabel('Treatment Group')
plt.ylabel('Tumor Volume (mm3)')
plt.title('Distribution of Tumor Volume by Treatment Group')
plt.show()

## Line and Scatter Plots

In [None]:
# Generate a line plot of tumor volume vs. time point for a single mouse treated with Capomulin

mouse_data = cleaned_data[(cleaned_data['Drug Regimen'] == 'Capomulin') & (cleaned_data['Mouse ID'] == 'b128')]
plt.plot(mouse_data['Timepoint'], mouse_data['Tumor Volume (mm3)'], marker='o')
plt.title('Capomulin treatment of mouse B128')
plt.xlabel('Time Point')
plt.ylabel('Tumor Volume (mm3)')
plt.show()


In [None]:
# Generate a scatter plot of mouse weight vs. the average observed tumor volume for the entire Capomulin regimen

capomulin_data = final_tumor_volume[final_tumor_volume['Drug Regimen'] == 'Capomulin']
average_tumor_volume = capomulin_data.groupby('Mouse ID')['Tumor Volume (mm3)'].mean()
mouse_weight = capomulin_data.groupby('Mouse ID')['Weight (g)'].first()

plt.scatter(mouse_weight, average_tumor_volume)
plt.xlabel('Mouse Weight (g)')
plt.ylabel('Average Tumor Volume (mm3)')
plt.title('Mouse Weight vs. Average Tumor Volume for Capomulin Regimen')

plt.show()

## Correlation and Regression

In [None]:
# Calculate the correlation coefficient and a linear regression model 
# for mouse weight and average observed tumor volume for the entire Capomulin regimen

capomulin_data = final_tumor_volume[final_tumor_volume['Drug Regimen'] == 'Capomulin']
average_tumor_volume = capomulin_data.groupby('Mouse ID')['Tumor Volume (mm3)'].mean()
mouse_weight = capomulin_data.groupby('Mouse ID')['Weight (g)'].first()
correlation = np.corrcoef(mouse_weight, average_tumor_volume)[0, 1]
coefficients = np.polyfit(mouse_weight, average_tumor_volume, 1)
slope = coefficients[0]
intercept = coefficients[1]

plt.scatter(mouse_weight, average_tumor_volume)
plt.xlabel('Mouse Weight (g)')
plt.ylabel('Average Tumor Volume (mm3)')
plt.title('Mouse Weight vs. Average Tumor Volume for Capomulin Regimen')

plt.plot(mouse_weight, slope * mouse_weight + intercept, 'r')

plt.show()