# Pymaceuticals Inc.
---

### Analysis

Pymaceuticals, Inc., a new pharmaceutical company that specializes in anti-cancer medications, screening for potential treatments for squamous cell carcinoma (SCC), a commonly occurring form of skin cancer, 249 mice who were identified with SCC tumors received treatment with a range of drug regimens. Over the course of 45 days, tumor development was observed and measured. The purpose of this study was to compare the performance of Pymaceuticals’ drug of interest, Capomulin, against the other treatment regimens.


1.By analyzing data mean and medain, values more or less same by mean that the data is distributed very evenly so data is good va, std,std.err count is also good with data.

2.Mice were equally distributed with weight, age, days tested on all medications with equal count of Male and Female mice.

3.By analyzing with bar graphs comparing with drug with testing on mice Capomulin is working good and second in place is Ramicane.

4.By analyzing with bar graphs comparing with drug with testing on mice Propriva is last in list(not wrking effectively).

5.Comaratively with all the data Infubinol's potential has some outliers plotted with box plots.

In [None]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
import numpy as np

# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single DataFrame
raw_df= pd.merge(study_results,mouse_metadata ,on = "Mouse ID", how = 'left' )

# Display the data table for preview
raw_df.head()

In [None]:
# Checking the number of mice.
raw_df["Mouse ID"].nunique()

In [None]:
# Our data should be uniquely identified by Mouse ID and Timepoint
# Get the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 

duplicate_mouseids = raw_df.loc[raw_df.duplicated(subset=['Mouse ID', 'Timepoint']), 'Mouse ID'].unique()
duplicate_mouseids


In [None]:
# Optional: Get all the data for the duplicate mouse ID. 

mask2=raw_df.loc[raw_df["Mouse ID"] == "g989"]
mask2


In [None]:
mask_test= raw_df["Mouse ID"] != "g989"

# data cleaning and dtypes
clean_df = raw_df.loc[mask_test].reset_index(drop=True)

clean_df.head()

In [None]:
# Checking the number of mice in the clean DataFrame.
clean_df["Mouse ID"].nunique()


## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

# Use groupby and summary statistical methods to calculate the following properties of each drug regimen: 
# mean, median, variance, standard deviation, and SEM of the tumor volume. 
# Assemble the resulting series into a single summary DataFrame.

mean_tumor=clean_df.groupby(["Drug Regimen"])["Tumor Volume (mm3)"].mean()
median_tumor=clean_df.groupby(["Drug Regimen"])["Tumor Volume (mm3)"].median()
var_tumor= clean_df.groupby(["Drug Regimen"])["Tumor Volume (mm3)"].var()
std_dev_tumor = clean_df.groupby(["Drug Regimen"])["Tumor Volume (mm3)"].std()
std_err_tumor= clean_df.groupby(["Drug Regimen"])["Tumor Volume (mm3)"].sem()


data={
    "Mean Tumor Volume": mean_tumor,
    "Median Tumor Volume" : median_tumor,
    "Tumor Volume Variance": var_tumor,
    "Tumor Volume Std.Dev.": std_dev_tumor,
    "Tumot Volume Std.Err.": std_err_tumor,  
     }
summary= pd.DataFrame(data)
summary




In [None]:
# A more advanced method to generate a summary statistics table of mean, median, variance, standard deviation,
# and SEM of the tumor volume for each regimen (only one method is required in the solution)

# Using the aggregation method, produce the same summary statistics in a single line


summary_stats = clean_df.groupby('Drug Regimen').agg({"Tumor Volume (mm3)": ['mean', 'median', 'var', 'std', 'sem']})
summary_stats


## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using Pandas.

# Group the data by the drug regimen and count the number of rows for each group
bar1 = clean_df.groupby('Drug Regimen').size().sort_values(ascending = False)

# Create a bar plot using the Pandas DataFrame.plot() method
bar1.plot(kind='bar', color='navy', figsize=(8, 2.5))

# Add title and labels
plt.title('Total Number of Rows for Each Drug Regimen')
plt.xlabel('Drug Regimen')
plt.ylabel('# of Observed Mouse Timepoints')

plt.show()

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using pyplot.
cols_agg={
    "Timepoint": "count",
    "Mouse ID" :"count"
}
barplot= clean_df.groupby(["Drug Regimen"]).agg(cols_agg).reset_index()
barplot=barplot.rename(columns={'Timepoint': 'Timepoint Count'}).sort_values(by = 'Timepoint Count' ,ascending=False)
barplot


x= barplot["Drug Regimen"]
y=barplot["Timepoint Count"]


plt.figure(figsize=(8, 2.5))
plt.bar(x, y, color='orange',alpha=0.90, align="center")

# Add labels and title
plt.xlabel('Drug Regimen')
plt.ylabel('# Of Observed Mouse Timepoints')
plt.title('Total Number of Mouse ID/Timepoints')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

#Display the plot
plt.show()

In [None]:
cols_agg={
    "Mouse ID": "count"
}
male_female_count= clean_df.groupby(["Sex"]).agg(cols_agg).reset_index()
male_female_count=male_female_count.rename(columns={'Mouse ID': 'Sex', 'Sex': 'index_sex'})
male_female_count=male_female_count.set_index("index_sex").sort_values(by ='Sex',ascending=False)
male_female_count
plot = male_female_count.plot.pie(y='Sex', figsize=(5,4.2), autopct='%1.1f%%', colors=["skyblue","orange"],legend= False)



In [None]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot

# Data for the pie chart
sizes = [958,922]  # Number of female and male mice
labels = ['Male', 'Female']
colors = ['skyblue','orange']
explode = (0.05, 0)  # Explode the first slice (Female)

# Create the pie chart
plt.pie(sizes, labels=labels, colors=colors, explode=explode, autopct='%1.1f%%', shadow=True, startangle=0)
plt.ylabel("Sex")
# Display the plot
plt.show()

## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin

# Start by getting the last (greatest) timepoint for each mouse
max_mouseid=clean_df.groupby(["Mouse ID"])["Timepoint"].max()
max_mouseid=max_mouseid.reset_index()
max_mouseid

# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
merged_clean_df = pd.merge( max_mouseid,clean_df ,on= ["Mouse ID", "Timepoint"], how='left')
merged_clean_df.head()


In [None]:
# Put treatments into a list for for loop (and later for plot labels)

treatment_list=["Capomulin", "Ramicane", "Infubinol", "Ceftamin"]

# Create empty list to fill with tumor vol data (for plotting)
tumor_vol_list=[]

# Calculate the IQR and quantitatively determine if there are any potential outliers. 

for drug in treatment_list:
    # Locate the rows which contain mice on each drug and get the tumor volumes
    final_tumor_vol = merged_clean_df.loc[merged_clean_df["Drug Regimen"] == drug, 'Tumor Volume (mm3)']
    
    # add subset
    tumor_vol_list.append(final_tumor_vol)
    
    # Determine outliers using upper and lower bounds
    quartiles = final_tumor_vol.quantile([.25,.5,.75])
    lowerq = quartiles[0.25]
    upperq = quartiles[0.75]
    iqr = upperq-lowerq
     # Determine outliers using upper and lower bounds
    lower_bound = lowerq - (1.5*iqr)
    upper_bound = upperq + (1.5*iqr)
    
    outliers = final_tumor_vol.loc[(final_tumor_vol < lower_bound) | (final_tumor_vol > upper_bound)]
    print(f"{drug}'s potential outliers: {outliers}")  


In [None]:
# Generate a box plot that shows the distrubution of the tumor volume for each treatment group.
orange_out = dict(markerfacecolor='red',markersize=12)
plt.boxplot(tumor_vol_list, labels = treatment_list,flierprops=orange_out)
plt.ylabel('Final Tumor Volume (mm3)')
plt.show()

## Line and Scatter Plots

In [None]:
# Generate a line plot of tumor volume vs. time point for a single mouse treated with Capomulin

# Filter the data for the Capomulin regimen
capomulin_data = clean_df[clean_df["Drug Regimen"] == "Capomulin"]

# Select a single mouse
#mouse_id = capomulin_data["Mouse ID"].iloc[0]
mouse_data = capomulin_data[capomulin_data["Mouse ID"] == "l509"]

# Create a line plot for tumor volume versus time point for the selected mouse
plt.figure(figsize=(10,3))
plt.plot(mouse_data["Timepoint"], mouse_data["Tumor Volume (mm3)"], marker='o', color='green', label=f"Mouse l509")
plt.xlabel("Timepoint(days)")
plt.ylabel("Tumor Volume (mm3)")
plt.title("Capomulin treatment for Mouse l509")
plt.grid(True)
plt.show()


In [None]:
# Generate a scatter plot of mouse weight vs. the average observed tumor volume for the entire Capomulin regimen

# Filter data for the Capomulin regimen
capomulin_data1 = clean_df[clean_df['Drug Regimen'] == 'Capomulin']

# Group by 'Mouse ID' and calculate the average tumor volume for each mouse
average_tumor_volume = capomulin_data1.groupby('Mouse ID')['Tumor Volume (mm3)'].mean()

# Merge average tumor volume data with the original data to get the corresponding weight for each mouse
merged_data = pd.merge(capomulin_data1, average_tumor_volume, on='Mouse ID')

# Create a scatter plot of mouse weight vs. average tumor volume
plt.figure(figsize=(10, 6))
plt.scatter(merged_data['Weight (g)'], merged_data['Tumor Volume (mm3)_y'], color='blue', alpha=0.7)
plt.title('Mouse Weight vs. Average Tumor Volume (Capomulin Regimen)')
plt.xlabel('Weight (g)')
plt.ylabel('Average Tumor Volume (mm3)')
plt.grid(True)
plt.show()

## Correlation and Regression

In [None]:
# Calculate the correlation coefficient and a linear regression model 
# for mouse weight and average observed tumor volume for the entire Capomulin regimen
# Assuming you have a DataFrame called 'capomulin_data' with columns 'Mouse Weight' and 'Tumor Volume (mm3)'
# Replace 'capomulin_data' with your actual DataFrame
# Filter data for the Capomulin regimen
capomulin_data2 = clean_df[clean_df['Drug Regimen'] == 'Capomulin']

# Group by 'Mouse ID' and calculate the average tumor volume for each mouse
average_tumor_volume= capomulin_data2.groupby('Mouse ID')['Tumor Volume (mm3)'].mean()

# Merge average tumor volume data with the original data to get the corresponding weight for each mouse
merged_data1 = pd.merge(capomulin_data2, average_tumor_volume, on='Mouse ID')
merged_data1
# Calculate the correlation coefficient
correlation = merged_data1['Weight (g)'].corr(merged_data1['Tumor Volume (mm3)_y'])
print(f"Correlation Coefficient: {correlation}")

# Perform linear regression
slope, intercept, r_value, p_value, std_err = st.linregress(merged_data1['Weight (g)'], merged_data1['Tumor Volume (mm3)_y'])

# Create the equation of the line
line_eq = f"y = {slope:.2f}x + {intercept:.2f}"
print(f"Linear Regression Equation: {line_eq}")

# Calculate the predicted values
predicted_values = slope * merged_data1['Weight (g)'] + intercept

# Plot the data and the linear regression line
plt.scatter(merged_data1['Weight (g)'], merged_data1['Tumor Volume (mm3)_y'])
plt.plot(merged_data1['Weight (g)'], predicted_values, color='red')
plt.xlabel('Mouse Weight')
plt.ylabel('Tumor Volume (mm3)_y')
plt.title('Mouse Weight vs Tumor Volume')
plt.show()
