## Overview

Cancer is the second common cause of death in the US and has major impacts on societies across the world. There are over 100 different types of cancers. Each has unique characteristics that influence its behavior and response to treatment, making it crucial to understand the different cancer types. Though there are many studies conducted on cancer and cancer treatments, there are still many cancer types of which doctors and scientists know little about.

## Data

### Data Overview

 This dataset published by ClinicalTrials.gov and adopted by Noah Rippner and published to data.world in 2016 and is open to public. The data ranges from 2010 to 2016. Data collection is preformed by govenment agency through trial approvals and reports.
   - Source URL: https://data.world/nrippner/cancer-trials/workspace/file?filename=study_fields.csv

This dataset contains 10686 rows of data. Each row represents a trial with a unique title and NTC number.

A list of variable and variable discriptions is listed below:
1. **Rank**: Numerical (ordinal). Represents the ranking of the clinical trial in the dataset.
2. **NCT Number**: Categorical (ID). A unique identifier for each clinical trial.
3. **Title**: Categorical (text). The title or name of the clinical trial.
4. **Recruitment**: Categorical (Recruiting, Completed, Not Yet Recruiting, etc.). Indicates the current recruitment status of the trial.
5. **Study Results**: Categorical (Available, Not Available). Reflects whether the trial results have been posted.
6. **Conditions**: Categorical (text). Lists the medical conditions being studied in the trial.
7. **Interventions**: Categorical (text). Describes the treatments or interventions being tested in the trial.
8. **Sponsor/Collaborators**: Categorical (text). Organizations or institutions that sponsor or collaborate on the trial.
9. **Gender**: Categorical (Male, Female, Both). Indicates the eligible gender(s) for participation in the trial.
10. **Age Groups**: Categorical (Child, Adult, Senior). The age groups eligible to participate in the trial.
11. **First Received, Start Date, Completion Date**: Date (timestamp). Key dates related to the trial, such as when it was first received, started, and completed.
12. **Outcome Measures**: Categorical (text). Metrics used to evaluate the effectiveness or safety of the interventions.
13. **URL**: Categorical (URL). A link to more detailed information about the clinical trial on a public database.


### Data Details

In [2]:
# load the data
import pandas as pd
import plotly.express as px
df = pd.read_csv('study_fields.csv')
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10686 entries, 0 to 10685
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Rank                     10686 non-null  int64  
 1   NCT Number               10686 non-null  object 
 2   Title                    10686 non-null  object 
 3   Recruitment              10686 non-null  object 
 4   Study Results            10686 non-null  object 
 5   Conditions               10686 non-null  object 
 6   Interventions            9887 non-null   object 
 7   Sponsor/Collaborators    10686 non-null  object 
 8   Gender                   10685 non-null  object 
 9   Age Groups               10686 non-null  object 
 10  Phases                   7314 non-null   object 
 11  Enrollment               10612 non-null  float64
 12  Funded Bys               10686 non-null  object 
 13  Study Types              10686 non-null  object 
 14  Study Designs         

Unnamed: 0,Rank,Enrollment
count,10686.0,10612.0
mean,5343.5,12539.72
std,3084.926822,991095.4
min,1.0,0.0
25%,2672.25,22.0
50%,5343.5,48.0
75%,8014.75,120.0
max,10686.0,100000000.0


### Most Frequently Studied Cancer Conditions
We are interested as to **what conditions are most frequently studied.**

#### Implementation

In [14]:
# Filtered data set only includes necenssary columns
df_conditions = df[['NCT Number','Conditions','Gender','Age Groups']]
# Exploding dataset by conditions
conditions_exploded = df_conditions.assign(Conditions=df_conditions.Conditions.str.split('|')).explode('Conditions')
print(f'Number of Rows in conditions_exploded: {len(conditions_exploded)}')

top_10_conditions = conditions_exploded['Conditions'].value_counts().head(10)
print(top_10_conditions)

Number of Rows in conditions_exploded: 28666
Conditions
Breast Cancer        872
Prostate Cancer      500
Cancer               349
Multiple Myeloma     332
Lung Cancer          221
Melanoma             220
Lymphoma             209
Pancreatic Cancer    207
Colorectal Cancer    205
Leukemia             184
Name: count, dtype: int64


In [16]:
import plotly.express as px
fig = px.bar(top_10_conditions,
             x=top_10_conditions.index,
             y= top_10_conditions.values,
             text_auto=True,
             labels = {'x':'Conditions','y':'Number of Trials'},
             title='Top 10 Most Frequently Studied Cancer Conditions From 2010 To 2016')
fig.show()

Visualization Description: The plot above visualizes the top 10 most studied cancer conditions. We needed to filter the dataset, explode the conditions column, and find the top 10 conditions.

The dataset is filtered to include only nessensary values; including unnessensary information in the dataset will slow down operation. The variable 'Conditions' includes many values for each trial, therefore the condition column is exploded.

### Age Equality in Cancer Trials
We are interested to know **the age distribution across cancer clinical trials for the top 10 cancer types**.

At first we exploded the dataset by age group and explored the age group distribution in all clinical cancer trials.

In [17]:
# Create a copy of the original dataframe
age_exploded = conditions_exploded.copy()

# explode
age_exploded['Age Groups'] = age_exploded['Age Groups'].str.split('|')
age_exploded = age_exploded.explode('Age Groups')

# Modify the age columns
adult_trials_df = age_exploded[age_exploded['Age Groups'] == 'Adult']
child_trials_df= age_exploded[age_exploded['Age Groups'] == 'Child']
senior_trials_df = age_exploded[age_exploded['Age Groups'] == 'Senior']

print(age_exploded.head())
print(age_exploded.info())

# calculate counts for each age group
adult_count = adult_trials_df.shape[0]
child_count = child_trials_df.shape[0]
senior_count = senior_trials_df.shape[0]

# Output
print(f"Number of child entries: {child_count}")
print(f"Number of adult entries: {adult_count}")
print(f"Number of senior entries: {senior_count}")

    NCT Number         Conditions Gender Age Groups
0  NCT02012699  Pancreatic Cancer   Both      Adult
0  NCT02012699  Pancreatic Cancer   Both     Senior
0  NCT02012699     Thyroid Cancer   Both      Adult
0  NCT02012699     Thyroid Cancer   Both     Senior
0  NCT02012699        Lung Cancer   Both      Adult
<class 'pandas.core.frame.DataFrame'>
Index: 57938 entries, 0 to 10685
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   NCT Number  57938 non-null  object
 1   Conditions  57938 non-null  object
 2   Gender      57936 non-null  object
 3   Age Groups  57938 non-null  object
dtypes: object(4)
memory usage: 2.2+ MB
None
Number of child entries: 4401
Number of adult entries: 28262
Number of senior entries: 25275


In [22]:
age_groups = ['Child', 'Adult', 'Senior']
trial_counts = [child_count, adult_count, senior_count]

# Create a dataframe using age_group and Trial_counts
data = pd.DataFrame({'Age Groups': age_groups, 'Number of Trials': trial_counts})

fig = px.bar(data, x='Age Groups', y='Number of Trials',
             title='Age Distribution in U.S. Cancer Trials from 2010 to 2016',
             color='Age Groups',
             color_discrete_sequence=['blue', 'green', 'orange'],
             text_auto=True)

# Add labels
fig.update_layout(xaxis_title='Age Groups', yaxis_title='Number of Trials')

fig.show()

When exploring the data, a bar graph of the number of trials conducted on child, adult, and seniors is made. This graph help understand the overall age distribution among all the clinical trials in the dataset.

In [25]:
# Combine the age trial data
combined_df = pd.concat([adult_trials_df, senior_trials_df, child_trials_df], ignore_index=True)
# Count the occurrences of each condition and get the top 10
top_conditions = combined_df['Conditions'].value_counts().nlargest(10).index

# Filter the DataFrame for the top 10 conditions
filtered_df = combined_df[combined_df['Conditions'].isin(top_conditions)]
print(filtered_df.head())
# Count the occurrences of each condition
condition_counts = filtered_df['Conditions'].value_counts()
# Create an ordered list of conditions from highest to lowest
ordered_conditions = condition_counts.index.tolist()

     NCT Number         Conditions Gender Age Groups
0   NCT02012699  Pancreatic Cancer   Both      Adult
2   NCT02012699        Lung Cancer   Both      Adult
21  NCT02012699    Prostate Cancer   Both      Adult
38  NCT01631552  Colorectal Cancer   Both      Adult
55  NCT01391143    Prostate Cancer   Both      Adult


In [24]:
fig = px.histogram(filtered_df, x='Conditions',
                   color ='Age Groups',
    title='Age Distribution in Top 10 Most Frequently Studied Cancer Conditions from 2010 to 2016',
                    color_discrete_sequence=['blue', 'green', 'orange'],
                   text_auto=True,
                   barmode = 'group',
                   category_orders={'Conditions': ordered_conditions})

# Show the plot
fig.show()

  ### Gender Equality in Cancer Trials
We are also interested to see **gender distribution across clinical trials for top 10 cancer conditions**.


####Implementation

In [28]:
# Initialize a dictionary to store counts
gender_counts = {'Female': 0, 'Male': 0, 'Both': 0}

# Loop through the Gender column and count occurrences
for gender in df['Gender']:
    if gender in gender_counts:
        gender_counts[gender] += 1

# Display counts
print(gender_counts)

# Convert the dictionary to a DataFrame for plotting
gender_data = pd.DataFrame({
    'Gender': list(gender_counts.keys()),
    'Count': list(gender_counts.values())
})

# Create the pie chart
fig = px.pie(gender_data, names='Gender', values='Count',
             title="Gender Distribution in U.S. Clinical Trials from 2010 to 2016")

# Show the plot
fig.show()

{'Female': 1529, 'Male': 730, 'Both': 8426}


Exploring the data, a pie chart is made based on the raw data in the 'Gender' column. This chart suggests there seems to be more trials with female participants than male participants in all clinical trials. However, this chart can be confusing because the "Both" category includes both male and female participants, hiding information for individual gender. Furthermore, it hides information of each conditions by summarizing it in a single percentage.

In [29]:
# Create a copy
separated_gender_df = conditions_exploded.copy()

# Filter out trials where 'Gender' is 'Both'
both_gender_trials_df = separated_gender_df[separated_gender_df['Gender'] == 'Both'].copy()

# Create two copies of the 'Both' trials: one for Male and one for Female
female_trials_df = both_gender_trials_df.copy()
male_trials_df = both_gender_trials_df.copy()

# Modify the 'Gender' column for each respective dataframe
female_trials_df['Gender'] = 'Female'
male_trials_df['Gender'] = 'Male'

# Combine the male, female, and the rest of the dataset that is not 'Both'
separated_gender_df = pd.concat([separated_gender_df[separated_gender_df['Gender'] != 'Both'], female_trials_df, male_trials_df], ignore_index=True)

# show the new dataset
print(separated_gender_df.head())
print(separated_gender_df.info())

# count the amount of female and male entries in the updated dataframe
female_count = separated_gender_df[separated_gender_df['Gender'] == 'Female'].shape[0]
male_count = separated_gender_df[separated_gender_df['Gender'] == 'Male'].shape[0]

# Output the results
print(f"Number of Female entries: {female_count}")
print(f"Number of Male entries: {male_count}")

    NCT Number                 Conditions  Gender    Age Groups
0  NCT01891344             Ovarian Cancer  Female  Adult|Senior
1  NCT01891344  Epithelial Ovarian Cancer  Female  Adult|Senior
2  NCT01891344      Fallopian Tube Cancer  Female  Adult|Senior
3  NCT01891344          Peritoneal Cancer  Female  Adult|Senior
4  NCT01968213             Ovarian Cancer  Female  Adult|Senior
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52492 entries, 0 to 52491
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   NCT Number  52492 non-null  object
 1   Conditions  52492 non-null  object
 2   Gender      52491 non-null  object
 3   Age Groups  52492 non-null  object
dtypes: object(4)
memory usage: 1.6+ MB
None
Number of Female entries: 27500
Number of Male entries: 24991


In [30]:
# Explode by Conditions
separated_gender_df['Conditions'] = separated_gender_df['Conditions'].str.split('|')  # Split conditions by '|'
exploded_df = separated_gender_df.explode('Conditions')  # Explode into separate rows for each condition

# Filter top 10 conditions
top_10_conditions = exploded_df['Conditions'].value_counts().nlargest(10).index

# Filter dataset to include only top 10 conditions
filtered_df = exploded_df[exploded_df['Conditions'].isin(top_10_conditions)]

# Group by 'Conditions' and 'Gender' and count occurrences
condition_gender_counts = filtered_df.groupby(['Conditions', 'Gender']).size().reset_index(name='Number of Trials')

fig = px.bar(
    condition_gender_counts,
    x='Conditions',
    y='Number of Trials',
    color='Gender',
    title='Gender Distribution in Top 10 Most Frequently Studied Cancer Conditions from 2010 to 2016',
    labels={'Conditions': 'Cancer Conditions', 'Number of Trials': 'Number of Trials'},
    text='Number of Trials'
)

fig.update_layout(barmode = 'group',xaxis={'categoryorder': 'total descending'})
fig.show()


## Conclusion

Breast Cancer is the most studied cancer, followed by Prostate Cancer.
Cancer Trials are mostly conducted on adults and seniors. Adults are the most common subjects for cancer trials, followed by seniors, and children are last.
Cancer Trials are relatively evenly distributed across genders, except for special cancer conditions such as breast cancer and prostate cancer, in which one sex are more likely to develop this condition.

