# Power Up Research Software Development with Github Copilot


### 1.0 Data loading

#### 1.1 Load the cleaned CSV file

In [fm-ad-notebook-processing.ipynb](fm-ad-notebook-processing.ipynb), we took what we learned from [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb) and performed several data processing methods to clean our dataset. Now we are finally ready to do some analysis and create cool visualizations of our dataset.

First, let's import our cleaned dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('combined_data_cleaned.csv')

#### 1.2 Set output display

To effectively view and analyze the dataset, we need to configure pandas to display all columns and most rows of the dataframe.

In [None]:
pd.set_option("display.max_columns", None)  # or 1000
pd.set_option("display.max_rows", None)  # or 1000

Before we proceeed to the next section, let's do some sanity check of our cleaned dataset. Let's check the first few records, dimension, and column information about the dataframe. Create prompts below to do so.

In [None]:
# show first few records

You should expect to see the dimension (18003, 24).

In [None]:
# show df shape

In [None]:
# show df columns

In [None]:
# show df column and data types

### 4.0 Data analysis and visualization

#### 4.1 Data visualization

##### Distribution of disease types

Understanding the distribution of disease types helps identify the most common and rare cancers in the dataset, which is crucial for allocating resources and prioritizing research.

In [None]:
# create a pie chart of top 10 cases.disease_type from df

##### Gender demographic

Let's take a look at how the data is distributed with respect to gender.

In [None]:
# show the distribution of the column demographic.gender in bar chart

As we can see from above, the gender information was available from all but 9 samples and showed a slight bias toward females versus males.

According to the [study](https://aacrjournals.org/cancerres/article/77/9/2464/625134/High-Throughput-Genomic-Profiling-of-Adult-Solid), this bias can be explained in part by the large number of breast and GYN cancer samples within the dataset since both breast and gynecological cancers are specific to females. Let's try to visually see if that is the case.

In [None]:
# show the relationship between cases.primary_site and demographic.gender

A similar analysis we can look at is the relationship between the disease type and the gender of the patient

Identifying gender differences in disease prevalence can highlight gender-specific vulnerabilities or protective factors, influencing personalized treatment approaches.

In [None]:
# visualize the relationship between cases.disease_type and demographic.gender

##### Age distribution

The study "High-Throughput Genomic Profiling of Adult Solid Tumors" utilized patient samples that were part of routine clinical care, which were submitted for genomic profiling by Foundation Medicine. So the study did not do a random sampling as part of their data collection.

That being said, let's see how close to a normal distribution the dataset is with respect to age.

In [None]:
# show distribution of diagnoses.age_at_diagnosis_years

What is the relationship between age at diagnosis and disease type?

This question helps determine if certain cancers are more likely to occur at specific ages, which can inform targeted awareness and early detection efforts in particular demographics.

Is there a relationship between the primary diagnosis and the sample type?

This question is important to understand if certain diagnoses are more likely to be made from specific types of samples, affecting diagnostic strategies and the feasibility of certain tests.

#### 4.X Additional analysis

Now let's share with GitHub copilot chat the columns in our dataset and what visualizations and correlations it thinks that we can create from these columns.