# Power Up Research Software Development with Github Copilot


In this notebook, you will be pulling data from the [Registry of Open Data on AWS](https://registry.opendata.aws/). The Registry of Open Data on AWS (RODA) makes it easy for people to find datasets that are publicly available through AWS.

You will also be using various features GitHub Copilot to help you with your data exploration and data analysis processes.


For this workshop, you will be analyzing the Foundation Medicine Adult Cancer Clinical Dataset.

- [Link to instructions on how to access the dataset via AWS.](https://aws.amazon.com/marketplace/pp/prodview-suzlfg5oc67uy?sr=0-120&ref_=beagle&applicationId=AWSMPContessa)

- [Link to the dataset's documentation.](https://gdc.cancer.gov/about-gdc/contributed-genomic-data-cancer-research/foundation-medicine/foundation-medicine)



Taken from the [dataset's documentation](https://gdc.cancer.gov/about-gdc/contributed-genomic-data-cancer-research/foundation-medicine):
> The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI).
Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.

> The dataset is described in the accompanying publication: Hartmaier R.J. et al, “High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis”, Cancer Res. 2017 May 1;77(9):2464-2475 http://cancerres.aacrjournals.org/content/77/9/2464.long

*You are not expected to read the accompanying publication for this workshop. The notebook will help guide you in understanding the dataset.*

### 0. Prerequisites
To achieve the desired outcome of your analysis, the conda environment.yml file has installed the following packages:

- [NumPy](https://numpy.org/): Fundamental package for numerical computing with support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.
- [pandas](https://pandas.pydata.org/): Provides high-performance, easy-to-use data structures and data analysis tools for working with structured data, such as data frames.
- [matplotlib](https://matplotlib.org/): Comprehensive library for creating static, animated, and interactive visualizations in Python. It's often used for creating plots, charts, and graphs.
- [Seaborn](https://seaborn.pydata.org/): Built on top of matplotlib, Seaborn provides a high-level interface for creating attractive and informative statistical graphics.
- [scikit-learn](https://scikit-learn.org/): Simple and efficient tools for data mining and data analysis. It includes a wide range of machine learning algorithms for classification, regression, clustering, and more.
- [awscli](https://aws.amazon.com/cli/): The AWS Command Line Interface (AWS CLI) is a tool for managing AWS services and resources via the command line.
- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html): AWS SDK for Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
- [ipykernel](https://pypi.org/project/ipykernel/): A package that provides the IPython kernel for Jupyter. It allows Jupyter notebooks to execute Python code in an interactive and modular way.


Let's double check to see if we have these packages installed.

In [None]:
import importlib.util
import sys

def check_packages(packages):
    for package in packages:
        if importlib.util.find_spec(package) is None:
            print(f"{package} not found.")
        else:
            print(f"{package} is already installed.")

# List of packages to check and install
packages = ['numpy', 'pandas', 'matplotlib', 'seaborn', 'sklearn', 'awscli', 'boto3' 'ipykernel']


# Call the function
check_packages(packages)

### 1.0 Set-up

##### 1.1 View S3 bucket content

In [None]:
!aws s3 ls --no-sign-request s3://gdc-fm-ad-phs001179-2-open/

Feel free to explore the folders within the S3 bucket.

In [None]:
!aws s3 ls --no-sign-request s3://gdc-fm-ad-phs001179-2-open/2bec6dfb-5acd-4174-bc50-a00c567d8f33/

#### 1.2 Prepare the dataset for exploration and analysis.
Currently, the dataset is distributed into many .tsv files. We want to combine those files into one object.

In [None]:
import boto3
import pandas as pd
from botocore import UNSIGNED
from botocore.config import Config
from io import StringIO

# Initialize a session using Amazon S3
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Replace with your bucket name
bucket_name = 'gdc-fm-ad-phs001179-2-open'

# Get a list of all objects in the bucket
response = s3.list_objects_v2(Bucket=bucket_name)

# Initialize an empty list to hold the dataframes
dataframes = []

# Loop through each object in the bucket
for obj in response.get('Contents', []):
    key = obj['Key']
    
    # Check if the key represents a TSV file
    if key.endswith('.tsv'):
        # Fetch the file
        file_obj = s3.get_object(Bucket=bucket_name, Key=key)
        # Read the file content as string
        file_content = file_obj['Body'].read().decode('utf-8')
        # Convert the string to a DataFrame
        df = pd.read_csv(StringIO(file_content), sep='\t')
        # Append the dataframe to the list
        dataframes.append(df)

# Concatenate all the dataframes in the list into a single dataframe
combined_df = pd.concat(dataframes, ignore_index=True)

### 2.0 Data analysis

#### 2.1 Data exploration

In [None]:
# set the display to show all columns and most rows

pd.set_option("display.max_columns", None)  # or 1000
pd.set_option("display.max_rows", None)  # or 1000

In [None]:
# show first few records of the dataframe

In [None]:
# show the dataframe's dimensions

In [None]:
# show descriptive statistics of the dataframe

In [None]:
# show the columns and their data types

In [None]:
# show the number of missing values in each column in desceding order

In addition to missing values, there are columns with 'Unknown' values in string format to represent missing values.

Let's take a look at which columns have 'Unknown' values.

In [None]:
# show which columns have the value 'Unknown' in them and show how many each column has in descending order

In [None]:
# show the number of unique values in each column in descending order

In [None]:
# show 5 unique values of columns with unique values less than 100

In [None]:
# count how many records share the same case_id

In [None]:
# show the records with the case_id 40e57344-a8ad-4de4-92e4-6e681c0593b7

#### 2.2 Data processing

Before we handle duplicate records, let's normalize the notation for missing values first.

Currently, missing values are listed as UNKNOWN or NaN. Let's convert them all to NaN for uniforimity.

In [None]:
# change values 'Unknown' to NaN in the dataframe using numpy and create a new dataframe

In [None]:
# check if any columns still have the value 'Unknown' in them and show how many each column has in descending order

##### 2.2.1 Handling duplicate records

Now we can start checking to see if there are duplicate records.

In [None]:
# show the number of duplicate records in the dataframe

In [None]:
# drop duplicate records in the dataframe and create a new dataframe

In [None]:
# show dataframe shape

For the next step, refer to the results we gathered from the table with number of unique values in each column.

Create a prompt to list the columns with more than 18000 unique values not including the age column.

In [None]:
# List the name of columns that have more than 18000 unique values


Many of the columns seem to server the role of being a unique identifier (UID). You only need one UID. Let's drop the other UIDs except for case_id.

In [None]:
# List the name of the columns that have the 'id' in their name except for the column case_id

In [None]:
# drop the columns specified in id_columns from the dataframe and create a new dataframe

In [None]:
# show the shape of the new dataframe

In [None]:
# show the columns in the dataframe

In [None]:
# show columns that have more than 18000 unique values

You can remove the columns above (except for the case_id column) if you think it'll make your analysis easier.

In [None]:
# show the top 5 unique values of the columns that have more than 18000 unique values

In 2.1, you saw that there were records that shared the same case_id.

Let's check if there are any other records share a case_id.


In [None]:
# count how many records share the same case_id

In [None]:
# show a bar graph with the x axis as the number of records shared by case_id and the y axis as the number of records

Let's take a look at the instance where a case_id is shared between records.

Create a prompt below to generate code to show you records that shares a case_id different from the case_id in 2.1.

In [None]:
# show the records with the case_id aff95088-8760-46d2-a404-b545807e0735

Based on the output above, it looks like these records compliment each other. In the column that a record has a value, the other has a null value, and vice versa. This is most likely due to how the data is gathered.

Let's try to prove if this is the case for all records that share a case_id.

Prompt for GitHub Copilot chat:

Write Python code that verifies whether the records with the case_id aff95088-8760-46d2-a404-b545807e0735 complement each other in terms of missing values after the first four columns. The code should output whether these records, when combined, fill in each other's missing values like a puzzle

In [None]:
# Given two records with the case_id 'aff95088-8760-46d2-a404-b545807e0735', verify if combining them fills in each other's NaN values. Specifically, check if one record has missing values in certain columns that are filled in the other record.

In [None]:
# do the same prove for above but for all the records in the dataframe

In [None]:
# Group by 'case_id' and take the first non-null value for each group

Let's take a look at the shape and the first few records of our dataframes. 

In [None]:
# show number of unique values in each column in descending order

Let's check if there are still any empty values.

In [None]:
# check to see if there are any null values in the dataframe

In [None]:
# show the number unique values of the columns that have null values

##### 2.2.2 Normalizing age column

Let's start by normalizing the age column. As you will see, the age column does not represent age in years.

In [None]:
# describe stats on diagnoses.age_at_diagnosis column

According to the publication associated with this dataset, the youngest age of the participant is 19.

Let's do some basic math to calulate our normalization factor.

In [None]:
6947/19

In [None]:
32493/19

In [None]:
# create a new dataframe, create a new column 'diagnoses.age_at_diagnosis_years' by dividing 'diagnoses.age_at_diagnosis' by 365, 0and drop the 'diagonses.age_at_diagnosis' column

In [None]:
# count how many records that have the value of 'diagnosis.age_at_diagnosis_years' greater or equal to 89

In [None]:
# drop the record with 'diagnosis.age_at_diagnosis_years' greater or equal to 89

In [None]:
# round down the age column and convert to integer

In [None]:
# show statistical summary of the age column

#### 2.3 Data visualization

In [None]:
# create a bar graph of the age column

Now let's share with GitHub copilot chat the columns in our dataset and what visualizations and correlations we can create from these columns.

In [None]:
# list columns of the dataframe and datatype in json format

#### 2.4 Additional analysis