# Power Up Research Software Development with Github Copilot


In this notebook, you will be pulling data from the [Registry of Open Data on AWS](https://registry.opendata.aws/). The Registry of Open Data on AWS (RODA) makes it easy for people to find datasets that are publicly available through AWS.

You will also be using various features GitHub Copilot to help you with your data exploration and data analysis processes.


For this workshop, you will be analyzing the Foundation Medicine Adult Cancer Clinical Dataset.

- [Link to instructions on how to access the dataset via AWS.](https://aws.amazon.com/marketplace/pp/prodview-suzlfg5oc67uy?sr=0-120&ref_=beagle&applicationId=AWSMPContessa)

- [Link to the dataset's documentation.](https://gdc.cancer.gov/about-gdc/contributed-genomic-data-cancer-research/foundation-medicine/foundation-medicine)



Taken from the [dataset's documentation](https://gdc.cancer.gov/about-gdc/contributed-genomic-data-cancer-research/foundation-medicine):
> The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI).
Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.

> The dataset is described in the accompanying publication: Hartmaier R.J. et al, “High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis”, Cancer Res. 2017 May 1;77(9):2464-2475 http://cancerres.aacrjournals.org/content/77/9/2464.long

*You are not expected to read the accompanying publication for this workshop. The notebook will help guide you in understanding the dataset.*

### 0. Prerequisites
To achieve the desired outcome of your analysis, the venv requirements.txt file has installed the following packages:

- [NumPy](https://numpy.org/): Fundamental package for numerical computing with support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.
- [pandas](https://pandas.pydata.org/): Provides high-performance, easy-to-use data structures and data analysis tools for working with structured data, such as data frames.
- [matplotlib](https://matplotlib.org/): Comprehensive library for creating static, animated, and interactive visualizations in Python. It's often used for creating plots, charts, and graphs.
- [Seaborn](https://seaborn.pydata.org/): Built on top of matplotlib, Seaborn provides a high-level interface for creating attractive and informative statistical graphics.
- [scikit-learn](https://scikit-learn.org/): Simple and efficient tools for data mining and data analysis. It includes a wide range of machine learning algorithms for classification, regression, clustering, and more.
- [awscli](https://aws.amazon.com/cli/): The AWS Command Line Interface (AWS CLI) is a tool for managing AWS services and resources via the command line.
- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html): AWS SDK for Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
- [ipykernel](https://pypi.org/project/ipykernel/): A package that provides the IPython kernel for Jupyter. It allows Jupyter notebooks to execute Python code in an interactive and modular way.


Let's double check to see if we have these packages installed.

In [None]:
import importlib.util
import sys

def check_packages(packages):
    for package in packages:
        if importlib.util.find_spec(package) is None:
            print(f"{package} not found.")
        else:
            print(f"{package} is already installed.")

# List of packages to check and install
packages = ['numpy', 'pandas', 'matplotlib', 'seaborn', 'sklearn', 'awscli', 'boto3', 'ipykernel']


# Call the function
check_packages(packages)

### 1.0 Data loading

##### 1.1 View S3 bucket content

We will start by listing all the files in the specified S3 bucket. This helps us understand the structure of the data and identify the `.tsv` files that we need to combine. The `--no-sign-request` option is used because the bucket is publicly accessible and does not require authentication.

In [None]:
!aws s3 ls --no-sign-request s3://gdc-fm-ad-phs001179-2-open/

Feel free to explore the folders within the S3 bucket by appending some of the folder names you see above.

In [None]:
!aws s3 ls --no-sign-request s3://gdc-fm-ad-phs001179-2-open/2bec6dfb-5acd-4174-bc50-a00c567d8f33/

#### 1.2 Load the dataset

Upon inspecting the S3 bucket, you'll notice that it contains numerous folders, many of which include a `.tsv` file. To streamline the data loading process, we have provided code to combine all the `.tsv` files into a single cohesive file. This combined data will then be merged into one dataframe and saved as a `.csv` file for easy reference in the subsequent notebooks.

In [None]:
import boto3
import pandas as pd
import numpy as np
from botocore import UNSIGNED
from botocore.config import Config
from io import StringIO

# Initialize a session using Amazon S3
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Replace with your bucket name
bucket_name = 'gdc-fm-ad-phs001179-2-open'

# Get a list of all objects in the bucket
response = s3.list_objects_v2(Bucket=bucket_name)

# Initialize an empty list to hold the dataframes
dataframes = []

# Loop through each object in the bucket
for obj in response.get('Contents', []):
    key = obj['Key']
    
    # Check if the key represents a TSV file
    if key.endswith('.tsv'):
        # Fetch the file
        file_obj = s3.get_object(Bucket=bucket_name, Key=key)
        # Read the file content as string
        file_content = file_obj['Body'].read().decode('utf-8')
        # Convert the string to a DataFrame
        df = pd.read_csv(StringIO(file_content), sep='\t')
        # Append the dataframe to the list
        dataframes.append(df)

# Concatenate all the dataframes in the list into a single dataframe
df = pd.concat(dataframes, ignore_index=True)

# Save dataframe to a CSV file
df.to_csv('combined_data.csv', index=False)

#### 1.3 Set output display

To effectively view and analyze the dataset, we need to configure pandas to display all columns and most rows of the dataframe.

In [None]:
pd.set_option("display.max_columns", None)  # or 1000
pd.set_option("display.max_rows", None)  # or 1000

### 2.0 Data exploration

#### 2.1 Inspect the Data

Let's take an initial look at the data to understand its structure and contents. We will display the first few rows of the dataframe.

In [None]:
# show first few records of the dataframe

#### 2.2 Data overview

We will now get a summary of the dataframe, which includes the number of rows and columns, column names, and the data types of each column.

In [None]:
# get an overview of the dataframe

Look at the number of rows and columns for the dataset.

In [None]:
# show the dataframe's dimensions

We will now display the columns and their data types. This is important to ensure that the data types are appropriate for the analyses we plan to perform.

In [None]:
# show the columns and their data types

#### 2.3 Descriptive statistics

This step involves generating descriptive statistics of the dataframe. Descriptive statistics provide insights into the central tendency, dispersion, and shape of the dataset’s distribution.

In [None]:
# show descriptive statistics of the dataframe

#### 2.4 Missing values

Here, we will identify the number of missing values in each column. This is essential for understanding the completeness of the dataset and for planning data cleaning steps.

In [None]:
# show the number of missing values in each column in descending order

#### 2.5 'Unknown' values

In addition to missing values, there are columns with 'Unknown' values in string format to represent missing values. Let's take a look at which columns have 'Unknown' values.

In [None]:
# show which columns have the value 'Unknown' in them and show how many each column has in descending order
unknown_values = df.isin(['Unknown']).sum().sort_values(ascending=False)
unknown_values[unknown_values > 0]

#### 2.6 Unique values

We will now count the number of unique values in each column. This helps in understanding the variability and potential categorical nature of the data.

In [None]:
# show the number of unique values in each column in descending order
unique_values = df.nunique().sort_values(ascending=False)
unique_values

For columns with a relatively small number of unique values, we will display a sample of these values. This helps in understanding the categorical variables in the dataset.

In [None]:
# show 5 unique values of columns with unique values less than 100
for col, n_unique in unique_values.items():
    if n_unique < 100:
        unique_vals = df[col].unique()
        print(f"{col}: {unique_vals[:5]}")

#### 2.7 Duplicate records

In this step, we will count the number of duplicate records in the dataframe. Duplicate records can arise due to various reasons such as data entry errors or merging datasets. Identifying and handling duplicates is important to ensure the integrity and accuracy of the analysis.

In [None]:
# show the number of duplicate records in the dataframe
n_duplicates = df.duplicated().sum()
n_duplicates

In 2.6, we observed that there are 18,004 unique values for the `case_id` column, but the dataset contains 72,016 records overall. Since `case_id` is the unique identifier for each record, this suggests that there might be records sharing the same `case_id`.

To investigate this, we will count the number of records that share the same `case_id`. This will help us identify potential duplicate records or multiple entries for the same case.

In [None]:
# count how many records share the same case_id
case_id_counts = df['case_id'].value_counts()
case_id_counts

Finally, we will inspect the records associated with a specific `case_id`. This allows us to examine the data for a particular case in detail, which can help in understanding how to potentially merge the data together.

In [None]:
# show the records with the case_id 40e57344-a8ad-4de4-92e4-6e681c0593b7
case_id = '40e57344-a8ad-4de4-92e4-6e681c0593b7'

df[df['case_id'] == '40e57344-a8ad-4de4-92e4-6e681c0593b7']

This concludes the data exploration phase of our analysis. In the next notebook, [fm-ad-notebook-processing.ipynb](fm-ad-notebook-processing.ipynb), we will use the insights and information we've gathered to make informed decisions on how to clean the dataset.
