In [None]:
# ðŸ§¬ COSMIC68 Cancer Mutation Analysis using Athena + PyAthena

This notebook analyzes somatic mutations in the COSMIC68 dataset using Amazon Athena. We explore the cancer types most frequently associated with DNA mutations.

**Tools Used**:
- Amazon Athena + AWS S3
- PyAthena + Pandas
- Matplotlib
- COSMIC68 (hg19) Dataset


 Connect to Athena

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
from pyathena import connect

# Athena connection
conn = connect(
    s3_staging_dir="s3://athena-output-351869726285/",
    region_name="us-east-1",
    encryption_option="SSE_S3"
)


Query COSMIC Dataset

In [10]:
query = '''
SELECT cosmic_info
FROM "1000_genomes".hg19_cosmic68
LIMIT 1000
'''
cosmic_df = pd.read_sql(query, conn)
cosmic_df.head()


  cosmic_df = pd.read_sql(query, conn)


Unnamed: 0,cosmic_info
0,ID=COSM917032;OCCURENCE=1(endometrium)
1,ID=COSM1262222;OCCURENCE=1(oesophagus)
2,"ID=COSM1651216,COSM917097;OCCURENCE=1(endometr..."
3,ID=COSM917132;OCCURENCE=1(endometrium)
4,ID=COSM268400;OCCURENCE=1(large_intestine)


In [13]:
# Extract cancer type using regex
cosmic_df['cancer_type'] = cosmic_df['cosmic_info'].str.extract(r'OCCURENCE=\d+\((.*?)\)')
cosmic_df = cosmic_df.dropna(subset=['cancer_type'])
cosmic_df.head()


Unnamed: 0,cosmic_info,cancer_type
0,ID=COSM917032;OCCURENCE=1(endometrium),endometrium
1,ID=COSM1262222;OCCURENCE=1(oesophagus),oesophagus
2,"ID=COSM1651216,COSM917097;OCCURENCE=1(endometr...",endometrium
3,ID=COSM917132;OCCURENCE=1(endometrium),endometrium
4,ID=COSM268400;OCCURENCE=1(large_intestine),large_intestine


In [None]:
# Count frequency of cancer types
cancer_counts = cosmic_df['cancer_type'].value_counts().reset_index()
cancer_counts.columns = ['Cancer Type', 'Count']

# Plot
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.bar(cancer_counts['Cancer Type'][:10], cancer_counts['Count'][:10], color='salmon')
plt.title("Top 10 Cancer Types in COSMIC68 Dataset")
plt.ylabel("Number of Mutations")
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.savefig("cosmic_cancer_types.png")
plt.show()


In [None]:
## âœ… Conclusion

We extracted 1,000 rows from the COSMIC68 dataset and analyzed the types of cancers linked to mutations. 

This data helps researchers identify patterns in mutation types and their associated cancers, supporting work in cancer genomics and precision medicine.
