# Downloading and Caching Large Datasets

## Why We Need This Notebook

In this course, we work with two large CSV files from the DepMap project:
- **CRISPR dependency data** (~350 MB): Gene essentiality scores across cancer cell lines
- **Gene expression data** (~450 MB): RNA expression levels for thousands of genes

### The Problem
These files are too large to download repeatedly every time you open a Colab notebook. Each time you restart your Colab session, the runtime's temporary storage is wiped clean.

### The Solution
We'll download these files **once** from Zenodo (a research data repository) and save them to your **Google Drive**. Your Google Drive is persistent storage that stays accessible across all your Colab sessions.

### Benefits
- **Faster**: Load from Google Drive in seconds instead of downloading each time
- **Reliable**: No need to depend on external servers every session
- **Efficient**: Save bandwidth and time

---

## Instructions

1. **Run this notebook ONCE** to download and cache the data
2. **In future notebooks**, simply mount Google Drive and load the cached files
3. The data will be saved in: `My Drive/Colab_Data/`

Let's get started!

## Step 1: Mount Google Drive

This allows Colab to access your Google Drive storage.

In [None]:
import pathlib


# Create a directory for our data if it doesn't exist
data_dir = pathlib.Path('data')
data_dir.mkdir(parents=True, exist_ok=True)

print(f"‚úì Data directory ready: {data_dir}")

## Step 2: Download CRISPR Dependency Data

This file contains gene essentiality scores - which genes are critical for cancer cell survival.

**Note**: This may take 2-3 minutes depending on your internet connection.

In [None]:
import pandas as pd


# Zenodo URL for CRISPR dependency data
dependency_url = 'https://zenodo.org/records/17609722/files/dependency_df_nan.csv?download=1'

print("Downloading CRISPR dependency data from Zenodo...")
print("This may take a few minutes for the large file (~350 MB)...\n")


# Read the CSV directly from URL
dependency_df = pd.read_csv(dependency_url)

# Save to Google Drive with our chosen filename
dependency_path = pathlib.Path(data_dir / 'dependency.csv')
dependency_df.to_csv(dependency_path, index=False)



print("‚úì Download complete!")
print(f"‚úì Saved as: {dependency_path}")


## Step 3: Download Gene Expression Data

This file contains RNA expression levels for thousands of genes across cancer cell lines.

**Note**: This may also take 2-3 minutes.

In [None]:
# Zenodo URL for gene expression data
expression_url = 'https://zenodo.org/records/17609575/files/expression_df_nan.csv?download=1'

print("Downloading gene expression data from Zenodo...")
print("This may take a few minutes for the large file (~450 MB)...\n")


# Read the CSV directly from URL
expression_df = pd.read_csv(expression_url)

# Save to Google Drive with our chosen filename
expression_path = pathlib.Path(data_dir / 'expression.csv')
expression_df.to_csv(expression_path, index=False)


print("‚úì Download complete!")
print(f"‚úì Saved as: {expression_path}")


## Step 4: Verify the Cached Files

Let's test loading the files from Google Drive to make sure everything worked!

In [None]:
print("Testing: Loading cached files from Google Drive...\n")

# Test loading dependency data
print("Loading dependency.csv...")

test_dependency = pd.read_csv(pathlib.Path(data_dir / 'dependency.csv'))

print("‚úì Loaded data successfully")
print(f"  Shape: {test_dependency.shape}\n")

# Test loading expression data
print("Loading expression.csv...")

test_expression = pd.read_csv(pathlib.Path(data_dir / 'expression.csv'))

print("‚úì Loaded data successfully")
print(f"  Shape: {test_expression.shape}\n")

print("="*60)
print("SUCCESS! Both files are cached and ready to use.")
print("="*60)
print("\nYour files are stored at:")
print("  üìÅ {data_dir}")
print("     ‚îú‚îÄ‚îÄ dependency.csv")
print("     ‚îî‚îÄ‚îÄ expression.csv")

---

## How to Use These Files in Other Notebooks

Now that you've cached the data, here's how to load it in your assignment notebooks:

```python
# 1. Pathlib Library import
import pathlib
import pandas as pd



# 2. Load the cached files
data_dir = pathlib.Path('data')

dependency_df = pd.read_csv(f'{data_dir}/dependency.csv')
expression_df = pd.read_csv(f'{data_dir}/expression.csv')

# 3. Start analyzing!
print(f"Dependency data: {dependency_df.shape}")
print(f"Expression data: {expression_df.shape}")
```

### Tips
- Loading locally takes ~10-20 seconds (much faster than downloading!)
- You only need to run this download notebook once
- If the data gets updated, just re-run this notebook to refresh your cache
- Make sure to mount Google Drive in every new Colab session

---

## Troubleshooting


**Problem**: Download is very slow
- **Solution**: This is normal for large files. Be patient - you only do this once!

**Problem**: "Memory error" when loading
- **Solution**: Go to Runtime ‚Üí Change runtime type ‚Üí Select "High-RAM" option

---

**You're all set!** üéâ

Your data is now safely stored locally and ready for your assignments.