# Creating New DataFrames from Existing Data

Often you need to extract specific information from a DataFrame and organize it into a new DataFrame. This is a fundamental skill in data analysis!

## Learning Objectives

By the end of this notebook, you will be able to:
1. Extract specific columns and rows to create new DataFrames
2. Collect calculated values into a new DataFrame
3. Build summary DataFrames from loop iterations
4. Combine data from multiple sources into one DataFrame
5. Use list comprehensions for efficient DataFrame creation

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np

print('Libraries loaded!')

---

## Part 1: Basic Extraction - Selecting Columns and Rows

Let's start with a simple gene expression dataset.

In [None]:
# Sample gene expression data
data = {
    'sample_id': ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],
    'patient_type': ['healthy', 'healthy', 'healthy', 'disease', 'disease', 'disease'],
    'BRCA1': [5.2, 5.5, 5.1, 8.2, 8.5, 8.1],
    'TP53': [6.8, 6.5, 6.9, 9.1, 9.3, 8.9],
    'MYC': [3.2, 3.5, 3.1, 3.3, 3.4, 3.2]
}

df = pd.DataFrame(data)
print('Gene expression data:')
print(df)

### Example 1: Extract Specific Columns

In [None]:
# Create new DataFrame with only gene columns
genes_only = df[['BRCA1', 'TP53', 'MYC']]

print('Genes only:')
print(genes_only)

### Example 2: Extract Rows Based on Condition

In [None]:
# Create DataFrame with only healthy patients
healthy_df = df[df['patient_type'] == 'healthy']

print('Healthy patients only:')
print(healthy_df)

### Example 3: Extract Subset (Rows + Columns)

In [None]:
# Create DataFrame with diseased patients and only BRCA1 + TP53
disease_subset = df[df['patient_type'] == 'disease'][['sample_id', 'BRCA1', 'TP53']]

print('Diseased patients, BRCA1 and TP53 only:')
print(disease_subset)

### 📝 Practice Question 1

**Task:** From the original `df`, create a new DataFrame containing:
- Only samples where BRCA1 expression > 6.0
- Only columns: sample_id, patient_type, BRCA1

Print the result.

In [None]:
# YOUR CODE HERE
# Create new DataFrame with BRCA1 > 6.0


---

## Part 2: Creating DataFrames from Calculations

Often you calculate statistics and want to collect them into a new DataFrame.

### Example 4: Calculate Mean for Each Gene

In [None]:
# Calculate mean expression for each gene
gene_means = {
    'gene': ['BRCA1', 'TP53', 'MYC'],
    'mean_expression': [
        df['BRCA1'].mean(),
        df['TP53'].mean(),
        df['MYC'].mean()
    ]
}

mean_df = pd.DataFrame(gene_means)
print('Gene mean expressions:')
print(mean_df)

### Example 5: Calculate Multiple Statistics

In [None]:
# Calculate mean, median, and std for BRCA1
brca1_stats = {
    'statistic': ['mean', 'median', 'std', 'min', 'max'],
    'value': [
        df['BRCA1'].mean(),
        df['BRCA1'].median(),
        df['BRCA1'].std(),
        df['BRCA1'].min(),
        df['BRCA1'].max()
    ]
}

stats_df = pd.DataFrame(brca1_stats)
print('BRCA1 statistics:')
print(stats_df)

### 📝 Practice Question 2

**Task:** Create a DataFrame showing mean and standard deviation for each gene (BRCA1, TP53, MYC).

Your DataFrame should have columns: `gene`, `mean`, `std`

**Hint:** Create a dictionary with three lists, then convert to DataFrame.

In [None]:
# YOUR CODE HERE
# Create DataFrame with gene stats


---

## Part 3: Building DataFrames in a Loop

Very common pattern: iterate through data and collect results.

### Example 6: Loop Through Genes and Collect Stats

In [None]:
# Method 1: Build lists, then create DataFrame
genes = ['BRCA1', 'TP53', 'MYC']
gene_names = []
means = []
stds = []

for gene in genes:
    gene_names.append(gene)
    means.append(df[gene].mean())
    stds.append(df[gene].std())

summary_df = pd.DataFrame({
    'gene': gene_names,
    'mean': means,
    'std': stds
})

print('Gene summary (built with loop):')
print(summary_df)

### Example 7: Compare Groups in a Loop

In [None]:
# Compare healthy vs disease for each gene
genes = ['BRCA1', 'TP53', 'MYC']
results = []

for gene in genes:
    healthy_mean = df[df['patient_type'] == 'healthy'][gene].mean()
    disease_mean = df[df['patient_type'] == 'disease'][gene].mean()
    fold_change = disease_mean / healthy_mean
    
    results.append({
        'gene': gene,
        'healthy_mean': healthy_mean,
        'disease_mean': disease_mean,
        'fold_change': fold_change
    })

# Convert list of dictionaries to DataFrame
comparison_df = pd.DataFrame(results)
print('Healthy vs Disease comparison:')
print(comparison_df.round(2))

### 📝 Practice Question 3

**Task:** Loop through each sample (row) and create a DataFrame showing:
- sample_id
- patient_type  
- max_gene_expression (the highest expression among BRCA1, TP53, MYC)

**Hint:** Use `df.iterrows()` to iterate through rows.

In [None]:
# YOUR CODE HERE
# Create DataFrame with max expression per sample


---

## Part 4: Using List Comprehensions (Advanced)

List comprehensions provide a faster, more elegant way to build DataFrames.

### Example 8: List Comprehension for Statistics

In [None]:
# Same as Example 6, but with list comprehension
genes = ['BRCA1', 'TP53', 'MYC']

summary_df_lc = pd.DataFrame({
    'gene': genes,
    'mean': [df[gene].mean() for gene in genes],
    'std': [df[gene].std() for gene in genes]
})

print('Gene summary (list comprehension):')
print(summary_df_lc)

### Example 9: List Comprehension with Dictionaries

In [None]:
# Same as Example 7, but more concise
genes = ['BRCA1', 'TP53', 'MYC']

comparison_lc = pd.DataFrame([
    {
        'gene': gene,
        'healthy_mean': df[df['patient_type'] == 'healthy'][gene].mean(),
        'disease_mean': df[df['patient_type'] == 'disease'][gene].mean()
    }
    for gene in genes
])

# Add fold change
comparison_lc['fold_change'] = comparison_lc['disease_mean'] / comparison_lc['healthy_mean']

print('Healthy vs Disease (list comprehension):')
print(comparison_lc.round(2))

### 📝 Practice Question 4

**Task:** Use a list comprehension to create a DataFrame showing the range (max - min) for each gene.

Your DataFrame should have columns: `gene`, `range`

**Hint:** `[{'gene': gene, 'range': ...} for gene in genes]`

In [None]:
# YOUR CODE HERE
# Create DataFrame with gene ranges using list comprehension


---

## Part 5: Combining Multiple DataFrames

Sometimes you have data from different sources that need to be combined.

### Example 10: Concatenating DataFrames Vertically

In [None]:
# Two batches of experiments
batch1 = pd.DataFrame({
    'sample': ['A1', 'A2', 'A3'],
    'expression': [5.2, 5.5, 5.1]
})

batch2 = pd.DataFrame({
    'sample': ['B1', 'B2', 'B3'],
    'expression': [6.1, 6.3, 6.0]
})

print('Batch 1:')
print(batch1)
print('\nBatch 2:')
print(batch2)

# Combine them
combined = pd.concat([batch1, batch2], ignore_index=True)
print('\nCombined:')
print(combined)

### Example 11: Merging DataFrames (Like SQL JOIN)

In [None]:
# Sample metadata
metadata = pd.DataFrame({
    'sample_id': ['S1', 'S2', 'S3'],
    'patient_age': [45, 38, 52]
})

# Expression data
expression = pd.DataFrame({
    'sample_id': ['S1', 'S2', 'S3'],
    'BRCA1': [5.2, 5.5, 5.1]
})

print('Metadata:')
print(metadata)
print('\nExpression:')
print(expression)

# Merge on sample_id
merged = pd.merge(metadata, expression, on='sample_id')
print('\nMerged:')
print(merged)

### 📝 Practice Question 5

**Task:** You have two DataFrames:
1. Patient info: sample_id, treatment (drug_A, drug_B, control)
2. Response data: sample_id, tumor_size_change (negative = shrinkage)

Create both DataFrames with 6 samples (2 per treatment), then merge them.

**Example data:**
- Samples: S1-S6
- Treatments: 2x drug_A, 2x drug_B, 2x control
- Tumor changes: drug_A (-3.5, -2.8), drug_B (-1.2, -0.9), control (+0.5, +0.8)

In [None]:
# YOUR CODE HERE
# Create two DataFrames and merge them


---

## Part 6: Real-World Example - Gene Correlation Analysis

Let's put it all together with a realistic biology example.

In [None]:
# Gene expression across 10 samples
np.random.seed(42)

gene_data = pd.DataFrame({
    'sample': [f'S{i}' for i in range(1, 11)],
    'BRCA1': np.random.normal(6.0, 1.0, 10),
    'TP53': np.random.normal(7.0, 1.2, 10),
    'MYC': np.random.normal(5.5, 0.8, 10),
    'PTEN': np.random.normal(4.5, 1.1, 10)
})

print('Gene expression data:')
print(gene_data.round(2))

In [None]:
# Task: Calculate correlation between BRCA1 and all other genes
from scipy.stats import pearsonr

target_gene = 'BRCA1'
other_genes = ['TP53', 'MYC', 'PTEN']

# Build results in a loop
correlation_results = []

for gene in other_genes:
    r, p = pearsonr(gene_data[target_gene], gene_data[gene])
    correlation_results.append({
        'gene': gene,
        'correlation': r,
        'p_value': p,
        'significant': 'Yes' if p < 0.05 else 'No'
    })

# Convert to DataFrame
corr_df = pd.DataFrame(correlation_results)

print(f'Correlation with {target_gene}:')
print(corr_df.round(3))

### 📝 Practice Question 6 (Challenge)

**Task:** Create a "pairwise correlation matrix" DataFrame.

Calculate the correlation between ALL pairs of genes (BRCA1-TP53, BRCA1-MYC, TP53-MYC, etc.).

Your DataFrame should have columns:
- `gene1`
- `gene2`
- `correlation`

**Hint:** Use nested loops or `itertools.combinations`

In [None]:
# YOUR CODE HERE
# Create pairwise correlation DataFrame


---

## Summary

### Key Patterns for Creating DataFrames:

**1. Extract columns/rows:**
```python
new_df = old_df[['col1', 'col2']]
new_df = old_df[old_df['col'] > 5]
```

**2. From dictionary:**
```python
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data)
```

**3. Build with lists in loop:**
```python
col1_data = []
col2_data = []
for item in items:
    col1_data.append(value1)
    col2_data.append(value2)
df = pd.DataFrame({'col1': col1_data, 'col2': col2_data})
```

**4. List of dictionaries:**
```python
results = []
for item in items:
    results.append({'col1': val1, 'col2': val2})
df = pd.DataFrame(results)
```

**5. List comprehension:**
```python
df = pd.DataFrame([
    {'col1': x, 'col2': calc(x)}
    for x in items
])
```

**6. Combining DataFrames:**
```python
# Vertical (stack rows)
pd.concat([df1, df2], ignore_index=True)

# Horizontal (join by key)
pd.merge(df1, df2, on='key_column')
```

### Common Use Cases in Biology:
- Extracting gene subsets for pathway analysis
- Building summary statistics tables
- Comparing treatment groups
- Correlation/statistical test results
- Combining data from multiple experiments

### Best Practices:
1. Start simple (extract first, then build)
2. Use meaningful column names
3. Always check your results with `.head()` or print statements
4. Choose the method that makes your code most readable
5. List comprehensions are faster but loops are clearer for beginners