# PA1: College Data Visualization & PCA Analysis

**Learning Goals:**
- Practice data visualization and interpretation skills
- Compare naive feature selection vs. principled dimensionality reduction
- Develop communication skills for peer review
- Apply your custom PCA implementation to real data

**Submission Requirements:**
- Complete all code cells and written responses
- Export notebook as PDF for peer review
- Focus on clear explanations and visual communication

---

## Setup & Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from student_code import pca_fit_transform

# Set plotting style
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Load the college dataset
df = pd.read_csv('College.csv', index_col=0)
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
df.head()

In [None]:
# Basic dataset exploration
print("Dataset Info:")
print(df.info())
print("\nBasic Statistics:")
df.describe()

## Part 1: Naive Feature Exploration

**Your Task:** Without using any dimensionality reduction techniques, explore the college dataset by selecting 2-3 features that you think might be most interesting or revealing about colleges. Create visualizations and provide your reasoning.

**Questions to consider:**
- Which features do you think would be most informative?
- What relationships do you expect to see?
- How might you distinguish between different types of colleges?

### 1.1 Feature Selection & Reasoning

**Write your response here:**

*Explain which 2-3 features you chose and why you think they would be most informative for understanding colleges. Consider what aspects of college experience or quality these features might capture.*

**My chosen features:**
1. Feature 1: [NAME] - [REASONING]
2. Feature 2: [NAME] - [REASONING] 
3. Feature 3 (optional): [NAME] - [REASONING]


In [None]:
# TODO: Define your chosen features here
feature1 = ""  # Replace with your chosen feature name
feature2 = ""  # Replace with your chosen feature name  
feature3 = ""  # Optional third feature

print(f"Selected features: {feature1}, {feature2}")
if feature3:
    print(f"Third feature: {feature3}")

### 1.2 Naive Visualization

In [None]:
# TODO: Create scatter plot of your chosen features
# Color by Private/Public if you want to see that distinction

plt.figure(figsize=(12, 8))

# Basic scatter plot
# TODO: Uncomment and modify this code block
# plt.scatter(df[feature1], df[feature2], alpha=0.6)
# plt.xlabel(feature1)
# plt.ylabel(feature2)
# plt.title(f'College Data: {feature1} vs {feature2}')

# Optional: Color by private/public
# private_colleges = df[df['Private'] == 'Yes']
# public_colleges = df[df['Private'] == 'No']
# plt.scatter(private_colleges[feature1], private_colleges[feature2], 
#            alpha=0.6, label='Private', s=50)
# plt.scatter(public_colleges[feature1], public_colleges[feature2], 
#            alpha=0.6, label='Public', s=50)
# plt.legend()

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# TODO: If you chose a third feature, create a 3D plot or additional 2D plots
# This cell is optional

if feature3:
    # Option 1: 3D scatter plot
    fig = plt.figure(figsize=(12, 9))
    ax = fig.add_subplot(111, projection='3d')
    
    # TODO: Uncomment and modify
    # ax.scatter(df[feature1], df[feature2], df[feature3], alpha=0.6)
    # ax.set_xlabel(feature1)
    # ax.set_ylabel(feature2)
    # ax.set_zlabel(feature3)
    # ax.set_title(f'3D View: {feature1}, {feature2}, {feature3}')
    
    plt.show()
    
    # Option 2: Multiple 2D plots
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # TODO: Create additional 2D combinations
    # axes[0].scatter(df[feature1], df[feature3], alpha=0.6)
    # axes[0].set_xlabel(feature1)
    # axes[0].set_ylabel(feature3)
    # axes[0].set_title(f'{feature1} vs {feature3}')
    
    # axes[1].scatter(df[feature2], df[feature3], alpha=0.6)
    # axes[1].set_xlabel(feature2)
    # axes[1].set_ylabel(feature3)
    # axes[1].set_title(f'{feature2} vs {feature3}')
    
    plt.tight_layout()
    plt.show()

### 1.3 Naive Analysis & Observations

**Write your analysis here:**

*Based on your visualizations above, describe what patterns, clusters, or relationships you observe. What insights about colleges can you draw from these features?*

**My observations:**
- Pattern 1: [DESCRIBE]
- Pattern 2: [DESCRIBE]  
- Surprising finding: [DESCRIBE]
- Limitations of this approach: [DESCRIBE]


---

## Part 2: PCA-Based Analysis

Now let's apply your custom PCA implementation to see what the data's principal components reveal. We'll use all numeric features to get a comprehensive view.

### 2.1 Data Preparation for PCA

In [None]:
# Prepare numeric data for PCA (exclude categorical 'Private' column)
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric features for PCA: {numeric_features}")
print(f"Number of features: {len(numeric_features)}")

# Extract numeric data as numpy array
X = df[numeric_features].values
print(f"\nData matrix shape: {X.shape}")
print(f"Data range check:")
print(f"Min values: {X.min(axis=0)[:5]}... (showing first 5)")
print(f"Max values: {X.max(axis=0)[:5]}... (showing first 5)")

### 2.2 Apply Your Custom PCA Implementation

In [None]:
# Apply PCA using your custom implementation
# Start with first 4 components for visualization
n_components = 4

X_pca, components, explained_variance_ratio, feature_means = pca_fit_transform(
    X, n_components=n_components
)

print(f"PCA Results:")
print(f"Transformed data shape: {X_pca.shape}")
print(f"Components shape: {components.shape}")
print(f"\nExplained variance ratios:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"PC{i+1}: {ratio:.3f} ({ratio*100:.1f}%)")

print(f"\nTotal variance explained by first {n_components} components: {explained_variance_ratio.sum():.3f} ({explained_variance_ratio.sum()*100:.1f}%)")

### 2.3 Visualize Principal Components

In [None]:
# Plot explained variance
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Component')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
cumulative_variance = np.cumsum(explained_variance_ratio)
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 'bo-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Scatter plot of first two principal components
plt.figure(figsize=(12, 8))

# Color by private/public for comparison
private_mask = df['Private'] == 'Yes'

plt.scatter(X_pca[private_mask, 0], X_pca[private_mask, 1], 
           alpha=0.6, label='Private', s=50)
plt.scatter(X_pca[~private_mask, 0], X_pca[~private_mask, 1], 
           alpha=0.6, label='Public', s=50)

plt.xlabel(f'PC1 ({explained_variance_ratio[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({explained_variance_ratio[1]*100:.1f}% variance)')
plt.title('College Data in PCA Space (First Two Components)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 2.4 Interpret Principal Components

In [None]:
# Analyze what each principal component represents
# by looking at the feature loadings (components)

def plot_component_loadings(components, feature_names, component_idx, n_top=10):
    """Plot the top feature loadings for a principal component"""
    loadings = components[:, component_idx]
    
    # Get indices of features with highest absolute loadings
    top_indices = np.argsort(np.abs(loadings))[-n_top:]
    
    plt.figure(figsize=(10, 6))
    y_pos = range(len(top_indices))
    
    plt.barh(y_pos, loadings[top_indices])
    plt.yticks(y_pos, [feature_names[i] for i in top_indices])
    plt.xlabel('Component Loading')
    plt.title(f'PC{component_idx + 1} Feature Loadings (Top {n_top})')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    return top_indices, loadings[top_indices]

# Analyze first two principal components
print("PC1 - Most Important Features:")
pc1_indices, pc1_loadings = plot_component_loadings(components, numeric_features, 0)

print("\nPC2 - Most Important Features:")
pc2_indices, pc2_loadings = plot_component_loadings(components, numeric_features, 1)

### 2.5 PCA Insights & Interpretation

**Write your analysis here:**

*Based on the PCA results and component loadings above, describe what you think each principal component represents in terms of college characteristics.*

**PC1 Interpretation:**
- This component seems to represent: [YOUR INTERPRETATION]
- Key features: [LIST TOP FEATURES AND THEIR SIGNS]
- Real-world meaning: [WHAT DOES THIS AXIS REPRESENT?]

**PC2 Interpretation:**
- This component seems to represent: [YOUR INTERPRETATION]  
- Key features: [LIST TOP FEATURES AND THEIR SIGNS]
- Real-world meaning: [WHAT DOES THIS AXIS REPRESENT?]


---

## Part 3: Comparison & Critical Analysis

Now compare your naive feature selection approach with the PCA-based analysis.

### 3.1 Side-by-Side Comparison

In [None]:
# TODO: Create side-by-side comparison plots
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Left plot: Your naive approach (modify as needed)
# TODO: Recreate your naive visualization
private_colleges = df[df['Private'] == 'Yes']
public_colleges = df[df['Private'] == 'No']

# TODO: Uncomment and modify these lines with your chosen features
# axes[0].scatter(private_colleges[feature1], private_colleges[feature2], 
#                alpha=0.6, label='Private', s=50)
# axes[0].scatter(public_colleges[feature1], public_colleges[feature2], 
#                alpha=0.6, label='Public', s=50)
# axes[0].set_xlabel(feature1)
# axes[0].set_ylabel(feature2)
# axes[0].set_title('Naive Feature Selection')
# axes[0].legend()
# axes[0].grid(True, alpha=0.3)

# Right plot: PCA approach
private_mask = df['Private'] == 'Yes'
axes[1].scatter(X_pca[private_mask, 0], X_pca[private_mask, 1], 
               alpha=0.6, label='Private', s=50)
axes[1].scatter(X_pca[~private_mask, 0], X_pca[~private_mask, 1], 
               alpha=0.6, label='Public', s=50)
axes[1].set_xlabel(f'PC1 ({explained_variance_ratio[0]*100:.1f}% variance)')
axes[1].set_ylabel(f'PC2 ({explained_variance_ratio[1]*100:.1f}% variance)')
axes[1].set_title('PCA-Based Analysis')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 3.2 Critical Comparison & Reflection

**Write your comprehensive analysis here:**

**Comparison Questions:**

1. **What new insights did PCA reveal that your naive approach missed?**
   - [YOUR RESPONSE]

2. **How do the patterns/clusters compare between the two approaches?**
   - [YOUR RESPONSE]

3. **Which approach better separates different types of colleges? Why?**
   - [YOUR RESPONSE]

4. **What are the advantages and disadvantages of each approach?**
   
   **Naive Feature Selection:**
   - Advantages: [LIST]
   - Disadvantages: [LIST]
   
   **PCA Approach:**
   - Advantages: [LIST] 
   - Disadvantages: [LIST]

5. **When might you prefer one approach over the other?**
   - [YOUR RESPONSE]

6. **How did implementing matrix multiplication by hand affect your understanding of PCA?**
   - [YOUR RESPONSE]


---

## Part 4: Communication & Peer Review Preparation

Prepare a clear summary of your findings for peer review.

### 4.1 Executive Summary

**Write a concise summary (1-3 paragraphs) that someone unfamiliar with your analysis could understand:**

**Dataset & Goal:**
[Describe the college dataset and what you were trying to understand]

**Approach:**
[Briefly explain both the naive and PCA approaches you used]

**Key Findings:**
[Summarize the most important insights from both approaches]

**Conclusions:**
[What did you learn about college data and about dimensionality reduction techniques?]


### 4.2 Questions for Peer Reviewers

**List 2-3 specific questions you'd like your peers to address when reviewing your work:**

1. [YOUR QUESTION - e.g., "Do you agree with my interpretation of PC1? What would you call it?"]

2. [YOUR QUESTION - e.g., "Are there other feature combinations I should have tried in my naive approach?"]

3. [YOUR QUESTION - e.g., "What limitations do you see in my analysis that I may have missed?"]


---

## Submission Checklist

Before submitting, verify you have:

- [ ] Completed all code cells with your custom implementations
- [ ] Provided thoughtful written responses to all reflection questions
- [ ] Created clear, well-labeled visualizations
- [ ] Written an executive summary suitable for peer review
- [ ] Prepared specific questions for your peer reviewers
- [ ] Tested that your PCA implementation works correctly with the college data
- [ ] Exported the notebook as PDF for submission

**Note:** Focus on clear communication and critical thinking. Your peers will be evaluating both your technical implementation and your ability to explain insights from data.