# Phase 1: Data Collection & Understanding

**Objective**: Load the student performance dataset, inspect its structure, and understand the data we're working with.

**Study Document**: `docs/phase_studies/Phase_01_Data_Collection_Understanding.md`

---

## Learning Goals

By the end of this notebook, you should be able to:
1. Load CSV data into pandas DataFrame
2. Inspect dataset dimensions and structure
3. Identify data types for each feature
4. Detect missing values
5. Generate basic statistical summaries
6. Understand the target variable (G3)
7. Document initial observations

---


## 1. Import Libraries


In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows
pd.set_option('display.width', None)        # Auto-detect width

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


## 2. Load Dataset

**Note**: European CSV files often use semicolon (`;`) as delimiter instead of comma.


In [None]:
# Load the dataset
# TODO: Load data/raw/student-mat.csv
# Hint: Check the delimiter - might be ';' instead of ','

df = None  # Replace with actual loading code

print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")


## 3. Initial Inspection

### 3.1 Dataset Dimensions


In [None]:
# TODO: Display number of rows and columns
# TODO: Display column names
# TODO: Display first few rows


### 3.2 Data Types


In [None]:
# TODO: Check data types of all columns
# TODO: Use df.info() to see types and non-null counts


### 3.3 Missing Values Analysis


In [None]:
# TODO: Check for missing values
# TODO: Count missing values per column
# TODO: Calculate percentage of missing values


## 4. Statistical Summary

### 4.1 Numerical Features


In [None]:
# TODO: Generate statistical summary for numerical columns
# Use df.describe() - shows count, mean, std, min, quartiles, max


### 4.2 Categorical Features


In [None]:
# TODO: Identify categorical columns
# TODO: For each categorical column, show unique values and their counts


## 5. Target Variable Analysis (G3)

Understanding our target variable is crucial!


In [None]:
# TODO: Analyze G3 (final grade)
# - Mean, median, std
# - Min and max
# - Distribution (histogram)
# - Check for any zeros (students who didn't take exam?)


## 6. Sample Records Inspection


In [None]:
# TODO: Display first 10 rows
# TODO: Display last 10 rows
# TODO: Display random sample of 10 rows


## 7. Data Quality Checks

### 7.1 Check for Duplicates


In [None]:
# TODO: Check for duplicate rows


### 7.2 Value Range Checks


In [None]:
# TODO: Verify value ranges make sense
# - Age should be 15-22
# - Grades (G1, G2, G3) should be 0-20
# - Binary variables should only have 2 values
# - etc.


## 8. Initial Observations & Questions

### Document your findings:

**Dataset Overview:**
- Number of students: 
- Number of features: 
- Target variable range: 

**Data Quality:**
- Missing values: 
- Duplicates: 
- Data type issues: 

**Interesting Observations:**
- 
- 
- 

**Questions for Next Phase:**
- 
- 
- 

**Challenges Identified:**
- 
- 
- 


## 9. Summary

Write a brief summary of what you learned in this phase.

---

**Next Steps**: 
1. Create conclusion document: `docs/phase_conclusions/Phase_01_Conclusion.md`
2. Document all findings and learnings
3. Prepare for Phase 2: Exploratory Data Analysis
