### **Lesson Plan: Learning Statistics Through School Data Analysis**  
**Dataset**: `data.csv` (student enrollment, courses, demographics, and instructors)  
**Tools**: Python (Pandas, NumPy, Matplotlib, Seaborn), Jupyter Notebook, Excel/Google Sheets  

---

### **Week 1: Data Exploration & Descriptive Statistics**  
**Objective**: Understand the dataset’s structure and basic statistical measures.  

#### **Key Concepts**  
1. **Variables Types**:  
   - **Categorical**: `gender`, `department`, `course_name`.  
   - **Numerical**: `credits`, `grade_level`, `enrollment_year`.  
   - **Date**: `date_of_birth`.  

2. **Descriptive Statistics**:  
   - **Measures of Central Tendency**: Mean, median, mode.  
   - **Measures of Spread**: Range, variance, standard deviation.  
   - **Frequency Distributions**: Counts, percentages.  

#### **Practical Exercises**  
1. **Load and Inspect Data**:  
   ```python  
   import pandas as pd  
   df = pd.read_csv('data.csv')  
   print(df.head())  
   print(df.describe(include='all'))  
   ```  

2. **Data Cleaning**:  
   - Check for missing values: `df.isnull().sum()`.  
   - Remove duplicates: `df.drop_duplicates(subset='student_id', inplace=True)`.  
   - Convert `date_of_birth` to age:  
     ```python  
     df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])  
     df['age'] = (pd.to_datetime('2023-01-01') - df['date_of_birth']).astype('<m8[Y]')  
     ```  

3. **Basic Descriptive Stats**:  
   - Mean/median credits: `df['credits'].mean()`, `df['credits'].median()`.  
   - Most common course: `df['course_name'].mode()`.  

4. **Visualization**:  
   - Age distribution histogram:  
     ```python  
     import seaborn as sns  
     sns.histplot(df['age'], bins=20, kde=True)  
     ```  
   - Bar chart of departments:  
     ```python  
     df['department'].value_counts().plot(kind='bar')  
     ```  

**Expected Outcome**:  
- A cleaned dataset with calculated age.  
- Summary statistics and visualizations for key variables.  

---

### **Week 2: Inferential Statistics & Hypothesis Testing**  
**Objective**: Draw conclusions about the population from sample data.  

#### **Key Concepts**  
1. **Hypothesis Testing**:  
   - Null vs. alternative hypotheses.  
   - **p-values** and significance levels (α = 0.05).  
   - **Chi-square test** (categorical variables), **t-test** (numerical variables).  

2. **Confidence Intervals**:  
   - Calculate 95% CI for average credits.  

#### **Practical Exercises**  
1. **Chi-Square Test**:  
   - **Question**: Is there a relationship between `gender` and `department`?  
   ```python  
   from scipy.stats import chi2_contingency  
   contingency_table = pd.crosstab(df['gender'], df['department'])  
   chi2, p, _, _ = chi2_contingency(contingency_table)  
   print(f"p-value: {p}")  # Reject null hypothesis if p < 0.05  
   ```  

2. **t-Test**:  
   - **Question**: Do Science and Math departments differ in average credits?  
   ```python  
   from scipy.stats import ttest_ind  
   science = df[df['department'] == 'Science']['credits']  
   math = df[df['department'] == 'Math']['credits']  
   t_stat, p = ttest_ind(science, math)  
   ```  

3. **Confidence Interval**:  
   ```python  
   import numpy as np  
   mean_credits = df['credits'].mean()  
   std_credits = df['credits'].std()  
   n = len(df)  
   ci = 1.96 * (std_credits / np.sqrt(n))  
   print(f"95% CI: {mean_credits - ci} to {mean_credits + ci}")  
   ```  

**Expected Outcome**:  
- Answers to hypotheses (e.g., "Math and Science departments have significantly different credit loads").  

---

### **Week 3: Trend Analysis & Correlation**  
**Objective**: Analyze relationships and trends over time.  

#### **Key Concepts**  
1. **Correlation**:  
   - **Pearson’s r** (linear relationships).  
   - **Spearman’s ρ** (non-linear/ordinal data).  

2. **Time Series**:  
   - Enrollment trends by year.  

#### **Practical Exercises**  
1. **Correlation Matrix**:  
   ```python  
   numerical_df = df[['grade_level', 'credits', 'age']]  
   sns.heatmap(numerical_df.corr(), annot=True)  
   ```  

2. **Enrollment Trends**:  
   - Plot students per enrollment year:  
     ```python  
     sns.lineplot(x='enrollment_year', y='student_id', data=df, estimator='count')  
     ```  

3. **Grade vs. Credits**:  
   - Boxplot of credits by grade:  
     ```python  
     sns.boxplot(x='grade_level', y='credits', data=df)  
     ```  

**Expected Outcome**:  
- Identify trends (e.g., "Enrollment peaked in 2021").  
- Determine if higher grades correlate with more credits.  

---

### **Week 4: Advanced Topics (Optional)**  
**Objective**: Apply predictive modeling and segmentation.  

#### **Key Concepts**  
1. **Clustering**: Group students by behavior (e.g., course choices).  
2. **Regression**: Predict enrollment numbers.  

#### **Practical Exercises**  
1. **K-Means Clustering**:  
   ```python  
   from sklearn.cluster import KMeans  
   X = df[['age', 'credits']]  
   kmeans = KMeans(n_clusters=3).fit(X)  
   df['cluster'] = kmeans.labels_  
   sns.scatterplot(x='age', y='credits', hue='cluster', data=df)  
   ```  

2. **Linear Regression**:  
   ```python  
   from sklearn.linear_model import LinearRegression  
   X = df[['enrollment_year']]  
   y = df['student_id']  # Use student count as proxy for enrollment  
   model = LinearRegression().fit(X, y)  
   print(f"Slope: {model.coef_[0]}")  # Trend over time  
   ```  

**Expected Outcome**:  
- Segmented student groups (e.g., "Cluster 1: High-credit, older students").  

---

### **Final Project**  
**Task**: Answer a real-world school question using the dataset.  
**Example**:  
- *"Are instructors in high-enrollment departments teaching more courses?"*  
- **Steps**:  
  1. Calculate courses per instructor.  
  2. Compare instructor workload across departments.  
  3. Use a bar chart and t-test to validate findings.  

---

### **Learning Resources**  
- **Books**: *Practical Statistics for Data Scientists* (O’Reilly).  
- **Courses**: [Kaggle Learn](https://www.kaggle.com/learn), [Coursera Data Science Specialization](https://www.coursera.org/specializations/jhu-data-science).  
- **Practice**: Replicate analyses in Excel for comparison.  

--- 

**Key Takeaway**: Focus on connecting statistical methods to real-world school problems (e.g., resource allocation, student retention). Document your process and iterate! 🎓