<a href="https://colab.research.google.com/github/Morsalah/OULAD-StudentWithdrawalPrediction-FinalProject/blob/main/FinalProject%7CFeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### **1. Aggregating Time-Based Features**
From tables like `studentVle.csv`, `studentAssessment.csv`, and `studentRegistration.csv`:

- **Average Clicks per Week**:
  Aggregate `sum_click` grouped by `id_student` and `week_from`, then calculate averages.  
  Example: `avg_clicks_per_week = sum(sum_click) / num_weeks`.

- **Days Registered Before Start**:  
  From `date_registration`, calculate the number of days a student registered before the course started:  
  `days_registered_before_start = abs(date_registration)` (if negative).

- **Days to Unregister**:  
  For students who unregistered, calculate the duration of course participation:  
  `days_to_unregister = date_unregistration - date_registration`.

---

### **2. Assessment Performance Features**
Using `studentAssessment.csv` and `assessments.csv`:

- **Weighted Assessment Score**:
  Calculate a weighted average score for each student:  
  `weighted_score = sum(score * weight) / sum(weight)`.

- **Missed Assessments Count**:  
  Count how many assessments a student didn’t submit (`score` is missing).

- **Time Taken for Submissions**:  
  Calculate the average time a student took to submit assessments:  
  `time_to_submit = date_submitted - date_assessment`.

---

### **3. Interaction Features**
From `studentVle.csv`:

- **Total Interactions**:
  Aggregate `sum_click` per student to get total interactions with learning materials.

- **Active Days**:
  Count the unique days a student interacted with learning materials (`count(unique(date))`).

- **Material Diversity**:
  Count the unique `id_site` values interacted with, representing material variety.

---

### **4. Module and Presentation Characteristics**
From `courses.csv` and `assessments.csv`:

- **Assessment Density**:
  Calculate the average number of assessments per week for a module:  
  `assessment_density = num_assessments / length`.

- **Module Duration**:  
  Add the `length` of the module as a feature for students associated with it.

- **Final Exam Weight**:  
  Add the weight of the final exam as a feature for module-performance comparison.

---

### **5. Demographic and Socioeconomic Features**
From `studentInfo.csv`:

- **Education Level Encoding**:  
  Encode `highest_education` as ordinal values (e.g., `No formal quals = 0`, `Postgraduate = 4`).

- **Deprivation Proxy**:  
  Convert `imd_band` to numeric and normalize.

- **Age Group Encoding**:  
  Convert `age_band` into numeric bins or one-hot encode.

- **Previous Attempts Ratio**:  
  `prev_attempt_ratio = num_of_prev_attempts / total_attempts_possible`.

---

### **6. Temporal Features**
From `studentVle.csv` and `studentAssessment.csv`:

- **Interaction Intensity Before Assessments**:  
  Calculate clicks in the `studentVle.csv` data 7 days before assessments are due.  
  Example: `pre_assessment_interactions = sum_click (7 days window)`.

- **Midterm vs. Final Activity**:  
  Compare interaction rates or scores between the midterm and final exam periods.

---

### **7. Combining Features Across Tables**
- **Cumulative Clicks to Score Correlation**:  
  Join `studentAssessment.csv` and `studentVle.csv` to examine how `sum_click` correlates with assessment scores.

- **Activity per Module**:  
  Combine `vle.csv` and `studentVle.csv` to derive per-module interaction statistics:  
  `module_activity = sum(sum_click) grouped by code_module`.

---

### **8. Predictive Features**
From `studentInfo.csv` and `studentAssessment.csv`:

- **Risk of Withdrawal**:  
  Create a flag feature based on early unregistration:  
  `is_withdrawn = 1 if date_unregistration < date_assessment`.

- **Engagement to Outcome Ratio**:  
  Compare total interactions and credits studied with the final result:  
  `engagement_ratio = total_clicks / studied_credits`.

---

### Implementation Steps:
1. **Data Integration**:  
   Merge tables using keys like `id_student`, `code_module`, and `code_presentation`.

2. **Aggregation**:  
   Use group-by operations to compute aggregated features for each student.

3. **Feature Scaling**:  
   Normalize features like `sum_click`, `time_to_submit`, etc., for uniformity.

4. **Validation**:  
   Analyze feature importance using models like Random Forest or feature selection techniques.