# Data Proficiency Project Plan

---

## 1. Project Overview

**Objective:**  
Analyze participant data from the Everything Data mentorship cohort to understand demographics, motivations, skill levels, and factors influencing graduation. Use data-driven insights to recommend improvements for future cohorts.

**Scope:**

- Data cleaning and preprocessing
- Exploratory data analysis (EDA)
- Predictive modeling for graduation status
- Evaluation of models and actionable recommendations

**Deliverables:**

- Cleaned dataset
- EDA visualizations and summary statistics
- Classification models predicting graduation status
- Model evaluation metrics (accuracy, precision, recall, F1-score)
- Recommendations for program improvement
- README explaining workflow

---

## 2. Project Workflow

### Step 1: Data Acquisition & Inspection

- Load dataset using Python (`pandas`).
- Inspect data types, missing values, and general structure.
- Identify categorical vs. numerical variables.

### Step 2: Data Cleaning & Preprocessing

- Handle missing values (imputation or removal).
- Standardize categorical entries (e.g., gender, country).
- Encode categorical variables (One-Hot Encoding or Label Encoding).
- Convert timestamps into usable features if necessary (e.g., month/year of registration).
- Check for duplicates and inconsistencies.

### Step 3: Exploratory Data Analysis (EDA)

**Demographics Analysis:**

- Age range distribution
- Gender and country breakdown
- Track-wise participant distribution

**Experience & Motivation:**

- Years of learning experience vs. graduation rate
- Hours per week availability vs. performance
- Motivation for joining analysis (word cloud or category frequency)

**Performance Metrics:**

- Distribution of aptitude test completion and total scores
- Correlation matrix between numerical features and graduation status

**Visualizations:**  
Histograms, bar charts, boxplots, heatmaps (Matplotlib/Seaborn).

### Step 4: Predictive Modeling

**Goal:** Predict graduation status.

**Feature Selection:** Experience, hours/week, total score, self-assessed skill level, motivation, etc.

**Models to Compare:**

- Logistic Regression
- Random Forest Classifier (or Gradient Boosting)

- Split data into train/test (e.g., 80/20)
- Train models and tune hyperparameters (if necessary)

### Step 5: Model Evaluation

**Metrics to report:**

- Accuracy
- Precision
- Recall
- F1-score

- Confusion matrix visualization
- Compare models and choose the best-performing one

### Step 6: Insights & Recommendations

- Identify key factors influencing graduation
- Provide actionable recommendations:
  - Ideal participant profile
  - Suggested weekly learning hours
  - Support mechanisms for participants at risk of dropping out
- Present findings with clear visuals and narrative

---

## 3. Timeline

| Task | Start Date | End Date |
| --- | --- | --- |
| Data Inspection & Cleaning | Aug 21 | Aug 23 |
| EDA & Visualization | Aug 24 | Aug 26 |
| Predictive Modeling | Aug 27 | Aug 30 |
| Model Evaluation | Aug 31 | Sep 1 |
| Insights & Recommendations | Sep 2 | Sep 4 |
| Prepare Report & Dashboard | Sep 5 | Sep 11 |
| Submission & Presentation | Sep 12 | Sep 12 |

---

## 4. Tools & Technologies

- Python: `pandas`, `numpy`, `matplotlib`, `seaborn`, `scikit-learn`
- Optional: Jupyter Notebook or Google Colab
- Reporting: PDF/Markdown with visualizations or an interactive dashboard
