# Data Proficiency Project Plan

---

## 1. Project Overview

**Objective:**  
Analyze participant data from the Everything Data mentorship cohort to understand demographics, motivations, skill levels, and factors influencing graduation. Use data-driven insights to recommend improvements for future cohorts.

**Scope:**

- Data cleaning and preprocessing
- Exploratory data analysis (EDA)
- Predictive modeling for graduation status
- Evaluation of models and actionable recommendations

**Deliverables:**

- Cleaned dataset
- EDA visualizations and summary statistics
- Classification models predicting graduation status
- Model evaluation metrics (accuracy, precision, recall, F1-score)
- Recommendations for program improvement
- README explaining workflow

---

## 2. Project Workflow

### Step 1: Data Acquisition & Inspection

- Load dataset using Python (`pandas`).
- Inspect data types, missing values, and general structure.
- Identify categorical vs. numerical variables.

### Step 2: Data Cleaning & Preprocessing

- Handle missing values (imputation or removal).
- Standardize categorical entries (e.g., gender, country).
- Encode categorical variables (One-Hot Encoding or Label Encoding).
- Convert timestamps into usable features if necessary (e.g., month/year of registration).
- Check for duplicates and inconsistencies.

### Step 3: Exploratory Data Analysis (EDA)

**Demographics Analysis:**

- Age range distribution
- Gender and country breakdown
- Track-wise participant distribution

**Experience & Motivation:**

- Years of learning experience vs. graduation rate
- Hours per week availability vs. performance
- Motivation for joining analysis (word cloud or category frequency)

**Performance Metrics:**

- Distribution of aptitude test completion and total scores
- Correlation matrix between numerical features and graduation status

**Visualizations:**  
Histograms, bar charts, boxplots, heatmaps (Matplotlib/Seaborn).

### Step 4: Predictive Modeling

**Goal:** Predict graduation status.

**Feature Selection:** Experience, hours/week, total score, self-assessed skill level, motivation, etc.

**Models to Compare:**

- Logistic Regression
- Random Forest Classifier (or Gradient Boosting)

- Split data into train/test (e.g., 80/20)
- Train models and tune hyperparameters (if necessary)

### Step 5: Model Evaluation

**Metrics to report:**

- Accuracy
- Precision
- Recall
- F1-score

- Confusion matrix visualization
- Compare models and choose the best-performing one

### Step 6: Insights & Recommendations

- Identify key factors influencing graduation
- Provide actionable recommendations:
  - Ideal participant profile
  - Suggested weekly learning hours
  - Support mechanisms for participants at risk of dropping out
- Present findings with clear visuals and narrative

---

## 3. Timeline

| Task | Start Date | End Date |
| --- | --- | --- |
| Data Inspection & Cleaning | Aug 21 | Aug 23 |
| EDA & Visualization | Aug 24 | Aug 26 |
| Predictive Modeling | Aug 27 | Aug 30 |
| Model Evaluation | Aug 31 | Sep 1 |
| Insights & Recommendations | Sep 2 | Sep 4 |
| Prepare Report & Dashboard | Sep 5 | Sep 11 |
| Submission & Presentation | Sep 12 | Sep 12 |

---

## 4. Tools & Technologies

- Python: `pandas`, `numpy`, `matplotlib`, `seaborn`, `scikit-learn`
- Optional: Jupyter Notebook or Google Colab
- Reporting: PDF/Markdown with visualizations or an interactive dashboard


# Step 1: Data Acquisition & Inspection

In [3]:
# Load dataset
import pandas as pd 
df = pd.read_csv("Cohort 3 DS.csv")

# Shape of the dataset
print("Shape of the dataset:", df.shape)


Shape of the dataset: (63, 15)


In [5]:
# Preview the dataset
print("Preview of the dataset:")
df.head()

Preview of the dataset:


Unnamed: 0,Timestamp,Id. No,Age range,Gender,Country,Where did you hear about Everything Data?,How many years of learning experience do you have in the field of data?,Which track are you applying for?,How many hours per week can you commit to learning?,What is your main aim for joining the mentorship program?,What is your motivation to join the Everything Data mentorship program?,How best would you describe your skill level in the track you are applying for?,Have you completed the everything data aptitude test for your track?,Total score,Graduated
0,12/1/2024 23:50:47,DS301,18-24 years,Male,Kenya,Word of mouth,Less than six months,Data science,less than 6 hours,Upskill,to enter into the data analysis career,Beginner - I have NO learning or work experien...,Yes,58.67,No
1,12/3/2024 9:35:19,DS302,25-34 years,Male,Kenya,WhatsApp,6 months - 1 year,Data science,more than 14 hours,Upskill,To grow and improve my skills in data science ...,Elementary - I have theoretical understanding ...,Yes,70.0,No
2,12/3/2024 19:16:49,DS303,18-24 years,Female,Kenya,WhatsApp,6 months - 1 year,Data science,more than 14 hours,Upskill,I’m motivated to join Everything Data to enhan...,Intermediate - I have theoretical knowledge an...,Yes,64.33,Yes
3,12/3/2024 12:52:36,DS304,18-24 years,Female,Kenya,WhatsApp,6 months - 1 year,Data science,7-14 hours,Upskill,I'd like to upskill and Join the Data Community,Intermediate - I have theoretical knowledge an...,Yes,75.0,No
4,12/3/2024 18:12:27,DS305,18-24 years,Male,Kenya,WhatsApp,Less than six months,Data science,7-14 hours,Upskill,I aim to join the mentorship program to enhanc...,Beginner - I have NO learning or work experien...,Yes,59.0,No


In [15]:
# Check column names and data types
df.dtypes

Timestamp                                                                           object
Id. No                                                                              object
Age range                                                                           object
Gender                                                                              object
Country                                                                             object
Where did you hear about Everything Data?                                           object
How many years of learning experience do you have in the field of data?             object
Which track are you applying for?                                                   object
How many hours per week can you commit to learning?                                 object
What is your main aim for joining the mentorship program?                           object
What is your motivation to join the Everything Data mentorship program?             object

In [9]:
# Summary statistics (numerical only)
df.describe()

Unnamed: 0,Total score
count,63.0
mean,69.261905
std,7.238371
min,58.33
25%,64.0
50%,67.67
75%,74.33
max,83.67


In [12]:
# inspect unique values for categorical features
categorical_col = df.select_dtypes(include= ['object']).columns
for col in categorical_col:
    print(f"\nUnique values in '{col}':")
    print(df[col].unique()
          )


Unique values in 'Timestamp':
['12/1/2024 23:50:47' '12/3/2024 9:35:19' '12/3/2024 19:16:49'
 '12/3/2024 12:52:36' '12/3/2024 18:12:27' '11/27/2024 10:40:39'
 '11/28/2024 14:42:45' '12/3/2024 12:58:05' '12/1/2024 22:02:23'
 '12/2/2024 13:29:00' '12/3/2024 16:47:48' '12/2/2024 23:59:16'
 '12/3/2024 14:19:12' '11/27/2024 11:25:40' '11/28/2024 9:11:46'
 '12/3/2024 11:18:47' '11/27/2024 9:19:48' '11/28/2024 20:17:51'
 '12/2/2024 23:06:17' '12/2/2024 13:33:40' '12/3/2024 7:07:40'
 '12/2/2024 16:35:54' '11/28/2024 16:27:08' '12/3/2024 12:59:55'
 '11/27/2024 10:19:12' '11/28/2024 19:57:13' '12/3/2024 13:30:40'
 '11/27/2024 17:00:08' '12/2/2024 18:02:53' '12/3/2024 12:39:32'
 '12/3/2024 16:53:38' '12/1/2024 22:16:29' '12/2/2024 13:45:42'
 '12/2/2024 14:45:22' '12/2/2024 16:11:30' '12/1/2024 18:49:03'
 '12/2/2024 23:41:04' '12/3/2024 20:23:34' '11/28/2024 18:20:02'
 '12/2/2024 1:55:38' '12/1/2024 12:40:47' '12/3/2024 10:11:01'
 '12/3/2024 16:00:30' '12/1/2024 21:44:14' '11/29/2024 12:54:58'
 '

In [13]:
# check for missing values
print("\nMissing values per column:")
df.isnull().sum()


Missing values per column:


Timestamp                                                                          0
Id. No                                                                             0
Age range                                                                          0
Gender                                                                             0
Country                                                                            0
Where did you hear about Everything Data?                                          0
How many years of learning experience do you have in the field of data?            0
Which track are you applying for?                                                  0
How many hours per week can you commit to learning?                                0
What is your main aim for joining the mentorship program?                          0
What is your motivation to join the Everything Data mentorship program?            0
How best would you describe your skill level in the track you are

In [14]:
print("\nNumber of duplicate rows:", df.duplicated().sum())



Number of duplicate rows: 0


# Step 1 Summary: Data Acquisition & Inspection

- **Dataset Shape:** 63 rows × 15 columns  
- **Data Types:** 14 categorical (`object`), 1 numerical (`float64` → *Total score*)  
- **Preview:** Data contains participant demographics, motivations, and scores.  
- **Numerical Summary (Total Score):**
  - Mean: 69.26
  - Min: 58.33
  - Max: 83.67
  - Std: 7.24
- **Categorical Highlights:**
  - Age Ranges: 18–24, 25–34, 35–44, 45–54
  - Gender: Male, Female
  - Country: Kenya, South Africa
  - Tracks: Data Science, Data Analysis
  - Aptitude Test: Yes/No
  - Graduation Status: Yes/No
- **Missing Values:** 0  
- **Duplicates:** 0  

**Conclusion:** Dataset is complete, clean at the structural level, and ready for **Step 2: Data Cleaning & Preprocessing**.
