## ✅ Realistic Project Workflow: **“Student Performance Analysis System”**

### 🔹 Step 1: Load and Inspect Data

* Generate or load a DataFrame with:

  * `Name` (str)
  * `Age` (int)
  * `Class` (str)
  * `Marks` (int)
  * `Attendance` (%)
  * `Extracurricular` (Yes/No)
  * `Guardian` (str)
* 20–30 students for realism

---

### 🔹 Step 2: Clean the Data

* Remove rows with missing **Name**
* Fill missing **Attendance** with average
* Replace missing **Marks** with median
* Convert `Extracurricular` to boolean

---

### 🔹 Step 3: Feature Engineering

* Add `Result`: Pass/Fail (Pass if Marks ≥ 40)
* Add `Grade`:

  * A: 90+
  * B: 75–89
  * C: 60–74
  * D: <60
* Add `Needs Help`: If Marks < 50 and Attendance < 75

---

### 🔹 Step 4: Filtering

* Students with good performance: Marks > 80 and Attendance > 85
* Students who need help (based on new column)

---

### 🔹 Step 5: Aggregation & Grouping

* Average marks per `Class`
* Count of students in each `Grade`
* Group by `Guardian` to analyze student clusters (maybe siblings)

---

### 🔹 Step 6: Sorting & Ranking

* Sort students by Marks descending
* Top 5 performers
* Bottom 5 for intervention

---

### 🔹 Step 7: Final Output & Save

* Export final cleaned dataset to:

  * `final_student_report.csv`
  * Optionally, also generate:

    * `top_5.csv`
    * `students_needing_help.csv`

---



In [1]:
import pandas as pd
import numpy as np

# Step 1: Simulate a realistic dataset of 30 students
np.random.seed(42)

names = [
    "Aarav", "Vivaan", "Aditya", "Vihaan", "Arjun", "Sai", "Reyansh", "Ayaan", "Krishna", "Ishaan",
    "Shaurya", "Atharv", "Aryan", "Kabir", "Ranveer", "Anaya", "Myra", "Saanvi", "Aadhya", "Diya",
    "Pari", "Kiara", "Pihu", "Ira", "Meera", "Tara", "Riya", "Anika", "Nitya", "Aarohi"
]

ages = np.random.randint(13, 18, size=30)
classes = np.random.choice(["8A", "9B", "10A"], size=30)
marks = np.random.randint(30, 100, size=30).astype(float)
attendance = np.random.randint(60, 100, size=30).astype(float)
extracurricular = np.random.choice(["Yes", "No"], size=30)
guardians = np.random.choice(["Parent1", "Parent2", "Parent3", "Parent4"], size=30)

# Introduce some missing values
marks[[3, 7, 12]] = np.nan
attendance[[2, 10, 15]] = np.nan
names[5] = None  # Missing name

# Create DataFrame
df = pd.DataFrame({
    "Name": names,
    "Age": ages,
    "Class": classes,
    "Marks": marks,
    "Attendance": attendance,
    "Extracurricular": extracurricular,
    "Guardian": guardians
})

df.head(10)


Unnamed: 0,Name,Age,Class,Marks,Attendance,Extracurricular,Guardian
0,Aarav,16,10A,65.0,83.0,Yes,Parent1
1,Vivaan,17,9B,79.0,70.0,No,Parent4
2,Aditya,15,9B,33.0,,Yes,Parent1
3,Vihaan,17,10A,,67.0,Yes,Parent4
4,Arjun,17,9B,35.0,94.0,No,Parent3
5,,14,10A,83.0,94.0,Yes,Parent3
6,Reyansh,15,10A,33.0,92.0,Yes,Parent3
7,Ayaan,15,8A,,64.0,Yes,Parent2
8,Krishna,15,10A,92.0,98.0,Yes,Parent4
9,Ishaan,17,8A,47.0,87.0,Yes,Parent2


In [None]:
# ✅ Task 1.1: Drop rows where Name is missing - RESET INDEX
df = df[df['Name'].notna()].reset_index(drop=True)
print("\n✅ After dropping missing names:")
print(df)



✅ After dropping missing names:
       Name  Age Class  Marks  Attendance Extracurricular Guardian
0     Aarav   16   10A   65.0        83.0             Yes  Parent1
1    Vivaan   17    9B   79.0        70.0              No  Parent4
2    Aditya   15    9B   33.0         NaN             Yes  Parent1
3    Vihaan   17   10A    NaN        67.0             Yes  Parent4
4     Arjun   17    9B   35.0        94.0              No  Parent3
5   Reyansh   15   10A   33.0        92.0             Yes  Parent3
6     Ayaan   15    8A    NaN        64.0             Yes  Parent2
7   Krishna   15   10A   92.0        98.0             Yes  Parent4
8    Ishaan   17    8A   47.0        87.0             Yes  Parent2
9   Shaurya   16   10A   73.0         NaN              No  Parent2
10   Atharv   15   10A   63.0        68.0              No  Parent1
11    Aryan   17    8A    NaN        67.0              No  Parent2
12    Kabir   14    8A   43.0        71.0             Yes  Parent1
13  Ranveer   16   10A   77.0

In [3]:
#✅ Task 1.2: Fill missing Marks and Attendance with their column averages
df['Marks'] = df['Marks'].fillna(df['Marks'].mean())
df['Attendance'] = df['Attendance'].fillna(df['Attendance'].mean())

print("\n✅ After filling missing Marks & Attendance:")
print(df)



✅ After filling missing Marks & Attendance:
       Name  Age Class      Marks  Attendance Extracurricular Guardian
0     Aarav   16   10A  65.000000   83.000000             Yes  Parent1
1    Vivaan   17    9B  79.000000   70.000000              No  Parent4
2    Aditya   15    9B  33.000000   81.307692             Yes  Parent1
3    Vihaan   17   10A  60.307692   67.000000             Yes  Parent4
4     Arjun   17    9B  35.000000   94.000000              No  Parent3
5   Reyansh   15   10A  33.000000   92.000000             Yes  Parent3
6     Ayaan   15    8A  60.307692   64.000000             Yes  Parent2
7   Krishna   15   10A  92.000000   98.000000             Yes  Parent4
8    Ishaan   17    8A  47.000000   87.000000             Yes  Parent2
9   Shaurya   16   10A  73.000000   81.307692              No  Parent2
10   Atharv   15   10A  63.000000   68.000000              No  Parent1
11    Aryan   17    8A  60.307692   67.000000              No  Parent2
12    Kabir   14    8A  43.00000

✅ STEP 2: ENRICHING THE DATA
We'll now create new insights/features from the cleaned data:

In [5]:
#🎯 Task 2.1: Add Result column → "Pass" if Marks ≥ 40, else "Fail"
#We'll now create new insights/features from the cleaned data:
df['Result'] = df['Marks'].apply(lambda x: 'Pass' if x >= 40 else 'Fail')


In [6]:
#🎯 Task 2.2: Add Grade column
def assign_grade(marks):
    if marks >= 90:
        return 'A'
    elif marks >= 75:
        return 'B'
    elif marks >= 60:
        return 'C'
    elif marks >= 40:
        return 'D'
    else:
        return 'F'

df['Grade'] = df['Marks'].apply(assign_grade)


In [9]:
#🎯 Task 2.3: Flag students who need help → Marks < 60 and Attendance < 75
df['Needs_Help'] = df.apply(
    lambda row: 'Yes' if row['Marks'] < 60 and row['Attendance'] < 75 else 'No',
    axis=1
)
print(df)


       Name  Age Class      Marks  Attendance Extracurricular Guardian Result  \
0     Aarav   16   10A  65.000000   83.000000             Yes  Parent1   Pass   
1    Vivaan   17    9B  79.000000   70.000000              No  Parent4   Pass   
2    Aditya   15    9B  33.000000   81.307692             Yes  Parent1   Fail   
3    Vihaan   17   10A  60.307692   67.000000             Yes  Parent4   Pass   
4     Arjun   17    9B  35.000000   94.000000              No  Parent3   Fail   
5   Reyansh   15   10A  33.000000   92.000000             Yes  Parent3   Fail   
6     Ayaan   15    8A  60.307692   64.000000             Yes  Parent2   Pass   
7   Krishna   15   10A  92.000000   98.000000             Yes  Parent4   Pass   
8    Ishaan   17    8A  47.000000   87.000000             Yes  Parent2   Pass   
9   Shaurya   16   10A  73.000000   81.307692              No  Parent2   Pass   
10   Atharv   15   10A  63.000000   68.000000              No  Parent1   Pass   
11    Aryan   17    8A  60.3

In [10]:
#🎯 Task 2.4: Encode Extracurricular into numeric (Yes → 1, No → 0)
df['ExtraCode'] = df['Extracurricular'].map({'Yes': 1, 'No': 0})
print(df)

       Name  Age Class      Marks  Attendance Extracurricular Guardian Result  \
0     Aarav   16   10A  65.000000   83.000000             Yes  Parent1   Pass   
1    Vivaan   17    9B  79.000000   70.000000              No  Parent4   Pass   
2    Aditya   15    9B  33.000000   81.307692             Yes  Parent1   Fail   
3    Vihaan   17   10A  60.307692   67.000000             Yes  Parent4   Pass   
4     Arjun   17    9B  35.000000   94.000000              No  Parent3   Fail   
5   Reyansh   15   10A  33.000000   92.000000             Yes  Parent3   Fail   
6     Ayaan   15    8A  60.307692   64.000000             Yes  Parent2   Pass   
7   Krishna   15   10A  92.000000   98.000000             Yes  Parent4   Pass   
8    Ishaan   17    8A  47.000000   87.000000             Yes  Parent2   Pass   
9   Shaurya   16   10A  73.000000   81.307692              No  Parent2   Pass   
10   Atharv   15   10A  63.000000   68.000000              No  Parent1   Pass   
11    Aryan   17    8A  60.3

✅ STEP 3: ANALYSIS & AGGREGATION (Real Insights for Decision Making)

In [11]:
#📊 Task 3.1: Count of students in each grade
print(df['Grade'].value_counts())


Grade
C    9
D    7
F    6
B    4
A    3
Name: count, dtype: int64


In [12]:
#📊 Task 3.2: Average Marks & Attendance by Result
print(df.groupby('Result')[['Marks', 'Attendance']].mean())


            Marks  Attendance
Result                       
Fail    34.333333   77.051282
Pass    67.083612   82.418060


In [13]:
#📊 Task 3.3: Average Marks for Students in Extracurricular Activities
print(df[df['Extracurricular'] == 'Yes']['Marks'].mean())


56.71659919028341


In [14]:
#📊 Task 3.4: % of students who need help
need_help_percent = (df['Needs_Help'] == 'Yes').mean() * 100
print(f"{need_help_percent:.2f}% students need help")


17.24% students need help


In [15]:
#📊 Task 3.5: Sort top 5 performers
print(df.sort_values(by='Marks', ascending=False).head(5))


       Name  Age Class  Marks  Attendance Extracurricular Guardian Result  \
25     Riya   15    8A   94.0        96.0             Yes  Parent3   Pass   
7   Krishna   15   10A   92.0        98.0             Yes  Parent4   Pass   
15     Myra   16    8A   91.0        82.0              No  Parent4   Pass   
20    Kiara   17    9B   89.0        81.0              No  Parent2   Pass   
17   Aadhya   13    9B   82.0        96.0             Yes  Parent4   Pass   

   Grade Needs_Help  ExtraCode  
25     A         No          1  
7      A         No          1  
15     A         No          0  
20     B         No          0  
17     B         No          1  


🧾 STEP 4: EXPORT CLEANED + ANALYZED DATA
This step simulates what you’d actually send to a manager, client, or team lead.

In [17]:
#💾 Task 4.1: Save final DataFrame to CSV
df.to_csv("final_student_report.csv", index=False)


In [18]:
#💡 Task 4.2: Save a filtered version (only students who need help)
df[df['Needs_Help'] == 'Yes'].to_csv("students_needing_help.csv", index=False)


In [19]:
#📂 Task 4.3: Save top 5 performers to separate CSV
df.sort_values(by='Marks', ascending=False).head(5).to_csv("top_5_students.csv", index=False)
