**Assignment Submission Guidelines**

**1. Submission Platform:**

- Submit your completed assignment through Google Classroom.

**2. Submission Format:**

- Submit the Google Colab Notebook (.ipynb file) provided as the assignment template.
- Do not create a new notebook. Fill in the provided template.

**3. Template Completion:**

The template notebook contains:
- The code to generate the student_performance_detailed_nan.csv dataset.
- Placeholders for your code and explanations for each question.

Follow the instructions within the template.
- Code Cells:
  - Place your code solutions directly in the designated code cells below each question.
- Markdown Cells:
  - Provide your explanations and justifications in the designated Markdown cells.
- Report section:
  - Complete the markdown section at the bottom of the notebook titled "Report".
  - In this section, compile the explanation of each of the questions.
  - Answer the following data analysis questions:
    1.  What are the key characteristics of the student population in this dataset?
    2. Which factors appear to have the strongest influence on student grades?
    3. What are the most common missing data patterns, and what implications might they have?
    4. Based on your analysis, what are 2-3 recommendations you would make to improve student performance?

- Do not modify the structure of the template notebook.

**4. File Naming:**

Ensure the file name remains as provided in the template. Do not rename the file.

**5. Timely Submission:**

- Submit your completed template notebook by the deadline: **24th of March, 2025**.
- Late submissions will be penalized as follows:
- Submissions within **5:00pm 26th of March, 2025**  will receive a maximum of 5 marks for timely submission.
Submissions after  will receive 0 marks for timely submission.

**6. Report:**

- Complete the "Report" section at the end of your notebook.
- Ensure your report is:
  - Well-organized and easy to read.
  - Clear and concise.
  - Free of grammatical errors.

**7. Code Execution:**

Ensure your completed notebook runs without errors from top to bottom.
Before submitting, restart the kernel and run all cells to confirm reproducibility.



**8. Academic Integrity:**

All work must be your own.
Plagiarism will result in a failing grade.
Cite any external resources you use.



**Tips for Success:**

- Start the assignment early.
- Read the instructions within the template carefully.
- Plan your approach before coding.
- Test your code thoroughly.
- Document your work clearly.
- Review the rubrics to understand the grading criteria.


**Grading Rubrics:**

Total 50 Marks

- Timely Submission: 10 Marks
- Report : 10 Marks
- Level 1 (Basic Questions): 5 Marks (1 x 5 = 5)
- Level 2 (Intermediate Questions): 10 Marks (2 x 5 = 10)
- Level 3 (Advanced Questions): 15 Marks (3 x 5 = 15)

##**Assignment**

**Background**

You are a data analyst working for "EduMetrics," a specialized educational consultancy. EduMetrics partners with schools, universities, and educational organizations to improve student outcomes and optimize resource allocation through data-driven insights.

Your team has been tasked with analyzing a comprehensive dataset of student performance and related factors. This dataset, which you've compiled, contains information on a diverse group of students, including their demographics, academic performance, behavioral indicators, and school-related factors.

Your goal is to leverage this data to uncover key factors that influence student success. By identifying these trends, you can provide actionable recommendations to educational institutionsld text

In [None]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

def generate_student_performance_data(num_students=1000):
    """Generates synthetic student performance data with more specific columns."""

    student_data = []

    for student_id in range(1, num_students + 1):
        # Student Demographics
        age = random.randint(14, 18)
        gender = random.choice(['Male', 'Female', 'Other'])
        race = random.choice(['White', 'Black', 'Asian', 'Hispanic', 'Other'])
        ses = random.choice(['Low', 'Medium', 'High'])
        special_ed = random.choice([True, False])

        # Academic Performance
        math_grade = np.random.normal(75, 10)
        science_grade = np.random.normal(80, 12)
        english_grade = np.random.normal(78, 11)
        attendance = random.randint(150, 180)  # Days attended
        homework_completion = random.uniform(0.6, 1.0) # percentage 0.6 to 1

        # Behavioral and Psychological Factors
        counseling_type = random.choice(['Academic', 'Personal', 'Professional', None])
        motivation = np.random.normal(7, 1.5) # scale 1 to 10
        study_time = random.uniform(1, 10) # hours per week

        # School and Teacher Factors (simplified)
        teacher_experience = random.randint(1, 20)
        class_size = random.randint(20, 35)
        school_type = random.choice(['Public', 'Private', 'Charter'])

        # Extracurricular Activities
        sports_type = random.choice(['Basketball', 'Soccer', 'Tennis', 'Swimming', None])
        club_type = random.choice(['Math Club', 'Science Club', 'Debate Club', 'Art Club', None])

        # Disciplinary Actions
        disciplinary_action_type = random.choice(['Detention', 'Suspension', 'Warning', None])
        disciplinary_action_count = 0
        if disciplinary_action_type:
            disciplinary_action_count = random.randint(1, 3)

        student_data.append({
            'StudentID': student_id,
            'Age': age,
            'Gender': gender,
            'Race': race,
            'SES': ses,
            'SpecialEd': special_ed,
            'MathGrade': math_grade,
            'ScienceGrade': science_grade,
            'EnglishGrade': english_grade,
            'Attendance': attendance,
            'HomeworkCompletion': homework_completion,
            'CounselingType': counseling_type,
            'Motivation': motivation,
            'StudyTime': study_time,
            'TeacherExperience': teacher_experience,
            'ClassSize': class_size,
            'SchoolType': school_type,
            'SportsType': sports_type,
            'ClubType': club_type,
            'DisciplinaryActionType': disciplinary_action_type,
            'DisciplinaryActionCount': disciplinary_action_count
        })

    df = pd.DataFrame(student_data)
    return df

# Generate and save the dataset
student_df = generate_student_performance_data()
student_df.to_csv('student_performance_detailed.csv', index=False)

print("Synthetic student performance dataset generated: student_performance_detailed.csv")

Synthetic student performance dataset generated: student_performance_detailed.csv


**The Data**

The data comes from a compilation by EduMetrics, available in 'student_performance_detailed_nan.csv'. Each row represents a single student's performance record:

- StudentID - Unique identifier for each student.
- Age - Student's age in years.
- Gender - Student's gender
  - Male
  - Female
  - Other
- Race - Student's race or ethnicity
    - White
    - Black
    - Asian
    - Hispanic
    - Other
- SES - Student's socioeconomic status
    - Low
    - Medium
    - High
- SpecialEd - Indicates whether the student receives special education services (True/False).
- MathGrade - Student's grade in mathematics.
- ScienceGrade - Student's grade in science.
- EnglishGrade - Student's grade in English.
- Attendance - Number of days the student attended school.
- HomeworkCompletion - Percentage of homework completed.
- CounselingType - Type of counseling received
    - Academic
    - Personal
    - Professional
    - NaN if none
- Motivation - Student's level of motivation (numerical scale).
- StudyTime - Student's average study time per week (in hours).
- TeacherExperience - Years of teaching experience of the student's teacher.
- ClassSize - Number of students in the student's class.
- SchoolType - Type of school the student attends
    - Public
    - Private
    - Charter
- SportsType - Type of sport the student participates in
  - Basketball
  - Soccer
  - Tennis
  - Swimming
  - NaN if none
- ClubType - Type of club the student participates in
  - Math Club
  - Science Club
  - Debate Club
  - Art Club
  - NaN if none
- DisciplinaryActionType - Type of disciplinary action taken
    - Detention
    - Suspension
    - Warning
    - NaN if none
- DisciplinaryActionCount - Number of times the disciplinary action occurred.

## **Basic (RBT Levels: 2, 3):**

Total: 5 Marks

Each Question Carry 1 Mark

**Question 1. Missing Value Identification:**

Identify the columns in the dataset that contain missing values. How many missing values are present in each column?

In [None]:
# Question 1: Missing Value Identification
# Identify the columns in the dataset that contain missing values. How many missing values are present in each column?
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 2: Basic Missing Value Handling**

Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.


In [None]:
# Question 2: Basic Missing Value Handling
# Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 3: Data Type Conversion**

Verify the data types of each column. Convert the 'Attendance' column to an integer data type and the 'HomeworkCompletion' column to a float data type. Explain why these data types are appropriate.


In [None]:
# Question 3: Data Type Conversion
# Verify the data types of each column. Convert the 'Attendance' column to an integer data type and the 'HomeworkCompletion' column to a float data type. Explain why these data types are appropriate.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 4: Renaming Columns**

Rename the 'StudentID' column to 'Student_ID' and the 'MathGrade' column to 'Math_Score'. Explain why renaming columns can be useful.


In [None]:
# Question 4: Renaming Columns
# Rename the 'StudentID' column to 'Student_ID' and the 'MathGrade' column to 'Math_Score'. Explain why renaming columns can be useful.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 5: Duplicate Row Removal**

Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?


In [None]:
# Question 5: Duplicate Row Removal
# Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?
# Your Code Here:

**Explanation**

[Your explanation here]

##**Intermediate (RBT Levels: 3, 4):**

Total: 10 Marks

Each Question Carry 2 Marks



**Question 6: Targeted Missing Value Imputation**

Impute the missing values in the 'CounselingType' column with the most frequent value (mode). Explain why you chose this imputation method.


In [None]:
# Question 6: Targeted Missing Value Imputation
# Impute the missing values in the 'CounselingType' column with the most frequent value (mode). Explain why you chose this imputation method.
# Your Code Here:

**Explanation**

[Your explanation here]

Impute the missing values in the 'SportsType' and 'ClubType' columns with the string 'None'. Explain why you chose this imputation method.


In [None]:
# Impute the missing values in the 'SportsType' and 'ClubType' columns with the string 'None'. Explain why you chose this imputation method.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 7: Binning Numerical Data and Visualization**

Create a new categorical column called 'AgeGroup' by binning the 'Age' column into appropriate age ranges (e.g., 14-15, 16-17, 18). Explain your binning strategy. Create a bar chart showing the distribution of students in each age group.


In [None]:
# Question 7: Binning Numerical Data and Visualization
# Create a new categorical column called 'AgeGroup' by binning the 'Age' column into appropriate age ranges (e.g., 14-15, 16-17, 18). Explain your binning strategy. Create a bar chart showing the distribution of students in each age group.
# Your Code Here:

**Explanation**

[Your explanation here]

Create a new categorical column called 'StudyTimeCategory' by binning the 'StudyTime' column into quantiles. Explain your binning strategy. Create a boxplot chart showing the distribution of MathGrade based on StudyTimeCategory.

In [None]:
# Create a new categorical column called 'StudyTimeCategory' by binning the 'StudyTime' column into quantiles. Explain your binning strategy. Create a boxplot chart showing the distribution of MathGrade based on StudyTimeCategory.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 8: Outlier Detection and Removal**

Use the IQR method to identify and remove outliers from the 'MathGrade' and 'ScienceGrade' columns. Explain your outlier detection and removal process.


In [None]:
# Question 8: Outlier Detection and Removal
# Use the IQR method to identify and remove outliers from the 'MathGrade' and 'ScienceGrade' columns. Explain your outlier detection and removal process.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 9: String Manipulation**

Clean the 'Race' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.


In [None]:
# Question 9: String Manipulation
# Clean the 'Race' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 10: Dummy Variable Creation and Stacked Bar Plot**

Create dummy variables for the 'Gender' and 'SchoolType' columns. Explain how dummy variables are used in data analysis. Create a stacked bar plot to visualize the distribution of 'Gender' within each 'SchoolType'.


In [None]:
# Question 10: Dummy Variable Creation and Stacked Bar Plot
# Create dummy variables for the 'Gender' and 'SchoolType' columns. Explain how dummy variables are used in data analysis. Create a stacked bar plot to visualize the distribution of 'Gender' within each 'SchoolType'.
# Your Code Here:

**Explanation**

[Your explanation here]

##**Advanced (RBT Levels: 4, 5):**

Total: 15 Marks

Each Question Carry 3 Marks

**Question 11: Conditional Missing Value Imputation**

Impute missing values in the 'DisciplinaryActionCount' column. If 'DisciplinaryActionType' is NaN, impute 'DisciplinaryActionCount' with 0. Otherwise, impute with the mean of the existing values for the particular 'DisciplinaryActionType'. Explain your approach.

In [None]:
# Question 11: Conditional Missing Value Imputation
# Impute missing values in the 'DisciplinaryActionCount' column. If 'DisciplinaryActionType' is NaN, impute 'DisciplinaryActionCount' with 0. Otherwise, impute with the mean of the existing values for the particular 'DisciplinaryActionType'. Explain your approach.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 12: Custom Binning Function**

Write a custom function to create a 'MotivationCategory' column based on the 'Motivation' score. Categorize scores below 4 as 'Low', scores between 4 and 7 as 'Medium', and scores above 7 as 'High'. Apply this function to create the new column.


In [None]:
# Question 12: Custom Binning Function
# Write a custom function to create a 'MotivationCategory' column based on the
#'Motivation' score. Categorize scores below 4 as 'Low', scores between 4 and 7
#as 'Medium', and scores above 7 as 'High'. Apply this function to create the new column.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 13: Grouped Transformations and Line Chart**

Calculate the average 'MathGrade' for each 'SchoolType'. Then create a new column called 'MathGradeNormalized' that represents each student's 'MathGrade' as a z-score relative to their school type's average. Create a line chart visualizing the average normalized MathGrade across schools sorted by average normalized MathGrade.


In [None]:
# Question 13: Grouped Transformations and Line Chart
# Calculate the average 'MathGrade' for each 'SchoolType'. Then create a new column called 'MathGradeNormalized' that represents each student's 'MathGrade' as a z-score relative to their school type's average. Create a line chart visualizing the average normalized MathGrade across schools sorted by average normalized MathGrade.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 14: Data Sampling and Validation**

Randomly sample 20% of the dataset. Use this sample to calculate the mean 'StudyTime' for each 'SES' category. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.


In [None]:
# Question 14: Data Sampling and Validation
# Randomly sample 20% of the dataset. Use this sample to calculate the mean 'StudyTime' for each 'SES' category. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 15: Merging Hypothetical Data**

Imagine you have a second dataset with teacher demographic information (e.g., teacher qualifications). Merge this hypothetical dataset with the student performance dataset using the 'TeacherExperience' column as a key. Explain your merge strategy and how this merged data could be used for further analysis.


In [None]:
# Question 15: Merging Hypothetical Data
# Imagine you have a second dataset with teacher demographic information (e.g., teacher qualifications). Merge this hypothetical dataset with the student performance dataset using the 'TeacherExperience' column as a key. Explain your merge strategy and how this merged data could be used for further analysis.
# Your Code Here:

**Explanation**

[Your explanation here]

**Report**

**Part 1**

- In this section, compile the explanation of each of the questions.

**Part 2**

- Answer the following data analysis questions:
  1. What are the key characteristics of the student population in this dataset?"
  2. Which factors appear to have the strongest influence on student grades?"
  3. What are the most common missing data patterns, and what implications might they have?"
  4. Based on your analysis, what are 2-3 recommendations you would make to improve student performance?"