# Project SC1003: Team Allocation Simulator & Diversity Analysis
# SC1003 Project – AY25/26 Semester 1

Nguyen Hoang Duong hoangduo001@e.ntu.edu.sg (U2510950H)

Ke Qiyun qke003@e.ntu.edu.sg(U25011211J)

GAO XINYU xgao012@e.ntu.edu.sg (U2523895D)

**Problem Statement**

The "Introduction to Data Science" is a Year 1 course offered at NTU. This course attracts a
diverse group of students from various disciplines due to its widespread applicability and
popularity.

Recently, the course has experienced a significant increase in enrollment, with 6,000
students registered. These students are organized into 120 tutorial groups, each consisting
of 50 students. The course coordinator is facing challenges in efficiently forming teams for
a mini-project component of the course.

To address this issue, the course coordinator seeks your expertise in developing an
application capable of organizing students into teams of five for the data science mini-
project. You are provided with a csv file (records.csv) consist of 6,000 student records with
their Tutorial Group, Student ID, Name, School, Gender, and CGPA. Teams are to be formed
only using members from the same tutorial group (i.e., no team can contain two members
from different tutorial groups). The application must ensure fairness and diversity when
forming teams by considering the following factors:

**1. School Affiliation: To ensure a mix of knowledge and skills, no team should have a
majority of students from the same school.**

**2. Gender: To promote gender diversity, no team should have a majority of students
of the same gender.**

**3. Current CGPA: To balance academic performance, teams should not consist
predominantly of students with very high or very low CGPAs.**

The objective of your program is to strive for balanced and diverse team compositions,
taking into account the aforementioned criteria. Some tolerance is acceptable, if it happens
that a tutorial group is dominated by students from the same background or profile.

The logic behind our algorithm is actually a greedy algorithm to build teams, analyzes the diversity of the resulting teams, and generates visualizations to report on its effectiveness.

## Computational Thinking

In this project, we applied the following computational thinking models.

**Decomposition**: We divided the problem into smaller sub problem and worked on them individually. We used multiple helper function to help address each part of the problem

**Abstraction**: We excluded irrelevant details from the CSV file, such as names and student IDs, to concentrate on important informations.

**Algorithm Design**: We resolved the problem through a number of steps, following the flowchart we developed to make things clear.

## Importing Modules

First, we install and import the necessary library. We use matplotlib.pyplot for generating our final analysis visualizations

In [None]:
pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

## Flowchart

[![20251109185512.svg](https://raw.githubusercontent.com/GXY-Allen/images/main/2025/11/20251109185512.svg)](https://raw.githubusercontent.com/GXY-Allen/images/main/2025/11/20251109185512.svg)

## Data Input/Output Function

We use two helper function to handle file operations: one to read the student data from the csv file (record.csv) and one to write the final formed team into a new CSV file

### Reading Student Data

The read_csv function opens the specified file, skips the header row, and reads each subsequent line. It splits each line by the comma and returns a list of lists, where each inner list represents one student's data.

1. list $lines$ contains all the information from the csv file in the form below (includes header).

In [None]:
#[(header),'(tutorial group),(student ID),(Name),(Gender),(CGPA)\n',...]
#Eg:['Tutorial Group,Student ID,School,Name,Gender,CGPA\n', 'G-1,5002,CCDS,Aarav Singh,Male,4.02\n', 'G-1,3838,EEE,Aarti Nair,Female,4.05\n',...]

2. $for$ loop skip the header by visiting $lines[1:]$
3. $fields$ is a list created by sentence $line.strip().split(',')$, where $.strip()$ used to remove whitespace characters at both ends of a string and $.split()$ split a string into a list according to the specified delimiter which is ",".
4. output $data$ contains all the students information in the form of list of lists

In [None]:
def read_csv(file_path):
    # Read student data from CSV file and return as list of lists
    data = []
    with open(file_path, 'r') as file:
        lines = file.readlines()

    for line in lines[1:]:  # Skip header
        fields = line.strip().split(',')
        data.append(fields)

    return data

**Writing Team Data**

The write_teams_to_csv function takes the final list of all teams and writes them to a formatted text file (though named .csv, the output is more of a human-readable report). It also adds group headers whenever the tutorial group changes.

In [None]:
# Write to csv file
def write_teams_to_csv(all_teams, filename="teams_output.csv"):
    # Write all teams to a CSV file in the specified format with group separation
    with open(filename, 'w') as file:
        overall_team_num = 1
        current_group = None

        for team in all_teams:
            # Check if we're in a new group
            group_name = team[0]['group']
            if group_name != current_group:
                current_group = group_name
                # Write group header
                file.write(f"Group {group_name}:\n")
                file.write("=" * 50 + "\n\n")

            # Write team header
            file.write(f"Team {overall_team_num}\n")
            file.write("-" * 50 + "\n")

            # Write each student in the team
            for student_num, student in enumerate(team, 1):
                file.write(f"{student_num}. {student['id']}, {student['name']}, {student['school']}, {student['gender']}, {student['cgpa']}\n")

            # Add blank line between teams
            file.write("\n")
            overall_team_num += 1

    print(f"Teams written to '{filename}'")

## Data Structuring
$ create\_data\_structure $ function helps to convert flat list data into nested dictionary by tutorial group and student ID.
1. from the list $data$, we retrieve students information
2. from every student's information, we retrieve their belonging group and student ID
3. arrange them in to a nested dictionary called $data_p$
4. we can get access to student's info by $data\_p[tutorial\_group][student\_id]["school"/"name"/"gender"/"cgpa"]$

In [None]:
def create_data_structure(data):
    # Convert flat list data into nested dictionary by tutorial group and student ID
    data_p = {}
    for row in data:
        tutorial_group = row[0]
        student_id = row[1]

        if tutorial_group not in data_p:
            data_p[tutorial_group] = {}

        data_p[tutorial_group][student_id] = {
            "school": row[2],
            "name": row[3],
            "gender": row[4],
            "cgpa": float(row[5])
        }

    return data_p

**Defining Diversity Scores**

define our scoring functions:

1.   score_school() for school diversity
2.   score_gender() for gender balance
3.   average() for mean calculation
4.   score_cgpa() for CGPA balance


**School Diversity Score**



1.   school_counts: count schools in each team
2.   max_school_count: find out the max school count using max function
3.   the greater one becomes new max_school_count
4.   value the max_school count and return score 100/50/0 for score_school
(If all from different schools → 100 points.
 If two students share the same school → 50 points.
 If more repetition → 0 points.)




In [None]:

def score_school(schools):
    # Score team based on school diversity
    if len(schools) == 0:
        return 0

    school_counts = {}
    for school in schools:
        school_counts[school] = school_counts.get(school, 0) + 1

    max_school_count = max(school_counts.values())

    if max_school_count == 1:
        return 100
    elif max_school_count == 2:
        return 50
    else:
        return 0

**Gender Diversity Score**
1.  gender_counts: count gender occurrences
2.  find out the max_gender using max function
3.  max count = 3 → 100 points; max count = 4 → 50 points; else 0.
4.  return points to score_gender


In [None]:
def score_gender(genders):
    # Score team based on gender diversity
    if len(genders) == 0:
        return 0

    gender_counts = {}
    for g in genders:
        gender_counts[g] = gender_counts.get(g, 0) + 1

    max_gender_count = max(gender_counts.values())

    if max_gender_count == 3:
        return 100
    elif max_gender_count == 4:
        return 50
    else:
        return 0

**CGPA Balance Score**

**Average Function (average)**
1.   average=summation/length
2.    Compute the arithmetic mean safely.If the list is empty, return 0 to avoid errors.

**CGPA Balance Function**
1.   find out the average of class_cgpa and team_cgpa
2.   Compare the team’s average CGPA with the whole class average.
3.   Smaller difference = better balance = higher score:
* Difference ≤ 0.15 → 100 points
* ≤ 0.30 → 50 points
* ≤ 0.50 → 20 points
* more than0.50 → 0 points
4.   return points to score_school



In [None]:
def average(values):
    # Calculate average of a list of values
    if len(values) == 0:
        return 0
    return sum(values) / len(values)

def score_cgpa(team_cgpas, class_cgpas):
    # Score team based on CGPA balance compared to class average
    class_avg = average(class_cgpas)
    group_avg = average(team_cgpas)
    diff = abs(class_avg - group_avg)

    if diff <= 0.15:
        return 100
    elif diff <= 0.3:
        return 50
    elif diff <= 0.5:
        return 20
    else:
        return 0

**Total Team Score**
1.   form lists schools, genders,cgpas from the team
2.   calculate score_school,score_gender,score_cgpa
3.   add three subscore together to find out the total score of team


In [None]:
def team_score(team, class_cgpas):
    # Calculate total score for a team
    schools = [p['school'] for p in team]
    genders = [p['gender'] for p in team]
    cgpas = [p['cgpa'] for p in team]

    return score_school(schools) + score_gender(genders) + score_cgpa(cgpas, class_cgpas)


**The Team Formation Algorithm (form_teams_for_group) (form team in 1 group)**
1.   initialize empty lists
* teams → stores completed teams.
* available_students → a copy of all students to choose from.
* class_cgpas → list of all CGPAs in this tutorial group, for calculating the average cgpa of the class
2.   Outer while loop:
* Continue forming teams as long as there are enough students (≥ team_size).
3.   Inner while loop:
* Keep adding students until one team reaches the desired size
4.   Greedy method:
* try to put every available candidate into the test_group and calculate the score of test_group
* compare score of each test_group using team_score function to find the best available candidate
* add the best available candidate to the team list and remove him/her from the available candidate
* try this method again and again until all 5 members are confirmed
* append team to list teams

In [None]:
def form_teams_for_group(students, team_size=5):
    # Form teams for a tutorial group using greedy algorithm
    teams = []
    available_students = students[:]
    class_cgpas = [s['cgpa'] for s in students]

    while len(available_students) >= team_size:
        team = []

        while len(team) < team_size and available_students:
            best_score = -999
            best_candidate = None

            for candidate in available_students:
                test_team = team + [candidate]
                test_score = team_score(test_team, class_cgpas)

                if test_score > best_score:
                    best_score = test_score
                    best_candidate = candidate

            if best_candidate:
                team.append(best_candidate)
                available_students.remove(best_candidate)

        if len(team) == team_size:
            teams.append(team)


**Handle Remaining Students**

1.   if there are students left:
* distribute them equally to existing teams
* if no completed teams formed then create one with all remaining students



In [None]:
# Handle remaining students
    if available_students:
        if teams:
            # Distribute remaining students to existing teams
            for i, student in enumerate(available_students):
                teams[i % len(teams)].append(student)
        else:
            # If no complete teams formed, create one team with all remaining
            teams.append(available_students)

    return teams

**Forming Teams for All Groups**
1. Iterate through each tutorial group in the dataset.
2. For each group, extract all students and convert them into a list.
3. Add the fields id and group for identification.
4. Call the greedy algorithm (form_teams_for_group) to form teams.
5. Combine all teams from different groups into one list.
6. Return the complete list all_teams.

In [None]:
def _sort_helper_(elem):
    # Function to help sort the tutorial group in order
    return int(elem[2:])

def form_all_teams(data_structure):
    # Form teams for all tutorial groups
    all_teams = []

    for group_name in sorted(data_structure.keys(), key = _sort_helper_):
        group_data = data_structure[group_name]

        # Convert dictionary to list of student dictionaries
        students = []
        for student_id, info in group_data.items():
            student = info.copy()
            student['id'] = student_id
            student['group'] = group_name
            students.append(student)

        teams = form_teams_for_group(students)
        all_teams.extend(teams)

    return all_teams

**Analysis & Visualization**

After forming the teams, we check how well the algorithm performed by evaluate the algorithm effectiveness

First, we use a helper function call calculate_diversity_stats to calculating diversity statistics. The function terates through every team created and gathers data for our plots. It calculates:

- The number of unique schools in each team.

- The number of unique genders in each team.

- The CGPA range (max - min) in each team.

- The number of "violations" (defined as >60% of a team being from one school or gender).

These are the following steps:

1. Define our function name calculate_diversity_stats(all_teams):.

2. Initialize a dictionary stats = { ... } to store the total team count, lists for team-specific scores, and counters for violations.

3. Iterate through each team in the list for team in all_teams:.

4. Calculate school diversity by finding the unique_schools in the team and check for violations by seeing if max_school / len(team) > 0.6.

5. Calculate gender diversity by finding the unique_genders in the team and check for violations by seeing if max_gender / len(team) > 0.6.

6. Calculate the team's CGPA spread by finding the cgpa_range = max(cgpas) - min(cgpas).

7. Calculate the final averages for all teams, such as stats['avg_unique_schools'] = average(...).

8. Return the dictionary containing all calculated statistics return stats.


In [None]:
def calculate_diversity_stats(all_teams):
    # Calculate diversity statistics for visualization
    # Dictionary stat to store
    stats = {
        'total_teams': len(all_teams),
        'school_diversity_scores': [],
        'gender_diversity_scores': [],
        'cgpa_ranges': [],
        'school_violations': 0,
        'gender_violations': 0
    }

    for team in all_teams:
        # School diversity
        schools = [s['school'] for s in team]
        unique_schools = len(set(schools))
        stats['school_diversity_scores'].append(unique_schools)

        school_counts = {}
        for school in schools:
            school_counts[school] = school_counts.get(school, 0) + 1
        max_school = max(school_counts.values())
        if max_school / len(team) > 0.6:
            stats['school_violations'] += 1

        # Gender diversity
        genders = [s['gender'] for s in team]
        unique_genders = len(set(genders))
        stats['gender_diversity_scores'].append(unique_genders)

        gender_counts = {}
        for gender in genders:
            gender_counts[gender] = gender_counts.get(gender, 0) + 1
        max_gender = max(gender_counts.values())
        if max_gender / len(team) > 0.6:
            stats['gender_violations'] += 1

        # CGPA range
        cgpas = [s['cgpa'] for s in team]
        cgpa_range = max(cgpas) - min(cgpas)
        stats['cgpa_ranges'].append(cgpa_range)

    stats['avg_unique_schools'] = average(stats['school_diversity_scores'])
    stats['avg_cgpa_range'] = average(stats['cgpa_ranges'])

    return stats

After calculate the diversity statistic, we use matplotlib library to visualize how the algorithm work

- Top-Left: A histogram showing the distribution of unique schools per team. (Ideally, we want high numbers, like 3, 4, or 5).

- Top-Right: A histogram showing the distribution of unique genders per team. (Ideally, we want all teams to have 2).

- Bottom-Left: A histogram of the CGPA range within teams. (A lower average range might be okay, but we want to see a good mix).

- Bottom-Right: A text box summarizing the key statistics, including violation rates.

This visualization is saved as diversity_analysis.png and also displayed to the user.

These are the following steps:

1. Define our function name visualize_diversity(stats):.

2. Set up a 2x2 grid of charts (subplots) using fig, axes = plt.subplots(2, 2, ...) and add a main title.

3. Chart 1 (Top-Left): Create a histogram axes[0, 0].hist(...) to show the distribution of unique schools per team.

4. Add a red dashed line axes[0, 0].axvline(...) to mark the average number of unique schools.

5. Chart 2 (Top-Right): Create a histogram axes[0, 1].hist(...) to show the distribution of unique genders per team.

6. Chart 3 (Bottom-Left): Create a histogram axes[1, 0].hist(...) to show the distribution of CGPA ranges within teams.

7. Add a red dashed line axes[1, 0].axvline(...) to mark the average CGPA range.

8. Chart 4 (Bottom-Right): Turn off the axis axes[1, 1].axis('off') and create a formatted summary_text string.

9. Display the summary_text using axes[1, 1].text(...) to show key statistics like violation rates and averages.

10. Adjust the layout for clarity plt.tight_layout() and save the final composite image to a file plt.savefig(...).

In [None]:
def visualize_diversity(stats):
    # Create visualizations to show algorithm effectiveness
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle('Team Formation Algorithm', fontsize=16, fontweight='bold')

    # Chart 1: Unique Schools per Team
    axes[0, 0].hist(stats['school_diversity_scores'], bins=range(1, 8), edgecolor='black', color='skyblue')
    axes[0, 0].set_xlabel('Number of Unique Schools in Team')
    axes[0, 0].set_ylabel('Number of Teams')
    axes[0, 0].set_title('School Diversity Distribution')
    axes[0, 0].axvline(stats['avg_unique_schools'], color='red', linestyle='--',
                       label=f'Average: {stats["avg_unique_schools"]:.2f}')
    axes[0, 0].legend()

    # Chart 2: Gender Diversity
    axes[0, 1].hist(stats['gender_diversity_scores'], bins=[1, 2, 3], edgecolor='black', color='lightcoral')
    axes[0, 1].set_xlabel('Number of Unique Genders in Team')
    axes[0, 1].set_ylabel('Number of Teams')
    axes[0, 1].set_title('Gender Diversity Distribution')
    axes[0, 1].set_xticks([1, 2])

    # Chart 3: CGPA Range Distribution
    axes[1, 0].hist(stats['cgpa_ranges'], bins=20, edgecolor='black', color='lightgreen')
    axes[1, 0].set_xlabel('CGPA Range in Team')
    axes[1, 0].set_ylabel('Number of Teams')
    axes[1, 0].set_title('CGPA Distribution Across Teams')
    axes[1, 0].axvline(stats['avg_cgpa_range'], color='red', linestyle='--',
                       label=f'Average: {stats["avg_cgpa_range"]:.3f}')
    axes[1, 0].legend()

    # Chart 4: Summary Statistics
    axes[1, 1].axis('off')
    summary_text = f"""
    DIVERSITY SUMMARY

    Total Teams: {stats['total_teams']}

    School Diversity:
    - Violations (>60% same school): {stats['school_violations']}
    - Violation Rate: {stats['school_violations']/stats['total_teams']*100:.1f}%
    - Avg Unique Schools/Team: {stats['avg_unique_schools']:.2f}

    Gender Diversity:
    - Violations (>60% same gender): {stats['gender_violations']}
    - Violation Rate: {stats['gender_violations']/stats['total_teams']*100:.1f}%

    CGPA Balance:
    - Avg CGPA Range: {stats['avg_cgpa_range']:.3f}

    Algorithm Effectiveness:
    {100 - (stats['school_violations'] + stats['gender_violations'])/stats['total_teams']*50:.1f}%
    """
    axes[1, 1].text(0.1, 0.5, summary_text, fontsize=11, verticalalignment='center',
                    family='monospace', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

    plt.tight_layout()
    plt.savefig('diversity_analysis.png', dpi=300, bbox_inches='tight')
    print("Visualization saved as 'diversity_analysis.png'")
    plt.show()


**Additional Requirement: Function to get a team size**

1.   Define our function name call get_team_size()
2.   Prompt user to enter team size between 4-10
3.   Using a try except to raise error if the input is not in the format
4.   If the input is an integer but not in the range, print error and ask user to try again



In [None]:
# Function to get the team size
def get_team_size():
    # Prompt user to enter team size between 4-10
    while True:
        try:
            team_size = int(input("Enter the number of students per team (4-10): "))
            if 4 <= team_size <= 10:
                return team_size
            else:
                print("Error: Team size must be between 4 and 10. Please try again.")
        except ValueError:
            print("Error: Please enter a valid integer.")

**Main Execution**


In [None]:
# Executable Program
import matplotlib.pyplot as plt

# Function to read CSV file
def read_csv(file_path):
    # Read student data from CSV file and return as list of lists
    data = []
    with open(file_path, 'r') as file:
        lines = file.readlines()

    for line in lines[1:]:  # Skip header
        fields = line.strip().split(',')
        data.append(fields)

    return data

# Write to csv file
def write_teams_to_csv(all_teams, filename="teams_output.csv"):
    # Write all teams to a CSV file in the specified format with group separation
    with open(filename, 'w') as file:
        overall_team_num = 1
        current_group = None

        for team in all_teams:
            # Check if we're in a new group
            group_name = team[0]['group']
            if group_name != current_group:
                current_group = group_name
                # Write group header
                file.write(f"Group {group_name}:\n")
                file.write("=" * 50 + "\n\n")

            # Write team header
            file.write(f"Team {overall_team_num}\n")
            file.write("-" * 50 + "\n")

            # Write each student in the team
            for student_num, student in enumerate(team, 1):
                file.write(f"{student_num}. {student['id']}, {student['name']}, {student['school']}, {student['gender']}, {student['cgpa']}\n")

            # Add blank line between teams
            file.write("\n")
            overall_team_num += 1

    print(f"Teams written to '{filename}'")

# Convert data to nested dictionary structure
def create_data_structure(data):
    # Convert flat list data into nested dictionary by tutorial group and student ID
    data_p = {}
    for row in data:
        tutorial_group = row[0]
        student_id = row[1]

        if tutorial_group not in data_p:
            data_p[tutorial_group] = {}

        data_p[tutorial_group][student_id] = {
            "school": row[2],
            "name": row[3],
            "gender": row[4],
            "cgpa": float(row[5])
        }

    return data_p

# Scoring functions
def score_school(schools):
    # Score team based on school diversity
    if len(schools) == 0:
        return 0

    school_counts = {}
    for school in schools:
        school_counts[school] = school_counts.get(school, 0) + 1

    max_school_count = max(school_counts.values())

    if max_school_count == 1:
        return 100
    elif max_school_count == 2:
        return 50
    else:
        return 0

def score_gender(genders):
    # Score team based on gender diversity
    if len(genders) == 0:
        return 0

    gender_counts = {}
    for g in genders:
        gender_counts[g] = gender_counts.get(g, 0) + 1

    max_gender_count = max(gender_counts.values())

    if 2 <= max_gender_count <= 3:
        return 100
    elif max_gender_count == 1 or max_gender_count == 4:
        return 50
    else:
        return 0

def average(values):
    # Calculate average of a list of values
    if len(values) == 0:
        return 0
    return sum(values) / len(values)

def score_cgpa(team_cgpas, class_cgpas):
    # Score team based on CGPA balance compared to class average
    class_avg = average(class_cgpas)
    group_avg = average(team_cgpas)
    diff = abs(class_avg - group_avg)

    if diff <= 0.15:
        return 100
    elif diff <= 0.3:
        return 50
    elif diff <= 0.5:
        return 20
    else:
        return 0

def team_score(team, class_cgpas):
    # Calculate total score for a team
    schools = [p['school'] for p in team]
    genders = [p['gender'] for p in team]
    cgpas = [p['cgpa'] for p in team]

    return score_school(schools) + score_gender(genders) + score_cgpa(cgpas, class_cgpas)

def form_teams_for_group(students, team_size=5):
    # Form teams for a tutorial group
    teams = []
    available_students = students[:]
    class_cgpas = [s['cgpa'] for s in students]

    while len(available_students) >= team_size:
        team = []

        while len(team) < team_size and available_students:
            best_score = -999
            best_candidate = None

            for candidate in available_students:
                test_team = team + [candidate]
                test_score = team_score(test_team, class_cgpas)

                if test_score > best_score:
                    best_score = test_score
                    best_candidate = candidate

            if best_candidate:
                team.append(best_candidate)
                available_students.remove(best_candidate)

        if len(team) == team_size:
            teams.append(team)

    # Handle remaining students
    if available_students:
        if teams:
            # Distribute remaining students to existing teams
            for i, student in enumerate(available_students):
                teams[i % len(teams)].append(student)
        else:
            # If no complete teams formed, create one team with all remaining
            teams.append(available_students)

    return teams

def _sort_helper_(elem):
    return int(elem[2:])

def form_all_teams(data_structure, team_size=5):
    # Form teams for all tutorial groups
    all_teams = []
    for group_name in sorted(data_structure.keys(), key = _sort_helper_):
        group_data = data_structure[group_name]
        # Convert dictionary to list of student dictionaries
        students = []
        for student_id, info in group_data.items():
            student = info.copy()
            student['id'] = student_id
            student['group'] = group_name
            students.append(student)

        teams = form_teams_for_group(students, team_size)
        all_teams.extend(teams)

    return all_teams

def calculate_diversity_stats(all_teams):
    # Calculate diversity statistics for visualization
    # Dictionary stat to store
    stats = {
        'total_teams': len(all_teams),
        'school_diversity_scores': [],
        'gender_diversity_scores': [],
        'cgpa_ranges': [],
        'school_violations': 0,
        'gender_violations': 0
    }

    for team in all_teams:
        # School diversity
        schools = [s['school'] for s in team]
        unique_schools = len(set(schools))
        stats['school_diversity_scores'].append(unique_schools)

        school_counts = {}
        for school in schools:
            school_counts[school] = school_counts.get(school, 0) + 1
        max_school = max(school_counts.values())
        if max_school / len(team) > 0.6:
            stats['school_violations'] += 1

        # Gender diversity
        genders = [s['gender'] for s in team]
        unique_genders = len(set(genders))
        stats['gender_diversity_scores'].append(unique_genders)

        gender_counts = {}
        for gender in genders:
            gender_counts[gender] = gender_counts.get(gender, 0) + 1
        max_gender = max(gender_counts.values())
        if max_gender / len(team) > 0.6:
            stats['gender_violations'] += 1

        # CGPA range
        cgpas = [s['cgpa'] for s in team]
        cgpa_range = max(cgpas) - min(cgpas)
        stats['cgpa_ranges'].append(cgpa_range)

    stats['avg_unique_schools'] = average(stats['school_diversity_scores'])
    stats['avg_cgpa_range'] = average(stats['cgpa_ranges'])

    return stats

def visualize_diversity(stats):
    # Create visualizations to show algorithm effectiveness
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle('Team Formation Algorithm', fontsize=16, fontweight='bold')

    # Chart 1: Unique Schools per Team
    axes[0, 0].hist(stats['school_diversity_scores'], bins=range(1, 8), edgecolor='black', color='skyblue')
    axes[0, 0].set_xlabel('Number of Unique Schools in Team')
    axes[0, 0].set_ylabel('Number of Teams')
    axes[0, 0].set_title('School Diversity Distribution')
    axes[0, 0].axvline(stats['avg_unique_schools'], color='red', linestyle='--',
                       label=f'Average: {stats["avg_unique_schools"]:.2f}')
    axes[0, 0].legend()

    # Chart 2: Gender Diversity
    axes[0, 1].hist(stats['gender_diversity_scores'], bins=[1, 2, 3], edgecolor='black', color='lightcoral')
    axes[0, 1].set_xlabel('Number of Unique Genders in Team')
    axes[0, 1].set_ylabel('Number of Teams')
    axes[0, 1].set_title('Gender Diversity Distribution')
    axes[0, 1].set_xticks([1, 2])

    # Chart 3: CGPA Range Distribution
    axes[1, 0].hist(stats['cgpa_ranges'], bins=20, edgecolor='black', color='lightgreen')
    axes[1, 0].set_xlabel('CGPA Range in Team')
    axes[1, 0].set_ylabel('Number of Teams')
    axes[1, 0].set_title('CGPA Distribution Across Teams')
    axes[1, 0].axvline(stats['avg_cgpa_range'], color='red', linestyle='--',
                       label=f'Average: {stats["avg_cgpa_range"]:.3f}')
    axes[1, 0].legend()

    # Chart 4: Summary Statistics
    axes[1, 1].axis('off')
    summary_text = f"""
    DIVERSITY SUMMARY

    Total Teams: {stats['total_teams']}

    School Diversity:
    - Violations (>60% same school): {stats['school_violations']}
    - Violation Rate: {stats['school_violations']/stats['total_teams']*100:.1f}%
    - Avg Unique Schools/Team: {stats['avg_unique_schools']:.2f}

    Gender Diversity:
    - Violations (>60% same gender): {stats['gender_violations']}
    - Violation Rate: {stats['gender_violations']/stats['total_teams']*100:.1f}%

    CGPA Balance:
    - Avg CGPA Range: {stats['avg_cgpa_range']:.3f}

    Algorithm Effectiveness:
    {100 - (stats['school_violations'] + stats['gender_violations'])/stats['total_teams']*50:.1f}%
    """
    axes[1, 1].text(0.1, 0.5, summary_text, fontsize=11, verticalalignment='center',
                    family='monospace', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

    plt.tight_layout()
    plt.savefig('diversity_analysis.png', dpi=300, bbox_inches='tight')
    print("Visualization saved as 'diversity_analysis.png'")
    plt.show()

def get_team_size():
    # Prompt user to enter team size between 4-10
    while True:
        try:
            team_size = int(input("Enter the number of students per team (4-10): "))
            if 4 <= team_size <= 10:
                return team_size
            else:
                print("Error: Team size must be between 4 and 10. Please try again.")
        except ValueError:
            print("Error: Please enter a valid integer.")

def main_with_custom_team_size():
    # Main execution with custom team size
    # Get team size from user
    team_size = get_team_size()
    print(f"\nForming teams with {team_size} students per team...\n")

    # Read and process data
    data = read_csv("records.csv")
    data_structure = create_data_structure(data)

    # Example: Print one tutorial group's data
    print("Sample data from G-1:")
    print(data_structure.get('G-1', {}))
    print("----------------------------------------------")

    # Form teams with specified team size
    all_teams = form_all_teams(data_structure, team_size)

    # Print team information
    print(f"\nTotal teams formed: {len(all_teams)}")
    for i, team in enumerate(all_teams[:3], 1):  # Show first 3 teams
        print(f"\nTeam {i}:")
        for student in team:
            print(f"  - {student['name']} ({student['id']}) - {student['school']} - CGPA: {student['cgpa']}")

    # Write teams to CSV file
    write_teams_to_csv(all_teams, "teams_output.csv")

    # Calculate and visualize diversity
    stats = calculate_diversity_stats(all_teams)
    visualize_diversity(stats)

def main_with_default_team_size():
    # Main execution with default team size
    # Read and process data
    data = read_csv("records.csv")
    data_structure = create_data_structure(data)

    # Example: Print one tutorial group's data
    print("Sample data from G-1:")
    print(data_structure.get('G-1', {}))
    print("----------------------------------------------")

    # Form teams
    all_teams = form_all_teams(data_structure)

    # Print team information
    print(f"\nTotal teams formed: {len(all_teams)}")
    for i, team in enumerate(all_teams[:3], 1):  # Show first 3 teams
        print(f"\nTeam {i}:")
        for student in team:
            print(f"  - {student['name']} ({student['id']}) - {student['school']} - CGPA: {student['cgpa']}")

    # Write teams to CSV file
    write_teams_to_csv(all_teams, "teams_output.csv")

    # Calculate and visualize diversity
    stats = calculate_diversity_stats(all_teams)
    visualize_diversity(stats)

# Main execution
if __name__ == "__main__":
    print("\nChoose an option:")
    print("1. Use default team size")
    print("2. Specify custom team size (4-10 students)")
    while True:
        choice = input("\nEnter your choice (1 or 2): ").strip()

        if choice == "1":
            main_with_default_team_size()
            break
        elif choice == "2":
            main_with_custom_team_size()
            break
        else:
            print("Error: Please enter either 1 or 2.")

**Problem Faced when doing the project**

While doing this project, we encounter some of the problem:
1. Some of the tutorial groups had an significant imbalanced proportion of male and female. One of them is Group G-4, which has 16 males and 34 females.
2. We had to allocate teams based on 3 differents factor (School Affiliation, CGPA and Gender) at the same time, which make the algorithm much more complicated.

**Conclusion**

Our solution has successfully formed teams in a way that maximizes diversity (with about 95% effectiveness), supported by a diversity scoring metric we created to evaluate our results. The visualizations we generated illustrate that most teams are well-balanced, with only a few exceptions where diversity was lower.

Throughout this project, we learned how to work together more effectively, highlighting the importance of clear communication and coordinated effort. Our programming skills developed when we tackled different technical challenges, and we also gained an appreciation for patience during debugging, recognizing that taking short breaks can help us return with clearer thinking and better problem-solving strategies.

# APPENDIX B: USE OF AI TOOL(S) IN PROJECT WORK

Each team member should indicate either **A** or **B**.

**A.** I affirm that my contribution(s) to the lab work is my own, produced without help from any AI tool(s)

**B.** I affirm that my contribution(s) to the lab work has been produced with the help from AI tool(s)

---

## Team Member Declaration

| Full Name | Date | A or B |
|-----------|------|--------|
| Nguyen Hoang Duong          |  9/11/2025    |   A     |
| Ke Qiyun          |  9/11/2025  |    A    |
| GAO XINYU          |   9/11/2025   |   A     |

---

By including this information in your Jupyter notebook, you declare that the above affirmation made is true and that you have read and understood NTU's policy on the use of AI tools.

---

## AI Tool Usage Documentation

**If any team member answered B, the team member(s) must indicate and replicate the table below for every instance AI tool(s) is used.**

| Field | Details |
|-------|---------|
| **Name of AI tool** | *< For example, ChatGPT >* |
| **Input prompt** | *< Insert the question that you asked ChatGPT >* |
| **Date generated** | |
| **Output generated** | *< Insert the response verbatim from ChatGPT >* |
| **Output screenshot** | |
| **Impact on submission** | *< Briefly explain which part of your submitted work was ChatGPT's response applied >* |
|  | |