# Read CSV File

* Import CSV File using CSV Module

* Read CSV File and push data to `students_by_groups` dictionary

* `students_by_groups` is a dictionary where its keys are `tutorial_group` columns in the CSV file, while the values are objects of `TutorialGroup` datatype

* The TutorialGroup datatype contains an arrays of `Student` object

* Each `Student` object has the same parameters as the original data

## Example:

If a student has the following format:

```py
{
    'tutorial_group': "G-1",
    'student_id': 5002,
    'school': "CCDS",
    'name': "Aarav Singh",
    'gender': "Male",
    'CGPA': 4.02
}
```

Then the `students_by_groups` would look like this:

```py
students_by_groups = {
    'G-1': TutorialGroup(group_id = 1)
}

students_by_groups['G-1'].students = [
    Student(tutorial_group = 'G-1', student_id = 5002, school='CCDS', name='Aarav Singh', gender='Male', CGPA=4.02)
]
```

## Import required dependencies

In [1]:
import csv # For reading and writing csv files
import math # For ln function we use later
from typing import Dict, List # For type hints
import random # For random number generation

# We will try to construct OOP Classes for better formatting and nested functions

$$\color{green}{\Huge{\textbf{Student class}}}$$

In [2]:
class Student:
    def __init__(self, group_id: str, student_id: int, school: str, name: str, gender: str, cgpa: float):
        self.group_id = group_id
        self.student_id = student_id
        self.school = school
        self.name = name
        self.gender = gender
        self.cgpa = cgpa
    
    def __str__(self):
        return f"{self.student_id} {self.school} {self.name} {self.gender} {self.cgpa}"

$$\color{green}{\Huge{\textbf{Tutorial Group Class}}}$$

In [3]:
class TutorialGroup:
    def __init__(self, group_id: int):
        self.group_id = group_id
        self.students = []
        
    def __str__(self):
        return f"{self.group_id}:\n{[str(student) for student in self.students]}"
    
    def add_student(self, student: Student):
        self.students.append(student)


## Read and store data

In [35]:
students_by_groups: Dict[str, TutorialGroup] = {

}

students_by_id: Dict[str, Student] = {
    
}

with open('records.csv', mode='r') as file:
    # Create a CSV reader
    csv_reader = csv.reader(file)
    next(csv_reader)

    # Append students to corresponding tutorial groups
    for row in csv_reader:

        tutorial_group = row[0]
        student_id = int(row[1])  # Convert to int
        school = row[2]
        name = row[3]
        gender = row[4]
        cgpa = float(row[5])  # Convert to float

        if tutorial_group not in students_by_groups:
            students_by_groups[tutorial_group] = TutorialGroup(tutorial_group)

        students_by_id[str(student_id)] = Student(tutorial_group, student_id, school, name, gender, cgpa)
        students_by_groups[tutorial_group].add_student(Student(tutorial_group, student_id, school, name, gender, cgpa))
        

# Diversity Score Calculation

The diversity score of a group is the sum of diversity score for each pair of student within a group

Each student can be characterised by 3 parameters:

- School

- Gender

- CGPA

## Formula

We can define the diversity score for each pair of students $A$ and $B$ as the distance between $A(school_A, gender_A, cgpa_A)$ and $B(school_B, gender_B, cgpa_B)$:

$$
d(A, B) = \sqrt{\text{diff}(school_A, school_B)^2 + \text{diff}(gender_A, gender_B)^2 + \text{diff}(cgpa_A, cgpa_B)^2 }
$$

### Difference of school

The difference of school between 2 students can be fixed into a constant:

- If the schools are similar, then the difference is 0

- If the schools are different, then the difference is set to a constant $w_s$

### Difference of gender

The difference of gender between 2 studenst can also be formulated in the same way as the difference of school:

- If the genders are similar, then the difference is 0

- If the genders are similar, then the difference is set to a constant $w_g$

### Different of CGPA

There are many ways to calculate the CGPA difference between 2 students. The linear function is the most simple and easy way to go with:

$$ \text{diff} = abs(cgpa_A - cgpa_B) $$

In most cases, it works, however, this can create a valid problem that a solution where there is one student is the group has a very high GPA, while the rest is low, and we don't want that.

To mitigate, first need to normalise the $\text{diff}_{CGPA}$ by dividing it to the maximum distance obtainable $max_{CGPA} - min_{CGPA}$ of the whole tutorial group consisting of 50 people.

Let $d$ be the output after normalisation:

$$d = \frac{abs(cgpa_A - cgpa_B)}{max_{CGPA} - min_{CGPA}} $$

Now we will construct a function on $d$ so that it would be exponentially large for high penalties. A possible function is:

$$ x = 1 - \frac{1}{e^y} $$

$\color{red}{\text{With x only ranges from 0 to 1}}$

<img src="./FormulaPlot.png" alt="FormulaPlot" width="1200">

As we can see from the plot, the weight will be higher as x approaches 1, and it goes up exponentially. In fact, we can write down the function as:

$$ y = - ln (1 - x) $$

Which has derivative as:

$$ \frac{\partial y}{\partial x} = -\frac{1}{x-1} $$

This guarantees $ f(x_1) < f(x_2) $ if $x_1 < x_2$ $\forall x_1, x_2 < 1$ 

<img src="./Derivative.png" alt="Derivative" width="1200">

So now we have our final difference function:

$$ f(d) = - ln(1-d) w_{c} $$

Where $w_{c}$ is the weight of the CGPA to other parameters

### $\color{red}{\text{Note}}$

One important consideration to mind is the weight of each parameter. For example, since there are only 2 genders <b>Male</b> and <b>Female</b>, the difference between genders of two students should be weighted less (e.g: The other params can be calculated as normal, but this should only be 0 if similar and 0.5 if different).


## Calculating the difference between each pair of students

We will first re-write our `TutorialGroup` class to include `get_max_cgpa` and `get_min_cpa` functions

In [9]:
class TutorialGroup:
    def __init__(self, group_id: int):
        self.group_id = group_id
        self.students = []
        self.min_cpga = 0
        self.max_cpga = 0
        
    def __str__(self):
        return f"{self.group_id}:\n{[str(student) for student in self.students]}"
    
    def add_student(self, student: Student):
        self.students.append(student)
        
    def get_max_cgpa(self):
        if (self.max_cpga != 0):
            return self.max_cpga
        self.max_cpga = 0
        for student in self.students:
            if student.cgpa > self.max_cpga:
                self.max_cpga = student.cgpa
        return self.max_cpga
                
    def get_min_cgpa(self):
        if (self.min_cpga != 0):
            return self.min_cpga
        self.min_cpga = 10
        for student in self.students:
            if student.cgpa < self.min_cpga:
                self.min_cpga = student.cgpa
        return self.min_cpga
        

0


## Re-read the data to match with the current construction of class

In [None]:
students_by_groups: Dict[str, TutorialGroup] = {

}

students_by_id: Dict[str, Student] = {
    
}

with open('records.csv', mode='r') as file:
    # Create a CSV reader
    csv_reader = csv.reader(file)
    next(csv_reader)

    # Append students to corresponding tutorial groups
    for row in csv_reader:

        tutorial_group = row[0]
        student_id = int(row[1])  # Convert to int
        school = row[2]
        name = row[3]
        gender = row[4]
        cgpa = float(row[5])  # Convert to float

        if tutorial_group not in students_by_groups:
            students_by_groups[tutorial_group] = TutorialGroup(tutorial_group)

        students_by_id[str(student_id)] = Student(tutorial_group, student_id, school, name, gender, cgpa)
        students_by_groups[tutorial_group].add_student(Student(tutorial_group, student_id, school, name, gender, cgpa))
        

In [89]:
def diff(A: Student, B: Student, w_s: float, w_g: float, w_c: float) -> float:
    res = 0
    # If the school is not the same, then add w_s^2 to the result
    if A.school != B.school:
        res += w_s * w_s
    # If the gender is not the same, then add w_g^2 to the result
    if A.gender != B.gender:
        res += w_g * w_g
    # Calculate the difference in cgpa
    d = abs(A.cgpa - B.cgpa) / (students_by_groups[A.group_id].get_max_cgpa() - students_by_groups[A.group_id].get_min_cgpa())
    # If the difference is 1, then set it to 0.9999999999, because ln(0) is infinity
    if d == 1:
        d = 0.9999999999
    diff_cgpa = - math.log(1 - d) * w_c
    res = res + diff_cgpa * diff_cgpa
    return math.sqrt(res)


A = students_by_id['5002']
B = students_by_id['3838']
C = students_by_id['4173']
D = students_by_id['615']
print(A)
print(B)
dAB = diff(A, B, 1, 1, 1)
print(dAB)
print(C)
print(D) 
dCD = diff(C, D, 1, 1, 1)
print(dCD)
print(dAB > dCD)

5002 CCDS Aarav Singh Male 4.02
3838 EEE Aarti Nair Female 4.05
1.4149553044500742
4173 SBS Evelyn Cheung Female 4.48
615 SPMS Gia Tsai Female 3.89
2.3487639607487933
False


# Matching Algorithm

## Basic Algorithm

The most obvious algorithm we can do here is dividing students into groups in a randomise way. Note that we don't need to pick student randomly each time from the list. Instead, we can shuffle the order of the list then pick students in normal order. That way, it still preserves the random characteristic.

- We will re-write our `TutorialGroup` class to include `shuffle` and `assign_group` functions, also we will add `students_by_teams` dictionary from initialisation

- We will also re-rewrite our `Student` class to include `team_id`

- Also, we will also create a new class named `TeamGroup` to handle each team for students after being assigned. We will also construct the diversity score function for each TeamGroup using the `diff` function we defined above

In [2]:
class Student:
    def __init__(self, group_id: str, student_id: int, school: str, name: str, gender: str, cgpa: float):
        self.group_id = group_id
        self.team_id = 0 # Not set yet
        self.student_id = student_id
        self.school = school
        self.name = name
        self.gender = gender
        self.cgpa = cgpa
    
    def assign_team(self, team_id: int):
        self.team_id = team_id
    
    def __str__(self):
        return f"{self.student_id} {self.school} {self.name} {self.gender} {self.cgpa}"

$$\color{green}{\Huge{\textbf{Team Group Class}}}$$

In [24]:
class TeamGroup:
    def __init__(self, team_id: int, group_id: int):
        self.group_id = group_id
        self.team_id = team_id
        self.students = []
        
    def __str__(self):
        return f"{self.group_id}:\n{[str(student) for student in self.students]}"
    
    def add_student(self, student: Student):
        self.students.append(student)
    
    def diversity_score(self):
        score = 0
        for i in range(len(self.students)):
            for j in range(i + 1, len(self.students)):
                score += diff(self.students[i], self.students[j], 1, 1, 1)
        return score

In [42]:
class TutorialGroup:
    def __init__(self, group_id: int):
        self.group_id = group_id
        self.students_by_teams: Dict[int, TeamGroup] = {}
        self.students = []
        self.min_cpga = 0
        self.max_cpga = 0
        
    def __str__(self):
        return f"{self.group_id}:\n{[str(student) for student in self.students]}"
    
    def add_student(self, student: Student):
        self.students.append(student)
        
    def get_max_cgpa(self):
        if (self.max_cpga != 0):
            return self.max_cpga
        self.max_cpga = 0
        for student in self.students:
            if student.cgpa > self.max_cpga:
                self.max_cpga = student.cgpa
        return self.max_cpga
                
    def get_min_cgpa(self):
        if (self.min_cpga != 0):
            return self.min_cpga
        self.min_cpga = 10
        for student in self.students:
            if student.cgpa < self.min_cpga:
                self.min_cpga = student.cgpa
        return self.min_cpga
    
    def shuffle(self):
        random.shuffle(self.students)
        
    def assign_group(self, max_pax: int):
        self.shuffle()
        for i in range(0, len(self.students), max_pax):
            team_id = i // max_pax + 1
            self.students_by_teams[team_id] = TeamGroup(team_id, self.group_id)
            for j in range(i, i + max_pax):
                if j < len(self.students):
                    self.students[j].team_id = team_id
                    self.students_by_teams[team_id].add_student(self.students[j])
                else:
                    break

## Re-read data to match new constructions of classes

In [43]:
students_by_groups: Dict[str, TutorialGroup] = {

}

students_by_id: Dict[str, Student] = {
    
}

with open('records.csv', mode='r') as file:
    # Create a CSV reader
    csv_reader = csv.reader(file)
    next(csv_reader)

    # Append students to corresponding tutorial groups
    for row in csv_reader:

        tutorial_group = row[0]
        student_id = int(row[1])  # Convert to int
        school = row[2]
        name = row[3]
        gender = row[4]
        cgpa = float(row[5])  # Convert to float

        if tutorial_group not in students_by_groups:
            students_by_groups[tutorial_group] = TutorialGroup(tutorial_group)

        students_by_id[str(student_id)] = Student(tutorial_group, student_id, school, name, gender, cgpa)
        students_by_groups[tutorial_group].add_student(Student(tutorial_group, student_id, school, name, gender, cgpa))
        

In [117]:
group1 = students_by_groups['G-1']

mean = 0

group1.assign_group(5)

for team in group1.students_by_teams.values():
    print(team.diversity_score())
    mean += team.diversity_score()
    
mean /= len(group1.students_by_teams)

print("Mean: ", mean)



9.284542981225126
11.712664997762674
8.58333463190153
12.457034716388417
15.707559511170095
12.833314676588415
10.71369605616415
9.690403461861237
10.371613208236324
11.833248816641738
Mean:  11.31874130579397
