# Project 4
The data for this project is real data provided to us by the HPI School of Design Thinking. You are given as input a table stored in CSV format in a file named ‘project4.csv’. This table has 321 rows and five columns. Each row corresponds
to a student and the columns are as follows:

< hash > < Sex > < Discipline > < Nationality > < Semester >

- The < hash > field contains a cryptographic hash of the student’s name (for privacy reasons).
- The sex field contains ‘m’ for male and ‘f’ for female.
- The Discipline field contains one of the following seven entries:
    ‘Business’, ‘Creative Disciplines’, ‘Engineering’, ‘Humanities’, ‘Life Sciences’, ‘Media’ or ‘Social Sciences’.
- The Nationality field contains one of 37 nationalities, depending on the selfreported nationality of the student.
- The Semester field contains the semester in which that student was enrolled. This is stored as a code that indicates the semester and year. For example, the students in Winter 2015 semester have WT-15 (for Winter Term), and the students enrolled in this semester have the code ST-17.

In [73]:
import imp
import numpy as np
import pandas as pd
import utils
from matplotlib import pyplot as plt
%matplotlib inline

students = pd.read_csv('project4.csv')

## Teaming 1 (Arbitrary teaming)

In [162]:
teaming1 = utils.process_semesters(students, utils.arbitrary_teaming)
utils.store_teaming(teaming1, "output/teaming1.out")

## Teaming 2 (Intra-team diversity)

#### Defining the multi objective problem

In [170]:
def metric_gender_balance(team):
    k = len(team)
    assert k in (5,6), team
    men_share = sum(team['Sex'] == 'm') / k
    women_share = 1 - men_share
    opt_balance = np.ceil(k/2) * np.trunc(k/2) / k**2
    return 1 - men_share * women_share / opt_balance

def metric_discipline(team):
    k = len(team)
    occurences = team['Discipline'].value_counts()
    l2norm = np.sqrt((occurences**2).sum())
    return (l2norm - np.sqrt(k))/(k-np.sqrt(k))

def metric_nationality(team):
    k = len(team)
    occurences = team['Nationality'].value_counts()
    l2norm = np.sqrt((occurences**2).sum())
    return (l2norm - np.sqrt(k))/(k-np.sqrt(k))

def metric_collision(students, team, previous_teaming=None):
    # We have to consider to whole previous teaming over all semester
    # because there is a student which occurs in two semester and therefore
    # he might collide with any other students from both teams.
    if previous_teaming is None:
        return 0
    # TODO Count collisions
    return 0

def multi_objective(team, previous_teaming=None):
    metric = pd.Series(index=['Gender', 'Discipline', 'Nationality', 'Collision'])
    metric['Gender'] = metric_gender_balance(team)
    metric['Discipline'] = metric_discipline(team)
    metric['Nationality'] = metric_nationality(team)
    metric['Collision'] = metric_collision(team, previous_teaming)
    # print('{:.2f},{:.2f},{:.2f},{:.2f}'.format(
    #    metric['Gender'], metric['Discipline'], metric['Nationality'], metric['Collision']))
    # print(team.head())
    return metric
    
def sem_multi_objective(teaming, previous_teaming=None):
    results = []
    for i in range(16):
        team = teaming[teaming['Team'] == i+1]
        results.append(multi_objective(team, previous_teaming))
    results = pd.DataFrame(results)
    # Take the average for each objective
    return np.mean(results, axis=0)

def overall_multi_objective(teaming, previous_teaming=None):
    results = []
    for semester in ('WT-15', 'ST-16', 'WT-16', 'ST-17'):
        sem_teaming = teaming[teaming['Semester'] == semester]
        assert len(sem_teaming) in (80, 81),\
            'Expected 80 or 81 students but got {}'.format(len(sem_teaming))
        results.append(sem_multi_objective(sem_teaming, previous_teaming))
    return np.mean(results, axis=0)

In [175]:
teaming1_metric = overall_multi_objective(teaming1)
utils.print_metric(teaming1_metric, 'teaming 1')

Multi-objective metric for teaming 1: GenderBalance=0.29, Disciplines=0.38, Nationalities=0.63, Collision=0.00


#### Applying SEMO

In [164]:
imp.reload(utils)
teaming2 = utils.process_semesters(students, utils.intra_diversity_teaming)
utils.store_teaming(teaming2)  # "output/teaming2.out")

'"hash","team","Semester"\n'

In [None]:
teaming2_metric = overall_multi_objective(teaming2)
utils.print_metric(teaming2_metric, 'teaming 2')

### Data analysis:

| Semester | Sex | Discipline | Nationality |
|:--------:|:----|:-----------|:------------|
| WT-15 | 34:46 | 18:06:23:10:10:11 | ? |
| ST-16 | 43:37 | 25:10:22:06:06:11 | ? |
| WT-16 | 42:38 | 20:10:23:03:06:18 | ? |
| ST-17 | 45:36 | 22:08:20:05:06:20 | ? |

The values are not equally distributed so we will have teams with more and some with less diversity.

## Teaming 3 (Inter-team diversity)

## Teaming 4 (Double inter-team diversity)

# Paper notes

Regarding the problem definition and selection of a fitting algorithm:

It might be that for one team of a semester the number of disciplines is in conflict with the number of nationalities. Therefore we assume that this multi-objective problem is a nontrivial one. A algorithm for this problem has to look for a nondominated (Pareto optimal) solution.

We've chosen the Simple Evolutionary Multi-objective Optimization (SEMO) algorithm proposed by [Laumanns et al.][0]




[0]: http://repository.ias.ac.in/83516/1/20-a.pdf