# Taking a look at `test(genome)` 
```python
def test(genome):
    return genome.find(infection) != -1
```
What `.find()` does is that it takes in a variable `infection` and looks through `genome`.  
It returns the first occurrence which `infection` occurs.  
If `infection` cannot be found in `genome`, it returns the value -1.  
This allows the code to return `True` if it exists and `False` elsewise.  

Importing my favorite things in python.  
This segment will be heavily reliant on [Files, Folders and OS](https://sps.nus.edu.sg/sp2273/chapter-os.html)

In [1]:
import os
import glob
import shutil
import test_kit #This is for my test function
import numpy as np
import random

Taking it slow and steady for I hope to win the race.

In [2]:
os.getcwd()

'C:\\Users\\Chin Zhen Jie\\OneDrive\\Desktop\\SP2273\\GitGud\\learning-portfolio-ChinZJ\\Numero Uno'

In [3]:
os.chdir('C:\\Users\\Chin Zhen Jie\\OneDrive\\Desktop\\SP2273\\GitGud\\learning-portfolio-ChinZJ\\Numero Uno')

In my case, I have the extracted 'person-files' already in my folder 'Numero Uno' so I will just change the directory in there.

In [4]:
os.chdir('.\person-files')

`.` here refers to the current folder ie. 'Numero Uno'  
It is the exact same thing as  
`os.chdir('C:\\Users\\Chin Zhen Jie\\OneDrive\\Desktop\\SP2273\\GitGud\\learning-portfolio-ChinZJ\\Numero Uno\\person-files')`

In [5]:
os.getcwd()

'C:\\Users\\Chin Zhen Jie\\OneDrive\\Desktop\\SP2273\\GitGud\\learning-portfolio-ChinZJ\\Numero Uno\\person-files'

## Why does this use '\\'?

In [6]:
glob.glob('*')

['person_00000.txt',
 'person_00001.txt',
 'person_00002.txt',
 'person_00003.txt',
 'person_00004.txt',
 'person_00005.txt',
 'person_00006.txt',
 'person_00007.txt',
 'person_00008.txt',
 'person_00009.txt',
 'person_00010.txt',
 'person_00011.txt',
 'person_00012.txt',
 'person_00013.txt',
 'person_00014.txt',
 'person_00015.txt',
 'person_00016.txt',
 'person_00017.txt',
 'person_00018.txt',
 'person_00019.txt',
 'person_00020.txt',
 'person_00021.txt',
 'person_00022.txt',
 'person_00023.txt',
 'person_00024.txt',
 'person_00025.txt',
 'person_00026.txt',
 'person_00027.txt',
 'person_00028.txt',
 'person_00029.txt',
 'person_00030.txt',
 'person_00031.txt',
 'person_00032.txt',
 'person_00033.txt',
 'person_00034.txt',
 'person_00035.txt',
 'person_00036.txt',
 'person_00037.txt',
 'person_00038.txt',
 'person_00039.txt',
 'person_00040.txt',
 'person_00041.txt',
 'person_00042.txt',
 'person_00043.txt',
 'person_00044.txt',
 'person_00045.txt',
 'person_00046.txt',
 'person_0004

As you can see, each person file contains 5 digits. The easiest way to cheese through this would be with f strings when accessing things.  
We only want to read the file, this will be important when reading the file later on.  
We finally get to the first question, damn!

In [12]:
def get_genome(person_id):
    '''
    This takes in an integer (person_id), converts it to a 5 digit number that suits the .txt file before returning the entire
    sequence by reading the file
    '''
    file_name = f'person_{person_id:05}.txt' #:05 gives me a 5 digit number
    with open(file_name, 'r') as file: #r here refers to read
        genome_sequence = file.read() #this reads the file for me
    return genome_sequence #I use return instead of print within a function

For haters of f strings, we can go back to default string concatenation.
```python
def wahoo(person_id): #This takes in an integer
    if len(str(person_id)) == 1:
        file_name = 'person_0000' + str(person_id) + '.txt'
    elif len(str(person_id)) == 2:
        file_name = 'person_000' + str(person_id) + '.txt'
    elif len(str(person_id)) == 3:
        file_name = 'person_00' + str(person_id) + '.txt'
    else:
        file_name = 'person_0' + str(person_id) + '.txt'
    with open(file_name, 'r') as file: #r here refers to read
        genome_sequence = file.read() #this reads the file for me
    return genome_sequence #I use return instead of print within a function
```

Function seems to be alright, lets try it out, wahoo!  
Lets check if the first 100 people have infections.  
By right, the following should have infections: 1 , 7, 8, 27, 47, 57, 62, 63, and 78.  
These are the people that will have an output True

In [13]:
for i in range(100): #This starts from 0 and ends at 99. Note that the first .txt file is 00000
    patient_sequence = get_genome(i)
    result = test_kit.test(patient_sequence)
    if result: #This is already a boolean so I can do thangs like this.
        print(i)

1
7
8
27
47
57
62
63
78


I now need to randomly sample 100 people.  
For this I need to find the range of values that my random number can take.  
I can cheat by looking at the file and realising it is 499, or I can make my life hard.  
As you can see from above, glob.glob actually gives me a list. I will use that to my advantage

In [14]:
people_list = glob.glob('*')
people_list[-1]

'person_00499.txt'

This file is always constant with exception to the 5 digits, I can implement string slicing to extract this value. I can also get min_num just for the lols.

In [15]:
min_num = int(people_list[0][7:12])
max_num = int(people_list[-1][7:12])
print(min_num, max_num)

0 499


Just because I can

In [16]:
random_list = random.sample(range(0, 500),100) #Reference for how this works. I use random.sample instead of randint because I am unsure if randint gives me unique values which I require.
random_list

[41,
 10,
 6,
 132,
 224,
 317,
 305,
 52,
 89,
 185,
 302,
 295,
 229,
 81,
 299,
 350,
 121,
 135,
 344,
 109,
 48,
 134,
 451,
 402,
 156,
 300,
 65,
 303,
 175,
 397,
 265,
 413,
 454,
 198,
 319,
 358,
 477,
 264,
 149,
 36,
 260,
 230,
 35,
 90,
 5,
 470,
 59,
 219,
 157,
 448,
 412,
 492,
 359,
 453,
 0,
 193,
 273,
 244,
 70,
 395,
 136,
 267,
 432,
 49,
 346,
 69,
 212,
 337,
 112,
 438,
 55,
 384,
 97,
 104,
 186,
 375,
 61,
 278,
 364,
 82,
 18,
 100,
 447,
 456,
 76,
 189,
 45,
 106,
 127,
 461,
 455,
 417,
 494,
 225,
 113,
 218,
 256,
 331,
 401,
 425]

In [17]:
def percentage(minimum, maximum, size):
    random_list = random.sample(range(minimum, maximum+1), size)
    sum = 0 #This will be calculation
    for i in random_list: #This will iterate the array of random integers generated
        genome_sequence = get_genome(i) #This lets me store a string of my genome sequence
        if test_kit.test(genome_sequence): #This runs the infected string against my genome sequence. if checks for a boolean which is what test returns
            sum += 1 #This gives me my count of infected people
    percent_infected = sum / size * 100 #I can do math!
    return percent_infected
    
        

In [18]:
percentage(min_num, max_num, 100)

19.0

In [19]:
def pool(list_of_id): #This takes in a list of integers
    compile_genome_id = ''
    for i in list_of_id:
        compile_genome_id += get_genome(i)
    return compile_genome_id

In [20]:
mixed_sample_1 = pool([20,21])
mixed_sample_2 = pool([21,22])
print(test_kit.test(get_genome(20))) #Tells me that person 20 is healthy
print(test_kit.test(get_genome(21))) #Tells me that person 21 is healthy
print(test_kit.test(get_genome(22))) #Tells me that person 22 is healthy
print(test_kit.test(mixed_sample_1)) #Tells me facts since 20 and 21 are healthy
print(test_kit.test(mixed_sample_2)) #Tells me facts since 21 and 22 are healthy

False
False
False
False
False


In [21]:
test = [i for i in range(100)]
group = [test[i:i+10] for i in range(0,int(len(test)),10)]
group

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]

In [22]:
#When I want to put 100 people into groups of 8. I have 12 remainder 4. This means I need 4 groups of 9 and 8 groups of 8
first_half = 4
group_size = 8
group_2 = [test[i:i+group_size+1] for i in range(0,(group_size+1)*first_half,group_size+1)]
print(group_2)
group_2 += [test[i:i+group_size] for i in range(((group_size+1)*first_half),int(len(test)),group_size)]
print(group_2)

[[0, 1, 2, 3, 4, 5, 6, 7, 8], [9, 10, 11, 12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23, 24, 25, 26], [27, 28, 29, 30, 31, 32, 33, 34, 35]]
[[0, 1, 2, 3, 4, 5, 6, 7, 8], [9, 10, 11, 12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23, 24, 25, 26], [27, 28, 29, 30, 31, 32, 33, 34, 35], [36, 37, 38, 39, 40, 41, 42, 43], [44, 45, 46, 47, 48, 49, 50, 51], [52, 53, 54, 55, 56, 57, 58, 59], [60, 61, 62, 63, 64, 65, 66, 67], [68, 69, 70, 71, 72, 73, 74, 75], [76, 77, 78, 79, 80, 81, 82, 83], [84, 85, 86, 87, 88, 89, 90, 91], [92, 93, 94, 95, 96, 97, 98, 99]]


If you are wondering what abomination of illegible code I have made above, don't panik itsok, I dont know either.  
Based on the sample context, I have a total of 100 people indexed from 0 to 99.  
Since I want to split them into a number of groups with uneven number of people, I will first find how many extra people I have left. In this context, that number is 4. This means that the first 4 groups that I have will have one additional element (or person in this case).  
Thus the first list comprehension that I am executing will cover these first number of groups.  
In this very specific case, you can interpret it this way:  
My range of `i` starts from 0, goes to 36 (the `stop` does not include the value 36) with `step` of 9 (this is the number of people I want to include in my list).  
My first list will take `0 to 8` (similarly, it excludes my `stop` which is `0 + 8 + 1` for the first iteration of `i`).  
My second iteration causes `i` to increase by a step of 9, causing my `i` to now take value 9.  
My second list will thus take `9 to 17`.  
This continues until I have cleared all remainder values.  
Now for the second half. I simply need to return back to the original list comprehension strats starting from where I left off, which would be `group_size + 1`.

In [104]:
def find_sick_people(ppl_list, group_size):
    '''
    The first half is to create my groups
    '''
    if len(ppl_list)%group_size == 0: #This ensures I have an even number of groups
        per_group = int(len(ppl_list)/group_size)
        group_ppl_list = [ppl_list[i:i+per_group] for i in range(0,int(len(ppl_list)),per_group)] #This generates my lists of groups of even people. Refer to the example code above
    else: #When I do not have a perfect split in group size. In this scenario, I will simply allocate 1 group to store these people.
        extras = int(len(ppl_list)%group_size) #This lets me know how many extras I have. I will allocate the first number of extras group an additional patient.
        per_group = int((len(ppl_list)-extras)/group_size)
        group_ppl_list = [ppl_list[i:i+per_group+1] for i in range(0,(per_group+1)*extras,per_group+1)]
        group_ppl_list += [ppl_list[i:i+per_group] for i in range(((per_group+1)*extras),int(len(ppl_list)),per_group)]
    
    '''
    This second half is forr me to run through my initial pool sampling sort
    '''
    test_count = 0
    positive_group = [] #This will store my lists of potentially positive people
    for i in group_ppl_list:#This iterates through my list of lists
        test_count += 1
        if test_kit.test(pool(i)): #This checks whether any of my pooled people are positive
            positive_group += i #At the end of my iterations, I will have a list of potential individuals
    
    '''
    This last section is for me to run through a linear search of remaining potential individuals. I will be breaking if my initial group is 1
    '''
    if group_size == len(ppl_list):
        return (positive_group, test_count)
    positive_individuals = []
    for i in positive_group:
        test_count += 1
        if test_kit.test(get_genome(i)): #This checks whether the individual is positive
            positive_individuals.append(i) #This is the other alternative to adding things into your list.
    return (positive_individuals,test_count)

Here is the english version of the disgusting code that I have made above.  
I have created the `group list` according to explanations above.  
For each of the 10 groups, I pool all the genome sequence together and test it against the infection. If any of them has an infection, I will add that to my temporary `list` `"positive_group"`.  
Thereafter, I will check each individual manually by looping through the list `"positive_group"` and include any sick individual in my final list.  
What I return after is a tuple containing the list of positive individuals and the total test count that I did

In [105]:
def get_sick_individuals(sick_people):
    return f' The list of sick people are {sick_people[0]}'
def get_total_tests(sick_people):
    return f' The total number of tests used are {sick_people[1]}'

In [106]:
ppl_list = [i for i in range(100)]
sick_people_sample = find_sick_people(ppl_list, 10)
sick_people_sample

([1, 7, 8, 27, 47, 57, 62, 63, 78], 70)

In [107]:
print(get_sick_individuals(sick_people_sample))
print(get_total_tests(sick_people_sample))

 The list of sick people are [1, 7, 8, 27, 47, 57, 62, 63, 78]
 The total number of tests used are 70


In [108]:
sick_people_sample_2 = find_sick_people(ppl_list, 5)
sick_people_sample_2

([1, 7, 8, 27, 47, 57, 62, 63, 78], 85)

In [109]:
print(get_sick_individuals(sick_people_sample_2))
print(get_total_tests(sick_people_sample_2))

 The list of sick people are [1, 7, 8, 27, 47, 57, 62, 63, 78]
 The total number of tests used are 85


# Mald Session on optimization

The theory here is that the groups must be small enough such that half of the groups will not contain anything before I perform a linear search according to my current algorithm.  
Implmenting foresight with the power of coding, I know that there are 62 infected people out of the 500 people.  
This means that my maximum number of people in a group should be 8.  
63 groups (for 8 people)
72 groups (for 7 people)
84 groups (for 6 people)
100 groups (for 5 people)
125 groups (for 4 people)
167 groups (for 3 people)
250 groups (for 2 people)
500 groups (for 1 person)

In [113]:
values_to_test = [62,63,64,71,72,73,83,84,85,100,125,166,167,168,250,500] #Note that for unequal divisions, I have included +-1
true_test = [i for i in range(500)]
for i in values_to_test:
    print(i, len((find_sick_people(true_test, i)[0])),get_total_tests(find_sick_people(true_test, i)))

62 62  The total number of tests used are 400
63 62  The total number of tests used are 380
64 62  The total number of tests used are 377
71 62  The total number of tests used are 367
72 62  The total number of tests used are 370
73 62  The total number of tests used are 367
83 62  The total number of tests used are 349
84 62  The total number of tests used are 364
85 62  The total number of tests used are 361
100 62  The total number of tests used are 350
125 62  The total number of tests used are 337
166 62  The total number of tests used are 333
167 62  The total number of tests used are 329
168 62  The total number of tests used are 328
250 62  The total number of tests used are 366
500 62  The total number of tests used are 500


How do you derive things from the above code?
Well, the print gives me 3 values, the first is the number of groups formed.  
The second is the total number of infected individuals. To avoiid cluttering, I merely check for the length of the list to ensure that my code is not going haywire.  
Lastly, I check for the number of tests.  
It seems that having 3 individuals in each group is the most optimal. (168 groups)

If you are wondering why I did not bother scrambling, it is because of the way the testing works.  
After the group sort, I perform a linear sort, so regardless of the number of individuals, I will simply be testing everyone left in the potential pool.