# Chapter 03


## Getting started

A good way to get started is reading the first chapter of [Modern Information Retrieval](https://isbnsearch.org/isbn/9780321416919), which I recommend you do.

However, for now, let's look at another [introduction](https://learning.oreilly.com/library/view/information-architecture-4th/9781491913529/ch01.html#hellocomma_itunes), the one to [Information Architecture](https://isbnsearch.org/isbn/9781491911686), also mentioned in our [reading list](https://github.com/TomDeneire/InformationScience/blob/master/README.md). Even though this is an introduction about information **architecture** (yet another discipline!), i.e. it looks at things from the perspective of information **design** rather than **retrieval**, it still serves very well as a concrete example of the different considerations to take into account when it comes to information retrieval.

(For copyright reasons, I will share a PDF version of this chapter with you through personal email rather than posting it in the public GitHub repo. Please do not disseminate it yourselves.)

## Manipulating Information

The case of iTunes shows several things, but above all it makes it clear that information (or metadata in this case) never just *is*. It is always manipulated in order to present it in a certain way. So we could say that at the heart of information retrieval is **manipulating information**, i.e. selecting, grouping, filtering, ordering, sorting, ranking. (For those of you who know [SQL](https://en.wikipedia.org/wiki/SQL), notice how this resembles the `select` statement? For those of you who don't, don't worry, we'll look into it later on.)

In programming terms, most of this boils down to string operations, like testing metadata for certain criteria or sorting them. And while manipulating strings might seem easy, things can get complicated really easily.

## Example: sorting strings

Let's look at the example of sorting strings. Suppose our information retrieval task is presenting an alphabetized list of contact persons. The alphabet is a recognizable and expected key for such a list, so that makes sense. 

Of course, in Python you can just do this:

In [7]:
contacts = ["Doe, John", "Poppins, Mary", "Doe, Jane"]
sorted_contacts = sorted(contacts)
print(sorted_contacts)

['Doe, Jane', 'Doe, John', 'Poppins, Mary']


But suppose you are dealing with a language where there is no built in sorting method. (And believe me, there are!) How would you go about sorting a list of strings?

Let me simplify the problem. Somewhere along the line you will have to represent individual characters as numbers, e.g. a = 1, b = 2, and then sort numbers.

So let's think about the root issue: how do you sort a list of numbers?

In [10]:
numbers = [7, 8, 1, 7, 2]
sorted_numbers = sorted(numbers)
print(sorted_numbers)

[1, 2, 7, 7, 8]


Of course, the sorting algorithm is a well-known chapter in Computer Science. Some of you might be familiar with different kinds of sorts, like merge sort, insertion sort or (my favourite) bubble sort. For some Python implementations, see this [Tutorialspoint article](https://www.tutorialspoint.com/python_data_structure/python_sorting_algorithms.htm).

But if you have never studied it, writing your own sort for the first time will not be an easy exercise. I challenge you, if you've never done it. For a bit of fun, here's another kind of sort I recently implemented in Python: *random sort*. Very time-inefficient, but perfectly functional!

In [12]:
def random_sort(InputList):
    from random import shuffle
    check = 0
    while check == 0:
        shuffle(InputList)
        test = 0
        for unsorted in InputList:
            if unsorted >= test:
                check = 1
            else:
                check = 0
                break
            test = unsorted
    return InputList

print(random_sort(numbers))

[1, 2, 7, 7, 8]


And that's only the first part of the problem: sorting lists of numbers. Now try to think how this would help to sort lists of strings. First of all, how would you translate strings to numbers? 

One way is to use [Unicode](https://en.wikipedia.org/wiki/Unicode) code points for numbers:


In [28]:
for char in "Doe, John":
    print(ord(char), end=",")

68,111,101,44,32,74,111,104,110,

But of course when the case changes, the numbers will also change:

In [40]:
for char in "doe, john":
    print(ord(char),end=",")

100,111,101,44,32,106,111,104,110,

You can account for that by converting all strings to lower case first, but what happens in the case of `Étienne` versus `Etienne`, which are usually interchangeable?

In [42]:
for char in "Étienne".lower():
    print(char + " = " + str(ord(char)))
print("\n")
for char in "Etienne".lower():
    print(char + " = " + str(ord(char)))

é = 233
t = 116
i = 105
e = 101
n = 110
n = 110
e = 101


e = 101
t = 116
i = 105
e = 101
n = 110
n = 110
e = 101


And, by the way, do you know the encoding of the strings the list will contain? And why does that matter?

You can see how complex seemingly trivial tasks of information retrieval, like alphabetizing a list, really are.