# Summer of Code: Problem 0 worked solution

How to go about solving this problem? This post is a worked example of how to go about it.

If you've not done so already, [have a look at the problem](https://learn2.open.ac.uk/mod/quiz/view.php?id=1352295).

In summary, we're given an input file that looks like this…
```
Tilly, Daisy-May, Tori
Iona, Deniz, Kobe, Grayson, Luka
Demi, Reanne, Tori
Dafydd, Reanne, Rohit, Kai, Iona, Nojus
Tommy, Rosa, Demi
Daisy-May, Tilly, Grayson, Deniz, Kobe, Tommy, Rohit
Sultan, Iona, Dafydd, Rosa, Kobe, Devan
Tilly, Rohit, Tori, Deniz, Kobe, Jennie
Luka, Tori, Tommy
Kobe, Rosa, Demi
```
…and have to solve some tasks with it. 

I'll implement this solution in Python 3, but the ideas should translate across to any other programming language. 

# Part 1
I'll only think about the first task first. I'll deal with the second part once I've got the first part working.

In summary, the task is to find the maximum number of names on any one line.

## Where to start
Let's think about the overall shape of a solution to this task: the **algorithm**. We're given a text file of invitations and we have to read it. Once we've read the file, we have to split it into lines and then split each line into names, and then do something about counting names on a line.

So, the subtasks are:
1. Read the text file.
2. Split the file into lines.
3. For each line, split the line into names.
4. Count the number of names on each line.
5. Find the highest count.

That doesn't seem anything particularly complicated. There's lots of repetition, but no really complex logic to deal with. 

Next, let's think about the **data structures** we'll need. I find that often, once the data structures are sorted, the rest of the program follows. When choosing data structures, I find it helpful to think about what's stored and what operations we need to perform on that data. 

In this case, we need to store a bunch of lines and a bunch of names within each line.

We don't do anything with a _name_ apart from count it, so storing it as a simple `string` should be sufficient.

A _line_ is a group of names. All we need to do is count how many names are in a line, so a simple container should do. Let's keep it simple and just just a `list` of names (a `list` of `string`s).

We also need to store the whole set of lines. We'll call that the _invitations_. All we need to do is process all the lines in the invitations to get the answer. Again, a simple container such as a `list` of lines will do. 

Let's now take each of these subtasks in turn.

## Steps 1 and 2: read a text file
In Python, the `open()` built-in command reads a file. By default, it reads the file as text. It also conveniently allows us to process the file line-by-line. The standard Python idiom for this is:

```
for line in open('path/to/file'):
    # do some processing
```
As this datafile is in a strange place, store the path in a variable for reuse.

In [1]:
invitation_filename = '../../data/00-invites.txt'

However, Python leaves the trailing newline character on the end of each line. We can get rid of that with the `str.strip()` method.

We also need to create the _invitations_ to store the lines in. This starts as an empty list, and we `append()` lines to it as they're read.

Note that the line-as-read is just a long `string` of all the names.

In [2]:
# Create the empty list
invitation_lines = []

# Iterate over all the lines in the file
for line in open(invitation_filename):
    invitation_lines.append(line.strip())
    
# Finish by saying how many lines we've read
len(invitation_lines)

200

## Step 3: split the invitation lines into names
Now we have the list of invitation lines, we can process each line and split it into names. Luckily for us, the built-in `str.split()` method does this for us. It works like this:

In [3]:
'Tom, Dick, Harry'.split(', ')

['Tom', 'Dick', 'Harry']

This splits a string into parts, using the given delimiter. 

We can use that to `split` each instruction line into names. We follow the same overall pattern as above, creating an empty list lf `invitations` and `append`ing to it as we process each line.

In [4]:
# Create the empty list
invitations = []

# Iterate over all the invitations
for line in invitation_lines:
    invitations.append(line.split(', '))
    
# Finish by saying how many lines we've processed, and the first split invitation
len(invitations), invitations[0]

(200,
 ['Caprice',
  'Marlene',
  'Carlie',
  'Fatema',
  'Glyn',
  'Kaycee',
  'Ainsley',
  'Cloe',
  'Zunaira',
  'Tyrell',
  'Annaliese',
  'Ameera',
  'Darrell',
  'Caiden',
  'Reyansh',
  'Oran',
  'Ebonie',
  'Corben',
  'Dionne',
  'Dafydd',
  'Harrison',
  'Mikolaj',
  'Tommy',
  'Marley',
  'Crystal',
  'Aryan',
  'Sebastian',
  'Xena'])

## Step 4: count the names in each line
The built-in `len()` function counts the number of items in a list. As we have a bunch of name lists, we can use `len()` to count the number of names in each line. 

Again, we'll use the same pattern of iterating over all the lines in the invitations.

In [5]:
# Create the empty list
invitation_lengths = []

# Iterate over all the invitations
for line in invitations:
    invitation_lengths.append(len(line))
    
# Finish by giving the first few counts
invitation_lengths[:5]

[28, 13, 20, 20, 22]

## Step 5: find the highest count
There's a couple of ways we can go here. The built-in `max()` function, when passed a list of numbers, finds the maximum value in that list. That means the solution is easy:

In [6]:
max(invitation_lengths)

33

But there's a more general pattern for when `max()` isn't applicable. It uses an _accumulator_ which starts off at some initial value and is updated as each item is processed. The accumulator could be something like a running total when going through a list of amounts. In this case, the accumulator is the most names we've seen on a line so far. If we see a longer line than any we've seen before, we update the most names accumulator.

That gives this pattern:

In [7]:
# Set the accumulator to some initial value
most_names = 0

# Iterate over all the invitation lengths
for name_count in invitation_lengths:
    if name_count > most_names:
        most_names = name_count

# Finish by giving the highest name
most_names

33

## Refactoring
We have a working solution, but it's rather repetitive. Can we make it simpler? In particular, do we really need to walk over the list of invitations so many times?

_Refactoring_ is the process of tidying up and simplifiying an already-working solution into something better.

Here's the solution we have.

In [8]:
# Create the empty list
invitation_lines = []

# Iterate over all the lines in the file
for line in open(invitation_filename):
    invitation_lines.append(line.strip())

invitations = []

# Iterate over all the invitations
for line in invitation_lines:
    invitations.append(line.split(', '))
    
invitation_lengths = []

# Iterate over all the invitations
for line in invitations:
    invitation_lengths.append(len(line))
    
# Give the solution
max(invitation_lengths)

33

As you can see, each stage involves cycling over the list of all invitations. As each stage just transforms individual lines, we can do all of the operations within one loop.

In [9]:
# Create the empty list
invitation_lengths = []

# Iterate over all the lines in the file
for line in open(invitation_filename):
    stripped_line = line.strip()
    names = stripped_line.split(', ')
    name_count = len(names)
    invitation_lengths.append(name_count)

# Give the solution
max(invitation_lengths)

33

If we include the "accumulator" version of finding the longest line, we don't need to store any of the invitation lines at all, like this:

In [10]:
# Set the accumulator to some initial value
most_names = 0

# Iterate over all the lines in the file
for line in open(invitation_filename):
    stripped_line = line.strip()
    names = stripped_line.split(', ')
    name_count = len(names)
    if name_count > most_names:
        most_names = name_count

# Finish by giving the highest name
most_names

33

## A more Pythonic approach
If you're interested, a more "Pythonic" approach (using idiomatic Python) to this problem is to use _list comprehensions_ rather than lots of `append` calls. 

We can find `invitation_lines` like this:

In [11]:
invitation_lines = [line.strip() for line in open(invitation_filename)]

and then find the `invitations` from `invitation_lines`

In [12]:
invitations = [line.split(', ') for line in invitation_lines]

Or even combine the two steps in one call:

In [13]:
invitations = [line.strip().split(', ') for line in open(invitation_filename)]

From `invitations`, we can find the number of names on each line with another comprehension:

In [14]:
invitation_lengths = [len(names) for names in invitations]

and then find the maximum length:

In [15]:
max(invitation_lengths)

33

If you really want to go to town, you can combine all these steps into a one-liner:

In [16]:
max(len(line.strip().split(', ')) for line in open(invitation_filename))

33

# Part 2
Now to look at the second part. This is a bit more complex. We have to _count_ the number of times each name appears anywhere in the file, and then count how many names appear more than once.

The first chunk of the algorithm is the same as before. In summary, my approach is:

So, the subtasks are:
1. Read the text file.
2. Split the file into lines.
3. For each line, split the line into names.
4. _Something something count names in whole input._
5. Count how many names appear more than once.

The first three steps are the same as before.

Note the detail about step 4. 

Perhaps things will be clearer if we think about the **data structure** for keeping track of the counts of all the names.

What we want is a data structure that holds a bunch of names, and associates each name with the number of times we've seen that name so far. When we first see a name in the file, we want to include it in the data structure. When we see a name _again_, we want to incremement the number times we've seen it.

In subtask 5, we can look at that data structure and pick out all the names with a count of two or more.

This association of a _name_ and a _number_ is a classic use of a _key-value store_, which crops up all over computing. In this case, the name is the key and the number of occurrences is the value. A simple built-in key-value store is called a `dict` in Python, a `Hash` in Ruby, and a `Map` in Java (and an `object` in JavaScript, but… well, [JavaScript](https://www.destroyallsoftware.com/talks/wat)).

That seems like a useful thing to use. Going back to the problem, we have the updated list of subtasks:

1. Read the text file.
2. Split the file into lines.
3. For each line, split the line into names.
4. For each name in the input:
   1. If it's in the name counts, increase its count by 1
   2. Otherwise, include the name in the name counts with a count of 1
5. Count how many names appear more than once.

This is a slightly more complex algorithm as it includes both loops and conditionals, but it's not too bad.

We can reuse the first three subtasks from Part 1, and assume the `invitations` list-of-lists-of-names already exists. 

## Step 4
When we implement step 4 using the logic above, the Python code looks like this:

In [17]:
# Empty dict
name_counts = {}

# for each name in the input
for invite in invitations:
    for name in invite:
        # record how many times we've now seen this name
        if name in name_counts:
            name_counts[name] += 1
        else:
            name_counts[name] = 1

If you squint, you can see something like the "accumulator" pattern again, with the `name_counts` `dict` being the accumulator.

(If you want a more Pythonic approach, take a look at the [`defaultdict`](https://docs.python.org/3/library/collections.html#defaultdict-objects) and [`Counter`](https://docs.python.org/3/library/collections.html#counter-objects) objects in the [`collections`](https://docs.python.org/3/library/collections.html) standard library.)

## Step 5
Talking of accumulators, we use that pattern again to do step 5, but this time we're iterating over the `name_counts` rather than the input. In this case, the accumulator of the number of people invited. 

In [18]:
number_invited = 0
for name in name_counts:
    if name_counts[name] > 1:
        number_invited += 1
        
number_invited

457

## An alternative way
There's another way of solving this task. As we only need to record the names we've seen twice or more, we can walk over the input building that list of invitiations directly. 

The trick is to maintain two lists of names: one is the list of names to invite, and one is a list of names we've seen in the input. 

When we see a name for the first time, we add it to the list of seen names. When we see a name again, it's already in the seen names so we know to add it ot the list of invited people. 

In both cases, we don't want to add a name to either list if it's already there. That implementation gets easier if we use `set`s, which don't allow duplicate items. That means we don't use `list`s for `seen` and `invited`, but use `set`s instead. 

That gives the algorithm of:
1. For each name in the input
   1. If the name is in the seen set
      1. Add it to the invited set
   2. Add the name to the seen set
   
Note that the order of these steps is important: we want to check for inviting a person _before_ recording that we've seen them at all.

In [19]:
# create empty sets
seen = set() # all the names we've seen
invited = set() # the people to invite

# for each name in the input
for invite in invitations:
    for name in invite:
        # invite this person if we've seen them before
        if name in seen:
            invited.add(name)
        # record that we've now seen this person
        seen.add(name)
        
len(invited)

457

## Which is better?
Which of these two ways is better? It depends a lot on what you mean by "better". 

The approach with the `dict` of counts is perhaps easier to follow as a _process_. It's also more easily extendable to other conditions and tasks, such as different thresholds for getting an invite or finding the most popular person (i.e. the person who was mentioned the most).

The approach with `set`s of names is perhaps closer to a description of the _conditions_ the solution must fulfil: `invited` is the `set` of names we've `seen` and `seen` again. But it's less flexible: how would you use that approach to ensure you only invited people who were mentioned at least three times, or ten times?

But as a learning exercise, the main thing is seeing many different ways of solving problems so you have a range of alternatives to choose from when you come across fresh challenges.