# Session 8: Greedy algorithms and the knapsack problem

*Data Structures and Algorithms*

*Achyuthuni Sri Harsha*

------------------------------------------------------------------------

We will finish the course with a discussion on the limits of
computation. Even with all the computational resources at our disposal
some problems turn out to be difficult to solve accurately at large
scale. We will talk about intractability and the use of approximations.
Examples include the knapsack problem and the traveling salesman
problem. We will conclude with some data analytics using the popular
pandas library.

------------------------------------------------------------------------

## Preparation

**Readings:**

Guttag. Chapter 12.1.

VanderPlas, Jake. Python Data Science Handbook.

-   <https://github.com/jakevdp/PythonDataScienceHandbook>

-   Chapter 3

Reda, Greg. Intro to pandas data structures.

-   <http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/>

Evans, Julia. Pandas cookbook.

-   <https://github.com/jvns/pandas-cookbook>

-   Chapters 1-2.

***Optional Readings:***

Moffitt, Chris. Common Excel Tasks Demonstrated in Pandas.

-   <http://pbpython.com/excel-pandas-comp.html>

10 Minutes to pandas

-   A succinct reference for key tools in the package.

-   <http://pandas.pydata.org/pandas-docs/stable/10min.html>

Rougier, Nicolas. Matplotlib tutorial.

-   <https://www.labri.fr/perso/nrougier/teaching/matplotlib/>

Augsburger, Tom. Modern pandas.

-   Advanced material on best practices.

-   <http://tomaugspurger.github.io/modern-1.html>

Bokeh 5-minute overview.

-   <http://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/master/quickstart/quickstart.ipynb>

Scikit-learn tutorial.

-   <http://scikit-learn.org/stable/tutorial/basic/tutorial.html>

**Questions:**

Please read the material above, and think about how you would explain to
your classmates:

-   What are the assumptions of the (0-1) knapsack problem, and what is
    the output? What is the computational complexity of the problem?

-   There is a greedy algorithm to the knapsack problem. What is a
    greedy algorithm and what are their main benefits and problems?

------------------------------------------------------------------------


## The knapsack problem

During the lecture, we started looking at the knapsack problem. The
problem is simple: you want to pick the most valuable items to fill out
a 'knapsack' with limited capacity. But the applications are plentiful:
many problems with budget constraints in finance and operations
management are versions of the knapsack problem or closely related to
it.

More precisely, we start with a set of \$n\$ items that each have values
\$v_i\$ and weights \$w_i\$ and are trying to maximize the total value
of items to pick, subject to a capacity constraint \$W\$:

$$ \max\_{x_i \in \\0,1\\} \sum\_{i} x_i v_i $$

$$ \text{s.t. } \sum\_{i} x_i w_i \le W $$

This is the binary (0/1) version of the knapsack problem, where we
either pick each item or do not.

Our goal is to fill a knapsack (a budget) with generic items (eg
projects for a firm). To implement this in Python, our object-oriented
instincts suggest creating a `class Item`.

In [1]:
class Item(object):
    def __init__(self, n, v, w):
        self.name = n
        self.value = float(v)
        self.weight = float(w)
    # more code...

When creating a new knapsack item, we initialise it with a specified
name, value, and weight. The complete implementation of the class is
given in the Python file `ses08.py`.

### Filling the knapsack

Before turning to the binary knapsack problem, let's first look at the
simpler *fractional* version of it. In the fractional knapsack, we
assume that our items are divisible. For example, you're travelling to
Asia and want to bring back different kinds of tea leaves - you can
divide them into small amount.

We developed *greedy approaches* for this choice in the lecture. Roughly
speaking, a greedy choice is myopic: it seeks to maximize the current
value added, without considering the future consequences of the choice.
We looked at several greedy choices: the maximum-value item, the
minimum-weight item, and the best item in terms of bang-for-buck. The
function `greedy` contains a skeleton code for an implementation of this
choice.

In [2]:
def greedy(items, max_weight, key_function):
    """
    Greedy solution of the knapsack problem

    Parameters:
        items - list of potential knapsack items, 
        max_weight - number >= 0
        key_function - function that elements of items to floats

    Returns:
        list of items of greedy knapsack algorithm,
        where items are picked as sorted by key_function  

    """
    # Sort items   
    sorted_items = sorted(items, key=key_function, reverse=True)
    result = []
    total_value = 0.0 # knapsack value
    total_weight = 0.0 # knapsack weight <= maxWeight
    # more code...

The function takes three inputs: a list of items, a capacity constraint,
and `key_function`, a function to that defines your greedy choice.
Notice how Python allows us to pass a function as an argument to another
function in this convenient way.

The function `greedy` then proceeds to pass the `key_function` on to
Python's `sorted`-function. This function sorts the list of items based
on the key function we're using.

### Greedy

First, complete the key functions `value`, `weight_inverse` and
`density` we'll use for sorting. These functions should call the
relevant methods in the `Item` class to get the desired values. You can
try out sorting a list of items by the different functions using
`sorted`.

Then complete the function `greedy`. After sorting the list of items
based on the desired function, the function should pick each item in
turn, then finally returning:

-   `result`, a list containing the items in the knapsack.
-   `total_value`, the sum of the values of the items in the knapsack

Solve the knapsack problem with the different greedy choices. We saw in
the lecture that the bang-for-buck greedy algorithm seems like a
reasonable choice. Try your greedy algorithm out on different sets of
items. How different are your results with the different greedy
approaches?

### Brute force

The solution from the greedy algorithm is not in general optimal in the
binary case. But what is the optimal solution?

For relatively small problems, we can implement a 'brute force' solution
of the knapsack problem, which simply loops through all possible subsets
of items that we can create from the items we have, and picks the best
feasible one.

The problem with this approach is that there are *exponentially many*
subsets to go through. To create the subsets, the code contains a
function `power_set` that creates all possible subsets of a list of
items \$L\$. To do this, it uses a clever Python concept called a
*generator*: it does not try to create all the subsets at once store in
memory - with an exponential number of subsets, this could get ugly!
Instead, the generator creates extra subsets "live", that is, whenever
we need one, it generates the next one. This is done with the keyword
`yield`. You can read more about generators in Chapter 8.3.1 of the
textbook. The result is that the output from the function `power_set`
can simply be iterated through. This is done in the `brute_force`
function that follows.

Try solving the optimal solution of the brute force algorithm for small
problems. How fast is the greedy solution? How large problems can you
solve via brute force? What is the complexity of the brute-force
solution?

### Greedy fractional

Complete the function `greedy_fractional` in `ses08_extra.py`, which
solves a fractional version of the knapsack problem. In this problem,
you can pick not just entire items but any fraction of an item. We pick
in the same order as before, but if we cannot fit the entire item, we
just use as much of that item as we can until the knapsack is full. In
this case, the greedy algorithm with density sorting provides the
optimal solution.

### Question 4: Fantasy football

We'll now apply our algorithm to a potentially useful real-world
application: fantasy football (FF). In fantasy football, you act as a
manager of a football team in eg the Premier League and try to build the
best team of players possible using a limited budget. FF leagues are
popular among football fans: for example, the Premier League organizes
an official league on its website, with significant prizes for best
'managers'.

How is the the success of a FF team determined? The Premier League has a
set of rules for this: for example, your players will get a certain
number of points for scoring/assisting goals, keeping an empty sheet,
being named man of the match, and so on. The league also determines the
costs of acquiring different players into your team based on their
popularity and performance.

You're probably starting to see this as a knapsack problem: we have a
limited budget, as well as costs for each player. However, we're missing
the 'values' of different players. Determining a player's expected
future value for your team (how many goals will they score in following
games) may be more art than science, but one simple way to try to value
them is to use their past performance (in terms of points already
accumulated). The Premier League helpfully publishes the current scores
of each player on their website. If we download this data, we have
everything we need to solve a knapsack problem.

This data (current as of 20/9) is include in an html file downloaded
from the website. The Python file `ses08_extra.py` further includes
helper functions for you to read this data into a list of players:

-   The `read_players` takes JSON data downloaded from the Premier
    League website containing the players values, and parses it to
    create lists of player attributes.
-   The `Player` class is very similar to the `Item` class, but adapted
    for player attributes

In order to allow you to start building your team, we won't go into much
detail here - you can just run the functions. But going through these
functions later is a useful exercise to develop your skills in using
code written by others.

Go ahead and solve the problem using the greedy algorithm on the list of
players and a budget constraint. Try out different budget constraints.
What is the result? Is this what a football team looks like? What went
wrong?

You'll probably notice want to have exactly 11 players in your team, and
at specific places too (at most a single goalkeeper too and so on). In
the language of optimisation, this is a *cardinality constraint* on the
knapsack problem. That is, the optimisation problem would have an
additional constraint

$$ \sum_i x_i \le 11. $$

This is a somewhat more challenging problem. Try to implement a greedy
heuristic with a simple way of dealing with this: when picking in the
greedy order, we'll keep track of the numbers in different positions and
make sure the solution does not break any of the constraints. This is an
open problem for you to explore, so there are no tests for this
function.

This is probably not the best solution you could find. What could you do
to improve your solution? A brute-force algorithm would find the correct
solution, but there are many players so it will be quite slow. There are
several optimisation methods that could help you, but these are beyond
the scope of this course so they are left for you to explore. For
example, you could look into dynamic programming and genetic algorithms.
Guttag's book shows you a dynamic programming algorithm for the knapsack
problem (without the extra constraints on the number of players etc);
for genetic algorithms, you could start by searching online for "genetic
algorithms fantasy football python".

## Pandas

Please complete the mandatory exercises in the Jupyter Notebook file
ses08_pandas.ipynb.

## All done (really)\!

That's it for this module. 

## Review

How would you explain to a classmate:

-   What a greedy algorithm is?
-   What the knapsack problem is, what assumptions we make, and what its
    output is?