# Jupyter Notebook introduction

We'll start off with a very brief introduction on the basics of using Jupyter notebooks.

### Notebook cells

A notebook consists of a sequence of cells. A cell is a multi-line text input field, and its contents can be executed by typing `Shift-Enter`, or by clicking the `Run` button in the toolbar. What exactly this does depends on the type of cell. There are four types of cells: *code cells*, *markdown cells*, *raw cells* and *heading cells*. We will only focus on the first 2; code and markdown. Every cell starts off being a code cell, but its type can be changed by using a dropdown on the toolbar (which will be `Code`, initially).

In a code cell you can write *Python* code. When you run that cell (click on it and press `Shift-Enter`) the code in the cell will run, and the output of the cell will be displayed beneath the cell. Lets try out a very simple code cell below

In [1]:
x = 5
x = x + 2
print(x)

7


This produces the output you might expect, the exact the same result as executing that bit of *Python* code in a terminal. You can modify the contents of the code cell and run it again with `Shift-Enter` to see how the output changes.

Global variables are shared between cells. This means we can still use variables or functions from the first cell in a second cell, like so

In [2]:
y = 2 * x
print(y)

14


Notebooks are expected to be run top to bottom, starting with the first cell and ending with the last. **Failing to run some cells or running cells out of order is likely to result in errors.** For example, if we were to run the second cell before the first has been run the first, we would get an error saying `x` is not defined.

Before you hand in this exercise, make sure that it can run without errors from top to bottom. Test this by selecting *Kernel -> Restart & Run All* in the menu.

Now to the actual assignment:

# Pandas

In this series of exercises you will learn to use the library `pandas`. Pandas is a very popular library for storing and manipualting data. For the exercises in this notebook, we will be relying on the following sources:
- Chapter 3 from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- The official Pandas documentation: [Pandas Documentation](https://pandas.pydata.org/)

## Series 

Start by reading the introduction of `pandas` objects [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html).

Then, run the cell below to import the necessary libraries. In our distribution, we've also included a file named `answers.py`, which we will also import below. After each set of exercises, a function from `answers.py` is called that will check your answers and provide feedback in the form of assertions.

In [3]:
from pandas import Series
import pandas as pd
import answers

### Exercise 1

Create a `Series`-object named `income` containing the following *figures* and using the *sources* as its index.

In [4]:
from pandas import Series
import pandas as pd

income_sources = ["sales", "ads", "subscriptions", "donations"]
income_figures = [39041, 8702, 13200, 292]

# income = None
income = pd.Series(income_figures, index = income_sources)
# income = income_figures([39041, 8702, 13200, 292], index = income_sources["sales", "ads", "subscriptions", "donations"])

# your code here

display(income)

sales            39041
ads               8702
subscriptions    13200
donations          292
dtype: int64

Check your answer by running the cell below.

In [5]:
answers.test_1(income)

Testing dtype of answer: Success!
Testing indices: Success!
Testing values: Success!


### Exercise 2

Create a `Series` named `expenses` that uses the expense information below. Also create a `Series` named `profits` with the profit (income minus expenses). Create a variable named `total_profit` with the sum of all profits in `profits`.

In [25]:
expense_sources = ["ads", "sales", "donations", "subscriptions"]
expense_figures = [4713, 24282, 0, 3302]

# your code here
expenses = pd.Series(expense_figures, index = expense_sources)
profits = pd.Series(income - expenses, index = expense_sources)
total_profit = sum(profits)
print(total_profit)

28938


Check your answer by running the cell below.

In [26]:
answers.test_2(expenses, profits, total_profit)

Testing dtypes: Success!
Testing values: Success!


## DataFrames

### Exercise 3

Create a `DataFrame` named `skittles` with the *columns* `amount` and `rating`, using the different colors as the *index*.

|&nbsp;      | amount | rating |
|------------|--------|--------|
| **red**    | 7      | 3      |
| **green**  | 4      | 4      |
| **blue**   | 6      | 2      |
| **purple** | 5      | 4      |
| **pink**   | 6      | 3.5    |

Using `Jupyter`'s `display()` makes sure we get a nicely formatted table.

In [30]:
from pandas import DataFrame

# your code here
amount_dict = {'red': 7, 'green': 4, 'blue': 6,
             'purple': 5, 'pink': 6}
amount = pd.Series(amount_dict)
rating_dict = {'red': 3, 'green': 4, 'blue': 2,
             'purple': 4, 'pink': 3.5}
rating = pd.Series(rating_dict)
skittles = pd.DataFrame({'amount': amount, 'rating': rating})

display(skittles)

Unnamed: 0,amount,rating
red,7,3.0
green,4,4.0
blue,6,2.0
purple,5,4.0
pink,6,3.5


Check your answer by running the cell below.

In [31]:
answers.test_3(skittles)

Testing dtype: Success!
Testing labels: Success!
Testing values: Success!


In [35]:
# your code here
skittles_average = (sum(skittles['rating']) / len(skittles['rating']))

display(skittles_average)

3.3

Check your answer by running the cell below.

In [36]:
answers.test_4(skittles_average)

Testing: Success!


### Exercise 4

Calculate the mean `rating` and save as `skittles_average`.

### Exercise 5

Add a new column to the skittles `DataFrame` named `score`. The score of a color is equal to `amount * rating`.

In [38]:
# your code here
skittles.insert(2, "score", skittles['amount'] * skittles['rating'])
display(skittles)

Unnamed: 0,amount,rating,score
red,7,3.0,21.0
green,4,4.0,16.0
blue,6,2.0,12.0
purple,5,4.0,20.0
pink,6,3.5,21.0


Check your answer by running the cell below.

In [39]:
assert "score" in skittles

## Indexing and selection

Read the [next](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html) part of the reference.

### Exercise 6

For the given `DataFrame` select only columns 'a', 'c', and 'e', and rows 10, 20, 50, 60 and store the result again in the variable `frame`. As a clarification, the original `DataFrame` looks like:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>a</th>      <th>b</th>      <th>c</th>      <th>d</th>      <th>e</th>      <th>f</th>      <th>g</th>    </tr>  </thead>  <tbody>    <tr>      <th>10</th>      <td>0.0</td>      <td>1.0</td>      <td>2.0</td>      <td>3.0</td>      <td>4.0</td>      <td>5.0</td>      <td>6.0</td>    </tr>    <tr>      <th>20</th>      <td>7.0</td>      <td>8.0</td>      <td>9.0</td>      <td>10.0</td>      <td>11.0</td>      <td>12.0</td>      <td>13.0</td>    </tr>    <tr>      <th>30</th>      <td>14.0</td>      <td>15.0</td>      <td>16.0</td>      <td>17.0</td>      <td>18.0</td>      <td>19.0</td>      <td>20.0</td>    </tr>    <tr>      <th>40</th>      <td>21.0</td>      <td>22.0</td>      <td>23.0</td>      <td>24.0</td>      <td>25.0</td>      <td>26.0</td>      <td>27.0</td>    </tr>    <tr>      <th>50</th>      <td>28.0</td>      <td>29.0</td>      <td>30.0</td>      <td>31.0</td>      <td>32.0</td>      <td>33.0</td>      <td>34.0</td>    </tr>    <tr>      <th>60</th>      <td>35.0</td>      <td>36.0</td>      <td>37.0</td>      <td>38.0</td>      <td>39.0</td>      <td>40.0</td>      <td>41.0</td>    </tr>  </tbody></table>

Re-index such that it looks like:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>a</th>      <th>c</th>      <th>e</th>    </tr>  </thead>  <tbody>    <tr>      <th>10</th>      <td>0.0</td>      <td>2.0</td>      <td>4.0</td>    </tr>    <tr>      <th>20</th>      <td>7.0</td>      <td>9.0</td>      <td>11.0</td>    </tr>    <tr>      <th>50</th>      <td>28.0</td>      <td>30.0</td>      <td>32.0</td>    </tr>    <tr>      <th>60</th>      <td>35.0</td>      <td>37.0</td>      <td>39.0</td>    </tr>  </tbody></table>


In [75]:
import numpy as np

frame = DataFrame(np.arange(6 * 7.).reshape((6, 7)), index=[10, 20, 30, 40, 50, 60], columns=list('abcdefg'))

# your code here
frame2 = frame.loc[[10, 20, 50, 60],['a', 'c', 'e']]

display(frame2)

Unnamed: 0,a,c,e
10,0.0,2.0,4.0
20,7.0,9.0,11.0
50,28.0,30.0,32.0
60,35.0,37.0,39.0


Check your answer by running the cell below.

In [76]:
answers.test_6(frame2)

Testing: Success!


### Exercise 7

Replace all values in the data frame `frame` that are *divisible by 3* with the value *0*. Store the result in `frame`.

In [81]:
frame = DataFrame(np.arange(6 * 7.).reshape((6, 7)), index=[10, 20, 30, 40, 50, 60], columns=list('abcdefg'))

# your code here
frame[frame % 3 == 0] = 0

display(frame)

Unnamed: 0,a,b,c,d,e,f,g
10,0.0,1.0,2.0,0.0,4.0,5.0,0.0
20,7.0,8.0,0.0,10.0,11.0,0.0,13.0
30,14.0,0.0,16.0,17.0,0.0,19.0,20.0
40,0.0,22.0,23.0,0.0,25.0,26.0,0.0
50,28.0,29.0,0.0,31.0,32.0,0.0,34.0
60,35.0,0.0,37.0,38.0,0.0,40.0,41.0


Check your answer by running the cell below.

In [82]:
answers.test_7(frame)

Testing: Success!


## Operating on dataframes

Read about [operations in pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pandas.html).

### Exercise 8

Create a `Series` named `series_c` with the result of the calculation `series_a - series_b`. Note that `series_a` and `series_b` do not use the same indexing. Replace any missing values in `series_b` with 0. Values that are in `series_b` but not in `series_a` need not be in `series_c`.

In [149]:
series_a = Series([500, 400, 300, 200, 100], index=["a", "b", "c ", "d", "e"])
series_b = Series([23, 46, 67, 79], index=["a", "c ", "f", "g"])
series_c = series_a.sub(series_b, fill_value=0)
# series_d = series_c.reindex_like(series_b)

# your code here
series_c = series_c.reindex_like(series_a)
display(series_c)
# display(series_d)

a     477.0
b     400.0
c     254.0
d     200.0
e     100.0
dtype: float64

Check your answer by running the cell below.

In [150]:
answers.test_8(series_a, series_b, series_c)

Testing: Success!


## Map

`pandas` has it's own `map()` function! As expected, it maps a function to every element of a `Series` or `DataFrame` and gives back a new `Series` or `DataFrame`. As an example:

```Py
tokens = Series(["hello", " ", "world!"])

lengths = []
for token in tokens:
    lengths.append(len(token))
lenghts = Series(lengths)
```

Can be converted to:

```Py
tokens = Series(["hello", " ", "world!"])
lengths = tokens.map(len)
```

### Exercise 9

Convert all words in the `Series` `words` to lowercase. Use `str.lower()`.

In [157]:
words = Series(["foo", "Bar", "baz", "QUX", "QuUuX"])

# your code here
words = words.str.lower()

display(words)

0      foo
1      bar
2      baz
3      qux
4    quuux
dtype: object

Check your answer by running the cell below.

In [158]:
answers.test_9(words)

Testing: Success!


### Exercise 10

Sort `frame` on column `c` in descending order and store the solution in `frame`.

In [164]:
data = [[0.691074, -1.272521, -0.968045, -2.066171, -0.670358, 1.399483, -1.148168], 
        [1.75378, 2.409629, 1.842674, 0.754906, -0.115614, 0.877219, 1.599362], 
        [-1.41176, 1.103801, 1.216514, 0.548866, 2.255482, -0.176342, 0.965265], 
        [-0.741689, 0.216645, -0.278025, 0.777175, 0.869239, -0.943004, -0.140957], 
        [-1.58593, 1.1796, -0.702286, 2.367875, 0.592748, 1.386158, 0.535978], 
        [0.58498, 0.62389, -0.425614, 0.530479, -1.818631, -1.593188, 1.591233]]

frame = DataFrame(data, columns=list('abcdefg'))

# your code here
frame = frame.sort_values(by='c', ascending=False)

display(frame)

Unnamed: 0,a,b,c,d,e,f,g
1,1.75378,2.409629,1.842674,0.754906,-0.115614,0.877219,1.599362
2,-1.41176,1.103801,1.216514,0.548866,2.255482,-0.176342,0.965265
3,-0.741689,0.216645,-0.278025,0.777175,0.869239,-0.943004,-0.140957
5,0.58498,0.62389,-0.425614,0.530479,-1.818631,-1.593188,1.591233
4,-1.58593,1.1796,-0.702286,2.367875,0.592748,1.386158,0.535978
0,0.691074,-1.272521,-0.968045,-2.066171,-0.670358,1.399483,-1.148168


Check your answer by running the cell below.

In [165]:
answers.test_10(frame)

Testing: Success!


## Missing values

Read about [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)

### Exercise 11

In the `Series` `speeds` below, fill in the missing values. Use the speed from the previous datapoint available that is up to 3 datapoints away. Delete any datapoints that still do not have values afterwards.

In [175]:
speeds = Series(
         [49, 51, None, None, 50, 48, 47, 50,
          51, 47, 46, None, 46, 48, 48, 48,
          None, 49, None, None, None, None,
          None, 50, 50, 50, 51, 52, 51, 50,
          None, 50, None, None, None, None,
          None, 50, 49, 48, 49, None, 50, 50, 49])

# your code here
speeds = speeds.fillna(method='ffill', limit=3)
speeds = speeds.dropna()

display(speeds)

0     49.0
1     51.0
2     51.0
3     51.0
4     50.0
5     48.0
6     47.0
7     50.0
8     51.0
9     47.0
10    46.0
11    46.0
12    46.0
13    48.0
14    48.0
15    48.0
16    48.0
17    49.0
18    49.0
19    49.0
20    49.0
23    50.0
24    50.0
25    50.0
26    51.0
27    52.0
28    51.0
29    50.0
30    50.0
31    50.0
32    50.0
33    50.0
34    50.0
37    50.0
38    49.0
39    48.0
40    49.0
41    49.0
42    50.0
43    50.0
44    49.0
dtype: float64

Check your answer by running the cell below.

In [176]:
answers.test_11(speeds)

Testing: Success!


# Grouping

Read about [grouping data](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html).

### Exercise 12

Given the `grades` DataFrame below, group the data the course name. Do the following:

- Create a grouping called `groups`.
- Use a `for` loop to loop over the content of `groups`. Print the name and content of each group.
- Generate a `Series` called `average_course_grade` containing the average grade for both courses.
- Generate a `Series` called `student_count` containing the total number of students for both courses.

In [237]:
grades = pd.DataFrame([["Pascal", "Programming 2", 7.0], ["Morty", "Programming 1", 5.5], 
                       ["Slartibartfast", "Programming 1", 6.5], ["Ursula", "Programming 1", 9.5],
                       ["Morty", "Programming 2", 3.5], ["Marge", "Programming 1", 8.0],
                       ["Ursula", "Programming 2", 9.0]], 
                       columns = ["student_name", "course_name", "grade"])
                      
# your code here
display(grades)
groups = grades.groupby('course_name')
# display(groups)
for course_name, grade in groups:
    print(grade, course_name)

average_course_grade = pd.Series(grades.groupby('course_name')['grade'].mean())
student_count = pd.Series(grades.groupby('course_name')['course_name'].count())

display(average_course_grade)
display(student_count)

Unnamed: 0,student_name,course_name,grade
0,Pascal,Programming 2,7.0
1,Morty,Programming 1,5.5
2,Slartibartfast,Programming 1,6.5
3,Ursula,Programming 1,9.5
4,Morty,Programming 2,3.5
5,Marge,Programming 1,8.0
6,Ursula,Programming 2,9.0


     student_name    course_name  grade
1           Morty  Programming 1    5.5
2  Slartibartfast  Programming 1    6.5
3          Ursula  Programming 1    9.5
5           Marge  Programming 1    8.0 Programming 1
  student_name    course_name  grade
0       Pascal  Programming 2    7.0
4        Morty  Programming 2    3.5
6       Ursula  Programming 2    9.0 Programming 2


course_name
Programming 1    7.375
Programming 2    6.500
Name: grade, dtype: float64

course_name
Programming 1    4
Programming 2    3
Name: course_name, dtype: int64

Check your answer by running the cell below.

In [238]:
answers.test_12(average_course_grade, student_count)

Testing: Success!


You kan also use the `head(n)` method as an agregate function similar to `count()` or `mean()`. IT will yield the first `n` entries for each group. Read more about [groupby + head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.head.html).

### Exercise 13

Use `groupby` and `head` to select the **top two students** of each course of the `grades` DataFrame from above. Store the result in `top_students`.

The result should look like this: 

      student_name    course_name  grade
    3       Ursula  Programming 1    9.5
    6       Ursula  Programming 2    9.0
    5        Marge  Programming 1    8.0
    1        Morty  Programming 2    5.5

> Tip: you might have to sort the values of the `grades` DataFrame first.

In [245]:
# your code here
grades = grades.sort_values(by='grade', ascending=False)
top_students = grades.groupby('course_name')['student_name', 'course_name', 'grade'].head(n=2)

display(top_students)

  top_students = grades.groupby('course_name')['student_name', 'course_name', 'grade'].head(n=2)


Unnamed: 0,student_name,course_name,grade
3,Ursula,Programming 1,9.5
6,Ursula,Programming 2,9.0
5,Marge,Programming 1,8.0
0,Pascal,Programming 2,7.0


Check your answer by running the cell below.

In [246]:
answers.test_13(top_students)

Testing: Success!


## Pivot

Read about [pivot tables](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html).

### Exercise 14

Based on the `grades` DataFrame from previous exercises. Create a pivot table called `pivot_grades` that has the student names as rows and the course names as columns. The values in the table should be the grades of the corresponding combination of student and course. When the students only got a grade for one of the two courses, `pivot_table` should autmatically assign the value `NaN`. This is ok, you can leave this.

In [248]:
# your code here
pivot_grades = grades.pivot_table('grade', 'student_name', 'course_name')
display(pivot_grades)

course_name,Programming 1,Programming 2
student_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Marge,8.0,
Morty,5.5,3.5
Pascal,,7.0
Slartibartfast,6.5,
Ursula,9.5,9.0


Check your answer by running the cell below.

In [249]:
answers.test_14(pivot_grades)

Testing: Success!


## Explode

Read about [exploding DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html).

### Exercise 15

Create an unnested version of the `movies` DataFrame below. So the result should look like this:

                          movie                actors
    0            Hababam Sinifi           Kemal Sunal
    0            Hababam Sinifi           Münir Özkul
    0            Hababam Sinifi        Halit Akçatepe
    0            Hababam Sinifi            Tarik Akan
    1  The Shawshank Redemption           Tim Robbins
    1  The Shawshank Redemption        Morgan Freeman
    1  The Shawshank Redemption            Bob Gunton
    1  The Shawshank Redemption        William Sadler
    2                  Aynabaji    Chanchal Chowdhury
    2                  Aynabaji  Masuma Rahman Nabila
    2                  Aynabaji    Bijori Barkatullah
    2                  Aynabaji          Partha Barua
    3             The Godfather         Marlon Brando
    3             The Godfather             Al Pacino
    3             The Godfather            James Caan
    3             The Godfather          Diane Keaton
    
Store the result in `exploded_actors`

In [265]:
movies = pd.DataFrame([["Hababam Sinifi",
                           ["Kemal Sunal", "Münir Özkul", "Halit Akçatepe", "Tarik Akan"]],
                      ["The Shawshank Redemption",
                           ["Tim Robbins", "Morgan Freeman", "Bob Gunton", "William Sadler"]],
                      ["Aynabaji", 
                           ["Chanchal Chowdhury", "Masuma Rahman Nabila", "Bijori Barkatullah", "Partha Barua"]],
                      ["The Godfather",
                           ["Marlon Brando", "Al Pacino", "James Caan", "Diane Keaton"]]],
                     columns = ["movie", "actors"])

# your code here
exploded_actors = movies.explode('actors')

display(exploded_actors)

Unnamed: 0,movie,actors
0,Hababam Sinifi,Kemal Sunal
0,Hababam Sinifi,Münir Özkul
0,Hababam Sinifi,Halit Akçatepe
0,Hababam Sinifi,Tarik Akan
1,The Shawshank Redemption,Tim Robbins
1,The Shawshank Redemption,Morgan Freeman
1,The Shawshank Redemption,Bob Gunton
1,The Shawshank Redemption,William Sadler
2,Aynabaji,Chanchal Chowdhury
2,Aynabaji,Masuma Rahman Nabila


Check your answer by running the cell below.

In [267]:
answers.test15(exploded_actors)

## Reading and writing data 

Look at how you can [store data as a csv file](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).
Look at how you can [read the data back](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

### Exercise 16

As a final challenge try to see if you can combine the pandas methods `read_csv`, `to_csv`, `map`, `explode`, `sort_values`, `groupby` and `head` for a more complex data transformation.

The file `data/recipes.csv` contains a few recipes with their lists of ingredients. The recipes were rated on a scale from 1 to 5 by the users of a food website. The file also contains the average rating of the recipes. The contents:

    recipe_name,rating,ingredients
    Caprese Salad,4.5,tomato;olive oil;basil;mozzerella
    Lasagna,4.8,beef;pork;bacon;onion;celery;carrot;wine;tomato
    Beef Bourguignon,4.3,beef;bacon;onion;carrot;celery;flour;garlic;wine
    Hamburger,3.8,beef;bacon;letuce;bread;onion;mayo;ketchup;pickle
    Lentil Burger,4.0,lemon;lentils;yogurt;garlic;mushrooms;miso;parika;flour;bread;pickles

As you can see, the first column is the name, the second column contains the average rating and the third column contains all the ingredients (separated by a semicolon `;`)

We want to know for each ingredient, what the **top two recipes** are that contain that ingredient. So, the two recipes with the highest rating containing that ingredient. For example, the top two recipes for *bacon* are *Lasagna* and *Beef Bourguignon*. (*Hamburger* also contains *bacon*, but it's rating is lower than the other two recipes, so we ignore that one.)

Write a piece of code that reads the file `data/recipes.csv`, does all the required transformations and produces a file `ingredients.csv` into the `data` folder, containing the information we want. A fragment of the contents of the output file you should produce is shown here:

    ingredients,recipe_name
    bacon,Lasagna
    bacon,Beef Bourguignon
    basil,Caprese Salad
    beef,Lasagna
    beef,Beef Bourguignon
    bread,Hamburger
    bread,Lentil Burger
    carrot,Lasagna
    carrot,Beef Bourguignon
    ...

As you can see, the first two entries are the top two recipes for *bacon*. After that we see only one entry for *basil*, because *basil* occurs in only one recipe. This is okay: If an ingredient only occurs in one recipe, the output should only contain that entry.

The output should be sorted by the name of the ingredient.

In [340]:
# your code here
recipes = pd.read_csv('data/recipes.csv')
recipes['ingredients'] = recipes['ingredients'].str.split(pat=';')
display(recipes)
recipes = recipes.explode('ingredients')
recipes = recipes.groupby('ingredients')['ingredients', 'recipe_name'].head(n=2).sort_values('ingredients')
display(recipes)
recipes.to_csv(index=False, line_terminator='\n')


Unnamed: 0,recipe_name,rating,ingredients
0,Caprese Salad,4.5,"[tomato, olive oil, basil, mozzerella]"
1,Lasagna,4.8,"[beef, pork, bacon, onion, celery, carrot, win..."
2,Beef Bourguignon,4.3,"[beef, bacon, onion, carrot, celery, flour, ga..."
3,Hamburger,3.8,"[beef, bacon, letuce, bread, onion, mayo, ketc..."
4,Lentil Burger,4.0,"[lemon, lentils, yogurt, garlic, mushrooms, mi..."


  recipes = recipes.groupby('ingredients')['ingredients', 'recipe_name'].head(n=2).sort_values('ingredients')


Unnamed: 0,ingredients,recipe_name
2,bacon,Beef Bourguignon
1,bacon,Lasagna
0,basil,Caprese Salad
1,beef,Lasagna
2,beef,Beef Bourguignon
3,bread,Hamburger
4,bread,Lentil Burger
2,carrot,Beef Bourguignon
1,carrot,Lasagna
2,celery,Beef Bourguignon


'ingredients,recipe_name\nbacon,Beef Bourguignon\nbacon,Lasagna\nbasil,Caprese Salad\nbeef,Lasagna\nbeef,Beef Bourguignon\nbread,Hamburger\nbread,Lentil Burger\ncarrot,Beef Bourguignon\ncarrot,Lasagna\ncelery,Beef Bourguignon\ncelery,Lasagna\nflour,Beef Bourguignon\nflour,Lentil Burger\ngarlic,Beef Bourguignon\ngarlic,Lentil Burger\nketchup,Hamburger\nlemon,Lentil Burger\nlentils,Lentil Burger\nletuce,Hamburger\nmayo,Hamburger\nmiso,Lentil Burger\nmozzerella,Caprese Salad\nmushrooms,Lentil Burger\nolive oil,Caprese Salad\nonion,Beef Bourguignon\nonion,Lasagna\nparika,Lentil Burger\npickle,Hamburger\npickles,Lentil Burger\npork,Lasagna\ntomato,Lasagna\ntomato,Caprese Salad\nwine,Beef Bourguignon\nwine,Lasagna\nyogurt,Lentil Burger\n'

Check your answer by running the cell below.

In [341]:
answers.test_16()

Testing: Success!
