## Import the same libraries as in your first scraping notebook (just copy the "import" statements from the first notebook into this one and execute the cell)

## Exercise 1: Count the number of characters in the titles and in the summaries, and investigate how they are related to rating and rating counts.

1. Use [pd.read_csv](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html) to load one of the dataframes, you stored in your first scraping notebook, into this notebook. This could be the "carrots" recipes or the recipes for the search term, you chose yourself.

2. How do you count the number of characters in a single string, e.g. the number of characters in the string "Counting characters is an essential human skill"? Hint: See [this link](https://www.w3schools.com/python/ref_func_len.asp) about the "len" function. Does this function count black spaces? (You don't have to look for the answer in the documentation, you can just test it on your own). How many characters are there in the string "Counting characters is an essential human skill"?

3. Now count the number of characters in each of the recipe titles. Use the function [pd.Series.apply](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) to apply the [len function](https://www.w3schools.com/python/ref_func_len.asp) to each of rows in the "title" column of your dataframe. (You could also use a for loop for this, but using the apply function is much better pandas coding practice, since it can really speed up your code, when the task is more complicated than counting characters.)

4. Store the number of characters in the titles as a new column called "title_len" in the dataframe. If you don't know how to add a new column in a pandas dataframe, then try to google "python pandas make new column" or something similar, and see if you can find out on your own.

5. Repeat step 3-4 with the summaries instead of the titles, so you end up with a column "summary_len", which counts the number of characters in the summaries.

6. Use histograms to visualize the distributions of title_len and summary_len, and use scatterplots (or [regplot](https://seaborn.pydata.org/generated/seaborn.regplot.html)) to visualize the relationship between title_len, summary_len, rating and rating_count. What insights do you get?

## Exercise 2: Put recipes from different search terms into one big dataframe and compare the recipes

Just so you know, you could also do the comparisons between recipes from different search terms without putting everything into one large dataframe (ant this would probably be easier in this particular case), but we are doing the "one-big-dataframe" as a coding exercise :)

1. Copy the code, which defines the functions for scraping and formatting the scraped data, from the first notebook into this book. Execute the cell (this defines the functions in this notebook).

2. Next, copy the code, which scrapes allrecipes.com for the recipes relating to a search term and stores the results in a dataframe. Make dataframes for some different search terms, which you find interesting to compare, e.g. "beef", "pork", "chicken" and "meat".

3. In each of these dataframes, make a column "search term", which contains the search term you used to scrape the recipes.

4. Use [pd.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) to make the dataframes into a single dataframe. This dataframe should contain all the rows of the other dataframes together. It should have the same columns as the dataframes in your first scraping notebook, and additionally a "search term" column, which states the search term, which gave rise to the recipe in the given row.

5. If you are feeling advanced, you can try to use a [for loop](https://www.w3schools.com/python/python_for_loops.asp) to perform step 3-5 in a single for loop.

6. Store this dataframe locally on computer.

7. Use [pandas row selection](https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values) to select the rows, which belong to a specific search term.

8. Use row selection combined with seaborn histograms to visualize the distribution of ratings and rating counts for the different search terms.

9. Use [pd.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) combined with [pd.Series.mean](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html) to calculate the average rating for each of the search terms. Use pd.groupby combined with [pd.Series.quantile](https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html) to calculate various quantiles of the rating distributions for the different search terms.

## Exercise 3: Add names and sex to dataframe

Note that YOU SHOULD NOT STORE THE NAMES OF THE AUTHORS LOCALLY ON YOUR COMPUTER. Just keep the dataframes with the names in your running notebooks, so the author names disappear, when you shut down the notebook. After you have made the "sex" column, you can remove the author name column from your dataframe and save the dataframe with the sex column.

Also note that there is a lot to discuss about the methodological and conceptual pros and cons of this method for constructing a sex column from the author names. Have this discussion in your group! 
Being critical of an analysis is a skills as essential as coding and discussions of the advantages and disadvantages (both in terms of accuracy, relevance, theoretical validity and ethics) of various methods is core to this master program :)

1. Add authors name to the data, which you scrape from allrecipes.com. As in the first notebook, you need to expand this

In [15]:
variables = {'title':'h3 class="card__title"', 
             'rating_count':'span class="card__ratingCount card__metaDetailText"',
             'summary':'div class="card__summary"',
             'rating':'span class="review-star-text"'}

dictionary, such that you also scrape the author names.

2. Use pd.read_csv to load the names_gender.csv file into your notebook. As described [here](https://data.world/howarder/gender-by-name), this dataframe uses the birth name of US born babies between 1930-2015 to assign probabilities of an individual having a given sex from the individual's name. (Of course, the validity of this method can be discussed).

3. Make a function, which takes the author name of an recipe as the input, and uses the names_gender dataframe to output a guess on the sex of the individual, who wrote the recipe. There are many ways to do this, so you should think up your own method and then define a function, which implements this method. You should also allow the function to not make a guess in some cases, such that it outputs "NA" or something similar for the author_name in these cases.

4. Apply this function to the column of author names, such that you make a guess on the sex (including "NA") of each of the authors . Depending on your method, it might take a while to apply the function to your entire column of author names. Store the result as a "sex" column.

5. Remove the "author_name" column and store the dataframe.

6. Explore the patterns of how the sex column. Look at some concrete examples: 
    - Do they make sense, or do there seem to be a technical or conceptual error in function for guessing sex from names? 
    - What is the fraction of missing values? 
    - Of the rows with non-missing sex values, what are the proportion of females? 
    - How does the distribution of ratings and rating_counts depend on the constructed sex variable? 
    - What about the length of titles and summaries? 
    - Can you find other interesting patterns? 
 In all this pattern exploring, you should always keep the method, which you have used to construct the sex column in mind. Any pattern that you see might influenced by this particular method, and would not be the same if you for instance had access to the authors' biological sex or self-identitied gender.
 
7. Some research point to the fact that men more often than women cook meat-dishes, whereas men relatively more often cook vegetable-dishes and dessert. Can you find indications of this pattern (note that you could answer this question in many ways!)? 