To be able to run this program, we need to install the wordcloud package. You install this by going to the anaconda prompt and write >> conda install [package name]

You need to reopen everything once you did that! 

First, lets celebrate the installation by running cell below

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('LDZX4ooRsWs', width=640, height=360)

In [None]:
import pandas as pd
import requests
import time 
import tqdm
from bs4 import BeautifulSoup  
from wordcloud import WordCloud, STOPWORDS # Need an install! 
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### The code below is pretty advanced, so you don't need to understand it. What it basically does is it calls a website and asks for data. 
### Question for you: What is the website?
(click on the cell below and write the website).

Write your answer here: 


### Having called the website we get the data in a file format called .json, which we transform into a more readable data format using the package "BeautifulSoup" which can read HTML data. 

### Now you need to run the cell below: Do you already know the keyboard Shortcut to make a cell run? 
If no: Use Google to find the answer, and try using that instead of clicking run 

### Exercise
In the cell below, the first function is called "make_query" and takes the arguments: query_term, page_num, iter_ and wait
where the 3 last ones a defined straight away. In the end it returns "response".  

In line 19 (cell below), you need to define the function "check_response" which takes the argument: response.  
In line 26 (cell below), you state that the function returns response_d

IN line 87, in the very last function, you need to ask the function to return data 

<font color="orange"> Beware that indentation matters! </font>



In [None]:
##########################################################################################################
# Code for scraping ######################################################################################
##########################################################################################################

def make_query(query_term,page_num=1,iter_=5,wait=0.5):
   ## Add iterations for reliability in case of standard http errors
    response = False
    time.sleep(wait)
    for i in range(iter_):
        try:
            response = requests.get('https://www.allrecipes.com/element-api/content-proxy/faceted-searches-load-more?search=%s&page=%d'%(query_term,page_num))
      #, headers=headers, params=params, cookies=cookies)
            if response.ok:
                return response
        except:
            continue
    return response

### FILL IN HERE
    try:
        response_d = response.json()
    except:
        return {'error':True}
    if 'error' in response_d:
        print(response_d)
### FILL IN HERE

def page_data(query,result_count):
    results = []

    response_length = 24
    for page_num in tqdm.tqdm(range(result_count//response_length)):
        page_num+=1
        response = make_query(query,page_num=page_num)
        response_d = check_response(response)

        if 'error' in response_d:
            break
        results.append(response_d['html'])
        if response_d['hasNext']==False:
            break
    return results

def scrape_recipes(query) :
    response = make_query(query)
    response_d = check_response(response)
    count = response_d['totalResults']
    return page_data(query,count), count

##########################################################################################################
# Code for formatting dataframe ##########################################################################
##########################################################################################################

alt_vars = {'recipe_link':('a class="card__titleLink manual-link-behavior"','href'),
            'author_link':('a class="card__authorNameLink"','href')}

def parse_recipe_item(item, var2tags):
    d = {}
    for name,tag_attr in var2tags.items():
        tag,attr = tag_attr.split(maxsplit=1)
        class_name,identifier = attr.strip('"').split('="')
        identifier = {class_name:identifier}
        val = item.find(tag,attrs=identifier)
  # print(name,val)
        try:
            val = str(val.text.strip())
        except:
            continue
        d[name]= val
    for name,(tag_attr,key) in alt_vars.items():
        try :
            tag,attr = tag_attr.split(maxsplit=1)
            class_name,identifier = attr.strip('"').split('="')
            identifier = {class_name:identifier}
            val = item.find(tag,attrs=identifier)
            d[name] = str(val[key])
        except :
            continue
    return d

def get_recipes(html, var2tags):
    soup = BeautifulSoup(html,features="html.parser") # Beautifulsoup reads html data
    item_locs = soup.find_all('div',attrs={'class':'component card card__recipe card__facetedSearchResult'})
    data = []
    for item in item_locs:
        data.append(parse_recipe_item(item, var2tags))
# Fill in here

## Exercise: Scrape recipes with carrots, fill in the search word below

 By running the code in the cells below, you get all the recipes from [allrecipes.com](https://www.allrecipes.com/) which contain the search term "carrots".
Before running the cells, go to allrecipes, search for recipes with carrots, and write down what information you think we will extract (note that we only look at the search results page, and not the actual recipees). 

Now, fill in the search word below (remember to put it in ' '), and and get running. 

In [None]:
query =  # Fill in

#Scrape the recipes
results, count = scrape_recipes(query)

### Exercise, INSPECT the website and find out how we ask for the summary data
Below we make 4 variables: title, rating_count, summary and rating. 
For three of them the code is complete, but for the 'summary' you need to write the code. To find out what to write, you need to go to the website allrecipes.com, type in the search word carrot. Now you see all the recipes with carrots. Highlight the part you are interested in (the summary) and find out what the the type of variable and the name. Carefully follow the way it is written for the other variables.... 
<font color="orange" > Tip: First try the inspect for the other variables we are defining, like title, and find out how the html code looks for this. Then try with the summary. 

In [None]:
variables = {'title':'h3 class="card__title"', 
             'rating_count':'span class="card__ratingCount card__metaDetailText"',
             'summary': #fill in,
             'rating':'span class="review-star-text"'}

# parse_results
data = [] 
for res in tqdm.tqdm(results):
    data += get_recipes(res,variables)


# load into dataframe
# You need to name the dataset below
df = pd.DataFrame(data).drop_duplicates()

## Congratulations if all went well you have now downloaded some data and put it into a dataframe. 
### Exercise: Now we want to see how the data looks! To do this, follow these steps: 
1 Make a "code" cell

2 Now write: df.head(10) 

3 Run it

### Exercise, how many rows of data?
The dataframe has

In [None]:
# put a dataframe shape function here, and note that you should put 0 in brackets: 

 rows.

We see from the website that there are xx search hits on "carrots". Thus, our dataframe does not contain all search hits from the website. We won't be too concerned with why this happened now, but when you are using scraping for scientific purposes (or any 'serious' purpose), it is very important to look thoroughly into missing data.

### Exercise: Compare scraped data with manual inspection of website

Each row in the dataframe above corresponds to a recipe containing the word "carrots". Go to  [allrecipes.com](https://www.allrecipes.com/)  and search for "carrots" in the search box. Compare the results on the website with the dataframe above. How has the information on the website been mapped to the rows and columns in the dataframe?

## **<font color="orange">STOP - time for a carrot challenge! </font>**

 **<font color="orange">Finding the different buildings at CSS is pretty confusing - almost as confusing as Python. Now your task is to find building 35, do they sell carrots?. Once you are there: Have a coffee or a carrot </font>**
 
 
 **<font color="orange"> Join your group and go find building 35 - the canteen is in the basement </font>**



### Exercise: Make a Word Cloud

A Word Cloud is a visualization of the most frequent words in a text corpus. The larger words are more frequent. The code below generates a word cloud from all the titles in the scraped data (where a list of so-called stopwords, such as "the", "and" and "a" are removed - we also remove the words "carrots" and "carrot", since they are very frequent by construction of our data, and therefore they are uninteresting). Run the cells below. Does the resulting word cloud make sense compared with your expectations about recipes with carrots?

### EXERCISE: 
In the code below you need to add a part that converts everything to lowercase!

In [None]:
#Collect all the titles into one large string
title_words='' 
 
for val in df.title:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = # FILL IN HERE
     
    title_words += " ".join(tokens)+" "

IN the code below, you need to update the stopwords, because it is redundant to print the word "carrots" and "carrot" in the wordcloud. Also, you need to put them in "" and separate them using a comma, for python to understand that they are strings... Once it is done, run it. 

In [None]:
stopwords = set(STOPWORDS)
stopwords.update([]) #Fill in here
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                collocations = False,
                min_font_size = 10).generate(title_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (6, 6), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 


Do you see anything? No? 

Oh, perhaps we deleted the last line - try to google what you need to add. 

Add the missing piece and then re-run the cell

## **<font color="orange">STOP - time for a carrot challenge! </font>**

 **<font color="orange"> There is a nice library with some study rooms nearby: https://kub.ku.dk/biblioteker/samf/ </font>**
  **<font color="orange"> Join your group and go find that library - have a look around. Any books on carrots? </font>**


### Exercise: Make numeric variable from text data

 The "rating" column contains text, as seen below:
 
 We want to extract the actual rating from this text and store it as a numeric variable. This is done by the code below:

In [None]:
df["rating"] = df["rating"].str.replace("Rating: ", "")
df["rating"] = df["rating"].str.replace(" stars", "")
df["rating"] = df["rating"].astype(float)

 Now the "rating" column just contains the rating, have a look:

In [None]:
df.head()

and it is a decimal variable (aka. "float"), as seen by the command below:

In [None]:
df.dtypes

This command also reveals that "rating_count" is stored as a string. Let's turn it into a decimal variable (a "float"). Note that a counting variable (aka. an "integer") would make more sense, but some technical details make floats easier to use here. Follow the code from above to turn rating_count into a float variable. 

In [None]:
## make it float here

In [None]:
# Check if it worked by using the types function

### Exercise: see the first few observations using the same command as you previously used. Put it into the cell below

### Exercise: Make scatter plot of ratings vs rating counts

Now that "rating" and "rating_count" are stored as numeric variables, we can visualize their distributions with histograms and a scatter plot. 

### Lets start out with a scatterplot that shows the rating (the stars) against how many times it has been rated

In [None]:
sns.scatterplot(data=df,x='rating',y='rating_count', alpha=0.5);

## Oh wait: perhaps orange is more appropriate - we are talking about carrots after all. 
### Exercise: Change the color to orange: 
Write the same as above, but add a color statement (color=xxx). 

### Well done! You did a nice plot. You deserve a break and a song: 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('9OuFbyyt8k0', width=640, height=360)

### Let's get back to business: 
### We can also do more plots in one cell: 

In [None]:
fig, axs = plt.subplots(1,3,figsize=(20,5))
sns.histplot(data=df,x='rating',ax=axs[0])
sns.histplot(data=df,x='rating_count',ax=axs[1])
sns.scatterplot(data=df,x='rating',y='rating_count', alpha=0.5, ax=axs[2]);

 We see that the histogram for _rating_count_ and the scatterplot are not very informative due to the outliers in the distribution of _rating_count_. Below we visualize the distributions for _rating_ between 3 and 5, and _rating_count_ between 1 and 1000.

In [None]:
df_restricted = df[(3<=df['rating'])&(df['rating_count']<=1000)]
fig, axs = plt.subplots(1,3,figsize=(20,5))
sns.histplot(data=df_restricted,x='rating',ax=axs[0])
sns.histplot(data=df_restricted,x='rating_count',ax=axs[1])
sns.scatterplot(data=df_restricted,x='rating',y='rating_count', alpha=0.5, ax=axs[2]);

### Exercise: Make a different restriction: You choose how! 

### What insights about the distributions of _rating_ and _rating_count_ do you get from inspecting the plots above?

Write some of your insights in orange in the cell below!

## **<font color="orange">STOP - time for a carrot challenge! </font>**

 **<font color="orange"> Time for a small tour de CSS: Find the Chr. Hansen Auditorium (maybe somebody left some carrots there???), take a look inside (if it is empty) , then take a stroll to the Study Library at CSS, take small break in the courtyard and come back :) </font>**


### Exercise: Store the data on your machine

In the following step we assign a directory path to tell where to save a file. You need to assign your own path. More specifically, you need to exchange "C:/Users/qtk365/Dropbox/Postdoc/Teaching/Intro-data-sprint/Data/" with the path to the folder on your computer, where you want to store the data.

HINT 1: You can open your computers "file explorer" and manually determine where you want to store the data. Now copy the file path into your code below. 

HINT 2: Beware of forward slashes and double quotes!

HINT 3: If you cannot make it work, then just leave the path empty (path = "") and the data will be stored in the folder, where your notebook is stored.

In [None]:
path = "C:/Users/qtk365/Dropbox/Postdoc/Teaching/Intro-data-sprint/Data/" # CHANGE HERE 
filename = "recipes_carrots"

df.to_csv(path+filename+'.csv',index=False)

### Go into the folder, where you have stored your data and confirm that the data is in there.

### Exercise: Choose your own search term (aka. "query")

Exchange the word "ham" in the cell below with another search term, which you choose yourself, such as "apples" or "awesome" "danish" or whatever you want. Then run the cells. Compare the resulting dataframe with the results from manually searching the website.

In [None]:
query = ''

#Scrape the recipes
results,count = scrape_recipes(query)

In [None]:
variables = {'title':'h3 class="card__title"', 
             'rating_count':'span class="card__ratingCount card__metaDetailText"',
             'summary':, # Fill in 
             'rating':} # Fill in

# parse_results
data = [] 
for res in tqdm.tqdm(results):
  data+= get_recipes(res,variables)

# load into dataframe
df = pd.DataFrame(data).drop_duplicates()

In [None]:
#Print the first 10 rows of the dataframe

In [None]:
#Number of rows in the dataframe


## Exercise: Make Word Cloud with your new data

 Execute the cells below to make a word cloud with the data with your own search term. Remember to change 

> stopwords.update(["ham"])

 in the second cell, such that the stopwords (words that are removed before making the word cloud) make sense compared to your chosen search term.

In [None]:
#Collect all the titles into one large string
title_words='' 
 
for val in df.title:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    title_words += " ".join(tokens)+" "

In [None]:
stopwords = set(STOPWORDS)
stopwords.update([""]) # Fill in here
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                collocations = False,
                min_font_size = 10).generate(title_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (6, 6), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
## Show your cloud... 

## Exercise: Visualize distributions of _rating_ and _rating_count_ with your new data

In [None]:
## Fill in this one, to make the variables from float to string. Have a look above if you are in doubt

## **<font color="orange">STOP - time for a carrot challenge! </font>**

 **<font color="orange"> You came a long way. Now it is time to take a break. You can go to Netto and Fakta and buy some carrot-ice cream (??), and then take a walk around the lake! </font>**


In [None]:
fig, axs = plt.subplots(1,3,figsize=(20,5))
sns.histplot(data=df,x='rating',ax=axs[0])
sns.histplot(data=df,x='rating_count',ax=axs[1])
sns.scatterplot(data=df,x='rating',y='rating_count', alpha=0.5, ax=axs[2]);

In [None]:
df_restricted = df[(3<=df['rating'])&(df['rating_count']<=1000)]
fig, axs = plt.subplots(1,3,figsize=(20,5))
sns.histplot(data=df_restricted,x='rating',ax=axs[0])
sns.histplot(data=df_restricted,x='rating_count',ax=axs[1])
sns.scatterplot(data=df_restricted,x='rating',y='rating_count', alpha=0.5, ax=axs[2]);

## Store your new data

### Change the path name, so it points to folder, where you want to store your data. Change the filename, so it makes sense compared to your search term.

In [None]:
path = "C:/Users/qtk365/Dropbox/Postdoc/Teaching/Intro-data-sprint/Data/" # CHANGE HERE!
filename = "recipes_XXX"

df.to_csv(path+filename+'.csv',index=False)

### Go to the folder to confirm that your new data is in there.

## Exercise: Play around. 

### You can try out more search terms. After you have tried out some different search terms, you can load the data, which you stored for the different terms, and then you can try to use visualizations to compare the different data sets.

If you now feel somewhat comfortable working with python, you can try to investigate your data more. For instance, you can try to find out, if there is any relationship between the length of the summary and the rating. If you made it this far: YOU are good, and you should try to write your own code: For example: search for both something healthy and something unhealthy: Put it into the same dataframe, and compare ratings and number of ratings (rating_count): What gets more positive reviews. 

Another possibility is to look at different cuisines and compare them, e.g "Thai" vs. "French". What is more popular? and what seems to be more healthy??  

