# Homework 5

Installed packages needed for this exercise.

In [67]:
library(httr)
library(jsonlite)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(SnowballC)



## Rest API

In this exercise, we aim to fetch data from the Nobel Prize REST API, specifically focusing on the Nobel Prize in Physics for the year 2022. By sending a GET request to the API endpoint, we retrieve the prize information in JSON format. From this data, we will extract the motivations provided for each laureate's award. These motivations explain why the laureates were honored with the prize. We will then preprocess the text by converting it to lowercase, removing punctuation, numbers, common stopwords, and extra whitespace. Finally, we will visualize the most frequent terms from the motivations using a word cloud. This visualization will help us identify key themes in the field of physics research, with the most commonly used words displayed in the word cloud.

In [68]:
api_url <- "http://api.nobelprize.org/2.1/nobelPrize/phy/2022"
#API URL for the Nobel Prize in Physics 2022

response <- GET(api_url)
#Send a GET request to the API to fetch the data

if (status_code(response) == 200) {
#Check if the API request was successful
    
    data <- fromJSON(content(response #missing parts
                             
    
    motivations_text <- paste(motivations, collapse = " ")
    

    motivation_corpus <- Corpus(VectorSource(motivations_text))
    

    motivation_corpus <- tm_map(motivation_corpus, content_transformer(tolower))  # Convert to lowercase
    motivation_corpus <- tm_map(motivation_corpus, removePunctuation)  
                    # Remove punctuation
    motivation_corpus <- tm_map(motivation_corpus, removeNumbers)  
    # Remove numbers
    motivation_corpus <- tm_map(motivation_corpus, removeWords, stopwords("english"))  
    # Remove common stopwords
    motivation_corpus <- tm_map(motivation_corpus, stripWhitespace)  
    # Remove extra whitespace
    motivation_corpus <- tm_map(motivation_corpus, stemDocument)  
    # Apply stemming to words
    motivation_corpus <- tm_map(motivation_corpus, PlainTextDocument)
    
    wordcloud(
        motivation_corpus,
        scale = c(5, 0.5),           
        #Set min and max scale for word sizes
        max.words = 100,             
        #Set the maximum number of words
        random.order = FALSE,        
        #Words will be placed based on frequency
        rot.per = 0.35,              
        #Percentage of vertical words
        use.r.layout = FALSE,        
        #Use C++ for word placement
        colors = brewer.pal(8, "Dark2") 
    )
} else {
    print(paste("Error: ", status_code(response)))
}


ERROR: Error in parse(text = input): <text>:8:5: unexpected symbol
7:     
8:     motivations_text
       ^


In [61]:
api_url <- "http://api.nobelprize.org/2.1/nobelPrize/phy/2022"

response <- GET(api_url)


## Web Scraping

The goal of this part of the assignment is to scrape data from the website https://books.toscrape.com/. We want to create a table that contains key details about the books, specifically the UPC, title, price, and rating, from the first three pages of the site.

The code first sets up a URL pattern to get the correct address for each page. Then, it creates an empty table with columns for UPC, title, price, and rating. The program looks at each of the first three pages, gets the book information, and adds it to the table. After the loop finishes, the first five rows of the table are printed, showing only the UPC, title, price, and rating for the books.

In [42]:
url_pattern <- "https://books.toscrape.com/catalogue/page-%s.html"
#The URL pattern for pages, %s acts as a placeholder that will be replaced with the page number

books_info <- data.frame(upc = character(), title = character(), price = character(), rating = character(), stringsAsFactors = FALSE)
#Empty data frame to store book information

for (page_number in 1:3) {
  url <- sprintf(url_pattern, page_number)  
  page_data <- scrape_books_and_details(url)
  books_info <- rbind(books_info, page_data)
}
#Loop through the three first pages
#Scraping the book details and storing it in the table

print(head(books_info, n = 5))
#Printing the desired table

               upc                                 title  price rating
1 a897fe39b1053632                  A Light in the Attic £51.77  Three
2 90fa61229261140a                    Tipping the Velvet £53.74    One
3 6957f44c3847a760                            Soumission £50.10    One
4 e00eb4fd7b871a48                         Sharp Objects £47.82   Four
5 4165285e1663650f Sapiens: A Brief History of Humankind £54.23   Five


As we can see from the table, we have successfully achieved the table format as requested in the homework. The table now includes the correct columns for UPC, title, price, and rating.