---
---
Problem Set 6: Web Scraping

Applied Data Science using Python

New York University, Abu Dhabi

Out: 7th Dec 2023 || **Due: 14th Dec 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of Web Scraping

### Specific Goals
- Learn how to identify websites that allow scraping
- Learn the basics of BeautifulSoup
- Learn to adjust the scraping code to more than one query

## Collaboration Policy
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html).
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this semester and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **P6_YOUR NETID.ipynb**.

---

# General Instructions


This homework is worth 100 points. It has 4 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in a Jupyter (Colab) Notebook attached with this handout.

<font color="red">**Important Note:** Please scrape the websites ethically and make sure you utilize the `sleep` function between requests. </font>


# Part I: To Scrape or Not to Scrape (5 points)

A friend wants us to write a function that will scrape product data from 3 different websites, and will recommend which one to buy based on price and reviews. However, we need to figure out which of the below websites actually allow scraping:

- [Amazon.ae](https://www.amazon.ae/)
- [Noon.com](https://www.noon.com/uae-en/)
- [Instock.ae](https://www.instock.ae/)

Provide your answers, and, in a couple of sentences, describe how you have come to your answer.

*Hint: Two websites allow us to scrape their search results.*

Based on the analysis of the `robots.txt` files from the three websites, here are the conclusions regarding their scraping policies:

1. **[Amazon.ae](https://www.amazon.ae/)**: Amazon.ae's `robots.txt` file is quite restrictive. It disallows scraping in many parts of the website, including product, account, cart, and user interaction pages. Specific bots are entirely blocked. Thus, Amazon.ae does not generally allow scraping for product data.

2. **[Noon.com](https://www.noon.com/uae-en/)**: Noon.com's `robots.txt` file is more permissive. It only disallows access to paths under `/_svc/` while allowing all other paths. This indicates that Noon.com generally allows scraping, including for product data.

3. **[Instock.ae](https://www.instock.ae/)**: Instock.ae's `robots.txt` file restricts access to various parts of the site, including specific directories and functionalities related to products and customer interactions. However, it does not explicitly disallow scraping of product data from search results or general product pages.

Therefore, based on the `robots.txt` analysis, Noon.com and Instock.ae allow scraping of their search results to some extent, while Amazon.ae does not appear to permit scraping for product data.

### *Concepts required to complete this task:*

- Concepts of Ethical Scraping

## Rubric

- +5 points for correct answer and pointing the place you found the information

# Part II: Hindi Geet Mala (40 points)

[Hindi Geet Mala](https://www.hindigeetmala.net/) is a website containing information about Indian movies, songs, singers, etc. For this part, you will scrape information about all the movies in alphabetical order. Additionally, you will scrape information about songs.

More precisely, you will submit 2 CSV files:

* movies.csv: Title, Year, Number of Songs, Film Director, Film Producer, Film cast, Lyricist, Music Director, Singer, External links, Watch Full Movie

* songs.csv: Artists, Title, Rating, Number of Votes, Movie Title

**Notes:**

- you are NOT allowed to hardcode the list of letters.
- you are NOT allowed to use Pandas function `read_html()`

In [None]:
# Write you solution here
############# SOLUTION ###############

############# SOLUTION END ###############

### *Concepts required to complete this task:*

- Navigating through HTML code using functions
- DataFrame Creation and Writing to a file


## Rubric

- +30 points for correct output (15 points for each dataframe)
- +5 points for concise, logical code
- +3 points for ethical and mindful scraping
- +2 points for comments and variable names

# Part III: Yahoo Finance (30 points)

A friend who studies Economics, recommended us to buy stocks if the weekly average of closing price has been steadily increasing by at least 1% for the last 3 weeks.

Since we are interested in more than one company, let's write a function called `stock_decision()` that will scrape stock market close data for the given companies from [Yahoo Finance](https://finance.yahoo.com/), compute the weekly average for the past 3 weeks and check whether or not it has been increasing. If the weekly average has been steadily increasing by at least 1%, recommend to buy, otherwise recommend to sell.

The output should be two lines:

```
"Sell: "

"Buy: "
```

Implement the function for these companies.
- Apple
- Microsoft
- Amazon
- Tesla
- Facebook

For example, the url for Facebook would be https://finance.yahoo.com/quote/FB/history/.



**Hint:** For grouping the data per week, look up `resample()` function from Pandas. Another function that will come in handy is `pct_change()` from Pandas.

**Note:**

- you are NOT allowed to use Pandas function `read_html()`

In [None]:
# Write you solution here
############# SOLUTION ###############

############# SOLUTION END ###############

### *Concepts required to complete this task:*

- Navigating through HTML code using functions
- DataFrame Creation and Manipulations
- Applying functions to a DataFrame


## Rubric

- +20 points for correct output
- +5 points for concise and logical code
- +3 points for ethical and mindful scraping
- +2 points for comments and logical variable names

# Part IV: WikiCFP (25 points)

[WikiCFP](https://wikicfp.com/) is a semantic wiki for Call for Papers in science and technology fields.

For this part, you will be writing a program that will do the following:

* scrape the first 10 pages of WikiCFP for the keyword `computer science`
* create a Pandas DataFrame with the columns `abbreviation, name, dates, place, deadline`
* create a new column called `country`, which will take the value `online` if the place contains `Online` or `Virtual`.
* remove missing values, and calculate the distribution of places and print the top and bottom countries.

**Note:**

- you are NOT allowed to use Pandas function `read_html()`

In [None]:
# Write you solution here
############# SOLUTION ###############

############# SOLUTION END ###############

### *Concepts required to complete this task:*

- Navigating through HTML code using functions
- DataFrame Creation
- Adding new columns to an existing DataFrame
- String operations


## Rubric

- +17 points for correct output
- +3 points for concise and logical code
- +3 points for ethical and mindful scraping
- +2 points for comments and logical variable names