---
---
Recitation 12: Webscrapping

Applied Data Science in Python for Social Scientists

New York University, Abu Dhabi

Dated: 07th Dec 2023

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of web scraping
- Learn the fundamental concepts of ethical scraping

### Specific Goals
- Learn how to use Beautiful Soup
- Learn how to effectively navigate webpages
- Learn how to use web scraping to answer questions

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **R7_YOUR NETID.ipynb**.

---

# Part I: Hindi Geet Mala (50 points)

For this recitation we will scrape https://www.hindigeetmala.net/ website, in order to answer the following question.

*Has the average number of songs in a movie increased in 2018 compared to 1930s?*

In [14]:
import time
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import re

years = ['2018', '1930s']

header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
}

In [27]:
############ Solution###############
def get_movie_urls(year, page):

    url = "https://www.hindigeetmala.net/movie/" + year + ".php?page=" + str(page)
    
    # Get the HTML content
    response = requests.get(url, headers=header)
    html = response.content
    
    # Parse the HTML
    soup = bs(html, "html.parser")
    
    # Get the movie table (class = bi and alcen)
    table = soup.find("table", attrs={"class": "b1", "class": "w760", "class": "alcen"})
    
    # Get all tds in the table that have a w25p class
    tds = table.find_all("td", attrs={"class": "w25p"})
    # print(len(tds))
    
    # Get all the links in the tds
    links = []
    for td in tds:
        links.append(td.find("a").get("href"))
    # print(len(links))
    
    # Return the links
    return links

get_movie_urls("2018", 1)

['/movie/batti_gul_meter_chalu.htm',
 '/movie/dhadak.htm',
 '/movie/gold.htm',
 '/movie/loveratri.htm',
 '/movie/nawabzaade.htm',
 '/movie/satyameva_jayate.htm',
 '/movie/veere_di_wedding.htm',
 '/movie/stree_2018.htm',
 '/movie/102_not_out.htm',
 '/movie/3_storeys.htm',
 '/movie/aiyaary.htm',
 '/movie/angrezi_mein_kehte_hain.htm',
 '/movie/baa_baaa_black_sheep.htm',
 '/movie/baaghi_2.htm',
 '/movie/beyond_the_clouds.htm',
 '/movie/bhavesh_joshi_superhero.htm',
 '/movie/billu_ustaad.htm',
 '/movie/blackmail_2018.htm',
 '/movie/daas_dev.htm',
 '/movie/daddys_daughter.htm']

In [33]:
############ Solution###############
def get_num_songs(movie_url):
    url = "https://www.hindigeetmala.net/" + str(movie_url)
    
    # Get the HTML content
    response = requests.get(url, headers=header)
    
    # Parse the HTML
    soup = bs(response.content, "html.parser")
    
    # Get the trs with itemprop = "track"
    trs = soup.find_all("tr", attrs={"itemprop": "track"})
    
    # Return the number of trs
    return len(trs)

get_num_songs("movie/avtaar.htm")
# get_num_songs("movie/mr_and_mrs_55.htm") # Test case

6

In [34]:
############ Solution###############
def average_num_songs(year_list):
    song_counter = {}
    for year in year_list:
        song_counter[year] = []
        movie_urls = get_movie_urls(year, 1)
        for movie_url in movie_urls:
            num_songs = get_num_songs(movie_url)
            song_counter[year] =  song_counter[year] + [num_songs]
    
    for year in song_counter:
        print("Average number of songs in " + str(year) + " is " + str(sum(song_counter[year])/len(song_counter[year])))
    
    # return song_counter


average_num_songs(["1930s", "2018"])

Average number of songs in 1930s is 9.05
Average number of songs in 2018 is 5.25


## Rubric

- +40 points for correct output
- +5 points for conciseness
- +5 points for a reasonable choice of a data structure