# **Web Scraping Notes**

Web scraping is the process of automatically extracting data from websites. It allows you to collect and organize data from the web for analysis or other purposes.

---

## **Applications of Web Scraping**
1. **Data Collection for Research**: Gathering data for academic or market research.
2. **E-commerce**: Extracting product prices, reviews, or inventory details.
3. **Competitor Analysis**: Tracking competitors' offerings and updates.
4. **News Aggregation**: Collecting headlines or articles from news websites.
5. **Social Media Insights**: Analyzing trends or user-generated content.

---

## **Basic Steps in Web Scraping**
1. **Identify the Website**:
   - Choose a website with the data you want to scrape.
2. **Analyze the Website Structure**:
   - Use browser tools (Inspect Element) to examine the HTML structure of the webpage.
3. **Send an HTTP Request**:
   - Use tools like Python's `requests` library to fetch the webpage's content.
4. **Parse the HTML**:
   - Use libraries like `BeautifulSoup` to extract specific data.
5. **Store the Data**:
   - Save the extracted data in a file (e.g., CSV, JSON) or a database.

---

## **Common Tools for Web Scraping**
1. **Programming Languages**: Python is the most popular for web scraping.
2. **Libraries**:
   - `requests`: For sending HTTP requests to websites.
   - `BeautifulSoup`: For parsing and extracting data from HTML.
   - `pandas`: For storing and analyzing scraped data.
   - `Selenium`: For handling JavaScript-heavy websites.

---


## Scraping Top IMDB Movies

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'https://www.imdb.com/search/title/?groups=top_250'
url

'https://www.imdb.com/search/title/?groups=top_250'

In [3]:
response = requests.get(url)
response

<Response [403]>

| **Status Code** | **Category**              | **Description**                                                                 |
|------------------|---------------------------|---------------------------------------------------------------------------------|
| **100**          | Informational Response   | Continue: The client can proceed with the request body.                        |
| **101**          | Informational Response   | Switching Protocols: The server is switching protocols as requested.           |
| **200**          | Success                  | OK: The request was successful, and the desired content was returned.          |
| **201**          | Success                  | Created: The request was successful, and a new resource was created.           |
| **204**          | Success                  | No Content: The server successfully processed the request, but no content returned. |
| **301**          | Redirection              | Moved Permanently: The resource has been moved to a new URL permanently.        |
| **302**          | Redirection              | Found: The resource is temporarily located at a different URL.                 |
| **304**          | Redirection              | Not Modified: The resource has not been modified since the last request.       |
| **400**          | Client Error             | Bad Request: The server could not understand the request due to invalid syntax.|
| **401**          | Client Error             | Unauthorized: Authentication is required to access the resource.               |
| **403**          | Client Error             | Forbidden: The server understands the request but refuses to authorize it.     |
| **404**          | Client Error             | Not Found: The requested resource was not found on the server.                 |
| **429**          | Client Error             | Too Many Requests: The client sent too many requests in a given time (rate-limiting). |
| **500**          | Server Error             | Internal Server Error: The server encountered an unexpected error.             |
| **502**          | Server Error             | Bad Gateway: The server received an invalid response from an upstream server.  |
| **503**          | Server Error             | Service Unavailable: The server is temporarily unavailable due to maintenance or overload. |
| **504**          | Server Error             | Gateway Timeout: The server did not receive a timely response from an upstream server. |


In [4]:
with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }

In [5]:
response = se.get(url)
response

<Response [200]>

In [6]:
body = response.content

In [8]:
soup = BeautifulSoup(body, 'html.parser')

#### Scrape the movie's title

In [11]:
<h3 class="ipc-title__text">1. Gladiator</h3>

SyntaxError: invalid syntax (2710865570.py, line 1)

In [12]:
movies_title = soup.find('h3', class_="ipc-title__text")
movies_title

<h3 class="ipc-title__text">1. The Shawshank Redemption</h3>

In [13]:
movies_title.get_text()

'1. The Shawshank Redemption'

In [14]:
movies_titles = soup.find_all('h3', class_="ipc-title__text")
movies_titles

[<h3 class="ipc-title__text">1. The Shawshank Redemption</h3>,
 <h3 class="ipc-title__text">2. Star Wars: Episode IV - A New Hope</h3>,
 <h3 class="ipc-title__text">3. The Dark Knight</h3>,
 <h3 class="ipc-title__text">4. Inception</h3>,
 <h3 class="ipc-title__text">5. Interstellar</h3>,
 <h3 class="ipc-title__text">6. The Godfather</h3>,
 <h3 class="ipc-title__text">7. Top Gun: Maverick</h3>,
 <h3 class="ipc-title__text">8. Gladiator</h3>,
 <h3 class="ipc-title__text">9. Oppenheimer</h3>,
 <h3 class="ipc-title__text">10. Pulp Fiction</h3>,
 <h3 class="ipc-title__text">11. Inglourious Basterds</h3>,
 <h3 class="ipc-title__text">12. The Lord of the Rings: The Fellowship of the Ring</h3>,
 <h3 class="ipc-title__text">13. Dune: Part Two</h3>,
 <h3 class="ipc-title__text">14. Goodfellas</h3>,
 <h3 class="ipc-title__text">15. The Silence of the Lambs</h3>,
 <h3 class="ipc-title__text">16. The Wolf of Wall Street</h3>,
 <h3 class="ipc-title__text">17. Schindler's List</h3>,
 <h3 class="ipc-t

In [15]:
titles = []
for title in movies_titles:
    text = title.get_text()
    titles.append(text)
titles = titles[:25]
titles

['1. The Shawshank Redemption',
 '2. Star Wars: Episode IV - A New Hope',
 '3. The Dark Knight',
 '4. Inception',
 '5. Interstellar',
 '6. The Godfather',
 '7. Top Gun: Maverick',
 '8. Gladiator',
 '9. Oppenheimer',
 '10. Pulp Fiction',
 '11. Inglourious Basterds',
 '12. The Lord of the Rings: The Fellowship of the Ring',
 '13. Dune: Part Two',
 '14. Goodfellas',
 '15. The Silence of the Lambs',
 '16. The Wolf of Wall Street',
 "17. Schindler's List",
 '18. Avengers: Endgame',
 '19. Fight Club',
 '20. The Wild Robot',
 '21. Se7en',
 '22. Parasite',
 '23. Back to the Future',
 '24. Star Wars: Episode V - The Empire Strikes Back',
 '25. The Truman Show']

In [16]:
len(titles)

25

### Scrape the ratings

In [None]:
<span class="ipc-rating-star--rating">8.5</span>

In [31]:
ratings_span = soup.find_all('span', class_="ipc-rating-star--rating")
ratings_span

[<span class="ipc-rating-star--rating">9.3</span>,
 <span class="ipc-rating-star--rating">8.6</span>,
 <span class="ipc-rating-star--rating">9.0</span>,
 <span class="ipc-rating-star--rating">8.8</span>,
 <span class="ipc-rating-star--rating">8.7</span>,
 <span class="ipc-rating-star--rating">9.2</span>,
 <span class="ipc-rating-star--rating">8.2</span>,
 <span class="ipc-rating-star--rating">8.5</span>,
 <span class="ipc-rating-star--rating">8.3</span>,
 <span class="ipc-rating-star--rating">8.9</span>,
 <span class="ipc-rating-star--rating">8.4</span>,
 <span class="ipc-rating-star--rating">8.9</span>,
 <span class="ipc-rating-star--rating">8.5</span>,
 <span class="ipc-rating-star--rating">8.7</span>,
 <span class="ipc-rating-star--rating">8.6</span>,
 <span class="ipc-rating-star--rating">8.2</span>,
 <span class="ipc-rating-star--rating">9.0</span>,
 <span class="ipc-rating-star--rating">8.4</span>,
 <span class="ipc-rating-star--rating">8.8</span>,
 <span class="ipc-rating-star--

In [18]:
ratings = []
for rating in ratings_span:
    score = rating.get_text()
    ratings.append(score)

In [19]:
ratings

['9.3',
 '8.6',
 '9.0',
 '8.8',
 '8.7',
 '9.2',
 '8.2',
 '8.5',
 '8.3',
 '8.9',
 '8.4',
 '8.9',
 '8.5',
 '8.7',
 '8.6',
 '8.2',
 '9.0',
 '8.4',
 '8.8',
 '8.2',
 '8.6',
 '8.5',
 '8.5',
 '8.7',
 '8.2']

In [20]:
len(ratings)

25

### Scrape the vote count

In [None]:
<span class="ipc-rating-star--voteCount">&nbsp;(<!-- -->1.8M<!-- -->)</span>

In [33]:
vote_spans = soup.find_all('span', class_="ipc-rating-star--voteCount")
vote_spans

[<span class="ipc-rating-star--voteCount"> (<!-- -->3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.5M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.7M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->775K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.8M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->891K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.3M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->1.7M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->2.1M<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount"> (<!-- -->630K<!-- -->)</span>,
 <span class="ipc-rating-star--voteCount">

In [22]:
vote_counts = []
for count in vote_spans:
    score = count.get_text(strip=True)
    vote_counts.append(score)

In [23]:
x = vote_spans[0]
x

<span class="ipc-rating-star--voteCount"> (<!-- -->3M<!-- -->)</span>

In [24]:
x.contents

['\xa0(', ' ', '3M', ' ', ')']

In [25]:
vote_counts = []

for span in vote_spans:
    for content in span.contents:
        if 'M' in content or 'K' in content:
           vote_counts.append(content) 

In [26]:
vote_counts

['3M',
 '1.5M',
 '3M',
 '2.7M',
 '2.3M',
 '2.1M',
 '775K',
 '1.8M',
 '891K',
 '2.3M',
 '1.7M',
 '2.1M',
 '630K',
 '1.3M',
 '1.6M',
 '1.7M',
 '1.5M',
 '1.4M',
 '2.5M',
 '168K',
 '1.9M',
 '1.1M',
 '1.4M',
 '1.4M',
 '1.3M']

### Scrape the release year of the movies

In [None]:
<span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2000</span>

In [35]:
spans = soup.find_all('span', class_="sc-4b408797-8 iurwGb dli-title-metadata-item")
spans

[<span class="sc-4b408797-8 iurwGb dli-title-metadata-item">1994</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2h 22m</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">R</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">1977</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2h 1m</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">PG</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2008</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2h 32m</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">PG-13</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2010</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2h 28m</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">PG-13</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-metadata-item">2014</span>,
 <span class="sc-4b408797-8 iurwGb dli-title-me

In [37]:
years=[]
duration=[]
certification=[]

for span in spans:
    text = span.get_text()
    if text.isdigit and len(text)==4:
        years.append(text)
    elif 'h' in text or 'm' in text:
        duration.append(text)
    else:
        certification.append(text)

In [38]:
years

['1994',
 '1977',
 '2008',
 '2010',
 '2014',
 '1972',
 '2022',
 '2000',
 '2023',
 '1994',
 '2009',
 '2001',
 '2024',
 '1990',
 '1991',
 '2013',
 '1993',
 '2019',
 '1999',
 '2024',
 '1995',
 '2019',
 '1985',
 '1980',
 '1998']

In [39]:
duration

['2h 22m',
 '2h 1m',
 '2h 32m',
 '2h 28m',
 '2h 49m',
 '2h 55m',
 '2h 10m',
 '2h 35m',
 '3h',
 '2h 34m',
 '2h 33m',
 '2h 58m',
 '2h 46m',
 '2h 25m',
 '1h 58m',
 '3h',
 '3h 15m',
 '3h 1m',
 '2h 19m',
 '1h 42m',
 '2h 7m',
 '2h 12m',
 '1h 56m',
 '2h 4m',
 '1h 43m']

In [40]:
certification

['R',
 'PG',
 'PG-13',
 'PG-13',
 'PG-13',
 'R',
 'PG-13',
 'R',
 'R',
 'R',
 'R',
 'PG-13',
 'PG-13',
 'R',
 'R',
 'R',
 'R',
 'PG-13',
 'R',
 'PG',
 'R',
 'R',
 'PG',
 'PG',
 'PG']

In [None]:
# movie Description
<div class="ipc-html-content-inner-div" role="presentation">A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through 

In [49]:
Description = soup.find_all('div', class_="ipc-html-content-inner-div")
Description

[<div class="ipc-html-content-inner-div" role="presentation">A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through simple compassion.</div>,
 <div class="ipc-html-content-inner-div" role="presentation">Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vader.</div>,
 <div class="ipc-html-content-inner-div" role="presentation">When a menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman, James Gordon and Harvey Dent must work together to put an end to the madness.</div>,
 <div class="ipc-html-content-inner-div" role="presentation">A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of 

In [48]:
Description.get_text()

'A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through simple compassion.'

In [50]:
Descriptions = []
for count in Description:
    score = count.get_text(strip=True)
    Descriptions.append(score)

In [51]:
 Descriptions

['A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through simple compassion.',
 "Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vader.",
 'When a menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman, James Gordon and Harvey Dent must work together to put an end to the madness.',
 'A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster.',
 'When Earth becomes uninhabitable in the future, a farmer and ex-NASA pilot, Joseph Cooper, is tasked to pilot a spacecraft, along with a team of researchers, to find a ne

In [52]:
len(Descriptions)

25

In [55]:
# store data
data = pd.DataFrame()

In [57]:
data['movies_titles'] = titles
data['ratings'] = ratings
data['vote'] = vote_counts
data['year'] = years
data['duration'] = duration
data['certification'] = certification
data['Description'] = Descriptions

In [58]:
data

Unnamed: 0,movies_titles,ratings,vote,year,duration,certification,Description
0,1. The Shawshank Redemption,9.3,3M,1994,2h 22m,R,A banker convicted of uxoricide forms a friend...
1,2. Star Wars: Episode IV - A New Hope,8.6,1.5M,1977,2h 1m,PG,Luke Skywalker joins forces with a Jedi Knight...
2,3. The Dark Knight,9.0,3M,2008,2h 32m,PG-13,When a menace known as the Joker wreaks havoc ...
3,4. Inception,8.8,2.7M,2010,2h 28m,PG-13,A thief who steals corporate secrets through t...
4,5. Interstellar,8.7,2.3M,2014,2h 49m,PG-13,When Earth becomes uninhabitable in the future...
5,6. The Godfather,9.2,2.1M,1972,2h 55m,R,The aging patriarch of an organized crime dyna...
6,7. Top Gun: Maverick,8.2,775K,2022,2h 10m,PG-13,The story involves Maverick confronting his pa...
7,8. Gladiator,8.5,1.8M,2000,2h 35m,R,A former Roman General sets out to exact venge...
8,9. Oppenheimer,8.3,891K,2023,3h,R,A dramatization of the life story of J. Robert...
9,10. Pulp Fiction,8.9,2.3M,1994,2h 34m,R,"The lives of two mob hitmen, a boxer, a gangst..."


In [59]:
data.to_csv('top_25_imdb_movies.csv', index=False)