# Web Scraping using BeautifulSoup
This notebook is converted from the provided PDF content.
It includes Data Gathering, Web Scraping, Image Scraping, and Data Preparation examples.

# Data Gathering
## Sources of Data
A vast amount of historical data can be found in files such as:
- MS Word documents
- Emails
- Spreadsheets
- PDFs
- HTML
- CSV, JSON, XML files

Public and Private Archives

CSV, JSON, and XML files use plaintext, a common format, and are compatible with a wide range of applications

The Web can be mined for data using a web scraping application

The IoT uses sensors create data

Sensors in smartphones, cars, airplanes, street lamps, and home appliances capture raw data

## Open Data and Private Data
1. Open Data
The Open Knowledge Foundation describes Open Data as “any content, information or data that people are free to use, reuse, and redistribute without any legal,
technological, or social restriction.”
2. Private Data
Data related to an expectation of privacy and regulated by a particular country/government

## Structured and Unstructured Data
1. Structured Data
Data entered and maintained in fixed fields within a file or record Easily entered, classified, queried, and analyzed Relational databases or spreadsheets
2. Unstructured Data Lacks organization
Raw data Photo contents, audio, video, web pages, blogs, books, journals, white papers, PowerPoint presentations, articles, email, wikis, word processing
documents, and text in general


## Example of Gathering Image Data using Webcam
Note: Run this snippet using local jupyter notebook

In [1]:
import cv2
from google.colab.patches import cv2_imshow
key = cv2. waitKey(1)
webcam = cv2.VideoCapture(0)
while True:
    try:
        check, frame = webcam.read()
        print(check) #prints true as long as the webcam is running
        print(frame) #prints matrix values of each framecd
        cv2.imshow("Capturing", frame)
        key = cv2.waitKey(1)
        if key == ord('s'):
            cv2.imwrite(filename='saved_img.jpg', img=frame)
            webcam.release()
            img_new = cv2.imread('saved_img.jpg', cv2.IMREAD_GRAYSCALE)
            img_new = cv2.imshow("Captured Image", img_new)
            cv2.waitKey(1650)
            cv2.destroyAllWindows()
            print("Processing image...")
            img_ = cv2.imread('saved_img.jpg', cv2.IMREAD_ANYCOLOR)
            print("Converting RGB image to grayscale...")
            gray = cv2.cvtColor(img_, cv2.COLOR_BGR2GRAY)
            print("Converted RGB image to grayscale...")
            print("Resizing image to 28x28 scale...")
            img_ = cv2.resize(gray,(28,28))
            print("Resized...")
            img_resized = cv2.imwrite(filename='saved_img-final.jpg', img=img_)
            print("Image saved!")

            break
        elif key == ord('q'):
            print("Turning off camera.")
            webcam.release()
            print("Camera off.")
            print("Program ended.")
            cv2.destroyAllWindows()
            break

    except(KeyboardInterrupt):
        print("Turning off camera.")
        webcam.release()
        print("Camera off.")
        print("Program ended.")
        cv2.destroyAllWindows()
        break


ModuleNotFoundError: No module named 'cv2'

## Example of Gathering Voice Data using Microphone
Note: Run the snippet of codes using local jupyter notebook

In [2]:
!pip3 install sounddevice

Defaulting to user installation because normal site-packages is not writeable
Collecting sounddevice
  Downloading sounddevice-0.5.5-py3-none-win_amd64.whl.metadata (1.4 kB)
Downloading sounddevice-0.5.5-py3-none-win_amd64.whl (365 kB)
Installing collected packages: sounddevice
Successfully installed sounddevice-0.5.5


In [3]:
 !pip3 install wavio

Defaulting to user installation because normal site-packages is not writeable
Collecting wavio
  Downloading wavio-0.0.9-py3-none-any.whl.metadata (5.7 kB)
Downloading wavio-0.0.9-py3-none-any.whl (9.5 kB)
Installing collected packages: wavio
Successfully installed wavio-0.0.9


In [4]:
 !pip3 install scipy

Defaulting to user installation because normal site-packages is not writeable
Collecting scipy
  Downloading scipy-1.17.1-cp313-cp313-win_amd64.whl.metadata (60 kB)
Downloading scipy-1.17.1-cp313-cp313-win_amd64.whl (36.5 MB)
   ---------------------------------------- 0.0/36.5 MB ? eta -:--:--
    --------------------------------------- 0.5/36.5 MB 2.4 MB/s eta 0:00:15
   - -------------------------------------- 1.0/36.5 MB 2.4 MB/s eta 0:00:15
   - -------------------------------------- 1.6/36.5 MB 2.4 MB/s eta 0:00:15
   -- ------------------------------------- 2.1/36.5 MB 2.4 MB/s eta 0:00:15
   -- ------------------------------------- 2.6/36.5 MB 2.4 MB/s eta 0:00:14
   --- ------------------------------------ 3.1/36.5 MB 2.5 MB/s eta 0:00:14
   ---- ----------------------------------- 3.7/36.5 MB 2.4 MB/s eta 0:00:14
   ---- ----------------------------------- 4.2/36.5 MB 2.4 MB/s eta 0:00:14
   ----- ---------------------------------- 4.7/36.5 MB 2.4 MB/s eta 0:00:14
   ----- --

In [5]:
!apt-get install libportaudio2

'apt-get' is not recognized as an internal or external command,
operable program or batch file.


In [40]:
# import required libraries
import sounddevice as sd
from scipy.io.wavfile import write
import wavio as wv

# Sampling frequency
freq = 44100

# Recording duration
duration = 5

# Start recorder with the given values
# of duration and sample frequency
recording = sd.rec(int(duration * freq),
                   samplerate=freq, channels=2)

# Record audio for the given number of seconds
sd.wait()

# This will convert the NumPy array to an audio
# file with the given sampling frequency
write("recording0.wav", freq, recording)
# Convert the NumPy array to audio file
wv.write("recording1.wav", recording, freq, sampwidth=2)

PortAudioError: Error querying device -1

## Web Scraping
**Web scraping, web harvesting, or web data** extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World
Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated
processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or
spreadsheet, for later retrieval or analysis.

## Image Scraping using BeautifulSoup and Request

In [None]:
%pip install bs4

In [None]:
%pip install requests

In [41]:
import requests
from bs4 import BeautifulSoup

def getdata(url):
    r = requests.get(url)
    return r.text

htmldata = getdata("https://www.google.com/")
soup = BeautifulSoup(htmldata, 'html.parser')
for item in soup.find_all('img'):
    print(item['src'])

/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png


In [None]:
!pip install selenium

## Image Scraping using Selenium
Note: Run the snippet of code using local jupyter notebook

In [42]:
#!pip install selenium
#!apt-get update # to update ubuntu to correctly run apt install
#!apt install chromium-chromedriver
#!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
import time
import requests
import shutil
import os
import getpass
import urllib.request
import io
import time
from PIL import Image
user = getpass.getuser()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
search_url = "https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568"
driver.get(search_url.format(q='Car'))
def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)#sleep_between_interactions
def getImageUrls(name,totalImgs,driver):

    search_url = "https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568"
    driver.get(search_url.format(q=name))
    img_urls = set()
    img_count = 0
    results_start = 0

    while(img_count<totalImgs): #Extract actual images now

        scroll_to_end(driver)

        thumbnail_results = driver.find_elements_by_xpath("//img[contains(@class,'Q4LuWd')]")
        totalResults=len(thumbnail_results)
        print(f"Found: {totalResults} search results. Extracting links from{results_start}:{totalResults}")

        for img in thumbnail_results[results_start:totalResults]:

            img.click()
            time.sleep(2)
            actual_images = driver.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'https' in actual_image.get_attribute('src'):
                    img_urls.add(actual_image.get_attribute('src'))

            img_count=len(img_urls)

            if img_count >= totalImgs:
                print(f"Found: {img_count} image links")
                break
            else:
                print("Found:", img_count, "looking for more image links ...")
                load_more_button = driver.find_element_by_css_selector(".mye4qd")
                driver.execute_script("document.querySelector('.mye4qd').click();")
                results_start = len(thumbnail_results)
    return img_urls
def downloadImages(folder_path,file_name,url):
    try:

        image_content = requests.get(url).content
    except Exception as e:
        print(f"ERROR - COULD NOT DOWNLOAD {url} - {e}")
    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')

        file_path = os.path.join(folder_path, file_name)

        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SAVED - {url} - AT: {file_path}")
    except Exception as e:
        print(f"ERROR - COULD NOT SAVE {url} - {e}")

def saveInDestFolder(searchNames,destDir,totalImgs,driver):
    for name in list(searchNames):
        path=os.path.join(destDir,name)
        if not os.path.isdir(path):
            os.mkdir(path)
        print('Current Path',path)
        totalLinks=getImageUrls(name,totalImgs,driver)
        print('totalLinks',totalLinks)
    if totalLinks is None:
            print('images not found for :',name)

    else:
        for i, link in enumerate(totalLinks):
            file_name = f"{i:150}.jpg"
            downloadImages(path,file_name,link)

searchNames=['cat']
destDir=f'/content/drive/My Drive/Colab Notebooks/Dataset/'
totalImgs=5
saveInDestFolder(searchNames,destDir,totalImgs,driver)


ModuleNotFoundError: No module named 'PIL'

## Web Scraping of Movies Information using BeautifulSoup
We want to analyze the distributions of IMDB and Metacritic movie ratings to see if we find anything interesting. To do this, weʼll first scrape data for over 2000 movies.




In [43]:
from requests import get
url = 'https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
print(response.text[:500])




In [44]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
headers = {'Accept-Language': 'en-US,en;q=0.8'}
type(html_soup)

bs4.BeautifulSoup

In [45]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
0


In [54]:
first_movie = movie_containers[0]
first_movie


IndexError: list index out of range

In [55]:
first_movie.div

AttributeError: ResultSet object has no attribute "div". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [56]:
first_movie.a

AttributeError: ResultSet object has no attribute "a". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [57]:
first_movie.h3

AttributeError: ResultSet object has no attribute "h3". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [58]:
first_movie.h3.a

AttributeError: ResultSet object has no attribute "h3". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [59]:
first_name = first_movie.h3.a.text
first_name

AttributeError: ResultSet object has no attribute "h3". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [60]:
first_year = first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold')
first_year

AttributeError: ResultSet object has no attribute "h3". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [61]:
first_year = first_year.text
first_year

NameError: name 'first_year' is not defined

In [62]:
first_movie.strong

AttributeError: ResultSet object has no attribute "strong". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [63]:
first_imdb = float(first_movie.strong.text)
first_imdb

AttributeError: ResultSet object has no attribute "strong". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [64]:
first_mscore = first_movie.find('span', class_ = 'metascore favorable')
first_mscore = int(first_mscore.text)
print(first_mscore)

AttributeError: ResultSet object has no attribute "find". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [65]:
first_votes = first_movie.find('span', attrs = {'name':'nv'})
first_votes

AttributeError: ResultSet object has no attribute "find". You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [66]:
first_votes['data-value']

NameError: name 'first_votes' is not defined

In [67]:
first_votes = int(first_votes['data-value'])

NameError: name 'first_votes' is not defined

In [68]:
 # Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:

# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
      name = container.h3.a.text
      names.append(name)
# The year
      year = container.h3.find('span', class_ = 'lister-item-year').text
      years.append(year)
# The IMDB rating
      imdb = float(container.strong.text)
      imdb_ratings.append(imdb)
# The Metascore
      m_score = container.find('span', class_ = 'metascore').text
      metascores.append(int(m_score))
# The number of votes
      vote = container.find('span', attrs = {'name':'nv'})['data-value']
      votes.append(int(vote))

In [69]:
import pandas as pd
test_df = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(test_df.info())
test_df


<class 'pandas.DataFrame'>
RangeIndex: 0 entries
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      0 non-null      float64
 1   year       0 non-null      float64
 2   imdb       0 non-null      float64
 3   metascore  0 non-null      float64
 4   votes      0 non-null      float64
dtypes: float64(5)
memory usage: 132.0 bytes
None


Unnamed: 0,movie,year,imdb,metascore,votes


In [70]:
from time import time
from time import sleep
from random import randint
from IPython.core.display import clear_output
pages = [ '1','2','3','4','5']
years_url = [ '2017', '2018', '2019', '2020']
# Redeclaring the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Preparing the monitoring of the loop
start_time = time()
requests = 0
# For every year in the interval 2000-2017
for year_url in years_url:
    # For every page in the interval 1-4
    for page in pages:
        # Make a get request
        response = get('https://www.imdb.com/search/title?release_date=' + year_url +
        '&sort=num_votes,desc&page=' + page, headers = headers)
        # Pause the loop
        sleep(randint(8,15))
        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)
        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))
        # Break the loop if the number of requests is greater than expected
        if requests > 72:
            warn('Number of requests was greater than expected.')
            break
        # Parse the content of the request with BeautifulSoup
        page_html = BeautifulSoup(response.text, 'html.parser')
        # Select all the 50 movie containers from a single page
        mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
        # For every movie of these 50
        for container in mv_containers:
            # If the movie has a Metascore, then:
            if container.find('div', class_ = 'ratings-metascore') is not None:
                # Scrape the name
                name = container.h3.a.text
                names.append(name)
                # Scrape the year
                year = container.h3.find('span', class_ = 'lister-item-year').text
                years.append(year)
                # Scrape the IMDB rating
                imdb = float(container.strong.text)
                imdb_ratings.append(imdb)
                # Scrape the Metascore
                m_score = container.find('span', class_ = 'metascore').text
                metascores.append(int(m_score))
                # Scrape the number of votes
                vote = container.find('span', attrs = {'name':'nv'})['data-value']
                votes.append(int(vote))

ImportError: cannot import name 'clear_output' from 'IPython.core.display' (C:\Users\tipqc\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\IPython\core\display.py)

In [71]:
movie_ratings = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(movie_ratings.info())
movie_ratings.head(10)

<class 'pandas.DataFrame'>
RangeIndex: 0 entries
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      0 non-null      float64
 1   year       0 non-null      float64
 2   imdb       0 non-null      float64
 3   metascore  0 non-null      float64
 4   votes      0 non-null      float64
dtypes: float64(5)
memory usage: 132.0 bytes
None


Unnamed: 0,movie,year,imdb,metascore,votes


In [72]:
movie_ratings.tail(10)

Unnamed: 0,movie,year,imdb,metascore,votes


## Data Preparation
- Collected data may not be compatible or formatted correctly
- Data must be prepared before it can be added to a data set
- Extract, Transform and Load (ETL)

process for collecting data from a variety of sources, transforming the data, and then loading the data into a database

**Data preprocessing**
Data Processing is a process of cleaning the raw data i.e. the data is collected in the real world and is converted to a clean data set. In other words, whenever the data is
gathered from different sources it is collected in a raw format and this data isnʼt feasible for the analysis. Therefore, certain steps are executed to convert the data into a small
clean data set, this part of the process is called as data preprocessing.

Most of the real-world data is messy, some of these types of data are: 1. **Missing data**: Missing data can be found when it is not continuously created or due to technical
issues in the application (IOT system). 2. **Noisy Data** This type of data is also called outliners, this can occur due to human errors (human manually gathering the data) or
some technical problem of the device at the time of collection of data. 3. **Inconsistent data**: This type of data might be collected due to human errors (mistakes with the
name or values) or duplication of data.

These are some of the basic pre processing techniques that can be used to convert raw data. 1. Conversion of data: As we know that Machine Learning models can only
handle numeric features, hence categorical and ordinal data must be somehow converted into numeric features. 2. Ignoring the missing values: Whenever we encounter
missing data in the data set then we can remove the row or column of data depending on our need. This method is known to be efficient but it shouldnʼt be performed if there
are a lot of missing values in the dataset. 3. Filling the missing values: Whenever we encounter missing data in the data set then we can fill the missing data manually, most
commonly the mean, median or highest frequency value is used.

1. **Machine learning**: If we have some missing data then we can predict what data shall be present at the empty position by using the existing data. 5. **Outliers detection** : There are some error data that might be present in our data set that deviates drastically from other observations in a data set. [Example: human weight = 800 Kg; due to mistyping of extra 0]

## Example of Data Preparation of movie_rating.csv

In [73]:
movie_ratings['year'].unique()

array([], dtype=float64)

In [74]:
movie_ratings.dtypes

movie        float64
year         float64
imdb         float64
metascore    float64
votes        float64
dtype: object

In [75]:
movie_ratings['year'] = (movie_ratings.year.apply(lambda x:x.replace('(I)','')))

In [76]:
movie_ratings['year'].unique()

array([], dtype=float64)

In [77]:
movie_ratings['year'] = (movie_ratings.year.apply(lambda x:x.replace('(II)','')))

In [78]:
movie_ratings['year'] = (movie_ratings.year.apply(lambda x:x.replace('(III)','')))

In [79]:
movie_ratings['year'].unique()

array([], dtype=float64)

In [80]:
movie_ratings['year'] = (movie_ratings.year.apply(lambda x:x.replace('(','')))

In [81]:
movie_ratings['year'].unique()

array([], dtype=float64)

In [82]:
movie_ratings['year'] = (movie_ratings.year.apply(lambda x:x.replace(')','')))

In [83]:
movie_ratings['year'].unique()

array([], dtype=float64)

In [84]:
movie_ratings['year'] = movie_ratings['year'].astype(int)

In [85]:
movie_ratings['year'].unique()

array([], dtype=int64)

In [86]:
movie_ratings.dtypes

movie        float64
year           int64
imdb         float64
metascore    float64
votes        float64
dtype: object

In [87]:
movie_ratings.head(10)

Unnamed: 0,movie,year,imdb,metascore,votes


In [88]:
movie_ratings.tail(10)

Unnamed: 0,movie,year,imdb,metascore,votes


# Webscraping on own Website


In [2]:
import requests
from bs4 import BeautifulSoup

def getData(url):
    r = requests.get(url)
    print(r.status_code)
    return r.text

url = "https://www.google.com"
html = getData(url)

200


In [3]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from webdriver_manager.firefox import GeckoDriverManager
import time

options = Options()
options.binary_location = r"C:\Program Files\Mozilla Firefox\firefox.exe"

service = Service(GeckoDriverManager().install())

driver = webdriver.Firefox(service=service, options=options)

driver.get("https://myanimelist.net/topanime.php?limit=0")
time.sleep(5)

print("Page Title:", driver.title)

Page Title: Top Anime - MyAnimeList.net


In [36]:
url_base = "https://myanimelist.net/topanime.php?limit={}"

titles = []
ratings = []
eps = []
air_dates = []
members = []


for pages in range(0, 150, 50):
    url = url_base.format(pages)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    datas = soup.find_all("tr", class_='ranking-list')
    for data in datas:
        title_tag = data.find("h3",class_="fl-l fs14 fw-b anime_ranking_h3")
        title = title_tag.text.strip()
        rating_loc = data.find("td", class_="score ac fs14")
        rating = float(rating_loc.find("span", class_="score-label").text.strip())
        detail = data.find("div", class_="information di-ib mt4").text.strip()
        parts = detail.split("\n")
        anime_type_ep = parts[0]       
        air_date = parts[1]            
        member = parts[2]
        member = member.replace(" members", "").replace(",", "")

        eps.append(anime_type_ep)
        air_dates.append(air_date)
        members.append(member)
        titles.append(title)
        ratings.append(rating)

In [39]:
print(titles)
print(ratings)
print(eps)
print(air_dates)
print(members)
print(len(titles))

['Sousou no Frieren', 'Sousou no Frieren 2nd Season', 'Fullmetal Alchemist: Brotherhood', 'Chainsaw Man Movie: Reze-hen', 'Steins;Gate', 'Shingeki no Kyojin Season 3 Part 2', 'Gintama: The Final', 'Gintama°', 'Hunter x Hunter (2011)', 'Ginga Eiyuu Densetsu', "Gintama'", "Gintama': Enchousen", 'One Piece Fan Letter', 'Bleach: Sennen Kessen-hen', 'Gintama.', 'Kaguya-sama wa Kokurasetai: Ultra Romantic', 'Fruits Basket: The Final', 'Clannad: After Story', 'Gintama', 'Koe no Katachi', 'Kusuriya no Hitorigoto 2nd Season', 'Code Geass: Hangyaku no Lelouch R2', '3-gatsu no Lion 2nd Season', 'Gintama Movie 2: Kanketsu-hen - Yorozuya yo Eien Nare', 'Monster', 'Gintama. Shirogane no Tamashii-hen - Kouhan-sen', 'Owarimonogatari 2nd Season', 'Shingeki no Kyojin: The Final Season - Kanketsu-hen', 'Kusuriya no Hitorigoto', 'Kingdom 3rd Season', 'Violet Evergarden Movie', 'Shingeki no Kyojin Movie: Kanketsu-hen - The Last Attack', 'Kimi no Na wa.', 'Vinland Saga Season 2', 'Gintama. Shirogane no Tama

In [38]:
import pandas as pd
import csv
data_dict = {
    "titles": titles,
    "Episodes": eps,
    "Air dates": air_dates,
    "Members": members,
    "ratings": ratings    
}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,titles,Episodes,Air dates,Members,ratings
0,Sousou no Frieren,TV (28 eps),Sep 2023 - Mar 2024,1347210,9.28
1,Sousou no Frieren 2nd Season,TV (10 eps),Jan 2026 -,404202,9.18
2,Fullmetal Alchemist: Brotherhood,TV (64 eps),Apr 2009 - Jul 2010,3642367,9.11
3,Chainsaw Man Movie: Reze-hen,Movie (1 eps),Sep 2025 - Sep 2025,408735,9.10
4,Steins;Gate,TV (24 eps),Apr 2011 - Sep 2011,2784233,9.07
...,...,...,...,...,...
145,Zoku Natsume Yuujinchou,TV (13 eps),Jan 2009 - Mar 2009,265390,8.53
146,Fruits Basket 2nd Season,TV (25 eps),Apr 2020 - Sep 2020,578093,8.53
147,Bakuman. 3rd Season,TV (25 eps),Oct 2012 - Mar 2013,346704,8.52
148,Dr. Stone: Science Future Part 2,TV (12 eps),Jul 2025 - Sep 2025,205834,8.52


In [89]:
df.to_csv()

',titles,Episodes,Air dates,Members,ratings\r\n0,Sousou no Frieren,TV (28 eps),        Sep 2023 - Mar 2024,        1347210,9.28\r\n1,Sousou no Frieren 2nd Season,TV (10 eps),        Jan 2026 - ,        404202,9.18\r\n2,Fullmetal Alchemist: Brotherhood,TV (64 eps),        Apr 2009 - Jul 2010,        3642367,9.11\r\n3,Chainsaw Man Movie: Reze-hen,Movie (1 eps),        Sep 2025 - Sep 2025,        408735,9.1\r\n4,Steins;Gate,TV (24 eps),        Apr 2011 - Sep 2011,        2784233,9.07\r\n5,Shingeki no Kyojin Season 3 Part 2,TV (10 eps),        Apr 2019 - Jul 2019,        2556918,9.05\r\n6,Gintama: The Final,Movie (1 eps),        Jan 2021 - Jan 2021,        180537,9.05\r\n7,Gintama°,TV (51 eps),        Apr 2015 - Mar 2016,        685906,9.05\r\n8,Hunter x Hunter (2011),TV (148 eps),        Oct 2011 - Sep 2014,        3148284,9.03\r\n9,Ginga Eiyuu Densetsu,OVA (110 eps),        Jan 1988 - Mar 1997,        358500,9.02\r\n10,Gintama\',TV (51 eps),        Apr 2011 - Mar 2012,        607857,9.02

## Conlcusion
In this activity, I followed the procedure which made me aware that we are able to get data from different sources such as websites, webcams, and microphones through python. Webscraping is basically using python to browse through the website that I have selected and get the data that I need through loops. This gave me more knowledge on how to browse through a website in another lens, through inspect element. This became quite important as in order to get the data that is needed, it is a must to understand the structure of the website and its elements to narrow down the search to get the data that is needed. In getting data in one's chosen website, I had quite a hard time getting the data that I was looking for because I couldn't quite narrow down the class of each of them, or getting the wrong selector, which might narrow the area of search too much that the other data is getting cut. 

Overall, this activity made me realize that it is important to actually browse through the element of the website instead of using the "Pick an element from the page" to understand its structure and what is required in order to get the data needed. 
