## Web Scraping using BeautifulSoup

### Data Gathering

#### Sources of Data
A vast amount of historical data can be found in files such as:
- MS Word documents
- Emails
- Spreadsheets
- MS PowerPoints
- PDFs
- HTML
- Plaintext files

Public and Private Archives:
- CSV, JSON, and XML files use plaintext, a common format, and are compatible with a wide range of applications.
- The Web can be mined for data using a web scraping application.
- The IoT uses sensors to create data.
- Sensors in smartphones, cars, airplanes, street lamps, and home appliances capture raw data.

#### Open Data and Private Data
- **Open Data**: The Open Knowledge Foundation describes Open Data as “any content, information, or data that people are free to use, reuse, and redistribute without any legal, technological, or social restriction.”
- **Private Data**: Data related to an expectation of privacy and regulated by a particular country/government.

### Structured and Unstructured Data

#### Structured Data
- Data entered and maintained in fixed fields within a file or record.
- Easily entered, classified, queried, and analyzed.
- Relational databases or spreadsheets.

#### Unstructured Data
- Lacks organization.
- Raw data: Photo contents, audio, video, web pages, blogs, books, journals, white papers, PowerPoint presentations, articles, email, wikis, word processing documents, and text in general.

### Example of Gathering Image Data Using Webcam

In [2]:
import cv2 

key = cv2.waitKey(1)
webcam = cv2.VideoCapture(0)

while True:
    try:
        check, frame = webcam.read()
        print(check) # prints true as long as the webcam is running
        print(frame) # prints matrix values of each frame
        cv2.imshow("Capturing", frame)
        key = cv2.waitKey(1)
        
        if key == ord('s'): 
            cv2.imwrite(filename='saved_img.jpg', img=frame)
            webcam.release()
            img_new = cv2.imread('saved_img.jpg', cv2.IMREAD_GRAYSCALE)
            img_new = cv2.imshow("Captured Image", img_new)
            cv2.waitKey(1650)
            cv2.destroyAllWindows()
            print("Processing image...")
            img_ = cv2.imread('saved_img.jpg', cv2.IMREAD_ANYCOLOR)
            print("Converting RGB image to grayscale...")
            gray = cv2.cvtColor(img_, cv2.COLOR_BGR2GRAY)
            print("Converted RGB image to grayscale...")
            print("Resizing image to 28x28 scale...")
            img_ = cv2.resize(gray, (28, 28))
            print("Resized image saved!")
        
        elif key == ord('q'):
            print("Turning off camera.")
            webcam.release()
            print("Camera off.")
            print("Program ended.")
            cv2.destroyAllWindows()
            break
        
    except KeyboardInterrupt:
        print("Turning off camera.")
        webcam.release()
        print("Camera off.")
        print("Program ended.")
        cv2.destroyAllWindows()
        break

True
[[[105 236 255]
  [ 80 206 231]
  [ 67 185 213]
  ...
  [205 255 251]
  [173 255 250]
  [154 255 247]]

 [[103 230 255]
  [ 81 204 230]
  [ 75 191 219]
  ...
  [196 255 250]
  [170 255 254]
  [154 255 254]]

 [[ 98 219 245]
  [ 79 197 225]
  [ 74 187 216]
  ...
  [180 255 251]
  [155 255 254]
  [138 255 250]]

 ...

 [[ 50 155 178]
  [ 50 155 178]
  [ 49 154 177]
  ...
  [ 58  89  72]
  [ 57  88  71]
  [ 56  87  70]]

 [[ 52 154 177]
  [ 52 154 177]
  [ 52 154 177]
  ...
  [ 58  90  72]
  [ 57  89  72]
  [ 57  89  72]]

 [[ 52 151 175]
  [ 53 152 176]
  [ 53 152 176]
  ...
  [ 58  91  73]
  [ 60  92  75]
  [ 61  93  76]]]
True
[[[142 238 255]
  [105 203 227]
  [ 79 182 205]
  ...
  [221 255 253]
  [189 255 253]
  [171 255 252]]

 [[135 228 253]
  [104 200 224]
  [ 81 181 205]
  ...
  [212 255 252]
  [185 255 253]
  [171 255 255]]

 [[128 217 242]
  [103 195 220]
  [ 84 181 205]
  ...
  [199 255 254]
  [178 255 255]
  [165 253 253]]

 ...

 [[ 65 152 176]
  [ 66 153 177]
  [ 68 154

error: OpenCV(4.9.0) D:\a\opencv-python\opencv-python\opencv\modules\highgui\src\window.cpp:971: error: (-215:Assertion failed) size.width>0 && size.height>0 in function 'cv::imshow'


### Example of Gathering Voice Data Using Microphone

In [3]:
# Import required libraries
import sounddevice as sd
from scipy.io.wavfile import write
import wavio as wv

# Sampling frequency
freq = 44100

# Recording duration
duration = 5

# Start recorder with the given values of duration and sample frequency
recording = sd.rec(int(duration * freq), samplerate=freq, channels=2)

# Record audio for the given number of seconds
sd.wait()

# Convert the NumPy array to an audio file with the given sampling frequency
write("recording0.wav", freq, recording)
wv.write("recording1.wav", recording, freq, sampwidth=2)


### Web Scraping

Web scraping, web harvesting, or web data extraction is the process of extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. Web scraping can be done manually by a software user, but it typically refers to automated processes implemented using a bot or web crawler.

#### Image Scraping using BeautifulSoup and Requests

In [5]:
import requests
from bs4 import BeautifulSoup

def getdata(url):
    r = requests.get(url)
    return r.text

htmldata = getdata("https://www.google.com/")
soup = BeautifulSoup(htmldata, 'html.parser')

for item in soup.find_all('img'):
    print(item['src'])

/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png


#### Image Scraping using Selenium


In [6]:
# Import required libraries
from selenium import webdriver
import time
import requests
import shutil
import os
import getpass
import urllib.request
import io
from PIL import Image

# Setup Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', chrome_options=chrome_options)

def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5) # sleep_between_interactions

def getImageUrls(name, totalImgs, driver):
    search_url = "https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568"
    driver.get(search_url.format(q=name))
    img_urls = set()
    img_count = 0
    results_start = 0  
    
    while img_count < totalImgs:  # Extract actual images now
        scroll_to_end(driver)
        thumbnail_results = driver.find_elements_by_xpath("//img[contains(@class,'Q4LuWd')]")
        totalResults = len(thumbnail_results)
        print(f"Found: {totalResults} search results. Extracting links from {results_start}:{totalResults}")
        
        for img in thumbnail_results[results_start:totalResults]:
            img.click()
            time.sleep(2)
            actual_images = driver.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'https' in actual_image.get_attribute('src'):
                    img_urls.add(actual_image.get_attribute('src'))
            img_count = len(img_urls)
            
            if img_count >= totalImgs:
                print(f"Found: {img_count} image links")
                break
            else:
                print("Found:", img_count, "looking for more image links ...")                
                load_more_button = driver.find_element_by_css_selector(".mye4qd")
                driver.execute_script("document.querySelector('.mye4qd').click();")
                results_start = len(thumbnail_results)
    return img_urls

def downloadImages(folder_path, file_name, url):
    try:
        image_content = requests.get(url).content
    except Exception as e:
        print(f"ERROR - COULD NOT DOWNLOAD {url} - {e}")
    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path, file_name)
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SAVED - {url} - AT: {file_path}")
    except Exception as e:
        print(f"ERROR - COULD NOT SAVE {url} - {e}")

def saveInDestFolder(searchNames, destDir, totalImgs, driver):
    for name in list(searchNames):
        path = os.path.join(destDir, name)
        if not os.path.isdir(path):
            os.mkdir(path)
        print('Current Path', path)
        totalLinks = getImageUrls(name, totalImgs, driver)
        print('totalLinks', totalLinks)
        
        if totalLinks is None:
            print('images not found for :', name)
        else:
            for i, link in enumerate(totalLinks):
                file_name = f"{i:150}.jpg"
                downloadImages(path, file_name, link)

searchNames = ['cat']
destDir = 'dataset'
totalImgs = 5
saveInDestFolder(searchNames, destDir, totalImgs, driver)

TypeError: WebDriver.__init__() got an unexpected keyword argument 'chrome_options'

#### Web Scraping of Movies Information Using BeautifulSoup

In [34]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the IMDB page
url = 'https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

# Headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive"
}

# Making the request
response = requests.get(url, headers=headers)
print(response.text[:500])


<!DOCTYPE html><html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
             


In [35]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
headers = {'Accept-Language': 'en-US,en;q=0.8'}
type(html_soup)


bs4.BeautifulSoup

In [50]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the IMDB page
url = 'https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

# Headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive"
}

# Making the request
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    html_soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extracting movie containers from the <ul> element
    movie_list = html_soup.find('ul', class_='ipc-metadata-list')
    movie_containers = movie_list.find_all('li', class_='ipc-metadata-list-summary-item')
    
    # Lists to store the scraped data
    names = []
    years = []
    durations = []
    ratings = []
    vote_counts = []
    metascores = []
    descriptions = []

    # Extract data from individual movie containers
    for container in movie_containers:
        name_tag = container.find('h3', class_='ipc-title__text')
        if name_tag:
            # Remove the number from the name
            name = ' '.join(name_tag.text.split('.')[1:]).strip()
            names.append(name)
        
        year_tag = container.find('span', class_='dli-title-metadata-item')
        year = year_tag.text.strip('()') if year_tag else 'N/A'
        years.append(year)
        
        duration_tag = container.find_all('span', class_='dli-title-metadata-item')[1] if len(container.find_all('span', class_='dli-title-metadata-item')) > 1 else None
        duration = duration_tag.text if duration_tag else 'N/A'
        durations.append(duration)
        
        rating_span = container.find('span', class_='ipc-rating-star')
        if rating_span and 'aria-label' in rating_span.attrs:
            rating = rating_span['aria-label']
            rating_value = ''.join(filter(str.isdigit, rating.split()[2]))
            ratings.append(float(rating_value) / 10)
        else:
            ratings.append('N/A')
        
        vote_tag = container.find('span', class_='ipc-rating-star--voteCount')
        vote = vote_tag.text if vote_tag else 'N/A'
        vote_counts.append(vote)
        
        metascore_tag = container.find('span', class_='metacritic-score-box')
        metascore = metascore_tag.text.strip() if metascore_tag else 'N/A'
        metascores.append(metascore)
        
        description_tag = container.find('div', class_='ipc-html-content-inner-div')
        description = description_tag.text.strip() if description_tag else 'N/A'
        descriptions.append(description)

    # Create a DataFrame
    movies_df = pd.DataFrame({
        'Name': names,
        'Year': years,
        'Duration': durations,
        'Rating': ratings,
        'Votes': vote_counts,
        'Metascore': metascores,
        'Description': descriptions
    })
    
    # Display the DataFrame
    print(movies_df.head())
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")


                             Name  Year Duration  Rating    Votes Metascore  \
0                           Logan  2017   2h 17m     8.1   (840K)        77   
1                  Thor: Ragnarok  2017   2h 10m     7.9   (822K)        74   
2  Guardians of the Galaxy Vol  2  2017   2h 16m     7.6   (765K)        67   
3                         Dunkirk  2017   1h 46m     7.8   (746K)        94   
4          Spider-Man: Homecoming  2017   2h 13m     7.4   (726K)        73   

                                         Description  
0  In a future where mutants are nearly extinct, ...  
1  Imprisoned on the planet Sakaar, Thor must rac...  
2  The Guardians struggle to keep together as a t...  
3  Allied soldiers from Belgium, the British Comm...  
4  Peter Parker balances his life as an ordinary ...  


In [51]:
movies_df.to_csv('movieratings2017.csv', index=False)

### Data Preparation

Collected data may not be compatible or formatted correctly. Data must be prepared before it can be added to a dataset.

#### Extract, Transform and Load (ETL)
A process for collecting data from a variety of sources, transforming the data, and then loading the data into a database.

#### Data Preprocessing
- **Conversion of data**: Convert categorical and ordinal data into numeric features.
- **Ignoring the missing values**: Remove the row or column of data if there are missing values.
- **Filling the missing values**: Fill the missing data manually with the mean, median, or highest frequency value.
- **Outliers detection**: Detect and handle error data that deviates drastically from other observations.

#### Example of Data Preparation of movie_rating.csv

In [83]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('movieratings2017.csv')

# Display the first few rows of the dataframe
print(df.head())

# Inspect the dataframe to understand its structure and identify any issues
print(df.info())
print(df.describe())

# Filter out non-movie entries (entries with 'TV-MA' as their duration)
df = df[df['Duration'] != 'TV-MA']

# Data Cleaning and Transformation

# Convert 'Year' column to numeric, errors='coerce' will convert invalid parsing to NaN
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')

# Remove parenthesis and replace 'K' with '000' in 'Votes' column
df['Votes'] = df['Votes'].str.replace('(', '').str.replace(')', '').str.replace('K', '000').astype(float)

# Function to convert duration to minutes
def convert_duration(duration):
    if 'h' in duration:
        hours, minutes = duration.split('h')
        hours = int(hours.strip()) if hours.strip() else 0
        minutes = int(minutes.replace('m', '').strip()) if minutes.strip() else 0
        return hours * 60 + minutes
    else:
        return int(duration.replace('m', '').strip()) if duration.strip() else 0

# Apply the function to the 'Duration' column
df['Duration'] = df['Duration'].apply(convert_duration)

# Convert 'Rating' column to numeric, errors='coerce' will convert invalid parsing to NaN
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Convert 'Metascore' column to numeric, errors='coerce' will convert invalid parsing to NaN
df['Metascore'] = pd.to_numeric(df['Metascore'], errors='coerce')

# Handle missing values
df.fillna({
    'Year': df['Year'].median(),
    'Votes': df['Votes'].median(),
    'Rating': df['Rating'].median(),
    'Metascore': df['Metascore'].median(),
}, inplace=True)

# Display the cleaned dataframe
print(df.head())
print(df.info())
print(df.describe())

# Save the cleaned dataframe to a new CSV file
df.to_csv('movieratings2017_clean.csv', index=False)


                             Name  Year Duration  Rating    Votes  Metascore  \
0                           Logan  2017   2h 17m     8.1   (840K)       77.0   
1                  Thor: Ragnarok  2017   2h 10m     7.9   (822K)       74.0   
2  Guardians of the Galaxy Vol  2  2017   2h 16m     7.6   (765K)       67.0   
3                         Dunkirk  2017   1h 46m     7.8   (746K)       94.0   
4          Spider-Man: Homecoming  2017   2h 13m     7.4   (726K)       73.0   

                                         Description  
0  In a future where mutants are nearly extinct, ...  
1  Imprisoned on the planet Sakaar, Thor must rac...  
2  The Guardians struggle to keep together as a t...  
3  Allied soldiers from Belgium, the British Comm...  
4  Peter Parker balances his life as an ordinary ...  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0 

  df['Votes'] = df['Votes'].str.replace('(', '').str.replace(')', '').str.replace('K', '000').astype(float)
