### Project outline
1. Introduction
2. Project motivation
3. Install the necessary libraries
4. Identify the website of interest, acquire the necessary info
5. Structure the data & EDA
6. System automation


#### **1. Introduction**

**Python web scraping** allows you to collect and parse data from websites programmatically.

**Necessary libraries** used to fetch and manipulate HTML content effortlessly;
1. urllib
2. Beautiful Soup 
3. MechanicalSoup

It is the automated procedure of extracting the large amount of data from websites. 
The data available on the websites which is unstructured can be converted to structured data using Web Scrapping.

**Ways to scrape websites**
1. Online Services
2. APIs
3. writing your own code, etc.

**Purpose**

To automate the data collecting process and also to be aware about the sudden price changes of certain commodities through email.

Alternatively, get notified instantly of some products which don't stay in stock for a long time such as PS5.

### **2. Motivation and future plan**

A friend wanted to do the following project on PS5 but it is not available online atleast not on Amazon India. So main goal was to constantly keep checking availability of PS5 and the moment it gets available order it but since there is no product item named PS5 on amazon, she had to go with ps4 disc.

The basics remain the same with little tweak in code we would be able to use this as PS5 scrapper!


### **3. Install the necessary libraries**

1. **BeautifulSoup**

Used for web scraping by parsing HTML and XML documents.
Helps extract data from web pages, eg. Extracting headlines from a news website.

2. **Requests**

Used to send HTTP requests to web pages and fetch data.
Works well with ***BeautifulSoup for web scraping***, eg. Downloading a webpage’s HTML content.

In [16]:
#First install the libraries
!pip install BeautifulSoup 
!pip install requests

Collecting BeautifulSoup
  Using cached BeautifulSoup-3.2.2.tar.gz (32 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'


  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\EMMACULATE\AppData\Local\Temp\pip-install-i519x4c3\beautifulsoup_b6624eef4a18421b93e61eecdbfe474c\setup.py", line 3
          "You're trying to run a very old release of Beautiful Soup under Python 3. This will not work."<>"Please use Beautiful Soup 4, available through the pip package 'beautifulsoup4'."
                                                                                                         ^^
      SyntaxError: invalid syntax
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is 



In [17]:
# import libraries 

from bs4 import BeautifulSoup
import requests
import time
import datetime

import smtplib

In [18]:
current_time = datetime.datetime.now()
print("Current Time:", current_time)

Current Time: 2025-02-06 11:43:18.422345


#### **4. Amazon, source PS4**

In [6]:
#connecting to a website

url = 'https://www.amazon.in/PS4-God-of-War/dp/B07YQ73Y8T/ref=sr_1_2?keywords=ps4%2Bgame&qid=1642854585&sr=8-2'
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"}

### **5. Structure and explore the data**

In [7]:
#getting the data from the website
page = requests.get(url, headers=header)

s1 = BeautifulSoup(page.content,'html.parser')
s2 = BeautifulSoup(s1.prettify(),'html.parser')

title = s2.find('span','a-size-large product-title-word-break').get_text()
print(title)
price = s2.find('span', 'a-offscreen').get_text()
print(price)


             PS4 GOW HITS
            

                             ₹899.00
                            


In [8]:
#removing the rupee sign from the price

price = price.strip()[1:]
title = title.strip()

print(title,price)

PS4 GOW HITS 899.00


In [9]:
today = datetime.date.today()
print(today)

2025-02-06


**saving the data from the amazon website into a csv file**

In [None]:
import csv 

h1 = ['Title','Price','Date']
data = [title,price,today]

with open('Scrapper_file.csv','w',newline='',encoding='UTF8') as f:
    writer = csv.writer(f)
    writer.writerow(h1)
    writer.writerow(data)

In [11]:
import pandas as pd

#df = pd.read_csv('/Users/digi/Desktop/Data-Analytics/web-scrapper/Scrapper_file.csv')
#df = pd.read_csv("D:/UCU/Easter semester 2025/Data mining modeling and analytics 2025/Practicals/Mining Websites/Scrapper_file.csv")
df = pd.read_csv("Scrapper_file.csv")
df.head(10)

Unnamed: 0,Title,Price,Date
0,PS4 GOW HITS,899.0,2025-02-06


In [12]:
#checking if the apending works

with open('Scrapper_file.csv','a+',newline='',encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerow(data)

### **6. Automation**

This function basically automates the entire process and keeps on updating the csv file on its on.

In [13]:
#automating the process

def automate():
    url = 'https://www.amazon.in/PS4-God-of-War/dp/B07YQ73Y8T/ref=sr_1_2?keywords=ps4%2Bgame&qid=1642854585&sr=8-2'
    header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"}

    page = requests.get(url, headers=header)

    s1 = BeautifulSoup(page.content,'html.parser')
    s2 = BeautifulSoup(s1.prettify(),'html.parser')

    title = s2.find('span','a-size-large product-title-word-break').get_text()
    price = s2.find('span', 'a-offscreen').get_text()
    price = price.strip()[1:]
    title = title.strip()
    
    import datetime
    today = datetime.date.today()

    import csv 

    h1 = ['Title','Price','Date']
    data = [title,price,today]

    with open('Scrapper_file.csv','a+',newline='',encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerow(data)
    #if (price < 900):
     #   send_mail()



    

**Loop to keep the scrapper running**

Here I have used just 1 second gap inorder to get the data and validate it fastly you can use any time interval but take note that the time is in seconds

In [None]:
#loop that keeps the process running after a defined interval
#note here time.sleep() is in seconds
while(True):
    automate()
    time.sleep(1) #unit is seconds

**Send Mail to yourself**

Challenge: Sometimes Google won't allow unauthorised app to access your mail so for the first time you will have to allow it manually after that it seems to work fine.

In [15]:
#sending a mail if the product gets available or there is a price drop

def send_mail():
    server = smtplib.SMTP_SSL('smtp.gmail.com',465)
    server.ehlo()
    #server.starttls()
    server.ehlo()
    server.login('enter email','@@@@@')
    
    subject = "Price Drop Alert!"
    body = "God of War-PS4 just dropped in price might want to have a look"
   
    msg = f"Subject: {subject}\n\n{body}"
    
    server.sendmail(
        'enter email',
        msg
     
    )