# WEB SCRAPING PROJECT

Web scraping is the process of extracting data from websites, and Python provides powerful libraries 
like Beautiful Soup 4 that make it easier to scrape and parse HTML and XML content.

Imports

In [95]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

We are collecting data of computer peripherals from "MD Computers" site using Beautiful Soup 4

Enter the item which you wanted to collect information.

In [96]:
search_item = input("Which item do you want to search? ")

Which item do you want to search? CPU


From the dynamic url of the website we are formatting the search result to the url.

In [97]:
url = f"https://mdcomputers.in/index.php?submit_search=&route=product%2Fsearch&&search={search_item}"

Parsing the html content of the page.

In [98]:
page_ = requests.get(url).text
doc = BeautifulSoup(page_, "html.parser")

Finding the total number of result pages by filtering the html tags through class name

In [99]:
total_pages_x = doc.find(class_="col-sm-6 text-right")

In [100]:
total_pages = (str(total_pages_x).split("(")[1]).split(" ")[0]

Total page results obtained for the search

In [101]:
total_pages 

'17'

In [106]:
total_pages = int(total_pages)

We are going to gather the item's name , its link , its price (old and new) below

Creating empty lists for the data

In [107]:
link = []
item_name = []
new_price = []
old_price = []

Here is the main code which scrape the information from the html code

From this url , we can search for the products we need , through iterating the pages till the total page results we obtained earlier.

In [108]:
for page in range(1, total_pages + 1):
    url = f"https://mdcomputers.in/index.php?route=product/search&page={page}&search={search_item}"
    page_ = requests.get(url).text
    doc = BeautifulSoup(page_, "html.parser")
    
    items = doc.find_all(class_ = "right-block right-b") #Finding the main div tag for each product which holds the data. 
    
    for item in items:
        link.append((item.find("h4")).find("a")['href']) #finding the link through href attribute
        item_name.append(((item.find("h4")).find("a")).string) #finding the name in the h4 tag
        new_price.append(((item.find("div",class_ = "price")).find("span",class_ = "price-new")).string[1:]) #finding the price
        try:
            #In some Products, there is no old price so in order to clear the error , finding the old price in this try block
            old_price.append(((item.find("div",class_ = "price")).find("span",class_ = "price-old")).string[1:])    
        except:
            #For the products having no old price , it is declared as null
            old_price.append(None)
    

Converting the lists to Dictionary and then to DataFrame.

In [109]:
dict = {'item_name': item_name , 'link': link , 'new_price': new_price , 'old_price': old_price}

In [110]:
df = pd.DataFrame(dict)

Here is our DataFrame with required data

In [111]:
df

Unnamed: 0,item_name,link,new_price,old_price
0,PowerPlay IV Gaming Bundle (Asus Dual RTX 3060...,https://mdcomputers.in/powerplay-iv-gaming-bun...,60499,74799
1,PowerPlay VI Gaming Bundle (Asus Dual RTX 3060...,https://mdcomputers.in/asus-powerplay-vi-gamin...,76599,80647
2,EK-Quantum Magnitude - CPU Water Block - For A...,https://mdcomputers.in/ek-quantum-magnatude-am...,23503,50399
3,EK-Quantum Magnitude - CPU Water Block - For A...,https://mdcomputers.in/ek-quantum-magnatude-am...,30219,64799
4,EK-Quantum Magnitude - CPU Water Block - For A...,https://mdcomputers.in/ek-quantum-magnatude-am...,24623,52699
...,...,...,...,...
398,Thermaltake UX 210 ARGB Lighting 120mm CPU Air...,https://mdcomputers.in/thermaltake-ux-210-argb...,3650,3900
399,Thermaltake Water 3.0 120 ARGB Sync All In One...,https://mdcomputers.in/thermaltake-water-3.0-1...,4995,7500
400,CORSAIR Hydro X Series XC5 RGB PRO CPU Water B...,https://mdcomputers.in/corsair-hydro-x-xc5-rgb...,6480,7896
401,CORSAIR Hydro X Series XC5 RGB PRO CPU Water B...,https://mdcomputers.in/corsair-hydro-x-xc5-rgb...,6480,7896


Converting the Dataframe to excel document with file name as the item we searched 

In [112]:
df.to_excel(f'{search_item}_list_mdcomputers.xlsx')

Finally we scraped the data from website(MD Computers) successfully through BeautifulSoup4.