# !!! Web Scrapping Project : (Extracting Laptop Information) !!!


Web scraping is the process of extracting data from websites. It involves automated techniques and tools to retrieve information from web pages. This data can then be used for various purposes, such as analysis, research, data mining, or populating databases.
                                    
                                                                  
                                                                                       

.




## Objective :

The primary objective of this project is to develop a robust and efficient web scraping solution for extracting Laptop-related data from a specific e-commerce website. The study aims to accomplish the following specific goals in the context of web scraping:

- ### Data Source Selection: 
    Identify and select the target e-commerce website known for its comprehensive laptop listings and a user-friendly structure that facilitates web scraping.

- ### Data Retrieval Logic: 
    Develop a scraping script that navigates the website's structure, iterates through product pages, and extracts data systematically.

- ### Data Storage: 
    Determine the data storage format, such as CSV, JSON, or a database, and implement a mechanism to save the scraped data efficiently for further analysis.

- ### Rate Limiting and Politeness: 
    Set up rate limiting and polite scraping practices to avoid overloading the target website's server, minimizing the risk of IP bans or disruptions.

- ### Ethical Considerations: 
    Carefully review and adhere to the website's terms of service and robots.txt file to ensure ethical scraping practices. Respect any request for user-agent headers and other directives.

- ### Documentation: 
    Maintain comprehensive documentation of the scraping methodology, including code comments, step-by-step explanations, and a record of any changes or optimizations made during the project.

By focusing on these detailed aspects of web scraping, this project aims to develop a reliable and maintainable data collection process for laptop-related information from the chosen e-commerce website. The resulting dataset will serve as a valuable resource for subsequent analysis or research.



.


## Main Libraries used : 

### 1) Requests -:
The Requests library is a popular Python library for making HTTP requests. It is easy to use and provides a number of features that make it ideal for web scraping, such as:

- Simple and intuitive interface: The Requests library has a simple and intuitive interface that makes it easy to get started with web scraping.
- Support for different HTTP methods: The Requests library supports all of the major HTTP methods, including GET, POST, PUT, and DELETE. This allows you to make a variety of requests to web servers, which is useful for different web scraping tasks.
- Support for authentication: The Requests library supports a variety of authentication methods, such as HTTP Basic Auth and OAuth. This allows you to scrape websites that require authentication.
- Support for cookies and sessions: The Requests library can automatically handle cookies and sessions, which is important for many web scraping tasks.
- Robust error handling: The Requests library has robust error handling, which can help you to handle errors that occur during web scraping.

#### How Requests works

The Requests library works by sending HTTP requests to web servers and receiving the responses. The responses can then be parsed to extract the desired data.

To make an HTTP request with the Requests library, you simply need to call the get(), post(), put(), or delete() method, depending on the HTTP method that you want to use. You can also pass in the URL of the web page that you want to scrape and any other necessary parameters, such as authentication credentials.

Once you have made the HTTP request, the Requests library will return a response object. The response object contains the HTTP status code, the response headers, and the response body.

The response body is the HTML content of the web page that you scraped. You can then parse the HTML content to extract the desired data.

W can use the Requests library to scrape a variety of different types of data from web pages, such as product information, news articles, and social media posts. The Requests library is a powerful and versatile tool for web scraping.

### 2) Beautiful Soup -:

We use the BeautifulSoup library in web scraping because it is a powerful and easy-to-use Python library for parsing HTML and XML documents. BeautifulSoup converts the HTML or XML document into a tree of Python objects, which makes it easy to navigate and extract the desired data.

Here are some of the benefits of using BeautifulSoup for web scraping:

- Easy to use: BeautifulSoup has a simple and intuitive interface, making it easy to get started with web scraping.
- Powerful: BeautifulSoup provides a variety of features for parsing and extracting data from HTML and XML documents.
- Flexible: BeautifulSoup can be used to scrape a wide variety of different types of data from web pages, such as product information, news articles, and social media posts.

#### How BeautifulSoup works

When you use BeautifulSoup to parse an HTML or XML document, it first converts the document into a tree of Python objects. This tree represents the structure of the document, with each node in the tree representing a different element in the document.

Once the document has been parsed into a tree, you can use BeautifulSoup to navigate the tree and extract the desired data. For example, you can use BeautifulSoup to find all of the elements on a web page with a specific class name or to extract the text from a specific element.

You can use BeautifulSoup to extract a variety of different types of data from web pages, such as product information, news articles, and social media posts. BeautifulSoup is a powerful and versatile tool for web scraping.

.

.

In [1]:
# Importing "request" library to make HTTP request to the server
import requests

# Importing "BeautifulSoup" to parse the HTML or XML code
from bs4 import BeautifulSoup

# Importing basic libraries
import pandas as pd
import numpy as np

In [None]:
product_name = []
rating_star = []
processor = []
RAM = []
operating_system = []
SSD = []
display_size = []
economic = []
discount = []


In [2]:

# Looping through all the webpages
for i in range(1,76):
    # Defning the URL of the webpage
    URL = f"https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page={i}"
    
    # Making a "GEt" request to get whole HTML code from webpage and storing it into 'response'
    response = requests.get(URL)
    # Collecting the HTTP content from 'response' as strings 
    collected_info = response.content
    # Parsing the HTML content
    soup = BeautifulSoup(collected_info,"html.parser")
    
    # Looking for "div" tag element with CSS class attribute which has all the useful information for current page  
    div = soup.find("div",class_="_1YokD2 _3Mn1Gg")
    
    # Selecting those elements that have all the information about each product
    product_info = div.find_all("div",class_="_3pLy-c row")
    
    # Selecting those elements that store "Ratings" of each product
    rating_info = div.find_all("div",class_="gUuXy-")
    
    # Selecting those elements that store information about product specifications
    specific_information = div.find_all("ul",class_="_1xgFaf")
    
    # Selecting those elements that store information about product specifications
    economic_info = div.find_all("div",class_="_25b18c")


    for tag in product_info:
        # Looking for that element that has the product name
        product_tag=tag.find("div",class_="_4rR01T")
        # Extract text part from elements
        product=product_tag.text
        product_name.append(product)


    for tag in rating_info:
        # Looking for that element that has the product rating
        rating_tag=tag.find("div",class_="_3LWZlK")
        if rating_tag==None:
            rating_star.append(0)
        else:
            # Extracting text part from element
            rating=rating_tag.text
            rating_star.append(rating)


    for tag in specific_information:
        # Looing for all those list elements that have the product information
        specification_tag=tag.find_all("li")
        # selecting first list element
        processor_li=specification_tag[0] 
        # Selecting second list element
        RAM_li=specification_tag[1]
        # selecting third list element
        operating_system_li=specification_tag[2]
        # selecting fourth list element
        SSD_li=specification_tag[3]
        # seleting fifth list element
        display_sizec_li=specification_tag[4]        
        
        # Extracting text from first list element
        if processor_li==None:
            processor.append("No info")
        else:
            processors=processor_li.text  
            processor.append(processors)
            
        # Extracting text from second list element
        if RAM_li==None:
            RAM.append("No info")
        else:
            RAM_info=RAM_li.text  
            RAM.append(RAM_info)       
            
        # Extracting text from third list element
        if operating_system_li==None:
            operating_system.append("No info")
        else:
            operating_sys=operating_system_li.text 
            operating_system.append(operating_sys)

        # Extracting text from fourth list element
        if SSD_li==None:
            SSD.append("No info")
        else:
            SSDs=SSD_li.text  
            SSD.append(SSDs)

        # Extracting text from figth list element
        if display_sizec_li==None:
            display_size.append("No info")
        else: 
            display_sizes=display_sizec_li.text 
            display_size.append(display_sizes)

    for tag in economic_info:
        # Selecting that element that has the product price and extracting text from it
        eco_tag=tag.find("div",class_="_30jeq3 _1_WHN1")
        eco=eco_tag.text  
        economic.append(eco)

    for tag in economic_info:
        # Selecting that element that has the discount percent and extracting text from it
        dicount_tag=tag.find("div",class_="_3Ay6Sb") 
        if dicount_tag==None:
            discount.append(0)
        else:
            disc=dicount_tag.text  
            discount.append(disc)

In [3]:
# creating the dataframe

df = pd.DataFrame([product_name,rating_star,processor,RAM,operating_system,SSD,display_size,economic,discount]).T
df.columns=["Name","Rating_star","Processor","RAM","Operating_system","SSD_Capacity","Display_size","Price","Discount"]

In [4]:
df.head(10)

Unnamed: 0,Name,Rating_star,Processor,RAM,Operating_system,SSD_Capacity,Display_size,Price,Discount
0,Lenovo ThinkBook 15 Core i5 12th Gen 1235U - (...,3.8,Intel Core i5 Processor (12th Gen),8 GB DDR4 RAM,Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 Inch) Display,"₹49,990",21% off
1,MSI Ryzen 5 Hexa Core 7530U - (8 GB/512 GB SSD...,4.4,AMD Ryzen 5 Hexa Core Processor,8 GB DDR4 RAM,Windows 11 Operating System,512 GB SSD,35.56 cm (14 Inch) Display,"₹41,990",32% off
2,HP Laptop Core i3 11th Gen 1115G4 - (8 GB/512 ...,4.2,Intel Core i3 Processor (11th Gen),8 GB DDR4 RAM,Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 Inch) Display,"₹38,990",20% off
3,HP 2023 Athlon Dual Core 3050U - (8 GB/512 GB ...,4.0,AMD Athlon Dual Core Processor,8 GB DDR4 RAM,Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 Inch) Display,"₹30,990",11% off
4,ASUS Vivobook 15 Core i5 11th Gen 1135G7 - (8 ...,4.3,Intel Core i5 Processor (11th Gen),8 GB DDR4 RAM,64 bit Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 Inch) Display,"₹42,990",38% off
5,DELL Inspiron Core i3 11th Gen 1115G4 - (8 GB/...,4.1,Intel Core i3 Processor (11th Gen),8 GB DDR4 RAM,64 bit Windows 11 Operating System,512 GB SSD,96.52 cm (38 cm) Display,"₹35,990",39% off
6,Lenovo V15 Ryzen 5 Hexa Core 5500U - (8 GB/512...,4.2,AMD Ryzen 5 Hexa Core Processor,8 GB DDR4 RAM,Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 Inch) Display,"₹34,604",50% off
7,Lenovo Ryzen 3 Quad Core 7320U - (8 GB/512 GB ...,3.8,AMD Ryzen 3 Quad Core Processor,8 GB LPDDR5 RAM,Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,"₹35,990",39% off
8,Infinix INBook Y1 Plus Intel Core i3 10th Gen ...,4.2,Intel Core i3 Processor (10th Gen),8 GB LPDDR4X RAM,64 bit Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 inch) Display,"₹26,990",46% off
9,HP (2023) Intel Core i5 11th Gen 1155G7 - (16 ...,4.3,Intel Core i5 Processor (11th Gen),16 GB DDR4 RAM,64 bit Windows 11 Operating System,512 GB SSD,39.62 cm (15.6 Inch) Display,"₹55,091",7% off


In [5]:
df.shape

(984, 9)

In [6]:
# converting the file in a csv file

df.to_csv("Laptop_information.csv")

#### !!!! Thanks !!!!