## Q1. What is Web Scraping? Why is it Used? Give three areas where Web Scraping is used to get data.

Ans- 

Web scraping is the process of extracting data from websites. It involves automated retrieval of information from web pages, typically using specialized software tools or programming scripts. Web scraping allows users to gather large amounts of data from the internet quickly and efficiently.

**There are several reasons why web scraping is used :**

* **Data Collection :** 

Web scraping is commonly used to collect data from various websites for analysis, research, or other purposes. This could include gathering product information from e-commerce sites, extracting news articles from news websites, or compiling social media data.

* **Competitive Intelligence:**

Businesses use web scraping to gather competitive intelligence, such as monitoring competitors' prices, product offerings, and customer reviews. By analyzing this data, businesses can make informed decisions to stay competitive in the market.

* **Research and Analysis :**

Researchers often use web scraping to collect data for academic studies, market research, or data analysis projects. By scraping data from multiple sources on the internet, researchers can gain insights and identify trends in various domains.

**Three areas where web scraping is commonly used to get data are :**

* **E-commerce :**

Web scraping is widely used in e-commerce to collect product information, prices, and reviews from online retailers. This data can be used for price comparison, market analysis, and monitoring competitor pricing strategies.

* **Social Media Monitoring :**

Web scraping is employed to gather data from social media platforms such as Twitter, Facebook, and LinkedIn. This includes extracting user comments, posts, and engagement metrics for sentiment analysis, marketing research, or social listening.

* **Financial Services :**

In finance, web scraping is utilized to extract data from financial news websites, stock market platforms, and other sources. This data can include stock prices, company financials, economic indicators, and news articles, which are then used for investment research, algorithmic trading, and market analysis.

## Q2. What are the different methods used for Web Scraping?

Ans-

There are several methods used for web scraping, ranging from simple manual techniques to more sophisticated automated approaches.

* **Web Scraping Libraries :**

Various programming libraries and frameworks are available for web scraping, such as BeautifulSoup (for Python), Scrapy, Puppeteer (for JavaScript), and BeautifulSoup (for Python). These libraries provide tools and functions to parse HTML and extract data from web pages programmatically.

* **Web Scraping Tools and Software :**

There are many web scraping tools and software applications available that allow users to scrape data from websites without writing code. These tools typically provide a user-friendly interface for specifying scraping parameters and exporting data in various formats.

* **Browser Extensions :**

 Browser extensions like Chrome's "Web Scraper" or Firefox's "Data Scraper" provide a graphical interface for extracting data from web pages. Users can define scraping rules using point-and-click actions, making it easier to scrape data without writing code.

* **Manual Copy-Pasting :** 

The simplest form of web scraping involves manually copying and pasting data from web pages into a spreadsheet or text file. While this method is straightforward, it is time-consuming and not suitable for large-scale data extraction.

* **APIs :** 

Some websites offer APIs (Application Programming Interfaces) that allow developers to access data in a structured format without the need for web scraping. APIs provide a more reliable and efficient way to access data, as they are designed specifically for this purpose. However, not all websites offer APIs, and some may have usage restrictions or require authentication.

## Q3. What is Beautiful Soup? Why is it used?

Ans -

Beautiful Soup is a Python library used for parsing HTML and XML documents. It provides a convenient way to extract and manipulate data from web pages. Beautiful Soup creates a parse tree from the parsed document (such as an HTML or XML file), which can then be traversed to extract data based on tags, attributes, and other criteria.

**Beautiful Soup is used for several reasons :**

* **Web Scraping :**

Beautiful Soup is commonly used for web scraping tasks, where data needs to be extracted from multiple web pages automatically. It simplifies the process of parsing and extracting data from HTML documents, making web scraping more efficient and manageable.

* **Data Extraction :**

It provides powerful tools for extracting data from web pages, such as finding elements by tag name, class, id, or other attributes. Users can extract text, links, images, and other content from HTML documents.

* **Data Cleaning :**

Beautiful Soup can be used to clean and normalize HTML data, removing unnecessary tags, attributes, or formatting to make it easier to work with.

* **Data Manipulation :**

In addition to extracting data, Beautiful Soup provides methods for manipulating the parsed document, such as modifying the structure, adding or removing elements, and navigating the parse tree.

## Q4. Why is flask used in this Web Scraping project?

Ans -

Flask is used in this web scraping project for several reasons:

* **Web Interface :**

Flask allows the creation of a web interface for the web scraping project. In this case, it provides routes for displaying the home page (`/`) and the results page (`/review`), enhancing user interaction and experience.

* **Handling HTTP Requests :**

Flask handles HTTP requests (GET and POST) from the client's web browser. When the user submits a search query on the home page or requests to view the review comments, Flask routes these requests to the appropriate functions for processing.

* **Integration with Web Scraping Libraries :**

Flask seamlessly integrates with web scraping libraries such as BeautifulSoup and Selenium. In this project, BeautifulSoup is used to parse HTML content, while Selenium is used to automate the web browser for dynamic content extraction.

* **Data Storage :**

Flask facilitates storing scraped data into a CSV file and a MongoDB database. It provides routes and functions to handle data processing tasks, such as extracting review comments from Flipkart and writing them to a CSV file, as well as inserting the data into a MongoDB collection.

* **Cross-Origin Resource Sharing (CORS) :**

Flask-CORS extension is utilized to handle cross-origin requests. This is important for allowing the web scraping project to interact with resources (like APIs or web pages) hosted on different domains.

* **Error Handling:** 

Flask provides mechanisms for error handling, allowing the project to gracefully handle exceptions that may occur during the web scraping process and display appropriate error messages to the user.

## Q5. Write the names of AWS services used in this project. Also, explain the use of each service.

Ans-

* **Amazon EC2 (Elastic Compute Cloud) :**

 Amazon EC2 could be used to host the web scraping application. It provides resizable compute capacity in the cloud and allows users to run applications on virtual servers called instances. The Flask application could be deployed on an EC2 instance for hosting the web interface and handling web scraping requests.

* **AWS Elastic Beanstalk**

Elastic Beanstalk is used in this project for its simplified deployment process, scalability features, automatic updates, monitoring capabilities, managed environment, integration with other AWS services, and cost-effectiveness. By abstracting away infrastructure management, Elastic Beanstalk allows developers to focus on writing code, while automatically handling load balancing, scaling, and capacity provisioning based on application demand. It supports rolling updates, integrates with AWS CloudWatch for monitoring and logging, and seamlessly integrates with other AWS services like Amazon RDS and Amazon S3. Additionally, its pay-as-you-go pricing model makes it cost-effective, eliminating the need for upfront infrastructure investments and providing predictable pricing based on usage. Overall, Elastic Beanstalk provides a convenient and scalable platform for deploying web applications, including Flask-based web scraping projects, while reducing operational overhead and improving application reliability.

* **AWS CODE PIPELINE**

AWS CodePipeline could be utilized in a web scraping project to automate the deployment process, facilitating continuous integration and continuous deployment (CI/CD) practices. By setting up pipelines that automatically trigger builds and deployments upon changes pushed to the repository, developers can ensure quick deployment of updates or improvements to the scraping logic. With support for multiple environments and seamless integration with other AWS services like AWS CodeBuild and AWS CodeDeploy, CodePipeline enables developers to create end-to-end automation pipelines covering the entire software delivery process. This streamlines development workflows, reduces the risk of introducing bugs, and improves deployment agility, ultimately enhancing the efficiency and reliability of the web scraping project.