# **Introduction To Web Scrapping and Implementation**

# What is Web Scrapping?

**Web scraping**, also known as web harvesting or web data extraction, is the process of automatically extracting information from websites. It involves using a program or a script to access a website's HTML code, parse the data, and extract specific information from it. This extracted data can then be saved, analyzed, or used for various purposes. Here are some key points to understand about web scraping:

**Components of Web Scraping:**

1. **HTTP Requests:** Web scraping begins with making HTTP requests to the target website to retrieve its HTML code. This can be done using programming libraries such as Requests in Python.

2. **HTML Parsing:** After obtaining the HTML code, web scrapers use HTML parsers (e.g., BeautifulSoup or lxml in Python) to parse the code and extract relevant data.

3. **Data Extraction:** Web scrapers identify specific elements on the web page, such as headings, links, tables, or other structured data, and extract the content from those elements. This often involves selecting elements using CSS selectors or XPath expressions.

**Use Cases of Web Scraping:**

1. **Data Collection:** Web scraping is commonly used to gather data from various websites, including product information from e-commerce sites, real estate listings, news articles, and more.

2. **Price Comparison:** Retailers and consumers use web scraping to compare prices of products across different websites and platforms.

3. **Market Research:** Companies can scrape data to analyze market trends, monitor competitors, and gather valuable insights.

4. **Content Aggregation:** News aggregators and content curation platforms use web scraping to collect news articles and blog posts from various sources.

5. **Lead Generation:** Businesses use web scraping to collect contact information (e.g., email addresses) of potential leads for marketing and sales purposes.

6. **Research and Analysis:** Researchers and analysts use web scraping to collect data for academic research, sentiment analysis, and more.

**Legality and Ethical Considerations:**

The legality and ethics of web scraping vary from one website to another and depend on local laws. Some websites may explicitly forbid scraping in their terms of service, while others may allow it to varying degrees. Ethical scraping practices include:

- Respecting the website's `robots.txt` file, which specifies which parts of the site can be crawled and scraped.
- Avoiding scraping that could overload a website's server and disrupt its operations.
- Not collecting and using personal or sensitive data without proper consent or legal basis.

It's important to be aware of the legal and ethical considerations when engaging in web scraping and to use this technique responsibly and in compliance with applicable laws and website policies.

# WorkFlow Of Web Scrapping

Performing web scraping involves a few key steps and principles:

**1. Identify Your Target Website:**
   - Choose the website from which you want to scrape data.
   - Ensure that you have the legal right to scrape data from this website, and be aware of any terms of service or `robots.txt` file restrictions.

**2. Select a Programming Language:**
   - You'll need a programming language to write your web scraping code. Python is a popular choice because it has libraries like Requests and BeautifulSoup for web scraping.

**3. Set Up Your Development Environment:**
   - Install the necessary libraries for your chosen programming language.

**4. Send an HTTP Request:**
   - Use your programming language to send an HTTP request to the URL of the web page you want to scrape. The response will contain the HTML content of the page.

**5. Parse the HTML:**
   - Use an HTML parsing library (e.g., BeautifulSoup in Python) to parse the HTML content and create a parse tree. This makes it easier to navigate and extract data from the HTML.

**6. Identify the Data to Scrape:**
   - Determine which parts of the HTML contain the data you want to scrape. You can often use CSS selectors or XPath expressions to locate specific HTML elements.

**7. Extract the Data:**
   - Once you've located the relevant HTML elements, extract the data from those elements. For example, you might extract text, links, or images.

**8. Store the Data:**
   - Store the scraped data in a structured format, such as a CSV file, a database, or a JSON file, depending on your needs.

**9. Handle Pagination and Pagination Links:**
   - If the data you want is spread across multiple pages, you'll need to implement a method to follow pagination links and scrape data from multiple pages.

**10. Handle Dynamic Content (Optional):**
   - Some websites load content dynamically using JavaScript. You may need to use a headless browser automation tool like Selenium to interact with the page and scrape dynamic content.

**Principles Behind Web Scraping:**

1. **Respect the Website's Terms of Service:** Always review the website's terms of service and adhere to their scraping policies. Some websites may explicitly prohibit scraping.

2. **Respect `robots.txt`:** Many websites include a `robots.txt` file that specifies which parts of the site can and cannot be scraped. Always respect these guidelines.

3. **Avoid Overloading the Server:** Don't send too many requests too quickly, as this can overload the server and disrupt its operations. Use rate limiting and follow best practices for polite web scraping.

4. **Use Libraries and Tools:** Make use of existing libraries and tools for web scraping, such as Requests, BeautifulSoup, or Scrapy in Python. These tools simplify the process and handle common tasks.

5. **Keep Error Handling in Mind:** When scraping websites, be prepared to handle errors gracefully. Web scraping may encounter issues like HTTP errors, missing data, or changes in website structure.

6. **Data Privacy and Legal Compliance:** Be mindful of data privacy and legal compliance, especially when scraping personal or sensitive data.

7. **Monitor and Maintain:** Websites can change their structure, so it's important to periodically check and update your scraping code to ensure it remains effective.

Web scraping can be a powerful tool for gathering data from the internet, but it should be done responsibly and ethically, with consideration of both legal and ethical principles.

# Tools and Libraries Used In Web Scrapping.

Web scraping can be made more efficient and convenient by using various tools and libraries. Here are some of the popular tools and libraries used for web scraping:

**1. Requests (Library):**
   - Language: Python
   - Description: Requests is a Python library for making HTTP requests. It's commonly used for fetching the HTML content of web pages, which can then be parsed for data extraction.

**2. BeautifulSoup (Library):**
   - Language: Python
   - Description: BeautifulSoup is a Python library that simplifies the parsing of HTML documents. It makes it easy to navigate and extract data from HTML content.

**3. Scrapy (Library):**
   - Language: Python
   - Description: Scrapy is a powerful web crawling and scraping framework for Python. It provides tools for defining how to follow links and extract data from websites.

**4. Selenium (Library/Tool):**
   - Language: Multiple (Python, Java, etc.)
   - Description: Selenium is a browser automation tool often used for scraping websites with dynamic content. It can simulate user interactions in a web browser to access data that's loaded via JavaScript.

**5. Puppeteer (Library/Tool):**
   - Language: JavaScript (Node.js)
   - Description: Puppeteer is a Node.js library developed by Google for controlling headless Chrome or Chromium browsers. It's commonly used for web scraping and automating browser tasks.

**6. Cheerio (Library):**
   - Language: JavaScript (Node.js)
   - Description: Cheerio is a fast, flexible, and jQuery-like library for parsing and manipulating HTML content on the server-side, making it a useful choice for web scraping in Node.js.

**7. lxml (Library):**
   - Language: Python
   - Description: lxml is a Python library for processing XML and HTML documents. It's known for its speed and efficiency in parsing and extracting data from HTML pages.

**8. Nokogiri (Library):**
   - Language: Ruby
   - Description: Nokogiri is a Ruby library for parsing and searching XML and HTML documents. It is widely used for web scraping in Ruby applications.

**9. BeautifulSoup4 (Python Library):**
   - Language: Python
   - Description: BeautifulSoup4 is an improved version of BeautifulSoup, optimized for Python 3. It provides better support for modern HTML and XML parsing.

**10. PyQuery (Python Library):**
    - Language: Python
    - Description: PyQuery is a Python library that allows you to make jQuery queries on XML documents. It can be a convenient choice for parsing and querying HTML content.

**11. Apache Nutch (Tool):**
    - Language: Java
    - Description: Apache Nutch is an open-source web crawling framework written in Java. It provides a scalable platform for web scraping and indexing web data.

**12. Octoparse (Tool):**
    - Language: Web-based (No coding required)
    - Description: Octoparse is a visual web scraping tool that doesn't require coding. Users can create scraping tasks through a point-and-click interface.

**13. Import.io (Tool):**
    - Language: Web-based (No coding required)
    - Description: Import.io is a cloud-based web scraping tool that allows users to extract data from web pages without writing code.

When choosing the right tool or library for web scraping, consider the complexity of the task, the website's structure, and your programming language preferences. Additionally, always ensure that you comply with the website's terms of service and legal regulations when scraping data.

# what is `nltk` Library ?

**NLTK (Natural Language Toolkit)** is a powerful Python library for working with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for tasks like tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK is widely used for natural language processing (NLP) and text analysis tasks. Here are some key features and functionalities of NLTK:

**1. Text Processing:** NLTK provides a range of text processing libraries that allow you to work with text data effectively. These include functions for tokenization (splitting text into words or sentences), stemming (reducing words to their root form), and lemmatization (reducing words to their base or dictionary form).

**2. Part-of-Speech Tagging:** NLTK includes tools for part-of-speech tagging, which is the process of identifying the grammatical category (e.g., noun, verb, adjective) of each word in a sentence. This is valuable for various NLP tasks.

**3. Parsing:** You can use NLTK for parsing sentences and extracting grammatical structures. It includes parsers for context-free grammars and dependency grammars, making it suitable for syntactic and semantic parsing.

**4. Named Entity Recognition (NER):** NLTK offers named entity recognition tools that can identify and classify named entities (e.g., names of people, organizations, locations) in text.

**5. WordNet Integration:** NLTK integrates with WordNet, a lexical database for English. This allows you to look up word meanings, synonyms, antonyms, and more.

**6. Text Corpora:** NLTK provides access to a wide range of text corpora, including large collections of text data for various languages. These corpora are useful for training and testing NLP models.

**7. Language Models:** You can build language models, including n-grams and hidden Markov models, for language understanding and generation tasks.

**8. Machine Learning Integration:** NLTK can be combined with machine learning libraries like Scikit-learn to create NLP and text classification models.

**9. Sentiment Analysis:** NLTK is often used for sentiment analysis, a task that involves determining the emotional tone of a piece of text, such as whether a review is positive or negative.

**10. Natural Language Understanding:** NLTK can be employed for a wide range of NLP tasks, from language understanding to text generation, making it a versatile library for working with human language data.

NLTK is a valuable resource for both beginners and experts in the field of NLP. It is widely used in research, education, and industry for tasks ranging from basic text processing to more complex language understanding and generation tasks. It provides a foundation for understanding and working with human language data in a computational context.



# What is `re` Library?

The `re` library, also known as the "regular expressions" or "regex" library, is a built-in Python library that provides support for regular expressions. Regular expressions are powerful tools for pattern matching and text manipulation. With the `re` library, you can search for, match, and manipulate strings using complex patterns rather than simple literal text. Here are some key features and use cases of the `re` library:

**1. Pattern Matching:** Regular expressions allow you to specify complex patterns to match in text data. These patterns can include a combination of characters, metacharacters, and quantifiers, enabling you to search for specific sequences or structures within strings.

**2. Search and Match:** The `re` library provides functions like `search()` and `match()` to find patterns in a string. The `search()` function searches for a pattern anywhere in the input string, while the `match()` function checks if the pattern matches at the beginning of the string.

**3. Replacement:** You can use regular expressions to replace matched patterns with other strings. This is particularly useful for data cleaning and manipulation tasks.

**4. Splitting:** Regular expressions can be used to split strings into substrings based on a pattern. This is helpful for parsing structured data.

**5. Grouping:** You can use parentheses to create groups within a regular expression pattern. This allows you to extract specific parts of a matched string.

**6. Quantifiers:** Regular expressions support quantifiers like `*` (zero or more occurrences), `+` (one or more occurrences), `?` (zero or one occurrence), and more, making it possible to describe the repetition of characters or groups.

**7. Character Classes:** Character classes like `[a-z]` or `[0-9]` allow you to match specific ranges of characters or digits.

**8. Metacharacters:** Metacharacters like `.` (matches any character), `^` (matches the start of a string), and `$` (matches the end of a string) enable you to express more complex matching conditions.

**9. Escape Sequences:** You can use escape sequences to match special characters or sequences literally. For example, `\d` matches a digit, and `\\` matches a backslash.

**10. Case-Insensitive Matching:** The `re` library provides an option to perform case-insensitive matching by specifying the `re.IGNORECASE` flag.

**11. Multiline Matching:** You can match patterns in multi-line text by specifying the `re.MULTILINE` flag.

Regular expressions are versatile and can be applied in a wide range of applications, such as data validation, text search and manipulation, data extraction from unstructured text, and more. The `re` library is a fundamental tool for handling textual data in Python and is commonly used in data cleaning, text processing, and text analysis tasks.


# Implementation

### Web Scrapping Data from `Wikipedia`

In [1]:
import requests # to get the html source code of the page
from bs4 import BeautifulSoup # to parse the html source code
import re # to use regular expressions
from nltk.corpus import stopwords # to remove stopwords
from nltk.tokenize import word_tokenize # to tokenize words from sentences
from nltk.stem import PorterStemmer # to stem words to their root form (e.g. running -> run)
import nltk

In [2]:
nltk.download() # download nltk data (stopwords, punkt) if not already downloaded (only need to do this once)

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Fetch and Parse HTML

In [3]:
url = "https://en.wikipedia.org/wiki/Independence_Day_(India)"  # Replace with the URL of the webpage you want to scrape
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')

### Extract data

In [4]:
# Example: Extracting all paragraphs
paragraphs = soup.find_all('p')

# Extracting text from each paragraph
paragraph_texts = [paragraph.get_text() for paragraph in paragraphs]

### Text Preprocessing

In [5]:
# Convert to lowercase
lowercase_text = [text.lower() for text in paragraph_texts]

# Remove special characters using regex
cleaned_text = [re.sub(r'[^a-zA-Z0-9\s]', '', text) for text in lowercase_text]

# Tokenization
tokenized_text = [word_tokenize(text) for text in cleaned_text]

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_text = [[word for word in tokens if word not in stop_words] for tokens in tokenized_text]

# Stemming
stemmer = PorterStemmer()
stemmed_text = [[stemmer.stem(word) for word in tokens] for tokens in filtered_text]

You can perform additional steps such as removing empty tokens, converting the processed text back to sentences or paragraphs, and so on, based on your requirements.

In [6]:
# Remove empty tokens
final_text = [[word for word in tokens if word.strip()] for tokens in stemmed_text]

# Convert tokens back to sentences
sentences = [' '.join(tokens) for tokens in final_text]

# Convert sentences back to paragraphs
processed_paragraphs = '\n\n'.join(sentences)

### Save the Propcess

In [7]:
with open('processed_text.txt', 'w', encoding='utf-8') as file:
    file.write(processed_paragraphs)

print(processed_paragraphs)




independ day celebr annual 15 august public holiday india commemor nation independ unit kingdom 15 august 1947 day provis indian independ act transfer legisl sovereignti indian constitu assembl came effect india retain king georg vi head state transit republ constitut india came effect 26 januari 1950 celebr indian republ day replac dominion prefix dominion india enact sovereign law constitut india india attain independ follow independ movement note larg nonviol resist civil disobedi led indian nation congress leadership mahatma gandhi

independ coincid partit india1 british india divid dominion india pakistan partit accompani violent riot mass casualti displac nearli 15 million peopl due religi violenc 15 august 1947 first prime minist india jawaharl nehru rais indian nation flag lahori gate red fort delhi subsequ independ day incumb prime minist customarili rais flag give address nation2 entir event broadcast doordarshan india nation broadcast usual begin shehnai music ustad bismil

# Thank You!