Objective:

The objective of this assignment is to help trainees gain hands-on experience with Scrapy, a powerful web scraping framework in Python. By the end of this assignment, trainees should be able to create Scrapy projects, build spiders to extract data from websites, and store the scraped data in various formats.

Task 1: Install and Set Up Scrapy

Install Scrapy:
- Install Scrapy in your Python environment.

- Use the following command to install: pip install scrapy

In [2]:
 pip install scapy 

Note: you may need to restart the kernel to use updated packages.


Create a Scrapy Project:
- Create a new Scrapy project named "web_scraper" in your working directory.

# create a project 
#  scrapy startproject webpager

Task 2: Create a Spider to Scrape a Website
- Choose a Website: Select a simple, publicly accessible website to scrape.
- Examples include:

- http://quotes.toscrape.com (A website designed for practicing web scraping)

- Generate a Spider: Create a spider within your project to scrape the website.

- Name the spider based on the website, e.g., quotes_spider for the quotes website.

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes_spider.com"]
    start_urls = ["https://quotes_spider.com"]
    
    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
    


# Extract Data:
- Extract the following data from the website:

- Quotes: Extract the text of the quotes.
- Authors: Extract the name of the author for each quote.
- Tags: Extract tags associated with each quote.

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes_spider.com"]
    start_urls = ["https://quotes_spider.com"]
    
    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

In [None]:
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2024-09-07 00:00:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}

#Task 3: Save the Scraped Data
- Save Data to a JSON File: Run the spider and save the scraped data to a JSON file.

In [None]:
# scrapy crawl quotes -O quotes.json
[
{"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]},
{"text": "“Try not to become a man of success. Rather become a man of value.”", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]},
{"text": "“It is better to be hated for what you are than to be loved for what you are not.”", "author": "André Gide", "tags": ["life", "love"]},
{"text": "“I have not failed. I've just found 10,000 ways that won't work.”", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]},
{"text": "“A day without sunshine is like, you know, night.”", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}
]

#Save Data to a CSV File: 
- Run the spider again and save the data to a CSV file.

In [None]:
## scrapy crawl quotes -O quotes.csv
text,author,tags
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,Albert Einstein,"change,deep-thoughts,thinking,world"
"“It is our choices, Harry, that show what we truly are, far more than our abilities.”",J.K. Rowling,"abilities,choices"
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,Albert Einstein,"inspirational,life,live,miracle,miracles"
"“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",Jane Austen,"aliteracy,books,classic,humor"
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",Marilyn Monroe,"be-yourself,inspirational"
“Try not to become a man of success. Rather become a man of value.”,Albert Einstein,"adulthood,success,value"
“It is better to be hated for what you are than to be loved for what you are not.”,André Gide,"life,love"
"“I have not failed. I've just found 10,000 ways that won't work.”",Thomas A. Edison,"edison,failure,inspirational,paraphrased"
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”,Eleanor Roosevelt,misattributed-eleanor-roosevelt
"“A day without sunshine is like, you know, night.”",Steve Martin,"humor,obvious,simile"

# Task 4: Implement Error Handling and Logging
- Add Error Handling: Modify your spider to include basic error handling, such as retrying failed requests or skipping certain elements if they are not found.

- First I forgot to add create myproject then I have to redo it .
- spelling mistake 
- crawl not found because i forgot to create project 
- Again delete the quotes.py and created again 
- Title, quotes , author name was empty when I saved to jason file
- Forgot to change the code crawl. 
- After doing it for 4 times I was able to get the result. 
- 4 time was easy as I note down all my mistake.

# Enable Logging: 
- Configure Scrapy’s logging to monitor your spider’s activity. Write logs to a file for review.

- cd desktop
- cd scrapywebpager
- ls (scrapywebpager)
- scrapy genspider quotes quotes.py
- scrapy crawl quotes 
- scrapy crawl quotes -O quotes.json
- scrapy crawl quotes -O quotes.csv
