<a href="https://colab.research.google.com/github/MFahadHussain/Data-Engineering-/blob/main/WebScrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Python Script
You'll need to install the necessary libraries if you haven't already:


In [1]:
!pip install requests beautifulsoup4 pandas




In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Function to fetch the content of the webpage
def fetch_quotes(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all quotes on the page
    quotes = []
    for quote in soup.find_all('div', class_='quote'):
        text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
        quotes.append({'quote': text, 'author': author, 'tags': tags})

    return quotes

# Main function to save the quotes into a CSV
def save_to_csv(quotes, filename='quotes.csv'):
    # Convert the list of quotes to a DataFrame
    df = pd.DataFrame(quotes)

    # Check if CSV exists, if so, append, otherwise create new
    try:
        df_existing = pd.read_csv(filename)
        df_existing = df_existing.append(df, ignore_index=True)
        df_existing.to_csv(filename, index=False)
    except FileNotFoundError:
        df.to_csv(filename, index=False)

# Main scraping flow
def main():
    base_url = 'http://quotes.toscrape.com/'
    all_quotes = []

    # Scrape 10 pages for demonstration
    for page_num in range(1, 11):
        url = base_url + str(page_num)
        print(f"Scraping page {page_num}...")
        quotes = fetch_quotes(url)
        all_quotes.extend(quotes)
        time.sleep(2)  # To avoid hitting the server too hard

    # Save all the quotes to CSV
    save_to_csv(all_quotes)

if __name__ == '__main__':
    main()


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...


Explanation:
fetch_quotes(url): Scrapes the quotes from the webpage and returns them as a list of dictionaries.

save_to_csv(quotes, filename): Converts the list of dictionaries to a Pandas DataFrame and saves the data to a CSV file. If the file already exists, it appends new data.

main(): This part controls the scraping loop, iterating over multiple pages of quotes.

In [None]:
import pandas as pd
import json

# Function to convert JSON file to CSV
def json_to_csv(input_json_file, output_csv_file):
    # Load the JSON data
    with open(input_json_file, 'r') as f:
        data = json.load(f)  # Load the data from the JSON file

    # Convert the JSON data into a pandas DataFrame
    df = pd.json_normalize(data)  # Flatten nested JSON if needed

    # Save the DataFrame to a CSV file
    df.to_csv(output_csv_file, index=False)
    print(f"Data has been successfully saved to {output_csv_file}")

# Example usage
input_json_file = 'data.json'  # Replace with your JSON file path
output_csv_file = 'report.csv'  # Desired output CSV file name

json_to_csv(input_json_file, output_csv_file)


2. Automating with Cron Job
If you want this script to run periodically, you can set up a cron job in Linux.

Here’s how you can add a cron job to run this script every day at 8 AM:

Open the cron table:

In [4]:
#crontab -e

#0 8 * * * /usr/bin/python3 /path/to/your/script.py
 # IN BASH