# **Are u Ready to Get the Data u need ?**

### **First: there are 4 ways to get the data:** 
* Download The Data. 
* Web Scraping. 
* mySQL Database. 
* Using API.   

## **Download the Data**
* Choose the public dataset you need and download by passing it's link and file format 



* Popular open data repositories

    * [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/)
    * [Kaggle Datasets](https://www.kaggle.com/datasets)
    * [Amazon’s AWS datasets](https://registry.opendata.aws/)
#
* Meta portals (they list open data repositories)

    * [Data Portals](http://dataportals.org/)
    * [OpenDataMonitor](http://opendatamonitor.eu/)
    * [Quandl](http://quandl.com/)
#
* Other pages listing many popular open data repositories

    * [Wikipedia’s list of Machine Learning datasets](https://homl.info/9)
    * [Quora.com](https://homl.info/10)
    * [The datasets subreddit](https://www.reddit.com/r/datasets)


In [None]:
#!pip install requests
import requests

def download_dataset(dataset_link, save_path):
    """
    Download a dataset file from a given URL and save it to a specified local path.

    Parameters:
        dataset_link (str): The URL of the dataset file to be downloaded.
        save_path (str): The local path where the downloaded dataset file will be saved.
    """
    try:
        # Send an HTTP GET request to the dataset link
        response = requests.get(dataset_link)
        response.raise_for_status()  # Raise an exception if the response status code indicates an error

        # Open the specified save_path file in binary write mode
        with open(save_path, 'wb') as file:
            # Write the content of the response to the file
            file.write(response.content)

        print("Download complete.")  # Print a message indicating successful download
    except requests.exceptions.RequestException as e:
        print("Error:", e)  # Print an error message if there's an exception during the request


### **Extract data from zip folder function**

In [None]:
#!pip install shutil
#!pip install os
import shutil
import os

def extract_archive(archive_path, extract_to):
    """
    Extract the contents of an archive (zip, tgz , tar.gz, tar.bz2, tar.xz) to a specified directory.

    Parameters:
        archive_path (str): The path to the archive file to be extracted.
        extract_to (str): The directory where the contents of the archive will be extracted.
    """
    try:
        # Determine the archive format based on the file extension
        _, extension = os.path.splitext(archive_path)

        if extension == '.zip':
            # Open the zip file and extract its contents
            shutil.unpack_archive(archive_path, extract_to, 'zip')
        elif extension == '.tar.gz' or extension == '.tgz':
            # Open the tar.gz file and extract its contents
            shutil.unpack_archive(archive_path, extract_to, 'gztar')
        elif extension == '.tar.bz2':
            # Open the tar.bz2 file and extract its contents
            shutil.unpack_archive(archive_path, extract_to, 'bztar')
        elif extension == '.tar.xz':
            # Open the tar.xz file and extract its contents
            shutil.unpack_archive(archive_path, extract_to, 'xz')
        else:
            print("Error: Unsupported archive format.")

        print("Extraction complete.")  # Print a message indicating successful extraction
    except Exception as e:
        print("Error:", e)  # Print an error message if there's an exception during extraction

# Example usage for various archive formats
archive_path = "path/to/your/archive.tar.gz"  # Replace with the path to your archive file
extract_to = "path/to/extract/folder"  # Replace with the directory where you want to extract

extract_archive(archive_path, extract_to)



## **Web Scraping**
* **when it comes to web scraping there are many powerful packages to do it like :** 

    * BeautifulSoup (Python).
    * Selenium Webdriver (Java, Python and Ruby).
    * Scrapy.

### **BeautifulSoup:**
* For simple HTML/XML parsing tasks and the website doesn't heavily rely on JavaScript, Beautiful Soup is a quick and easy choice.

In [None]:
#!pip install beautifulsoup4
from bs4 import BeautifulSoup

# BeautifulSoup Scraper Function
def scrape_with_beautifulsoup(url,tag,element = 'article'):
    """
    Scrape specific tag from a website using Beautiful Soup 4.

    Parameters:
        url (str): The URL of the website to scrape.

    Returns:
        list: A list of tag content.
    """
    
    # define the tag empty array
    tag_contents = []

    # return the page
    response = requests.get(url)

    # get the html page content 
    soup = BeautifulSoup(response.content, "html.parser")
    
    # iterate over all elements 
    for elm  in soup.find_all(element):

        # find specific tag inside this element 
        content = elm.find(tag).text

        # then add this cotent to the list 
        tag_contents.append(content)
    
    # return all contents
    return tag_contents

## **Selenium**
* If the website relies heavily on JavaScript, requires user interactions, or has dynamic content, Selenium is the way to go. 

In [None]:
#!pip install selenium
from selenium import webdriver

# Selenium Scraper Function
def scrape_with_selenium(url,tag,element ='article'):
    """
    Scrape specific tag from a website using Beautiful Soup 4.

    Parameters:
        url (str): The URL of the website to scrape.

    Returns:
        list: A list of tag content.
    """
    # define the tag empty array
    tag_contents = []

    driver = webdriver.Chrome()  # You need to have the Chrome WebDriver installed
    driver.get(url)
    
    for elm in driver.find_elements_by_css_selector(element):
        content = elm.find_element_by_css_selector(tag).text
        tag_contents.append(content)
    
    driver.quit()
    return tag_contents

## **Scrapy**
* Scrapy is a powerful framework that supports efficient crawling, data extraction, and pipeline processing and for larger scraping projects

In [None]:
#!pip install scrapy
# Make sure to install the 'scrapy' library if not already installed.

import scrapy
from scrapy.crawler import CrawlerProcess

class ArticleSpider(scrapy.Spider):
    name = "article_spider"
    
    # The 'start_requests' method is called to initialize the starting URLs for the spider.
    def start_requests(self):
        urls = ["https://example.com"]  # Replace with the URL of the website to scrape
        
        # Iterate over the list of URLs and yield a request for each URL with the 'parse' callback.
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    
    # The 'parse' method is the default callback function that processes the downloaded response.
    def parse(self, response):
        article_titles = []  # Initialize a list to store article titles
        
        # Use CSS selectors to extract article titles from the response's HTML content.
        for article in response.css("article"):
            title = article.css("h2::text").get()  # Extract the text content of <h2> element
            article_titles.append(title)  # Add the extracted title to the list
        
        return article_titles  # Return the list of extracted article titles

# Scrapy Scraper Function
def scrape_with_scrapy(url):
    """
    Scrape article titles from a website using Scrapy.

    Parameters:
        url (str): The URL of the website to scrape.

    Returns:
        list: A list of article titles.
    """
    process = CrawlerProcess()  # Create a CrawlerProcess instance
    process.crawl(ArticleSpider, start_urls=[url])  # Initialize the ArticleSpider with start URLs
    process.start()  # Start the crawling process
    return process.spider_instances[0].parsed_data  # Return the parsed article titles


## **MySQL Database** 

In [None]:
#!pip install mysql-connector-python
import mysql.connector

# function to retrieve data using condition
def retrieve_data_from_mysql_condition(host, user, password, database, table, condition_column, condition_value):
    """
    Retrieve data from a MySQL database table based on a specified condition.

    Parameters:
        host (str): MySQL host name.
        user (str): MySQL username.
        password (str): MySQL password.
        database (str): Name of the MySQL database.
        table (str): Name of the table to query.
        condition_column (str): Column name for the condition.
        condition_value (str): Value for the condition.

    Returns:
        list: List of tuples representing the retrieved data rows.
    """
    try:
        connection = mysql.connector.connect(
            host=host,
            user=user,
            password=password,
            database=database
        )
        
        cursor = connection.cursor()
        
        # Construct the SQL query dynamically with placeholders for the condition
        query = f"SELECT * FROM {table} WHERE {condition_column} = %s"
        
        # Execute the query with the condition value
        cursor.execute(query, (condition_value,))
        
        # Fetch all rows as a list of tuples
        retrieved_data = cursor.fetchall()
        
        cursor.close()
        connection.close()
        
        return retrieved_data
    
    except mysql.connector.Error as err:
        print("Error:", err)
        return None


# function to retrieve data using query
def retrieve_data_from_mysql_query(host, user, password, database, query):
    """
    Retrieve data from a MySQL database table based on a specified condition.

    Parameters:
        host (str): MySQL host name.
        user (str): MySQL username.
        password (str): MySQL password.
        database (str): Name of the MySQL database.
        
    Returns:
        list: List of tuples representing the retrieved data rows.
    """
    try:
        connection = mysql.connector.connect(
            host=host,
            user=user,
            password=password,
            database=database
        )
        
        cursor = connection.cursor()
        
        # Execute the query with the condition value
        cursor.execute(query)
        
        # Fetch all rows as a list of tuples
        retrieved_data = cursor.fetchall()
        
        cursor.close()
        connection.close()
        
        return retrieved_data
    
    except mysql.connector.Error as err:
        print("Error:", err)
        return None

