
# Data Extraction and Transformation

This notebook demonstrates data extraction, transformation, and loading (ETL) processes using various sources:
- PostgreSQL databases
- Web scraping
- IoT device APIs (e.g., ThingSpeak)
- REST APIs

Each section contains:
1. **Data Extraction:** Code to fetch data from the respective source.
2. **Data Transformation:** Cleaning and structuring the data.
3. **Exporting Data:** Saving the transformed data to a CSV file for further analysis.

### Prerequisites
Install the required libraries before running the notebook:
```bash
!pip install psycopg2-binary requests beautifulsoup4 pandas
```



## Example 1: Extracting Data from PostgreSQL

We connect to a public PostgreSQL server, retrieve data, transform it, and save it to a CSV file.

**Database Details:**
- Host: `rajje.db.elephantsql.com`
- Database: `ljayqbrj`
- User: `ljayqbrj`
- Password: `<your_password>`

Replace `<your_password>` with your actual password for the database.


In [None]:
import pandas as pd
import psycopg2

# PostgreSQL connection details
host = "hh-pgsql-public.ebi.ac.uk"
port = 5432
database = "pfmegrnargs"
user = "reader"
password = "NWDMCE5xdipIjRrp"

try:
    # Connect to PostgreSQL
    connection = psycopg2.connect(
        host=host,
        port=port,
        database=database,
        user=user,
        password=password
    )
    print("Connected to the PostgreSQL database!")

    # Query to fetch data (use the correct columns based on schema inspection)
    query = """
    SELECT id, description, avg_length, min_length, max_length, num_sequences, num_organisms
    FROM rnc_database
    LIMIT 100;
    """
    data_frame = pd.read_sql_query(query, connection)
    print("Data Extracted:")
    data_frame.head()

except Exception as e:
    print(f"Error: {e}")

finally:
    if connection:
        connection.close()
        print("PostgreSQL connection is closed.")

# Data Transformation
try:
    # 1. Add a new column 'length_category' based on the average length of sequences
    data_frame['length_category'] = data_frame['avg_length'].apply(lambda x: 'Short' if x < 200 else 'Long')

    # 2. Filter data: Retain only entries where the average length is greater than 150
    filtered_data = data_frame[data_frame['avg_length'] > 150]

    # 3. Sort data: Sort by average length in descending order
    sorted_data = filtered_data.sort_values(by='avg_length', ascending=False)

    # 4. Select relevant columns for export
    transformed_data = sorted_data[['id', 'description', 'avg_length', 'length_category', 'num_sequences', 'num_organisms']]

    # Display the transformed data
    print("Transformed Data:")
    print(transformed_data.head())

except KeyError as e:
    print(f"KeyError: {e}. Ensure the column names match the table structure.")

# Save the transformed data to a CSV file
csv_file_path = "/content/transformed_data.csv"
transformed_data.to_csv(csv_file_path, index=False)
print(f"Transformed data written to {csv_file_path}")


Connected to the PostgreSQL database!
Data Extracted:
PostgreSQL connection is closed.
Transformed Data:
    id                                        description  avg_length  \
30  50  provides comprehensive genomic view of plant l...      6659.0   
45   7  is a database providing comprehensive annotati...      3086.0   
49  28  a collaborative effort between leading researc...      2340.0   
46  53  is a database of experimentally validated long...      2311.0   
11  48  provides computational access to molecular-int...      2251.0   

   length_category  num_sequences  num_organisms  
30            Long         936926             80  
45            Long             62             10  
49            Long          11124              1  
46            Long            933              3  
11            Long             95              8  
Transformed data written to /content/transformed_data.csv


  data_frame = pd.read_sql_query(query, connection)



## Example 2: Extracting Data from Web Pages (Web Scraping)

We scrape quotes and their authors from the [Quotes to Scrape](http://quotes.toscrape.com) website.

**Libraries Used:** `requests`, `BeautifulSoup`


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Web scraping URL for quotes
url = "http://quotes.toscrape.com"

# Send GET request to the website
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the page content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract quotes and authors
    quotes = soup.find_all('span', class_='text')
    authors = soup.find_all('small', class_='author')

    # Create a DataFrame
    data = {
        'quote': [quote.text for quote in quotes],
        'author': [author.text for author in authors]
    }
    data_frame = pd.DataFrame(data)
    print("Data Extracted:")
    print(data_frame.head())

    # Save the extracted data to CSV
    csv_file_path = 'quotes_data.csv'
    data_frame.to_csv(csv_file_path, index=False)
    print(f'Extracted data written to {csv_file_path}')
else:
    print(f"Failed to fetch data from website. Status code: {response.status_code}")


Data Extracted:
                                               quote           author
0  “The world as we have created it is a process ...  Albert Einstein
1  “It is our choices, Harry, that show what we t...     J.K. Rowling
2  “There are only two ways to live your life. On...  Albert Einstein
3  “The person, be it gentleman or lady, who has ...      Jane Austen
4  “Imperfection is beauty, madness is genius and...   Marilyn Monroe
Extracted data written to quotes_data.csv



## Example 3: Extracting Data from IoT Devices (Public APIs)

We retrieve data from a public ThingSpeak IoT channel and analyze temperature readings.

**API URL:** `https://api.thingspeak.com/channels/12397/feeds.json?results=10`


In [None]:
import requests
import pandas as pd

# Updated ThingSpeak API URL with a valid public channel
api_url = "https://api.thingspeak.com/channels/12397/feeds.json?results=10"  # Example channel with temperature data

# Send GET request to the ThingSpeak API
response = requests.get(api_url)

# Check if the request was successful
if response.status_code == 200:
    # Load the JSON data from the response
    data = response.json()

    # Extract the feed data (the sensor data from the API response)
    feeds = data['feeds']

    # Convert the feed data into a Pandas DataFrame
    data_frame = pd.DataFrame(feeds)
    print("Data Extracted:")
    print(data_frame.head())

    # Data Transformation: Example transformations
    try:
        # Convert field1 to numeric (temperature example)
        data_frame['field1'] = pd.to_numeric(data_frame['field1'], errors='coerce')

        # 1. Add a new column 'temp_status' indicating if temperature is above or below 25°C
        data_frame['temp_status'] = data_frame['field1'].apply(lambda x: 'Above 25°C' if x > 25 else 'Below 25°C')

        # 2. Filter data: Retain only rows where the temperature is above 25°C
        filtered_data = data_frame[data_frame['field1'] > 25]

        # 3. Sort data: Sort by the temperature (field1) in descending order
        sorted_data = filtered_data.sort_values(by='field1', ascending=False)

        # 4. Select relevant columns for export
        transformed_data = sorted_data[['created_at', 'field1', 'temp_status']]  # Adjust columns based on your IoT data
        print("Transformed Data:")
        print(transformed_data.head())  # Display transformed data

        # Save the transformed data to CSV
        csv_file_path = 'iot_transformed_data.csv'
        transformed_data.to_csv(csv_file_path, index=False)
        print(f'IoT device data written to {csv_file_path}')

    except KeyError as e:
        print(f"KeyError: {e}. Ensure the column names match the API structure.")
else:
    print(f"Failed to fetch data from ThingSpeak. Status code: {response.status_code}")


Data Extracted:
             created_at  entry_id field1 field2 field3 field4 field5 field6  \
0  2024-12-05T08:36:46Z   5229127    270      0     95   32.6      0  29.39   
1  2024-12-05T08:37:46Z   5229128    270      0     95   32.6      0  29.39   
2  2024-12-05T08:38:46Z   5229129    270      0     95   32.5      0  29.39   
3  2024-12-05T08:39:46Z   5229130    270      0     95   32.5      0  29.39   
4  2024-12-05T08:40:46Z   5229131    270      0     95   32.6      0  29.39   

  field7 field8  
0  4.069      0  
1  4.068      0  
2  4.064      0  
3  4.065      0  
4  4.061      0  
Transformed Data:
             created_at  field1 temp_status
0  2024-12-05T08:36:46Z     270  Above 25°C
1  2024-12-05T08:37:46Z     270  Above 25°C
2  2024-12-05T08:38:46Z     270  Above 25°C
3  2024-12-05T08:39:46Z     270  Above 25°C
4  2024-12-05T08:40:46Z     270  Above 25°C
IoT device data written to iot_transformed_data.csv



## Example 4: Extracting Data from REST APIs

We fetch data from a sample REST API, perform transformations, and save the results to a CSV file.

**API URL:** `https://jsonplaceholder.typicode.com/posts`


In [None]:
import requests
import pandas as pd

# API URL for JSONPlaceholder
api_url = "https://jsonplaceholder.typicode.com/users"

# Send a GET request to the API to fetch data
response = requests.get(api_url)

# Check if the request was successful
if response.status_code == 200:
    # Load the JSON data from the response
    data = response.json()

    # Convert the data into a Pandas DataFrame
    data_frame = pd.DataFrame(data)
    print("Data Extracted:")
    print(data_frame.head())  # Display the first few rows of the extracted data

    # Data Transformation: Example transformations
    try:
        # 1. Example transformation - Create a new column 'full_name' by combining 'name' and 'company'
        data_frame['full_name'] = data_frame['name'] + " (" + data_frame['company'].apply(lambda x: x['name']) + ")"

        # 2. Filter data: Retain only entries where the city of the user is 'Lebsackbury'
        filtered_data = data_frame[data_frame['address'].apply(lambda x: x['city'] == 'Lebsackbury')]

        # 3. Sort data: Sort by the user's name alphabetically
        sorted_data = filtered_data.sort_values(by='name')

        # 4. Select relevant columns for export
        transformed_data = sorted_data[['id', 'full_name', 'email', 'address', 'phone']]  # Adjust columns based on your needs
        print("Transformed Data:")
        print(transformed_data.head())  # Display transformed data

    except KeyError as e:
        print(f"KeyError: {e}. Ensure the column names match the API structure.")

    # Save the transformed data to a CSV file
    csv_file_path = 'transformed_api_data.csv'
    transformed_data.to_csv(csv_file_path, index=False)
    print(f'Transformed data written to {csv_file_path}')

else:
    print(f"Failed to fetch data from API. Status code: {response.status_code}")


Data Extracted:
   id              name   username                      email  \
0   1     Leanne Graham       Bret          Sincere@april.biz   
1   2      Ervin Howell  Antonette          Shanna@melissa.tv   
2   3  Clementine Bauch   Samantha         Nathan@yesenia.net   
3   4  Patricia Lebsack   Karianne  Julianne.OConner@kory.org   
4   5  Chelsey Dietrich     Kamren   Lucio_Hettinger@annie.ca   

                                             address                  phone  \
0  {'street': 'Kulas Light', 'suite': 'Apt. 556',...  1-770-736-8031 x56442   
1  {'street': 'Victor Plains', 'suite': 'Suite 87...    010-692-6593 x09125   
2  {'street': 'Douglas Extension', 'suite': 'Suit...         1-463-123-4447   
3  {'street': 'Hoeger Mall', 'suite': 'Apt. 692',...      493-170-9623 x156   
4  {'street': 'Skiles Walks', 'suite': 'Suite 351...          (254)954-1289   

         website                                            company  
0  hildegard.org  {'name': 'Romaguera-Crona', 'c