**Importing Libraries and Setting up API URL**



In [11]:
import requests
import pandas as pd

API_URL = "https://api.openbrewerydb.org/v1/breweries/random?size=50"


We import the required libraries, such as requests for handling API requests and pandas for data manipulation. We also set the API URL, which will be used to fetch the data.

**Defining the Function to Fetch Data**

In [12]:
def fetch_data(url, iterations):
    all_data = []
    for _ in range(iterations):
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json()
            all_data.extend(data)
        else:
            print(f"Failed to retrieve data from {url}")
    return all_data


We are creating a function that fetches the data from the OpenBreweryDB API. The function iteratively fetches the data based on the number of iterations specified.

**Main Function for Data Processing**

In [13]:
def main():
    total_records = 10000
    records_per_iteration = 50
    iterations = total_records // records_per_iteration

    data = fetch_data(API_URL, iterations)

    # Transform the data into a pandas DataFrame
    df = pd.DataFrame(data)

    # Check the shape of the DataFrame
    print(f"Shape of the DataFrame: {df.shape}")

    # Drop any duplicates from the DataFrame if there are any
    df.drop_duplicates(inplace=True)

    # Replace phone number with "phone number unknown" when there are null values
    df['phone'].fillna("Phone number unknown", inplace=True)

    # Replace website URL with "Website URL unknown" when there are null values
    df['website_url'].fillna("Website URL unknown", inplace=True)

    # Sample data from the DataFrame
    print(df.head())

    return df

if __name__ == '__main__':
    df = main()


Shape of the DataFrame: (10000, 16)
                                     id                                name  \
0  ba9ac1f0-2b7d-4cc2-ab55-9eaa479fee80              Crooked Pecker Brewing   
1  51652e93-207c-47c3-8c98-dd4f1c81fd39               Lumber Barons Brewery   
2  a0d63612-aa12-440a-89fe-da702b1bbeec  Inside Passage Brewing Company LLC   
3  5235f708-e30a-49ae-8681-0ff7167c8e76              Dock Street Brewing Co   
4  99eacd23-0305-4a47-87ed-086e0da1ae35                         Browar Roch   

  brewery_type          address_1 address_2 address_3            city  \
0     planning               None      None      None         Newbury   
1      brewpub   804 E Midland St      None      None        Bay City   
2       closed               None      None      None       Ketchikan   
3     planning               None      None      None    Philadelphia   
4        micro  Nowe Rochowice 22      None      None  Nowe Rochowice   

  state_province postal_code        country     lo

This is the main function for data processing. It includes fetching data from the API, transforming it into a pandas DataFrame, dropping any duplicates, and replacing null values in the 'phone' and 'website_url' columns with specified strings.

By dividing the code into multiple cells, we can better understand the purpose and functionality of each part.

In [14]:
import sqlite3

# Create a new SQLite database called "brewery_data.db"
conn = sqlite3.connect('brewery_data.db')
cur = conn.cursor()

# Create "US_data" and "Non_US_data" tables in the SQLite database
cur.execute('''
    CREATE TABLE IF NOT EXISTS US_data (
        id INTEGER PRIMARY KEY,
        name TEXT,
        city TEXT,
        state TEXT,
        country TEXT,
        phone TEXT
    )
''')

cur.execute('''
    CREATE TABLE IF NOT EXISTS Non_US_data (
        id INTEGER PRIMARY KEY,
        name TEXT,
        city TEXT,
        state TEXT,
        country TEXT,
        phone TEXT
    )
''')

# Load the transformed data into the appropriate SQLite tables based on the country
for index, row in df.iterrows():
    if row['country'] == 'United States':
        cur.execute('''
            INSERT INTO US_data (name, city, state, country, phone)
            VALUES (?, ?, ?, ?, ?)
        ''', (row['name'], row['city'], row['state'], row['country'], row['phone']))
    else:
        cur.execute('''
            INSERT INTO Non_US_data (name, city, state, country, phone)
            VALUES (?, ?, ?, ?, ?)
        ''', (row['name'], row['city'], row['state'], row['country'], row['phone']))

# Commit the changes and close the connection
conn.commit()
conn.close()


We are establishing a connection to an SQLite database named "brewery_data.db" and then create two tables within the database: "US_data" and "Non_US_data".

These tables will store information about breweries, including their names, cities, states, countries, and phone numbers.

We will then iterate through the DataFrame df, which contains the brewery data fetched from the OpenBreweryDB API. For each row in the DataFrame, we will check for the value in the 'country' column.

1.   If the country is 'United States', the script inserts the corresponding brewery information into the "US_data" table.
2.   If the country is not 'United States', the script inserts the information into the "Non_US_data" table.

The data is extracted from the DataFrame row by row and inserted into the tables with the necessary information provided for each column.


In [15]:
import sqlite3
from tabulate import tabulate

# Establish a connection to the SQLite database
conn = sqlite3.connect('brewery_data.db')
cur = conn.cursor()

# Query to count the total number of records in "US_data" table
cur.execute("SELECT COUNT(*) FROM US_data")
us_data_count = cur.fetchone()[0]
print(f"Total number of records in US_data: {us_data_count}")

# Query to count the total number of records in "Non_US_data" table
cur.execute("SELECT COUNT(*) FROM Non_US_data")
non_us_data_count = cur.fetchone()[0]
print(f"Total number of records in Non_US_data: {non_us_data_count}")

# Query to count the total number of records grouped by state in "US_data" table
cur.execute("SELECT state, COUNT(*) FROM US_data GROUP BY state")
state_counts = cur.fetchall()
print("Total number of records grouped by state in US_data:")
headers = ["State", "Count"]
print(tabulate(state_counts, headers=headers, tablefmt="grid"))

# Close the connection to the database
conn.close()


Total number of records in US_data: 16807
Total number of records in Non_US_data: 567
Total number of records grouped by state in US_data:
+----------------------+---------+
| State                |   Count |
| Utah                 |       2 |
+----------------------+---------+
| Alabama              |      92 |
+----------------------+---------+
| Alaska               |     116 |
+----------------------+---------+
| Arizona              |     257 |
+----------------------+---------+
| Arkansas             |      96 |
+----------------------+---------+
| California           |    1919 |
+----------------------+---------+
| Colorado             |     922 |
+----------------------+---------+
| Connecticut          |     197 |
+----------------------+---------+
| Delaware             |      65 |
+----------------------+---------+
| District of Columbia |      32 |
+----------------------+---------+
| Florida              |     659 |
+----------------------+---------+
| Georgia            

 **Querying for Counts**

We are writing a query to count the total number of records in “US_data” table and “Non_US_data” table.

1.   For the "US_data" table, we are calculating the count of records where the breweries are located in the United States.
2.   For the "Non_US_data" table, we are calculates the count of records where the breweries are located outside the United States.

We are fetching the count using the cur.fetchone() method and storing them in the variables us_data_count and non_us_data_count.

**Querying and Grouping Data:**

We are writing a query to count the total number of records in the "US_data" table grouped by the 'state' column. This provides a breakdown of how many breweries are in each state in the United States. The results of this query are then stored in the state_counts variable.

**Tabulating Data:**

We are using tabulate library to format and print the results as a table for better readability. The headers as "State" and "Count" and uses the "grid" format for the table.