# ETL: Universities API → SQLite (Jupyter Notebook)

Este notebook implementa um pipeline ETL que:
1. **Extrai** os dados da API pública `http://universities.hipolabs.com/search`;
2. **Transforma** e normaliza os registros;
3. **Carrega** os dados em um banco SQLite local;
4. Executa consultas SQL de exemplo demonstrando o uso do banco.

> Observação: este notebook faz requisições HTTP à API pública. Execute as células em seu ambiente local (com internet).

In [1]:
# Requisitos: instale pacotes se necessário
# Execute apenas se necessário: !pip install requests pandas tqdm
try:
    import requests, sqlite3, pandas as pd
except Exception as e:
    print('Você pode precisar instalar dependências: pip install requests pandas tqdm')
    raise

In [2]:

BASE_URL = "https://raw.githubusercontent.com/Hipo/university-domains-list/refs/heads/master/world_universities_and_domains.json"

response = requests.get(BASE_URL)
if response.status_code == 200:
    universities = response.json()
    print(f"Total de registros coletados: {len(universities)}")
else:
    raise Exception(f"Falha ao baixar o JSON: {response.status_code}")


Total de registros coletados: 10185


In [3]:
def transform_data(universities):
    return [
        (
            uni.get("name"),
            uni.get("country"),
            uni.get("state-province"),
            ",".join(uni.get("web_pages", [])),
            ",".join(uni.get("domains", []))
        )
        for uni in universities
    ]

universities_data = transform_data(universities)
print(universities_data[:5]) 

[('Engineering Institute of Technology', 'Australia', None, 'https://www.eit.edu.au/', 'student.eit.edu.au'), ('Universitas Nusa Putra', 'Indonesia', None, 'https://nusaputra.ac.id', 'nusaputra.ac.id'), ('University of Kyrenia', 'Turkey', None, 'https://kyrenia.edu.tr', 'std.kyrenia.edu.tr,kyrenia.edu.tr'), ('Regent University College of Science and Technology', 'Ghana', None, 'https://regent.edu.gh', 'regent.edu.gh'), ('Wroclaw Akademia Biznesu', 'Poland', None, 'https://wab.edu.pl', 'student.wab.edu.pl,wab.edu.pl')]


In [4]:


DB_PATH = 'universities.db'

# def query_total_by_country(limit=20):
#     conn = sqlite3.connect(DB_PATH)
#     df = pd.read_sql_query(">
#         SELECT country, COUNT(*) AS total
#         FROM universities
#         GROUP BY country
#         ORDER BY total DESC
#         LIMIT :limit
#     ", conn, params={'limit': limit})
#     conn.close()
#     return df

def query_universities_by_country(country):
    conn = sqlite3.connect(DB_PATH)
    df = pd.read_sql_query("""
        SELECT id, name, state_province, domains, web_pages
        FROM universities
        WHERE lower(country) = lower(:country)
        ORDER BY name
    """, conn, params={'country': country})
    conn.close()
    return df

def search_universities_by_name(term, limit=100):
    conn = sqlite3.connect(DB_PATH)
    term_like = f"%{term}%"
    df = pd.read_sql_query("""
        SELECT id, name, country, domains, web_pages
        FROM universities
        WHERE lower(name) LIKE lower(:term)
        ORDER BY country, name
        LIMIT :limit
    """, conn, params={'term': term_like, 'limit': limit})
    conn.close()
    return df

In [9]:
with sqlite3.connect(DB_FILE) as conn:
    df_brasil = pd.read_sql_query("""
        SELECT name, state_province, domains, web_pages
        FROM universities
        WHERE country = 'Brazil'
        ORDER BY name
    """, conn)
    display(df_brasil)


Unnamed: 0,name,state_province,domains,web_pages
0,Centro Regional Universitário de Espiríto Sant...,,creupi.br,http://www.creupi.br/
1,Centro Universitário Antônio Eufrásio de Toled...,Presidente Prudente,toledoprudente.edu.br,https://toledoprudente.edu.br/
2,Centro Universitário Barao de Maua,,baraodemaua.br,http://www.baraodemaua.br/
3,Centro Universitário Claretiano,,claretiano.edu.br,http://www.claretiano.edu.br/
4,Centro Universitário De Goiás - UNIGOIÁS,Goiânia,unigoias.com.br,https://unigoias.com.br/
...,...,...,...,...
185,Universidade do Sagrado Coração,,usc.br,http://www.usc.br/
186,Universidade do Sul de Santa Catarina,,unisul.br,http://www.unisul.br/
187,Universidade do Vale do Itajaí,,univali.rct-sc.br,http://www.univali.rct-sc.br/
188,Universidade do Vale do Paraíba – Univap,São José dos Campos,univap.br,https://www.univap.br/univap/


In [7]:
with sqlite3.connect(DB_FILE) as conn:
    df_top10 = pd.read_sql_query("""
        SELECT country, COUNT(*) AS total
        FROM universities
        GROUP BY country
        ORDER BY total DESC
        LIMIT 10
    """, conn)
    display(df_top10)


Unnamed: 0,country,total
0,United States,2348
1,Japan,572
2,India,473
3,China,397
4,Germany,318
5,Russian Federation,309
6,France,297
7,"Korea, Republic of",244
8,United Kingdom,195
9,Iran,193


In [8]:
with sqlite3.connect(DB_FILE) as conn:
    df_termo = pd.read_sql_query("""
        SELECT name, country, web_pages
        FROM universities
        WHERE name LIKE '%Technology%'
        ORDER BY country, name
        LIMIT 50
    """, conn)
    display(df_termo)


Unnamed: 0,name,country,web_pages
0,Engineering Institute of Technology,Australia,https://www.eit.edu.au/
1,"Institute Of Technology, Australia",Australia,http://www.iota.edu.au/
2,Queensland University of Technology,Australia,http://www.qut.edu.au/
3,Royal Melbourne Institute of Technology,Australia,http://www.rmit.edu.au/
4,Swinburne University of Technology,Australia,http://www.swin.edu.au/
5,University of Technology Sydney,Australia,http://www.uts.edu.au/
6,Institute of Science and Technology,Austria,http://www.ist.ac.at/
7,New York Instiute of Technology,Bahrain,http://www.nyit.edu.bh/
8,Ahsanullah University of Science & Technology,Bangladesh,http://www.aust.edu/
9,Bangladesh University of Business & Technology,Bangladesh,http://www.bubt.edu.bd/
