# Projeto - Extração de Dados I

## Sistema de Monitoramento de Avanços no Campo da Genômica

### Contexto:

O grupo trabalha no time de engenharia de dados na HealthGen, uma empresa especializada em genômica e pesquisa de medicina personalizada. A genômica é o estudo do conjunto completo de genes de um organismo, desempenha um papel fundamental na medicina personalizada e na pesquisa biomédica. Permite a análise do DNA para identificar variantes genéticas e mutações associadas a doenças e facilita a personalização de tratamentos com base nas características genéticas individuais dos pacientes.

A empresa precisa se manter atualizada sobre os avanços mais recentes na genômica, identificar oportunidades para pesquisa e desenvolvimento de tratamentos personalizados e acompanhar as tendências em genômica que podem influenciar estratégias de pesquisa e desenvolvimento. Pensando nisso, o time de dados apresentou uma proposta de desenvolvimento de um sistema que coleta, analisa e apresenta as últimas notícias relacionadas à genômica e à medicina personalizada, e também estuda o avanço do campo nos últimos anos.

O time de engenharia de dados tem como objetivo desenvolver e garantir um pipeline de dados confiável e estável. As principais atividades são:

1. **Consumo de dados com a News API**:
    - Implementar um mecanismo para consumir dados de notícias de fontes confiáveis e especializadas em genômica e medicina personalizada, a partir da News API:
      [https://newsapi.org/](https://newsapi.org/)

2. **Definir Critérios de Relevância**:
    - Desenvolver critérios precisos de relevância para filtrar as notícias. Por exemplo, o time pode se concentrar em notícias que mencionem avanços em sequenciamento de DNA, terapias genéticas personalizadas ou descobertas relacionadas a doenças genéticas específicas.

3. **Cargas em Batches**:
    - Armazenar as notícias relevantes em um formato estruturado e facilmente acessível para consultas e análises posteriores. Essa carga deve acontecer 1 vez por hora. Se as notícias extraídas já tiverem sido armazenadas na carga anterior, o processo deve ignorar e não armazenar as notícias novamente, os dados carregados não podem ficar duplicados.

4. **Dados transformados para consulta do público final**:
    - A partir dos dados carregados, aplicar as seguintes transformações e armazenar o resultado final para a consulta do público final:
        - Quantidade de notícias por ano, mês e dia de publicação;
        - Quantidade de notícias por fonte e autor;
        - Quantidade de aparições de 3 palavras-chave por ano, mês e dia de publicação (as 3 palavras-chave serão as mesmas usadas para fazer os filtros de relevância do item 2 (2. Definir Critérios de Relevância)).
    - Atualizar os dados transformados 1 vez por dia.

In [0]:
import requests
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType

API_KEY = '9cfe8d8ea7cf42c480f8e1556dda0eea'

base_url = 'https://newsapi.org/v2/everything'

query = '(epigenetics OR epigenetic OR epigenomics OR epigenetic OR epigenomic) AND (disease OR sickness OR sick) AND (genomic OR genomics OR gene)'

response = requests.get(url=base_url, params={'q': query, 'apiKey': API_KEY})
response = response.json()
response = response['articles']

spark = SparkSession.builder.appName('example').getOrCreate()

schema = StructType([
    StructField("source", StructType([
        StructField("id", StringType(), True),
        StructField("name", StringType(), True)
    ]), True),
    StructField("author", StringType(), True),
    StructField("title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("url", StringType(), True),
    StructField("urlToImage", StringType(), True),
    StructField("publishedAt", StringType(), True),
    StructField("content", StringType(), True),
])

epigenetics_df = spark.createDataFrame(response, schema=schema)

epigenetics_df.display()



source,author,title,description,url,urlToImage,publishedAt,content
"List(null, Scientific American)","Tulika Bose, Tanya Lewis",Hunger in Gaza Could Affect Survivors' Health for Decades,Epigenetics research reveals how famines can cause health problems later in life — and how those changes might be passed down to future generations.,https://www.scientificamerican.com/podcast/episode/hunger-in-gaza-could-affect-survivors-health-for-decades/,https://static.scientificamerican.com/dam/m/796975bfa0d8e4df/webimage-GAZA_FAMINE_SQUARE.png?w=1200,2024-03-11T23:00:00Z,Tanya Lewis: The situation in Gaza right now is desperate. A large percentage of the population is experiencing hunger or even dying of starvation. [Kamala Harris news clip] Tulika Bose: Videos sh… [+8169 chars]
"List(null, Phys.Org)",Science X,Biologists discover the secrets of how gene traits are passed on,A research team has recently made a significant breakthrough in understanding how the DNA copying machine helps pass on epigenetic information to maintain gene traits at each cell division.,https://phys.org/news/2024-03-biologists-secrets-gene-traits.html,https://scx2.b-cdn.net/gfx/news/2024/biologists-discover-th-1.jpg,2024-03-07T14:46:37Z,A research team has recently made a significant breakthrough in understanding how the DNA copying machine helps pass on epigenetic information to maintain gene traits at each cell division. Understa… [+6078 chars]
"List(null, New Atlas)",Paul McClure,‘Bad’ cholesterol gene silenced without altering the DNA sequence,"By silencing the gene responsible for regulating ‘bad’ cholesterol without altering the primary DNA sequence, researchers have shown that it’s possible to use epigenetic editing to treat diseases rather than conventional DNA-breaking gene editing technology, …",https://newatlas.com/science/epigenetic-editing-cholesterol-gene-silenced/,https://assets.newatlas.com/dims4/default/48fd060/2147483647/strip/true/crop/2000x1050+0+142/resize/1200x630!/quality/90/?url=http%3A%2F%2Fnewatlas-brightspot.s3.amazonaws.com%2F1a%2F11%2Fb5b84b0f432c84e8962c950b577c%2Fepigenetic-editing-copy.jpg&na.image_optimisation=0,2024-02-29T07:30:00Z,"By silencing the gene responsible for regulating bad cholesterol without altering the primary DNA sequence, researchers have shown that its possible to use epigenetic editing to treat diseases rather… [+3063 chars]"
"List(null, Frontiersin.org)",,"Acetate Revisited: A Key Biomolecule at the Nexus of Metabolism, Epigenetics","Acetate, the shortest chain fatty acid, has been implicated in providing health benefits whether it is derived from the diet or is generated from microbial f...",https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2020.580171/full,https://www.frontiersin.org/files/MyHome%20Article%20Library/580171/580171_Thumb_400.jpg,2024-03-26T11:46:02Z,"Introduction Diabetes, obesity, heart disease, cancer, and liver disease have all been linked in various ways to acetate availability, metabolism, and signaling. Acetate supplementation induces phys… [+163834 chars]"
"List(null, Phys.Org)",Science X,New tool helps decipher gene behavior,Scientists have extensively researched the structure and sequence of genetic material and its interactions with proteins in the hope of understanding how our genetics and environment interact with diseases. This research has partly focused on 'epigenetic mark…,https://phys.org/news/2024-02-tool-decipher-gene-behavior.html,https://scx2.b-cdn.net/gfx/news/hires/2024/new-tool-helps-deciphe.jpg,2024-02-28T21:49:39Z,Scientists have extensively researched the structure and sequence of genetic material and its interactions with proteins in the hope of understanding how our genetics and environment interact with di… [+3034 chars]
"List(null, Singularity Hub)",Shelly Fan,Gene Silencing Slashes Cholesterol in Mice—No Gene Edits Required,"With just one shot, scientists have slashed cholesterol levels in mice. The treatment lasted for at least half their lives. The shot may sound like gene editing, but it’s not. Instead, it relied on an up-and-coming method to control genetic activity—without d…",https://singularityhub.com/2024/02/29/gene-silencing-slashes-cholesterol-in-mice-no-gene-edits-required/,https://singularityhub.com/wp-content/uploads/2023/09/deepmind-dna-double-helix-visualization.jpeg,2024-02-29T21:26:17Z,"With just one shot, scientists have slashed cholesterol levels in mice. The treatment lasted for at least half their lives. The shot may sound like gene editing, but its not. Instead, it relied on a… [+7085 chars]"
"List(null, Science Daily)",,Cracking epigenetic inheritance: Biologists discovered the secrets of how gene traits are passed on,A research team has recently made a significant breakthrough in understanding how the DNA copying machine helps pass on epigenetic information to maintain gene traits at each cell division. Understanding how this coupled mechanism could lead to new treatments…,https://www.sciencedaily.com/releases/2024/03/240307110735.htm,https://www.sciencedaily.com/images/scidaily-icon.png,2024-03-07T16:07:35Z,"A research team led by Professor Yuanliang ZHAI at the School of Biological Sciences, The University of Hong Kong (HKU) collaborating with Professor Ning GAO and Professor Qing LI from Peking Univers… [+5534 chars]"
"List(null, Forbes)","Victoria Forster, Contributor, Victoria Forster, Contributor  https://www.forbes.com/sites/victoriaforster/",AI identifies New Type Of Prostate Cancer,"There are two distinct types of prostate cancer, which may open the door to more personalized therapies, according to a new AI-driven study published today.",https://www.forbes.com/sites/victoriaforster/2024/02/29/ai-identifies-new-type-of-prostate-cancer/,https://imageio.forbes.com/specials-images/imageserve/65dfaa1c4adb4c8f69357735/0x0.jpg?format=jpg&height=900&width=1600&fit=bounds,2024-02-29T16:00:00Z,"There are two distinct genetic types of prostate cancer according to new research which used AI ... [+] analysis getty There are two distinct types of prostate cancer, according to a new AI-driven … [+3340 chars]"
"List(null, Science Daily)",,Sniffing our way to better health,"Imagine if we could inhale scents that delay the onset of cancer, inflammation, or neurodegenerative disease. Researchers are poised to bring this futuristic technology closer to reality.",https://www.sciencedaily.com/releases/2024/02/240227172140.htm,https://www.sciencedaily.com/images/scidaily-icon.png,2024-02-27T22:21:40Z,"Imagine if we could inhale scents that delay the onset of cancer, inflammation, or neurodegenerative disease. Researchers at the University of California, Riverside, are poised to bring this futurist… [+5311 chars]"
"List(null, Science Daily)",,'Junk DNA' no more: New method to identify cancers from repeat elements of genetic code,"Repeats of DNA sequences, often referred to as 'junk DNA' or 'dark matter,' that are found in chromosomes and could contribute to cancer or other diseases have been challenging to identify and characterize. Now, researchers have developed a novel approach tha…",https://www.sciencedaily.com/releases/2024/03/240313185102.htm,https://www.sciencedaily.com/images/scidaily-icon.png,2024-03-13T22:51:02Z,"Repeats of DNA sequences, often referred to as ""junk DNA"" or ""dark matter,"" that are found in chromosomes and could contribute to cancer or other diseases have been challenging to identify and charac… [+8429 chars]"
