**Objetivo:** é a partir dos nomes e dos aliases de várias empresa, encontrar várias mencoes das mesmas em notícias e tentar ...

1. grafo de palavras/pessoas/temas associadas [ver se é positivo / negativo o termo/pessoa]

2. relacao entre noticias e stock price

3. ...

Trabalho tem de ter 3 partes:

1. project structure + data acquisition

2. exploratory data analysis and visualization

3. results & discussion

Fonte de Dados: arquivo.pt (https://github.com/arquivo/pwa-technologies/wiki/Arquivo.pt-API)

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime

---

# data01.parquet

**sites dos quais vamos obter as noticias**

In [6]:
# news from https://www.kadaza.pt

def news(txtFile = 'noticias.txt'):
    """
    grab the news websites from a text file
    """
    with open(txtFile, 'r') as file:
        links = file.read().splitlines()
    return ",".join(links)

#news()

**como vão ser os api requests / decidir as empresas (PSI20) a analisar / fazer api requests in 3years groups**

*1 year to 3 years is long enough to smooth out short-term fluctuations and identify underlying trends. Charts with weekly or monthly intervals over these periods show developments over full economic/market cycles.*

In [7]:
def api_request(search, websites, date):
    """
    search: expression/word (what to look for)
    websites: comma separated websites (where to look for)
    date: list such as [20030101, 20031231] (when to look for)
    -
    returns the responde_items from arquivo.pt api
    """
    search = f"q=%22{search.replace(' ', '%20')}%22"
    websites = f"&siteSearch={websites}"
    date = f"&from={date[0]}&to={date[1]}"    
    url = (
        f"https://arquivo.pt/textsearch?{search}{websites}{date}"
        "&fields=linkToArchive,linkToExtractedText,tstamp"
        "&maxItems=500&dedupValue=25&dedupField=url&prettyPrint=false&type=html"
        )
    json = requests.get(url).json()
    data = json["response_items"]
    if len(data) == 500:
        print(f"You might have lost some data: {search, date}")
    return data

In [None]:
def datav1(companies):
    """
    this is the function where we choose the companies which will be in study
    -
    companies should be a dictionary
        {"company1": [aliases or other names the company is or was known by],
        "company2": [...]}
    -
    this data will be saved into a parquet file for future use and with already api requests

    also this will do the api requests .... get this better
    """
    # CREATING DF WITH COMPANIES AND THEIR ALIASES
    companies_data = {"companies": [], "aliases": []}
    for company in companies.keys():
        companies_data["companies"].append(company)
        companies_data["aliases"].append(companies[company])
    df = pd.DataFrame(companies_data).set_index("companies")

    # SITES OF WHERE TO LOOK FOR NEWS
    websites = news()

    # INITIALIZAING API REQUESTS
    # groups of 3 years, from 2000 to 2020
    for cluster in range(2000, 2021, 3):
        api_cluster = [] #reset api_cluster for each cluster (group of 3 year)
        print(f"Processing cluster: {cluster}")
        print("Processing company:", end=" ")
        # iterate over each company
        for company_aliases in df["aliases"]:
            api_company = [] #reset api_company for each company
            print(f"{company_aliases[0]}", end = "; ")
            # iterate over each company's aliases
            for alias in company_aliases:
                # iterate over each cluter's year
                for year in range(cluster, cluster + 3):                        
                    api_aliasS1 = api_request(alias, websites, [int(f"{year}0101"), int(f"{year}0630")])
                    api_aliasS2 = api_request(alias, websites, [int(f"{year}0701"), int(f"{year}1231")])
                    api_company += api_aliasS1 + api_aliasS2
            # save company data
            api_cluster.append(api_company)

        # save cluster (group of 3 years) data
        df[f"api.{cluster}"] = api_cluster
        print(f"{cluster} OK.")

    # save all data
    df.to_parquet("data01.parquet")
    print("Finished.")
    return df

companies = {"Banco Comercial Português": ["Banco Comercial Português", "BCP"],
             "Galp Energia": ["Galp Energia", "GALP"],
             "EDP": ["EDP", "Energias de Portugal", "Electricidade de Portugal"],
             "Sonae": ["Sonae", "SON"],
             "Mota-Engil": ["Mota-Engil", "EGL"]}
df01 = datav1(companies)
df01

In [None]:
df01.map(lambda x: len(x))

--- 

# data02.parquet

por ter usado `&dedupValue=25&dedupField=url` e diferentes aliases, há informação repetida

**filtrar repetidos e textos que não mencionem nenhum alias**

problemas ultrapassados:

- API has the following usage limits (250req/min, error 429): `time.sleep(60)`

- API error 404 for some urls: return 0 (False) and skip it

- extrair o texto demora muito: filtrar e salvar coluna a coluna

nota: podia ter feito já online o processamento do texto, mas não queria estar dependente do wifi

In [None]:
def extracText(linkToExtractedText):
    # Infinite loop to handle retry logic in case of 429 Too Many Requests
    while True:
        response = requests.get(linkToExtractedText)
        status_code = response.status_code
        
        if status_code == 200:
            # If the request is successful (200 OK), return the extracted text
            soup = BeautifulSoup(response.content, "html.parser")
            return soup.get_text()
        elif status_code == 429:
            # Handle 429 Too Many Requests by reading the Retry-After header
            print(" (...)", end = "")
            time.sleep(60)  # Pause execution for the retry period
        elif status_code == 404:
            return 0
        else:
            # For any other status codes (e.g., 500, ...), print the status and break the loop
            print(f"Request failed with status code {status_code}. Link was {linkToExtractedText}")
            break

# Function to process each column
def filterColumn(column, aliases):
    """aliases in text, repeated text and extract text"""
    global stats
    filtered_column = []

    for row in aliases.index:
        filtered_cell = []
        seen_text = set()
        print(f"; {row}", end = "")
        for i in column.loc[row]:
            
            # Extract text from 'linkToExtractedText'
            text = extracText(i['linkToExtractedText'])
                

            # Skip if the text has already been processed
            if text in seen_text:
                stats["duplicate"] += 1
                continue

            elif not text: #ERROR 404
                stats["404"] += 1
                continue
            
            # Check if any alias is found in the text
            elif any(alias.lower() in text.lower() for alias in aliases.loc[row]):
                i["ExtractedText"] = text  # Add extracted text to the record
                
                # Remove unwanted fields
                i.pop('linkToExtractedText', None)
                
                # Append the processed record
                filtered_cell.append(i)
                
                # Mark this text as processed
                seen_text.add(text)

        filtered_column.append(filtered_cell)
                
    return filtered_column


def processColumns(col_to_proc):
    print(f"Starting: {datetime.now()}")
    try:
        # continuar df criada
        df = pd.read_parquet("data02.parquet")
    except:
       # criar df para trabalhar
       df = pd.read_parquet("data01.parquet").to_parquet("data02.parquet")
       df = pd.read_parquet("data02.parquet")
    for column in col_to_proc:
        has_link = "linkToExtractedText" in df.iloc[-1][column][-1]
        has_extracText = "ExtractedText" in df.iloc[-1][column][-1]
        if not has_link and has_extracText:
            print(f"\n{column} already done. Skipping.")
        else:
            print(f"\nProcessing {column}", end = ": ")
            df[column] = filterColumn(df[column], df["aliases"])
            df.to_parquet("data02.parquet")
    print(f"\nEnded: {datetime.now()}.")

stats = {"404": 0, "duplicate": 0}
processColumns(["api.2000", "api.2003", "api.2006", "api.2009", "api.2012", "api.2015", "api.2018"])
print(stats)

In [None]:
pd.read_parquet("data02.parquet").map(lambda x: len(x)) - pd.read_parquet("data01.parquet").map(lambda x: len(x))

In [None]:
pd.read_parquet("data02.parquet").map(lambda x: len(x))

---