# Classify news by political spectrum and show trends in newspapers and journalists by political spectrum
Classify news within the political spectrum. (How to reach this classification? News favors more libertarian, authoritarian, free-market or closed-market policies)

Show newspapers, within the political spectrum, what their trend is based on the number of news articles and their textual sentiment (positive, neutral or negative).

Show the same information by journalist.

## How to reach this classification?
- Get the topic of the news.
- Try to get who is the main character of the news.
- If the main character is a politician, try to get the political party of the politician.
- If the sentiment of the news is positive, neutral or negative, try to get the political spectrum of the sentiment.

<img src="./images/political_spectrum.png" alt="political_spectrum" width="400"/>
<img src="./images/PT_Political_Compass.png" alt="PT_Political_Compass" width="550"/>

## Similar Works / State of the Art
- [Politiquices](https://github.com/politiquices) - [2021 arquivo awards](https://sobre.arquivo.pt/pt/conheca-os-vencedores-do-premio-arquivo-pt-2021/), 2nd place \
  This project shows relations between politicians and events in the political life of Portugal.
  
- [Memória Política](https://arquivo.pt/wayback/20230710141426/http://memoria-politica.pt/) - [2023 arquivo awards](https://sobre.arquivo.pt/pt/conheca-os-vencedores-do-premio-arquivo-pt-2023/), 3rd place   
  This is a relevant project because it shows the sentimental analysis of the news by each political party. 
  
- [Perfil Público](https://aclanthology.org/2024.propor-2.27.pdf) - Research paper \
  This project generates profiles different authors based on their writing style.
  
- [Arquivo do Parlamento](https://arquivo-parlamento.pt/) - [2022 arquivo awards](https://sobre.arquivo.pt/pt/conheca-os-vencedores-do-premio-arquivo-pt-2022/), 1st place \ 
  This project aggregates all the news for each politician/parties during all the legislatures in Portugal.

## Data Sources
- [arquivo.pt](https://arquivo.pt/)
  - [Público](https://www.publico.pt/)
  - [Jornal de Notícias](https://www.jn.pt/)
  - [Diário de Notícias](https://www.dn.pt/)
  - [Expresso](https://expresso.pt/)
  - [Observador](https://observador.pt/)
  - [Sapo](https://www.sapo.pt/)
  - [RTP](https://www.rtp.pt/)
  - [TVI](https://tvi.iol.pt/)
  - [Correio da Manhã](https://www.cmjornal.pt/)
  - [Jornal i](https://ionline.sapo.pt/)
  - [Sol](https://sol.sapo.pt/)
  - [Jornal Económico](https://jornaleconomico.sapo.pt/)
  - [Notícias ao Minuto](https://www.noticiasaominuto.com/)
  - [SIC Notícias](https://sicnoticias.pt/)
  - [Renascença](https://rr.sapo.pt/)
  - [Jornal de Negócios](https://www.jornaldenegocios.pt/)
  - [Visão](https://visao.sapo.pt/)
  - [Sábado](https://www.sabado.pt/)
- [wikidata.org](https://www.wikidata.org/)
- [dados.gov.pt](https://dados.gov.pt/)
- [parlamento.pt](https://www.parlamento.pt/Cidadania/Paginas/DadosAbertos.aspx)

## Methodology
- Get name of politicians from a certain time period.
- Limit our scope to the news articles that mention these politicians and their parties in the same time period.

In [1]:
import pandas as pd
import requests
import json
from pprint import pprint
with open("./Legislaturas/X.json") as json_file:
    legislature_json = json.load(json_file)

legislature = legislature_json["Legislatura"]

l_init_date = legislature["DetalheLegislatura"]["dtini"]  # 2005-03-10
l_end_date = legislature["DetalheLegislatura"]["dtfim"]  # 2009-10-14

deputies = legislature["Deputados"]["pt_ar_wsgode_objectos_DadosDeputadoSearch"]
parties = legislature["GruposParlamentares"]["pt_gov_ar_objectos_GPOut"]

pprint(len(deputies))

352


Deputies of the X legislature of the Assembly of the Republic of Portugal:

In [2]:
dp_df = pd.DataFrame(deputies, columns=['depId','depNomeParlamentar'])
party_df = pd.DataFrame(parties)
party_df

Unnamed: 0,sigla,nome
0,PS,Partido Socialista
1,PSD,Partido Social Democrata
2,PCP,Partido Comunista Português
3,CDS-PP,Partido Popular
4,BE,Bloco De Esquerda
5,PEV,"Partido Ecologista ""Os Verdes"""


In [3]:
dp_df

Unnamed: 0,depId,depNomeParlamentar
0,2249,ABEL BAPTISTA
1,2139,ABÍLIO DIAS FERNANDES
2,2269,ADÃO SILVA
3,2091,AFONSO CANDAL
4,2495,AGOSTINHO BRANQUINHO
...,...,...
347,2180,VÍTOR BAPTISTA
348,2147,VÍTOR HUGO SALGADO
349,2170,VÍTOR PEREIRA
350,2340,VITOR RAMALHO


In [4]:
init_date = l_init_date.replace("-", "")
end_date = l_end_date.replace("-", "")
maxItems = 100
domains = [
    #"publico.pt",
    #"www.publico.pt",
    #"jornal.publico.pt",
    #"dossiers.publico.pt",
    #"desporto.publico.pt",
    #"www.publico.clix.pt",
    #"digital.publico.pt",
    #"blogues.publico.pt",
    #"economia.publico.pt",
    #"m.publico.pt",
    #"ultimahora.publico.pt",
    "observador.pt",
    "www.dn.pt",
    "dn.sapo.pt",
    "www.dn.sapo.pt",
    "expresso.pt",
    "aeiou.expresso.pt",
    "expresso.sapo.pt",
    "www.correiomanha.pt",
    "www.correiodamanha.pt",
    "www.cmjornal.xl.pt",
    "www.cmjornal.pt",
    "www.jn.pt",
    "jn.pt",
    "jn.sapo.pt",
    "abola.pt",
    "www.abola.pt",
    "abola.pt:80",
    "www.sabado.pt",
    "www.sabado.pt:80",
    "www.sabado.xl.pt",
    "www.sabado.xl.pt:80",
    "sabado.pt",
    "visaoonline.clix.pt:80",
    "visao.clix.pt:80",
    "aeiou.visao.pt",
    "visao.sao.pt",
]

news_per_deputy = {}
total_dep = len(deputies)
depts = []

# search for news for each deputy in the years of the legislature
for index, dep in enumerate(deputies):
    dep_id = dep["depId"]
    dep_name = dep["depNomeParlamentar"]
    deputy = {
        "id": dep_id,
        "name": dep_name
    }
    depts.append(deputy)
    query = f"{dep_name}"

    print(f"{index + 1}/{total_dep} - {dep_name}")
    print(f"Searching news for {dep_name}...")

    payload = {
        "q": query,
        "maxItems": maxItems,
        "siteSearch": ",".join(domains),
        "from": init_date,
        "to": end_date,
    }

    # r = requests.get("https://arquivo.pt/textsearch", params=payload)
    # 
    # json_res = r.json()
    # items = json_res["response_items"]
    # 
    # news_per_deputy[dep_name] = {
    #     "estimated_nr_results": json_res["estimated_nr_results"],
    #     "items": items,
    # }
    # 
    # print(f"Found {json_res['estimated_nr_results']} news for {dep_name}.\n")

df = pd.DataFrame(
    [
        (dep, news_per_deputy[dep]["estimated_nr_results"])
        for dep in news_per_deputy
    ],
    columns=["Deputy", "N_news"],
)
df.sort_values(by="N_news", ascending=False, inplace=True)

1/352 - ABEL BAPTISTA
Searching news for ABEL BAPTISTA...
2/352 - ABÍLIO DIAS FERNANDES
Searching news for ABÍLIO DIAS FERNANDES...
3/352 - ADÃO SILVA
Searching news for ADÃO SILVA...
4/352 - AFONSO CANDAL
Searching news for AFONSO CANDAL...
5/352 - AGOSTINHO BRANQUINHO
Searching news for AGOSTINHO BRANQUINHO...
6/352 - AGOSTINHO GONÇALVES
Searching news for AGOSTINHO GONÇALVES...
7/352 - AGOSTINHO LOPES
Searching news for AGOSTINHO LOPES...
8/352 - ALBERTO ANTUNES
Searching news for ALBERTO ANTUNES...
9/352 - ALBERTO ARONS DE CARVALHO
Searching news for ALBERTO ARONS DE CARVALHO...
10/352 - ALBERTO COSTA
Searching news for ALBERTO COSTA...
11/352 - ALBERTO MARTINS
Searching news for ALBERTO MARTINS...
12/352 - ALCÍDIA LOPES
Searching news for ALCÍDIA LOPES...
13/352 - ALDA MACEDO
Searching news for ALDA MACEDO...
14/352 - ALDEMIRA PINHO
Searching news for ALDEMIRA PINHO...
15/352 - ALTINO BESSA
Searching news for ALTINO BESSA...
16/352 - ÁLVARO CASTELLO-BRANCO
Searching news for ÁLVAR

In [5]:
# write news_per_deputy to csv
with open("data.json", "w") as f:
    for dep in news_per_deputy:
        json.dump({
            "deputy": dep,
            "estimated_nr_results": news_per_deputy[dep]["estimated_nr_results"],
            "items": news_per_deputy[dep]["items"],
        }, f)
        f.write("\n")

# write df to csv
df.to_csv("news_per_deputy.csv", index=False)

In [6]:
news_per_deputy = pd.read_csv("news_per_deputy.csv")
news_per_deputy.head()
# Sum the number of news 
news_per_deputy["N_news"].sum()


0

## 

In [8]:
def try_request(endpoint, params={}, timeout=30, attempts=10):
    for i in range(attempts):
        try:
            request = requests.get(endpoint, params=params, timeout=timeout * ((i + 1) * 2))
            if request.status_code == 404:
                return False
            if request.status_code == 429:  # too many requests
                time.sleep(10)
            if request.status_code != 200:
                raise Exception("Bad status code %s" % request.status_code)
            if 'Ã' in request.text and "ç" not in request.text:
                request.encoding = "utf-8"
            return request.json()
        except Exception as e:
            print("[%s] for [%s] (attempt %d/%d)" % (e, params, i + 1, attempts))
    if not request or request.status_code == 500:
        return False  # end of attempts


def search(
    query_terms,
    site_search,
    _from=init_date,
    _to=end_date,
    page_type="html",
    max_items=500,
    fields="title,tstamp,originalURL,linkToOriginalFile,linkToArchive,fileName",
    pretty_print="false",
    next_page=True
):
    endpoint = "https://arquivo.pt/textsearch"
    timeout = 30
    attempts = 2
    params = {
        "q": query_terms,
        "from": _from,
        "to": _to,
        "type": page_type,
        "siteSearch": site_search,
        "dedupValue": 0,
        "dedpuField": "url",
        "maxItems": max_items,
        "fields": fields,
        "prettyPrint": pretty_print,
    }
    items = []

    response = try_request(endpoint=endpoint, params=params, timeout=timeout, attempts=attempts)
    if not response:
        return []

    for result in response["response_items"]:
        items.append(result)

    if next_page:
        while (True):
            if "next_page" in response:
                next_page_link = response["next_page"]
                response = try_request(endpoint=next_page_link, timeout=timeout, attempts=attempts)
                if response:
                    for result in response["response_items"]:
                        items.append(result)
            else:
                break
    return items


publico = ["publico.pt", "www.publico.pt", "jornal.publico.pt", "dossiers.publico.pt", "desporto.publico.pt",
           "www.publico.clix.pt", "digital.publico.pt", "blogues.publico.pt", "economia.publico.pt", "m.publico.pt", "ultimahora.publico.pt"]
observador = ["observador.pt"]
dn = ["www.dn.pt", "dn.sapo.pt", "www.dn.sapo.pt"]
expresso = ["expresso.pt", "aeiou.expresso.pt", "expresso.sapo.pt"]
cm = ["www.correiomanha.pt", "www.correiodamanha.pt", "www.cmjornal.xl.pt", "www.cmjornal.pt"]
jn = ["www.jn.pt", "jn.pt", "jn.sapo.pt"]
abola = ["abola.pt", "www.abola.pt", "abola.pt:80"]
visao = ["aeiou.visao.pt", "visao.sapo.pt"]
sabado = ["www.sabado.pt", "www.sabado.xl.pt", "www.sabado.xl.pt:80", "sabado.pt"]

terms = []
for dep in depts:
    terms.append(dep["name"])

for party in parties:
    terms.append(party["sigla"])
    terms.append(party["nome"])
# print(terms)

def dedup_results(arr):
    vals_url = []
    vals_title = []
    vals = []
    for val in arr:
        if val["originalURL"] in vals_url or val["fileName"] in vals_title:
            continue
        else:
            vals.append(val)
            vals_url.append(val["originalURL"])
    return vals 
    

searches = {}

for term in terms:
    try:
        items = search(query_terms=term, site_search=",".join(domains),
                        max_items=500, next_page=True)
        searches[term] = items
        print(f"Found {len(items)} news for {term}.\n")
    except Exception as e:
        print(term, e)

with open("searches.jsonl", "w") as f:
    for term in searches:
        for result in searches[term]:
            json.dump({
                "term": term,
                "items": result,
            }, f)
            f.write("\n")



Found 161 news for ABEL BAPTISTA.

Found 137 news for ABÍLIO DIAS FERNANDES.

Found 109 news for ADÃO SILVA.

Found 63 news for AFONSO CANDAL.

Found 182 news for AGOSTINHO BRANQUINHO.

Found 128 news for AGOSTINHO GONÇALVES.

Found 725 news for AGOSTINHO LOPES.

Found 762 news for ALBERTO ANTUNES.

Found 30 news for ALBERTO ARONS DE CARVALHO.

Found 2000 news for ALBERTO COSTA.

Found 2000 news for ALBERTO MARTINS.

Found 0 news for ALCÍDIA LOPES.

Found 81 news for ALDA MACEDO.

Found 0 news for ALDEMIRA PINHO.

Found 0 news for ALTINO BESSA.

Found 16 news for ÁLVARO CASTELLO-BRANCO.

Found 160 news for ÁLVARO SARAIVA.

Found 8 news for ANA CATARINA MENDONÇA MENDES.

Found 119 news for ANA COUTO.

Found 179 news for ANA DRAGO.

Found 800 news for ANA MANSO.

Found 543 news for ANA MARIA ROCHA.

Found 2000 news for ANA PAULA VITORINO.

Found 2000 news for ANA RODRIGUES.

Found 26 news for ANA ZITA GOMES.

Found 329 news for ANDRÉ ALMEIDA.

Found 377 news for ANTÓNIO ALMEIDA HENRIQUES

In [14]:
def dedup_results(arr):
    vals_url = []
    vals_title = []
    vals = []
    for val in arr:
        for result in arr[val]:
            if result["originalURL"] in vals_url or result["fileName"] in vals_title:
                continue
            else:
                vals.append({"term": val, "items": result})
                vals_url.append(result["originalURL"])
                vals_title.append(result["fileName"])
    return vals 
    
new_searches = dedup_results(searches)


In [15]:
with open("searches_dedup.jsonl", "w") as f:
    for term in new_searches:
        print(term)
        json.dump(term, f)
        f.write("\n")

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [8]:
pip install newspaper3k  

[0mDefaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install --upgrade lxml_html_clean

[0mDefaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [16]:
from newspaper import Article

## Parse / Web Scraping HTML Pages
- [newspaper3k](https://pypi.org/project/newspaper3k/)

In [17]:
import json
import pandas as pd

In [None]:
searches_2 = []
with open("searches_dedup.jsonl") as f:
    for line in f:
        line = json.loads(line)
        searches_2.append(line)
            
# create empty dataframe
df = pd.DataFrame(columns=["term", "url", "text"])
data = []

for search in searches_2:
    try:
        term = search["term"]
        item = search["items"]
        print(f"\nDownloading contents from {term}...")
    
        print(item)
        url = item["linkToOriginalFile"]
        article = Article(url, language="pt", memoize_articles=False, fetch_images=False)
        article.download()
        article.parse()
        fileName = item["fileName"]
        text = article.text
        title = item["title"]
        # Transform text into a single line
        text = text.replace("\n", " ")
        
        data.append([term, url, text, title])
        print("Saving results to csv...")
        df = pd.DataFrame(data, columns=["term", "url", "text", "title"])
        df.to_csv('results.csv', index=False)
        
    except Exception as e:
        print(e)




Downloading contents from ABEL BAPTISTA...
{'title': 'DN Online: Funcionária com reforma recusada pede nova junta médica', 'originalURL': 'http://dn.sapo.pt/2007/12/19/sociedade/funcionaria_reforma_recusada_pede_no.html', 'linkToArchive': 'https://arquivo.pt/wayback/20071221214514/http://dn.sapo.pt/2007/12/19/sociedade/funcionaria_reforma_recusada_pede_no.html', 'tstamp': '20071221214514', 'linkToOriginalFile': 'https://arquivo.pt/noFrame/replay/20071221214514id_/http://dn.sapo.pt/2007/12/19/sociedade/funcionaria_reforma_recusada_pede_no.html', 'fileName': 'FCCN-PT-HISTORICAL-ia400112.20080826133952'}
Saving results to csv...

Downloading contents from ABEL BAPTISTA...
{'title': 'Jorge Ferreira enfrenta Paulo Portas em Aveiro - dn - DN', 'originalURL': 'http://dn.sapo.pt/inicio/interior.aspx?content_id=605127', 'linkToArchive': 'https://arquivo.pt/wayback/20090924205957/http://dn.sapo.pt/inicio/interior.aspx?content_id=605127', 'tstamp': '20090924205957', 'linkToOriginalFile': 'https: