# Classify news by political spectrum and show trends in newspapers and journalists by political spectrum
Classify news within the political spectrum. (How to reach this classification? News favors more libertarian, authoritarian, free-market or closed-market policies)

Show newspapers, within the political spectrum, what their trend is based on the number of news articles and their textual sentiment (positive, neutral or negative).

Show the same information by journalist.

## How to reach this classification?
- Get the topic of the news.
- Try to get who is the main character of the news.
- If the main character is a politician, try to get the political party of the politician.
- If the sentiment of the news is positive, neutral or negative, try to get the political spectrum of the sentiment.

<img src="./images/political_spectrum.png" alt="political_spectrum" width="400"/>
<img src="./images/PT_Political_Compass.png" alt="PT_Political_Compass" width="550"/>

## Similar Works:
- [politiquices](https://github.com/politiquices) - [2021 arquivo awards](https://sobre.arquivo.pt/pt/conheca-os-vencedores-do-premio-arquivo-pt-2021/), 2nd place

## Data Sources:
- [arquivo.pt](https://arquivo.pt/)
  - [Público](https://www.publico.pt/)
  - [Jornal de Notícias](https://www.jn.pt/)
  - [Diário de Notícias](https://www.dn.pt/)
  - [Expresso](https://expresso.pt/)
  - [Observador](https://observador.pt/)
  - [Sapo](https://www.sapo.pt/)
  - [RTP](https://www.rtp.pt/)
  - [TVI](https://tvi.iol.pt/)
  - [Correio da Manhã](https://www.cmjornal.pt/)
  - [Jornal i](https://ionline.sapo.pt/)
  - [Sol](https://sol.sapo.pt/)
  - [Jornal Económico](https://jornaleconomico.sapo.pt/)
  - [Notícias ao Minuto](https://www.noticiasaominuto.com/)
  - [SIC Notícias](https://sicnoticias.pt/)
  - [Renascença](https://rr.sapo.pt/)
  - [Jornal de Negócios](https://www.jornaldenegocios.pt/)
  - [Visão](https://visao.sapo.pt/)
  - [Sábado](https://www.sabado.pt/)
- [wikidata.org](https://www.wikidata.org/)
- [dados.gov.pt](https://dados.gov.pt/)
- [parlamento.pt](https://www.parlamento.pt/Cidadania/Paginas/DadosAbertos.aspx)

In [2]:
import pandas as pd
import requests
import json
from pprint import pprint
with open("./Legislaturas/X.json") as json_file:
    legislature_json = json.load(json_file)

legislature = legislature_json["Legislatura"]

l_init_date = legislature["DetalheLegislatura"]["dtini"]  # 2005-03-10
l_end_date = legislature["DetalheLegislatura"]["dtfim"]  # 2009-10-14

deputies = legislature["Deputados"]["pt_ar_wsgode_objectos_DadosDeputadoSearch"]
parties = legislature["GruposParlamentares"]["pt_gov_ar_objectos_GPOut"]

pprint(len(deputies))

352


In [12]:
init_date = l_init_date.replace("-", "")
end_date = l_end_date.replace("-", "")
maxItems = 100
domains = [
    "publico.pt",
    "www.publico.pt",
    "jornal.publico.pt",
    "dossiers.publico.pt",
    "desporto.publico.pt",
    "www.publico.clix.pt",
    "digital.publico.pt",
    "blogues.publico.pt",
    "economia.publico.pt",
    "m.publico.pt",
    "ultimahora.publico.pt",
    "observador.pt",
    "www.dn.pt",
    "dn.sapo.pt",
    "www.dn.sapo.pt",
    "expresso.pt",
    "aeiou.expresso.pt",
    "expresso.sapo.pt",
    "www.correiomanha.pt",
    "www.correiodamanha.pt",
    "www.cmjornal.xl.pt",
    "www.cmjornal.pt",
    "www.jn.pt",
    "jn.pt",
    "jn.sapo.pt",
    "abola.pt",
    "www.abola.pt",
    "abola.pt:80",
    "www.sabado.pt",
    "www.sabado.pt:80",
    "www.sabado.xl.pt",
    "www.sabado.xl.pt:80",
    "sabado.pt",
    "visaoonline.clix.pt:80",
    "visao.clix.pt:80",
    "aeiou.visao.pt",
    "visao.sao.pt",
]

news_per_deputy = {}
total_dep = len(deputies)
depts = []

# search for news for each deputy in the years of the legislature
for index, dep in enumerate(deputies):
    dep_id = dep["depId"]
    dep_name = dep["depNomeParlamentar"]
    deputy = {
        "id": dep_id,
        "name": dep_name
    }
    depts.append(deputy)
    query = f"{dep_name}"

    print(f"{index + 1}/{total_dep} - {dep_name}")
    print(f"Searching news for {dep_name}...")

    payload = {
        "q": query,
        "maxItems": maxItems,
        "siteSearch": ",".join(domains),
        "from": init_date,
        "to": end_date,
    }

    # r = requests.get("https://arquivo.pt/textsearch", params=payload)
    #
    # json_res = r.json()
    # items = json_res["response_items"]
    #
    # news_per_deputy[dep_name] = {
    #     "estimated_nr_results": json_res["estimated_nr_results"],
    #     "items": items,
    # }
    #
    # print(f"Found {json_res['estimated_nr_results']} news for {dep_name}.\n")
#
# df = pd.DataFrame(
#     [
#         (dep, news_per_deputy[dep]["estimated_nr_results"])
#         for dep in news_per_deputy
#     ],
#     columns=["Deputy", "N_news"],
# )
# df.sort_values(by="N_news", ascending=False, inplace=True)

1/352 - ABEL BAPTISTA
Searching news for ABEL BAPTISTA...
2/352 - ABÍLIO DIAS FERNANDES
Searching news for ABÍLIO DIAS FERNANDES...
3/352 - ADÃO SILVA
Searching news for ADÃO SILVA...
4/352 - AFONSO CANDAL
Searching news for AFONSO CANDAL...
5/352 - AGOSTINHO BRANQUINHO
Searching news for AGOSTINHO BRANQUINHO...
6/352 - AGOSTINHO GONÇALVES
Searching news for AGOSTINHO GONÇALVES...
7/352 - AGOSTINHO LOPES
Searching news for AGOSTINHO LOPES...
8/352 - ALBERTO ANTUNES
Searching news for ALBERTO ANTUNES...
9/352 - ALBERTO ARONS DE CARVALHO
Searching news for ALBERTO ARONS DE CARVALHO...
10/352 - ALBERTO COSTA
Searching news for ALBERTO COSTA...
11/352 - ALBERTO MARTINS
Searching news for ALBERTO MARTINS...
12/352 - ALCÍDIA LOPES
Searching news for ALCÍDIA LOPES...
13/352 - ALDA MACEDO
Searching news for ALDA MACEDO...
14/352 - ALDEMIRA PINHO
Searching news for ALDEMIRA PINHO...
15/352 - ALTINO BESSA
Searching news for ALTINO BESSA...
16/352 - ÁLVARO CASTELLO-BRANCO
Searching news for ÁLVAR

In [4]:
# write news_per_deputy to csv
# with open("data.json", "w") as f:
#     for dep in news_per_deputy:
#         json.dump({
#             "deputy": dep,
#             "estimated_nr_results": news_per_deputy[dep]["estimated_nr_results"],
#             "items": news_per_deputy[dep]["items"],
#         }, f)
#         f.write("\n")
#
# # write df to csv
# df.to_csv("news_per_deputy.csv", index=False)

In [52]:
def try_request(endpoint, params={}, timeout=30, attempts=10):
    for i in range(attempts):
        try:
            request = requests.get(endpoint, params=params, timeout=timeout * ((i + 1) * 2))
            if request.status_code == 404:
                return False
            if request.status_code == 429:  # too many requests
                time.sleep(10)
            if request.status_code != 200:
                raise Exception("Bad status code %s" % request.status_code)
            if 'Ã' in request.text and "ç" not in request.text:
                request.encoding = "utf-8"
            return request.json()
        except Exception as e:
            print("[%s] for [%s] (attempt %d/%d)" % (e, params, i + 1, attempts))
    if not request or request.status_code == 500:
        return False  # end of attempts


def search(
    query_terms,
    site_search,
    _from=init_date,
    _to=end_date,
    page_type="html",
    max_items=2000,
    fields="title,tstamp,originalURL,linkToOriginalFile,linkToArchive",
    pretty_print="false",
    next_page=False
):
    endpoint = "https://arquivo.pt/textsearch"
    timeout = 30
    attempts = 1
    params = {
        "q": query_terms,
        "from": _from,
        "to": _to,
        "type": page_type,
        "siteSearch": site_search,
        "maxItems": max_items,
        "fields": fields,
        "prettyPrint": pretty_print,
    }
    items = []

    response = try_request(endpoint=endpoint, params=params, timeout=timeout, attempts=attempts)
    if not response:
        return []

    for result in response["response_items"]:
        items.append(result)

    if next_page:
        while (True):
            if "next_page" in response:
                next_page_link = response["next_page"]
                response = try_request(endpoint=next_page_link, timeout=timeout, attempts=attempts)
                if response:
                    for result in response["response_items"]:
                        items.append(result)
            else:
                break
    return items


publico = ["publico.pt", "www.publico.pt", "jornal.publico.pt", "dossiers.publico.pt", "desporto.publico.pt",
           "www.publico.clix.pt", "digital.publico.pt", "blogues.publico.pt", "economia.publico.pt", "m.publico.pt", "ultimahora.publico.pt"]
observador = ["observador.pt"]
dn = ["www.dn.pt", "dn.sapo.pt", "www.dn.sapo.pt"]
expresso = ["expresso.pt", "aeiou.expresso.pt", "expresso.sapo.pt"]
cm = ["www.correiomanha.pt", "www.correiodamanha.pt", "www.cmjornal.xl.pt", "www.cmjornal.pt"]
jn = ["www.jn.pt", "jn.pt", "jn.sapo.pt"]
abola = ["abola.pt", "www.abola.pt", "abola.pt:80"]
visao = ["aeiou.visao.pt", "visao.sapo.pt"]
sabado = ["www.sabado.pt", "www.sabado.xl.pt", "www.sabado.xl.pt:80", "sabado.pt"]

terms = []
for dep in depts:
    terms.append(dep["name"])

for party in parties:
    terms.append(party["sigla"])
    terms.append(party["nome"])

item_df = pd.DataFrame()
for term in terms:
    for journal in domains:
        items = search(query_terms=term, site_search=journal,
                       max_items=2000, next_page=False)
        print(items)

[]
[{'title': 'Composição do III Governo Constitucional', 'originalURL': 'http://www.publico.pt/servico/notinuse/proggov/governo3/comp3.html', 'linkToArchive': 'https://arquivo.pt/wayback/20050621190134/http://www.publico.pt/servico/notinuse/proggov/governo3/comp3.html', 'tstamp': '20050621190134', 'linkToOriginalFile': 'https://arquivo.pt/noFrame/replay/20050621190134id_/http://www.publico.pt/servico/notinuse/proggov/governo3/comp3.html'}, {'title': 'PUBLICO - Quem somos nós', 'originalURL': 'http://www.publico.pt/nos/fichatecnica.html', 'linkToArchive': 'https://arquivo.pt/wayback/20050621182756/http://www.publico.pt/nos/fichatecnica.html', 'tstamp': '20050621182756', 'linkToOriginalFile': 'https://arquivo.pt/noFrame/replay/20050621182756id_/http://www.publico.pt/nos/fichatecnica.html'}]
[{'title': 'PUBLICO.PT', 'originalURL': 'http://jornal.publico.pt/indice.asp?a=2005&m=03&d=24', 'linkToArchive': 'https://arquivo.pt/wayback/20050324071358/http://jornal.publico.pt/indice.asp?a=2005&

[{'title': 'DN Online: Sete meses de prisão para juiz por causa de crucifixo', 'originalURL': 'http://www.dn.sapo.pt/2005/12/10/sociedade/sete_meses_prisao_para_juiz_causa_cr.html', 'linkToArchive': 'https://arquivo.pt/wayback/20070609140404/http://www.dn.sapo.pt/2005/12/10/sociedade/sete_meses_prisao_para_juiz_causa_cr.html', 'tstamp': '20070609140404', 'linkToOriginalFile': 'https://arquivo.pt/noFrame/replay/20070609140404id_/http://www.dn.sapo.pt/2005/12/10/sociedade/sete_meses_prisao_para_juiz_causa_cr.html'}, {'title': 'DN Online: Daniel Campelo promete liderar um bastião oposicionista no Alto Minho', 'originalURL': 'http://www.dn.sapo.pt/2007/04/09/nacional/daniel_campelo_promete_liderar_basti.html', 'linkToArchive': 'https://arquivo.pt/wayback/20070610150243/http://www.dn.sapo.pt/2007/04/09/nacional/daniel_campelo_promete_liderar_basti.html', 'tstamp': '20070610150243', 'linkToOriginalFile': 'https://arquivo.pt/noFrame/replay/20070610150243id_/http://www.dn.sapo.pt/2007/04/09/na

KeyboardInterrupt: 