This notebook gets the feed from xadrez verbal podcast from its website and stores it into a csv file using pandas.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import json

When testing, I discovered that the website accessed an api to get posts to show. Each post is an episode of the podcast given in HTML containing the name, links and other infos.

In [19]:
url = f"https://public-api.wordpress.com/wpcom/v2/sites/53884404/articles?className=is-style-borders&showExcerpt=0&imageShape=uncropped&moreButton=1&showAuthor=0&postsToShow=100&mediaPosition=left&categories%5B0%5D=548679890&imageScale=2&textColor=black&excerptLength=55&showReadMore=0&readMoreLabel=Keep%20reading&showDate=1&showImage=1&showCaption=0&disableImageLazyLoad=0&minHeight=0&moreButtonText&showAvatar=1&showCategory=0&postLayout=list&columns=3&colGap=3&&&&&&typeScale=4&mobileStack=0&sectionHeader&specificMode=0&customTextColor&singleMode=0&showSubtitle=0&postType%5B0%5D=post&textAlign=left&includedPostStatuses%5B0%5D=publish&page=2"

In [20]:
ret = requests.get(url)

The response is a json file, so it's better to use a json parser.

In [21]:
articles_json = json.loads(ret.text)

articles_json.keys()

dict_keys(['items', 'ids', 'next'])

Items contains a list of posts and we can get the html and link by parsing the HTML using bs4

In [62]:
test = bs(articles_json['items'][0]['html'])

In [63]:
test.a['href']

'https://xadrezverbal.com/2020/02/15/xadrez-verbal-podcast-222-eua-el-salvador-e-europa/'

In [64]:
test.text.split('\n')[6].replace('\xa0', ' ')

'Xadrez Verbal Podcast #222 – EUA, El Salvador e Europa '

I don't know how many episodes there are and the api denies me when I input a giant number to 'postsToShow' parameter. <br>
There is a parameter used by the website to exclude already loaded posts so I collect that data and used it to input into the query. <br>
The loaded ids is given in the json file.

In [65]:
ids_string = ""

# Iterates over all ids and stores it separated by commas into a string.
for article in articles_json['ids']:
    ids_string = ids_string + str(article) + ','

ids_string = ids_string.removesuffix(',')

The way to build the new url is to input the loaded id string.

In [66]:
f"&exclude_ids={ids_string}"

'&exclude_ids=8412,8398,8388,8366,8335,8321,8296,8271,8247,8222,8206,8187,8165,8153,8108,8084,8058,8035,7999,7972,7942,7922,7903,7878,7862,7834,7823,7800,7766,7729,7711,7701,7692,7676,7667,7658,7644,7630,7621,7613,7600,7582,7569,7557,7550,7537,7525,7513,7501,7492,7483,7475,7465,7450,7439,7429,7415,7399,7391,7377,7366,7353,7343,7325,7289,7275,7255,7245,7229,7217,7170,7151,7142,7133,7119,7105,7090,7077,7068,7059,7054,7045,7040,7030,7017,7002,6983,6971,6949,6923,6898,6880,6860,6854,6837,6824,6817,6794,6606,6595'

Now we create the logger for the next queries

In [51]:
import logging
import os
# Color number definition
BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE = range(8)

# These are the sequences need to get colored output
RESET_SEQ = "\033[0m"
COLOR_SEQ = "\033[1;%dm"
BOLD_SEQ = "\033[1m"

COLORS = {
    'WARNING': YELLOW,
    'INFO': WHITE,
    'DEBUG': BLUE,
    'CRITICAL': RED,
    'ERROR': RED
}

# Special function used to ease a message formatting edition


def formatter_message(message, use_color=True):
    if use_color:
        message = message.replace(
            "$RESET", RESET_SEQ).replace("$BOLD", BOLD_SEQ)
    else:
        message = message.replace("$RESET", "").replace("$BOLD", "")
    return message

# Format log level name color accordinly


class ColoredFormatter(logging.Formatter):
    def __init__(self, msg, use_color=True):
        logging.Formatter.__init__(self, msg)
        self.use_color = use_color

    def format(self, record):
        levelname = record.levelname
        if self.use_color and levelname in COLORS:
            levelname_color = COLOR_SEQ % (
                30 + COLORS[levelname]) + levelname + RESET_SEQ
            record.levelname = levelname_color
        return logging.Formatter.format(self, record)

# Logger class used in all logging operations


class log(logging.Logger):
    # Message format with collors \033[1;35m = Magenta
    FORMAT = '\033[35m%(asctime)s\033[0m [$BOLD%(levelname)-18s$RESET]\033[35m [%(processName)s][%(threadName)s][%(module)s]\033[0m %(message)s - $BOLDLine:%(lineno)d$RESET'
    COLOR_FORMAT = formatter_message(FORMAT, True)

    def __init__(self, name='my_logger'):
        # Create logger with debug level
        logging.Logger.__init__(self, name, logging.DEBUG)

        # create console handler and set level to debug
        if not self.handlers:
            color_formatter = ColoredFormatter(self.COLOR_FORMAT)
            ch = logging.StreamHandler()
            ch.setLevel(logging.DEBUG)
            # add formatter to ch
            ch.setFormatter(color_formatter)
            # add ch to logger
            self.addHandler(ch)

            log_file = logging.FileHandler(
                filename=os.getcwd()+f'/logs/{name}.log', mode='w+', encoding='utf8')
            formatter = logging.Formatter(
                '%(asctime)s [%(levelname)-18s][%(processName)s][%(threadName)s][%(module)s] %(message)s - %(lineno)d')
            log_file.setFormatter(formatter)
            log_file.setLevel(logging.DEBUG)
            self.addHandler(log_file)

_log = log('my_log')

For convenience I created a function for getting all articles. It stops when the list received by the API has less than 2 items. Those 2 items are useless info for this context.

In [52]:
def get_articles():
    searching = True
    exclude_string = "&exclude_ids="
    articles_html = []
    
    # Searches untill no more articles are found.
    while(searching):
        _log.debug("Getting articles...")
        url = f"https://public-api.wordpress.com/wpcom/v2/sites/53884404/articles?className=is-style-borders&showExcerpt=0&imageShape=uncropped&moreButton=1&showAuthor=0&postsToShow=100&mediaPosition=left&categories%5B0%5D=548679890&imageScale=2&textColor=black&excerptLength=55&showReadMore=0&readMoreLabel=Keep%20reading&showDate=1&showImage=1&showCaption=0&disableImageLazyLoad=0&minHeight=0&moreButtonText&showAvatar=1&showCategory=0&postLayout=list&columns=3&colGap=3&&&&&&typeScale=4&mobileStack=0&sectionHeader&specificMode=0&customTextColor&singleMode=0&showSubtitle=0&postType%5B0%5D=post&textAlign=left&includedPostStatuses%5B0%5D=publish&page=2" + exclude_string
        ret = requests.get(url)

        _log.debug(f"Request response: {ret}")

        articles_json = json.loads(ret.text)

        _log.debug( f"{len(articles_json['items'])} articles found")

        # Checks items received.
        if len(articles_json['items']) > 2:
            for article in articles_json['items']:
                articles_html += bs(article['html'])

            # Stores the ids to exclude
            ids_string = ""
            for id in articles_json['ids']:
                ids_string = ids_string + str(id) + ','
            ids_string = ids_string.removesuffix(',')
            exclude_string = exclude_string + f"{ids_string}"

            _log.debug(exclude_string)
        else:
            searching = False

    # Returns all article html found
    return articles_html

In [53]:
articles_list = get_articles()

[35m2023-02-02 19:51:29,307[0m [[1m[1;34mDEBUG[0m  [0m][35m [MainProcess][MainThread][298439224][0m Getting articles... - [1mLine:7[0m
[35m2023-02-02 19:51:31,443[0m [[1m[1;34mDEBUG[0m  [0m][35m [MainProcess][MainThread][298439224][0m Request response: <Response [200]> - [1mLine:10[0m
[35m2023-02-02 19:51:31,445[0m [[1m[1;34mDEBUG[0m  [0m][35m [MainProcess][MainThread][298439224][0m 100 articles found - [1mLine:12[0m
[35m2023-02-02 19:51:31,503[0m [[1m[1;34mDEBUG[0m  [0m][35m [MainProcess][MainThread][298439224][0m &exclude_ids=8412,8398,8388,8366,8335,8321,8296,8271,8247,8222,8206,8187,8165,8153,8108,8084,8058,8035,7999,7972,7942,7922,7903,7878,7862,7834,7823,7800,7766,7729,7711,7701,7692,7676,7667,7658,7644,7630,7621,7613,7600,7582,7569,7557,7550,7537,7525,7513,7501,7492,7483,7475,7465,7450,7439,7429,7415,7399,7391,7377,7366,7353,7343,7325,7289,7275,7255,7245,7229,7217,7170,7151,7142,7133,7119,7105,7090,7077,7068,7059,7054,7045,7040,7030,7017,700

Checks the list length to see if it is a feasible number.

In [54]:
len(articles_list)

331

Transforms the results into a dataframe to store it into a csv file.

In [55]:
import pandas as pd

xadrez_verbal_feed = pd.DataFrame(columns=["name", "link"])

For each article, extracts is name and link at the same time as it is added into the dataframe.

In [57]:
for article in articles_list:
    xadrez_verbal_feed.loc[xadrez_verbal_feed.shape[0]] = [article.text.split('\n')[6].replace('\xa0', ' '), article.a['href']]

Checks the result

In [58]:
xadrez_verbal_feed

Unnamed: 0,name,link
0,"Xadrez Verbal Podcast #222 – EUA, El Salvador ...",https://xadrezverbal.com/2020/02/15/xadrez-ver...
1,Xadrez Verbal Podcast #221 – Fim do Brexit e d...,https://xadrezverbal.com/2020/02/08/xadrez-ver...
2,"Xadrez Verbal Podcast #220 – Europa, América L...",https://xadrezverbal.com/2020/01/31/xadrez-ver...
3,"Xadrez Verbal Podcast #219 – Europa, Oriente M...",https://xadrezverbal.com/2020/01/25/xadrez-ver...
4,Xadrez Verbal Podcast #218 – Virada de 2019 pa...,https://xadrezverbal.com/2020/01/18/xadrez-ver...
...,...,...
326,"Xadrez Verbal Podcast #5 – Jerusalém, Armênia ...",https://xadrezverbal.com/2015/06/12/xadrez-ver...
327,Xadrez Verbal Podcast #4 – A semana na polític...,https://xadrezverbal.com/2015/05/29/xadrez-ver...
328,Xadrez Verbal Podcast #3 – A semana na polític...,https://xadrezverbal.com/2015/05/22/xadrez-ver...
329,Xadrez Verbal Podcast #2 – A semana na polític...,https://xadrezverbal.com/2015/05/15/xadrez-ver...


Stores into a csv

In [61]:
xadrez_verbal_feed.to_csv("xadrez_verbal_feed.csv", sep=";", index=False)