### Explaining Sitenews Preprocessing


In [None]:
# Loading all sitenews. It takes a lot of time. 
# There are ./datasets/final_df.parquet - this function result

# final_df = run_async(load_all_sitenews, max_connections=20)
# save_dataframe_to_parquet(final_df, "final_df.parquet")

In [16]:
from isswrapper.util.helpers import read_parquet_into_dataframe
import pandas as pd 
from bs4 import BeautifulSoup
import os

In [26]:
current_path = os.getcwd()
project_path = os.path.dirname(current_path)
datasets_folder_path = os.path.join(project_path, 'datasets') 
df = read_parquet_into_dataframe(os.path.join(datasets_folder_path, "final_df.parquet"))
print(df.shape)
df.head()


(40799, 6)


Unnamed: 0,id,tag,title,published_at,modified_at,body
0,63318,site,О выявленном несоответствии ценных бумаг,2023-08-21 19:00:00,2023-08-21 19:00:05,<p>В соответствии с Правилами листинга ПАО Мос...
1,63316,site,Об установлении риск-параметров на фондовом ры...,2023-08-21 18:54:24,2023-08-21 18:54:24,<p>С 22.08.2023 решением НКО НКЦ (АО) устанавл...
2,63315,site,О вступлении в силу внутренних документов Моск...,2023-08-21 18:53:22,2023-08-21 18:53:23,"<p><span>Информируем, что 28 августа 2023 года..."
3,63314,site,Об изменении уровня листинга ценных бумаг,2023-08-21 18:24:25,2023-08-21 18:24:26,<p>В соответствии с Правилами листинга ПАО Мос...
4,63313,site,О регистрации программ биржевых облигаций,2023-08-21 18:19:22,2023-08-21 18:19:22,<p><span>В соответствии c Правилами листинга П...


We're dealing with a bunch of news articles, but our focus is on news related to unusual behavior in securities. We've noticed that if an article contains the phrase 'отклонения цен заявок' (price order deviations), it includes tables with information about changes in risk parameters for certain securities. What does that imply? Essentially, it means there might be fewer restrictions or, on the flip side, new limitations introduced. Our main interest lies in the latter scenario. Here, it's either something positive that's happened, or it could indicate someone with questionable intentions trying a pump-and-dump strategy.

So, our initial step is to sort through all the records that match our criteria.

In [28]:
filtered_df= df[df["body"].str.contains("отклонения цен заявок")]
filtered_df.shape

(140, 6)

In [32]:
def has_table(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')
    return table is not None

w_table_df = filtered_df[filtered_df["body"].apply(has_table)]
w_table_df.shape

(140, 6)

The next phase involves extracting these tables from the news articles and trimming away extraneous details.

However, not all tables within this dataset are well-organized. The older ones might lack the 'th' tag. In these instances, we will rely on the first row to act as the header.

The 'extract_table' function is purpose-built for this task. It functions by taking the HTML body and parsing out the table, converting it into a pandas DataFrame.

Subsequently, the 'preprocess' function trims away redundant and uninformative particulars—such as full names and numbers. Following this, it rearranges the column names for enhanced convenience and converts percentage data into standard numerical values.

In [33]:
def extract_table(html):
    soup = BeautifulSoup(html, "html.parser")
    header = None
    if not soup.findAll("th"):
        header = 0  
    tables = pd.read_html(str(soup), header=header)
    table = tables[0]
    return table

def preprocess(df):
    if df.shape[1] == 5:
        df.drop(columns = "№", inplace=True)
    df.drop(columns="Название", inplace=True)
    df.rename(
        columns ={df.columns[0]:"token", df.columns[1]:"current_limit", df.columns[2]:"new_limit"},
        inplace=True,
        )
    df["current_limit"] = df["current_limit"].str.rstrip("%").astype(float)
    df["new_limit"] = df["new_limit"].str.rstrip("%").astype(float)
    return df
w_table_df["table"] = w_table_df["body"].apply(extract_table)
w_table_df["t_shape"] = w_table_df["table"].apply(lambda x: x.shape[1])
w_table_df.drop(w_table_df[w_table_df["t_shape"]>6].index, inplace=True)
w_table_df["table"] = w_table_df["table"].apply(preprocess)
w_table_df.shape


(136, 8)

Why do we omit tables with more than 6 columns? In short, this concerns just 3 entries among the entire dataset, and these tables don't hold significant data. Moreover, automatic data extraction becomes notably challenging due to their lack of standardization. Thus, it proves more convenient to bypass such tables altogether.


In [34]:
pd_list = w_table_df["table"].tolist()
f_df = pd.concat(pd_list)
f_df

Unnamed: 0,token,current_limit,new_limit
0,DZRD,40.0,10.0
1,RKKE,20.0,10.0
0,VRSB,10.0,40.0
1,MRKS,10.0,40.0
2,ROST,10.0,40.0
...,...,...,...
0,LPSB,10.0,40.0
0,RUSI,10.0,40.0
0,UCSS,40.0,10.0
0,LPSB,40.0,10.0


Now, let's direct our attention to the entries where new limitations have been introduced (new_limit < current_limit).

In [35]:
f_df[f_df["current_limit"]>f_df["new_limit"]]

Unnamed: 0,token,current_limit,new_limit
0,DZRD,40.0,10.0
1,RKKE,20.0,10.0
0,VSYD,40.0,10.0
1,VSYDP,40.0,10.0
2,GTRK,100.0,10.0
...,...,...,...
0,MGVM,40.0,10.0
0,IDVP,40.0,10.0
0,UCSS,40.0,10.0
0,LPSB,40.0,10.0


At this juncture, we have identified the securities that warrant scrutiny for potential pump-and-dump activities.

In [36]:
f_df[f_df["current_limit"]>f_df["new_limit"]]["token"].unique()

array(['DZRD', 'RKKE', 'VSYD', 'VSYDP', 'GTRK', 'ASSB', 'TNSE', 'KCHEP',
       'MISBP', 'NNSBP', 'KCHE', 'VRSBP', 'RTSB', 'ROST', 'VRSB', 'NKHP',
       'TGKN', 'ELTZ', 'KUBE', 'BSPBP', 'SVAV', 'GECO', 'VGSBP', 'KTSB',
       'KTSBP', 'LNZLP', 'ROSB', 'KROTP', 'KROT', 'CHKZ', 'BLNG', 'MSTT',
       'UWGN', 'ORUP', 'RUSI', 'RU000A101NK4', 'MRKS', 'MSST', 'DASB',
       'LENT', 'IGST', 'IGSTP', 'DVEC', 'LPSB', 'UKUZ', 'RTSBP', 'LVHK',
       'ALBK', 'LNZL', 'KUZB', 'KRKOP', 'BELU', 'SVET', 'VJGZ', 'VJGZP',
       'RLMNP', 'ABRD', 'MERF', 'UNKL', 'MOBB', 'ISKJ', 'GTLC', 'TGKBP',
       'YAKG', 'FXRW', 'FXWO', 'RDRB', 'PAZA', 'MISB', 'IDVP', 'MGVM',
       'UCSS'], dtype=object)