<h1><center>DATA PREPROCESSING</center></h1>

Structure:

1. [x] Repeat findings from EDA notebook
2. [] Run preprocessing:
    - [x] title
    - [x] published_at
    - [x] author
    - [] short_description
    - [] description

# 1. Preparations

I would like to repeat what are the preprocessing steps based on EDA findings:
1. **title**: get rid of special characters, strip empty chars in the beginning/end of the text
2. **published_at**: convert to RFC 3339 format
3. **author**: if author is missing replace with "Unknown" token
4. **short_description**: if missing, take beginning of the description (the first sentence or first 1500 chars as it's an average len of short_description).
5. **description**: if missing, take raw_description (if exist) and strip HTML tags; if raw_description is also missing - take short_description (if exist). If all 3 descriptions are missing - drop that article. 

Required imports.

In [1]:
import os
import sys

import pandas as pd
import spacy
from bs4 import BeautifulSoup
from omegaconf import OmegaConf

pd.set_option("display.max_colwidth", 200)


In [2]:
ROOT = os.path.relpath("../")
config = OmegaConf.load(os.path.join(ROOT, "src/config/config.yaml"))


Helper utils.

In [3]:
def show_missing_values_statistics(data: pd.DataFrame, column: str) -> None:
    missing_count = data[column].isna().sum()
    print(
        f'Articles with missing "{column}": {missing_count} of {len(data)}'
        f" ({missing_count / len(data) * 100:.1f}%)."
    )


Reading the data that will be preprocessed.

In [4]:
data = pd.read_csv(os.path.join(ROOT, config.data.raw))
data.head(1)


Unnamed: 0,title,url,published_at,author,publisher,short_description,keywords,header_image,raw_description,description,scraped_at
0,Santoli’s Wednesday market notes: Could September’s stock shakeout tee up strength for the fourth quarter?,https://www.cnbc.com/2021/09/29/santolis-wednesday-market-notes-could-septembers-stock-shakeout-tee-up-strength-for-the-fourth-quarter.html,2021-09-29T17:09:39+0000,Michael Santoli,CNBC,"This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about trends, stocks and market statistics.","cnbc, Premium, Articles, Investment strategy, Markets, Investing, PRO Home, CNBC Pro, Pro: Santoli on Stocks, source:tagname:CNBC US Source",https://image.cnbcfm.com/api/v1/image/106949602-1632934577499-FINTECH_ETF_9-29.jpg?v=1632934691,"<div class=""group""><p><em>This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about trends, stocks and market statistics.</em></p><ul><li>A muted, inconclusiv...","This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about trends, stocks and market statistics.A muted, inconclusive bounce that has left the indexes fully wi...",2021-10-30 14:11:23.709372


In [5]:
data.columns.tolist()


['title',
 'url',
 'published_at',
 'author',
 'publisher',
 'short_description',
 'keywords',
 'header_image',
 'raw_description',
 'description',
 'scraped_at']

# 2. Preprocessing

## 2.1. Title

In [6]:
show_missing_values_statistics(data, "title")


Articles with missing "title": 0 of 625 (0.0%).


No articles with missing title.

In [7]:
data.loc[2, "title"]


'Europe&#039;s recovery depends on Renzi&#039;s Italy'

As it was shown in EDA and also is clear from the output above, title in some articles may contain specially encoded characters.</br>
We can get rid of it with BeautifulSoup package.

In [8]:
data["title"] = data["title"].apply(lambda x: BeautifulSoup(x).text)


In [9]:
data.loc[2, "title"]


"Europe's recovery depends on Renzi's Italy"

Now everything is fine.

## 2.2. Published_at

In [10]:
show_missing_values_statistics(data, "published_at")


Articles with missing "published_at": 0 of 625 (0.0%).


No articles with missing published_at property.

In [11]:
data.loc[0, "published_at"]


'2021-09-29T17:09:39+0000'

Almost everything is fine, except for Weaviate expects datetime in RFC 3339 format and instead of "+0000" datetime should ends with "+00:00".

In [12]:
data["published_at"] = data["published_at"].apply(lambda x: pd.to_datetime(x, utc=True).isoformat())


In [13]:
data.loc[0, "published_at"]


'2021-09-29T17:09:39+00:00'

Now it's fine.

The problem that I noticed that if datetime is not in a proper format loading data into Weaviate instance might fail without any error raised and it will affect not only property with datetime, but all other properties will have null value. At least it's true for the version of Weaviate that is used for this project (1.10.1).

## 2.3. Author

In [14]:
show_missing_values_statistics(data, "author")


Articles with missing "author": 228 of 625 (36.5%).


In [15]:
data.loc[14:17, "author"]


14                         NaN
15                     Tae Kim
16    Dawn Kopecki,Rich Mendez
17                         NaN
Name: author, dtype: object

As we can see some articles doesn't have author (~36% of articles). One of the ways to tackle it is to replace empty values with "Unknown" token.

In [16]:
data["author"] = data["author"].fillna(config.preprocessing.author.fillna_value)


In [17]:
data.loc[14:17, "author"]


14                     Unknown
15                     Tae Kim
16    Dawn Kopecki,Rich Mendez
17                     Unknown
Name: author, dtype: object

Now if we want to find articles that doesn't have authors, we can use "Unknown" token as a filter value.

## 2.4. Short_description

In [18]:
show_missing_values_statistics(data, "short_description")


Articles with missing "short_description": 16 of 625 (2.6%).


There are 16 articles with missing short_description property. As short_description is not the most important part of the article (description property is the most important) we will not drop articles because of it, but rather will try to fill up missing values.

As it's shown in EDA description is a shortened version of description (hence the name), so we can take first part of the description in cases when there is no short_description but description exists.

Let's see in how many cases there is a description when short_description is missing:

In [19]:
short_description_df = data.query("short_description.isnull() and description.notna()")
short_description_df[["short_description", "description"]].head()


Unnamed: 0,short_description,description
4,,"President Donald Trump hailed the U.S.-led intervention in Syria as ""perfectly executed,"" adding that the military campaign to degrade Bashar Assad's chemical weapons capability had accomplished i..."
19,,"In Monday’s Web Extra, Pete Najarian reveals where he’s seeing put buying. Also why stocks are plunging in Japan. This content is only available online - you won't find these trades on TV. _______..."
84,,"COOPERSTOWN, N.Y., Oct. 1, 2012 /PRNewswire/ -- The National Baseball Hall of Fame and Museum is adding to its art collection with the donation of a portrait depicting one of the sport's most famo..."
157,,"What was Wall Street saying about earnings season, Googlehitting an all-time high, Facebook’s1 billion users and European bank stocks? Find out in this week’s CNBC.com Stock Blog Roundup.Third-qua..."
184,,"Ireland's High Court on Thursday ruled that a 850 million euro ($1 billion) data center planned by Apple in the west of Ireland may proceed, dismissing an environmental challenge made by three peo..."


In order to take the first sentence we can split by "." and take first element from the list. But it's very unreliable approach.

For example:

In [20]:
text = short_description_df["description"].iloc[0]

print(text[:1000])
print(" - " * 50)
print(text.split(".")[0])


President Donald Trump hailed the U.S.-led intervention in Syria as "perfectly executed," adding that the military campaign to degrade Bashar Assad's chemical weapons capability had accomplished its goals.Less than a day after U.S., British and French forces targeted suspected chemical weapons sites in retaliation to an attack that left dozens of civilians dead last week, Trump thanked the U.S. coalition partners.Yet in an echo of former president George W. Bush, Trump used words that ultimately came back to haunt his predecessor, by pronouncing "Mission Accomplished." That characterization raised questions about whether Western forces would intervene again if Assad used chemical weapons again, or if the conflict escalated amid Russia's growing bellicosity."A perfectly executed strike last night. Thank you to France and the United Kingdom for their wisdom and the power of their fine Military. Could not have had a better result. Mission Accomplished!" Trump said in a Twitter post.Defens

It's much better to use more sophisticated approach, like spacy's sentence splitter: 

<div class="alert alert-block alert-info"><b>NOTE</b>: Don't forget to load spacy model beforehand.</div>

For example:
```bash
python -m spacy download en_core_web_sm
```

In [21]:
def get_first_sentence(text: str, nlp: spacy.Language) -> str:
    """Return the first sentence of the text where . is the delimeter.

    Parameters
    ----------
    text : str
        full text

    Returns
    -------
    str
        the first sentence of the full text
    """
    sents_generator = nlp(text).sents

    return next(sents_generator).text


nlp = spacy.load(config.preprocessing.spacy.model)
# for speed disable not needed pipes
nlp.disable_pipes(
    "tagger",
    "attribute_ruler",
    "lemmatizer",
    "ner",
)
print(f"Spacy pipes: {nlp.pipe_names}")


Spacy pipes: ['tok2vec', 'parser']


In [22]:
print(get_first_sentence(text, nlp))


President Donald Trump hailed the U.S.-led intervention in Syria as "perfectly executed," adding that the military campaign to degrade Bashar Assad's chemical weapons capability had accomplished its goals.


As we can spacy did much better job. Now let's broadcast it to all missing short_descriptions.

In [23]:
for idx in short_description_df.index:
    description = data.loc[idx, "description"]
    short_description = get_first_sentence(description, nlp)
    data.loc[idx, "short_description"] = short_description

data.loc[short_description_df.index, ["short_description", "description"]].head()


Unnamed: 0,short_description,description
4,"President Donald Trump hailed the U.S.-led intervention in Syria as ""perfectly executed,"" adding that the military campaign to degrade Bashar Assad's chemical weapons capability had accomplished i...","President Donald Trump hailed the U.S.-led intervention in Syria as ""perfectly executed,"" adding that the military campaign to degrade Bashar Assad's chemical weapons capability had accomplished i..."
19,"In Monday’s Web Extra, Pete Najarian reveals where he’s seeing put buying.","In Monday’s Web Extra, Pete Najarian reveals where he’s seeing put buying. Also why stocks are plunging in Japan. This content is only available online - you won't find these trades on TV. _______..."
84,"COOPERSTOWN, N.Y., Oct. 1, 2012 /PRNewswire/ -- The National Baseball Hall of Fame and Museum is adding to its art collection with the donation of a portrait depicting one of the sport's most famo...","COOPERSTOWN, N.Y., Oct. 1, 2012 /PRNewswire/ -- The National Baseball Hall of Fame and Museum is adding to its art collection with the donation of a portrait depicting one of the sport's most famo..."
157,"What was Wall Street saying about earnings season, Googlehitting an all-time high, Facebook’s1 billion users and European bank stocks?","What was Wall Street saying about earnings season, Googlehitting an all-time high, Facebook’s1 billion users and European bank stocks? Find out in this week’s CNBC.com Stock Blog Roundup.Third-qua..."
184,"Ireland's High Court on Thursday ruled that a 850 million euro ($1 billion) data center planned by Apple in the west of Ireland may proceed, dismissing an environmental challenge made by three peo...","Ireland's High Court on Thursday ruled that a 850 million euro ($1 billion) data center planned by Apple in the west of Ireland may proceed, dismissing an environmental challenge made by three peo..."


In [24]:
data.loc[short_description_df.index[0], "short_description"]


'President Donald Trump hailed the U.S.-led intervention in Syria as "perfectly executed," adding that the military campaign to degrade Bashar Assad\'s chemical weapons capability had accomplished its goals.'

Cool, look's like it worked like a charm.

## 2.5. Description

In [25]:
show_missing_values_statistics(data, "description")


Articles with missing "description": 32 of 625 (5.1%).


Now, the most crucial part. When Weaviate creates vector representation of an article it concatenates all string properties and vectorize it (if to be precise it send concatenated string to vectorizer module). Then this vector is used when running search query.

As description usually contain the majority of information about article it's important to fill up missing values. If it's not possible then such article should be dropped.

Base on EDA findings the strategy for filling up missing value for description is:
- if raw_description exists strip it from HTML tags and use it as replacements
- if raw_description doesn't not exist but short_description does - use shortened version as a replacements

### 2.5.1. Fill up with raw_description

In [26]:
raw_description_df = data.query("description.isna() and raw_description.notna()")
print(raw_description_df.shape[0])
raw_description_df[["description", "raw_description"]]


1


Unnamed: 0,description,raw_description
372,,"<div class=""group""></div>"


There is only one article where description is missing but not raw_description. What is not helpful is that raw_description contains only empty div block. That can also easily explain why description is empty.

### 2.5.2. Fill up with short_description

In [27]:
short_description_df = data.query("description.isna() and short_description.notna()")
print(short_description_df.shape[0])
short_description_df[["description", "short_description"]].head()


30


Unnamed: 0,description,short_description
1,,"This commentary originally ran on Facebook. Boris Johnson – The former London mayor and grudging ""Leave"" supporter-turned enthusiastic ""Leave"" leader chose the winning side. He's now a front-runne..."
2,,"In spring, ambitious reforms began in Italy. Under Matteo Renzi, the ailing economy will either begin a real recovery, or slide further. The outcome is vital to Italy, Europe and the global econom..."
6,,"Founders: Eran Barak, Barak Klinghofer (chief product officer), Idan Levin (CTO) Launched: 2014 Headquarters: Boston Funding: $10.5 million The three founders of Hexadite, a Boston-based cyberse..."
31,,"As the coronavirus spreads across the United States, American life has come to a halt: schools are closing, sports leagues are suspended, events are canceled and officials from major cities are an..."
34,,"Japan has overtaken the U.K. as the most expensive place to send overseas employees to work, a new report has found.The average expatriate package in Japan costs employers $405,685 – more than any..."


There are 30 articles where description is empty but not short_description.
We can use it.

In [28]:
data.loc[short_description_df.index, "description"] = data.loc[short_description_df.index, "short_description"]


In [29]:
show_missing_values_statistics(data, "description")


Articles with missing "description": 2 of 625 (0.3%).


Out of 32 articles with missing description only 2 are left.

## 2.6. Drop articles with missing values

In [30]:
data.isna().sum().to_frame(name="Number of missing values").style.bar()


Unnamed: 0,Number of missing values
title,0
url,0
published_at,0
author,0
publisher,0
short_description,2
keywords,0
header_image,0
raw_description,31
description,2


There are articles with missing:
- 31 raw_description (we don't use anyway)
- 2 short_description (not good not terrible)
- 2 description - these articles has to be dropped. 

In [31]:
data = data.dropna(subset=["description"]).reset_index(drop=True)
data["description"].isna().sum()


0

# 3. Save data

The last step is to save preprocessed data into intermediate folder. Later this data will be used during loading into Weaviate instance.

As we don't need some of columns like scrapped_at, header_image or raw_description it's better to save some space and save only required columns.

In [32]:
columns_to_save = [
    "title",
    "url",
    "published_at",
    "author",
    "short_description",
    "description",
    "keywords",
]

data[columns_to_save].to_csv(os.path.join(ROOT, config.data.interim), index=False)
