# Data Cleaning

Here the data cleaning process will be performed. 

The goal of this process is to acquire the best possible data that could be later on fed to the ML model in order to get the best results out of it.

Since we previously performed some EDA (Exploratory Data Analysis) we gathered some insights about the data, some of which are crucial for correct data cleaning process.

Here are the steps, that will be performed here:
1. Read all of the `.parquet` files with data and save it to the DataFrame
2. Remove HTML tags (prefix & suffix)
3. Remove prices from descriptions
4. Remove clause about the offer (sometimes at the end)
5. Remove links
6. Remove emojis
7. Remove rows with empty description
8. Remove rows with description lenght <= 15 words

In [2]:
import pandas as pd
import os
import re
import glob
from demoji import replace
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
ROOT_DIR = os.getcwd()

In [4]:
raw_data = pd.concat([pd.read_parquet(data_file) for data_file in glob.glob(f'{ROOT_DIR}/data/*.parquet')], ignore_index=True)
data = raw_data.copy()

## Function declarations

First, according to the steps we are supposed to take, we will write some functions for each step of the way

In [5]:
def make_text_lowercase(data: pd.Series) -> pd.Series:
    """
        A function to make a pd.Series rows lowercase

        Args:
            data: (pd.Series): pandas Series object to be processed
        Returns:
            pd.Series: pandas Series object with lowercase rows
    """
    return data.str.lower()

In [6]:
def clean_data_from_html(data: pd.Series, text_to_remove: list[str], is_regex: bool = False) -> pd.Series:
    """
    A function to clean prefix & suffix html tags from the pd.Series[str]
    
    Args:
        data (pd.Series): pandas Series object to be processed
        text_to_remove (list[str]): text to be removed from each record [prefix, suffix]
        is_regex (bool): is the pattern a regex
    Returns:
        pd.Series: processed pandas Series object with removed text    
    """

    def _remove_html_tags(data: pd.Series, text_to_remove: str, is_regex: bool = False) -> pd.Series:
        """
        A function to remove specific text from pd.Series[str]

        Args:
            data (pd.Series): pandas Series object to be processed
            text_to_remove (str): text to be removed from each record
            is_regex (bool): is the pattern a regex
        Returns:
            pd.Series: processed pandas Series object with removed text
        """
        return data.str.replace(text_to_remove, "", case=False, regex=is_regex)
        
    cleaned_prefix = _remove_html_tags(data, text_to_remove[0], is_regex)
    cleaned_data = _remove_html_tags(cleaned_prefix, text_to_remove[1], is_regex)
    
    return cleaned_data


In [7]:
def remove_with_regex(text: str, regex_pattern: str) -> str:
    """
    A function to remove some pattern from text with the use of regex

    Args:
        text (str): text from which the contents should be removed
        pattern (str): pattern to be used
    Returns:
        str: text without the things specified in the parrern
    """
    return re.sub(regex_pattern, '', text, flags=re.IGNORECASE | re.MULTILINE)

In [8]:
def remove_rows_without_description(data: pd.DataFrame) -> pd.DataFrame:
    """
    A function to remove records without description
    
    Args:
        data (pd.DataFrame): data frame to process
    Returns:
        pd.DataFrame: processed data frame object
    """
    rows_to_keep = []
    for index, row in data.iterrows():
        if not re.search(REGEX_EMPTY, row['description'], flags=re.IGNORECASE) and len(row['description'].split()) > 0:
            rows_to_keep.append(index)
    
    return data.loc[rows_to_keep]

In [9]:
def remove_short_descriptions(data: pd.Series, threshold: int) -> pd.Series:
    """
    A function to remove short descriptions
    
    Args:
        data (pd.Series): column with text
        threshold (int): length in words
    Returns:
        pd.Series: cleaned data
    """
    rows_to_keep = []
    for index, row in data.iterrows():
        if len(row['description'].split()) >= threshold:
            rows_to_keep.append(index)

    return data.loc[rows_to_keep]

In [10]:
HTML_PREFIX = r'^.*?\n.*?\n'
HTML_SUFFIX = r'\n.*?\n.*?\n.*?\n.*?$'

REGEX_PRICE = r'.*(?:cena|brutto|netto|23%|rabat|pln|eur|usd|zł).*$'
REGEX_CLAUSE = r'ogłoszenie\s+nie\s+stanowi\s+oferty.*$'
REGEX_LINK = r'https?://[^\s]+|www\.[^\s]+'
REGEX_EMPTY = r'brak opisu pojazdu\. aby uzyskać więcej informacji, skontaktuj się ze sprzedawcą\.'

WORD_COUNT_THRESHOLD = 15

In [11]:
# 1. Lowercase
data['description'] = make_text_lowercase(data['description'])

# 2. Html clean
data['description'] = clean_data_from_html(data['description'], [HTML_PREFIX, HTML_SUFFIX], True)

# 3. Remove prices from description
data['description'] = data['description'].apply(lambda x: remove_with_regex(x, REGEX_PRICE))

# 4. Remove the clause 'Ogloszenie nie stanowi oferty ...'
data['description'] = data['description'].apply(lambda x: remove_with_regex(x, REGEX_CLAUSE))

# 5. Remove links
data['description'] = data['description'].apply(lambda x: remove_with_regex(x, REGEX_LINK))

# 6. Remove emojis
data['description'] = data['description'].apply(replace)

# 7. Remove rows with no description
data = remove_rows_without_description(data)

# 8. Remove records with description shorter than 15 words
data = remove_short_descriptions(data, WORD_COUNT_THRESHOLD)

In [12]:
print(f'Before: {raw_data.shape[0]} records')
print(f'After:  {data.shape[0]} records')

Before: 25537 records
After:  23752 records


In [13]:
data.to_csv("cleaned_data.csv", sep=';', index=False, encoding="utf-8")

## Tokenization

This will be done by using **transformers** library from *huggingface*

For this you will need:
- protobuf (pip install protobuf)
- transformers (pip install transformers)
- pytorch or tensorflow (pip install torch torchvision)
- sacremoses (pip install sacremoses)

In [14]:
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [15]:
model_name = "allegro/herbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

ImportError: 
 requires the protobuf library but it was not found in your environment. Check out the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


In [4]:
data = pd.read_csv('./cleaned_data.csv', sep=';')

In [7]:
tokens = tokenizer(
    data['description'].tolist(),
    padding='max_length',
    truncation=True,
    max_length=256,
    return_tensors='pt'    # change this to tf for tensorflow
)

In [9]:
print('Done!')

Done!
