# Synthetic Data Cleaning 

This notebook cleans the synthetic data, adds URL and title lengths, and extracts additional features from the URL and title.

In [13]:
%pip install pandas bs4

Note: you may need to restart the kernel to use updated packages.


In [14]:
import pandas as pd
from bs4 import BeautifulSoup
import string
from urllib.parse import urlparse

## Load Data

Load the parquet file and drop unnecessary columns.

In [15]:
df = pd.read_parquet('../data/raw/classified_data.parquet')  # Synthetic data with labels
df = df.drop(columns=['content_type', 'server'])

## Clean Category Labels

Clean the `category` column by removing redundant labels and handling the 'Uncategorized' case.

In [16]:
def clean_labels(labels_str):
    if isinstance(labels_str, str):
        labels = [label.strip() for label in labels_str.split(',')]
    else:
        labels = labels_str
    
    if "Uncategorized" in labels:
        cleaned = [label for label in labels if label != "Uncategorized"]
        return cleaned if len(cleaned) > 0 else ["Uncategorized"]
    else:
        return labels

df["category"] = df["category"].apply(clean_labels)

## Data Overview

Display the first few rows and summary statistics of the dataframe.

In [17]:
print(df.head())
print(df.describe())

                                                 url  \
0  http://01088888317.com/bbs/board.php?bo_table=...   
1                http://3d.jzsc.net/search_3225.html   
2           http://22gl.nmjrjx.com/v_info/45979.html   
3              http://88yokohama.com/ishidatami.html   
4                  http://8p.wanjxx.com/hr/index.php   

                                               title  \
0  - 010-8888-8317 29 | -O1O-8888-8317,,,,,,,,,,,...   
1                                        ,,,,,su 3d,   
2                                                 __   
3                                                  U   
4                      Office of Human Resources | -   

                                             snippet language  \
0  - 010-8888-8317 29 | -O1O-8888-8317,,,,,,,,,,,...       ko   
1                                        ,,,,,su 3d,       ko   
2  __ : : 0.0 : : : 2022 : 720P/2025-02-12 21:19:...    zh-cn   
3  U lqlG Vy[W U U Walking in Okinawa Naha Ishida...       bn   
4

## Drop Missing Values and Duplicates

Remove rows with missing values and duplicate entries based on key columns.

In [18]:
df = df.dropna(subset=['url', 'title', 'snippet', 'language', 'warc_id', 'meta_description', 'category'])
df = df.drop_duplicates(subset=['url'], keep='first')
df = df.drop_duplicates(subset=['warc_id'], keep='first')

## Clean Text Fields

Ensure that text fields are of string type for consistency.

In [19]:
df = df.astype({
    'url': 'string',
    'title': 'string',
    'snippet': 'string',
    'language': 'string',
    'warc_id': 'string',
    'meta_description': 'string'
})

## Feature Engineering: URL and Title Lengths

Add new columns that capture the length of the URL and the title.

In [20]:
df['url_length'] = df['url'].str.len()
df['title_length'] = df['title'].str.len()

print(df[['url', 'title', 'url_length', 'title_length']].head())

                                                 url  \
0  http://01088888317.com/bbs/board.php?bo_table=...   
1                http://3d.jzsc.net/search_3225.html   
2           http://22gl.nmjrjx.com/v_info/45979.html   
3              http://88yokohama.com/ishidatami.html   
4                  http://8p.wanjxx.com/hr/index.php   

                                               title  url_length  title_length  
0  - 010-8888-8317 29 | -O1O-8888-8317,,,,,,,,,,,...         103            61  
1                                        ,,,,,su 3d,          35            11  
2                                                 __          40             2  
3                                                  U          37             1  
4                      Office of Human Resources | -          33            29  


## Additional URL Feature Extraction

Extract features from the URL such as domain, path depth, query count, digit count, and special character count.

In [21]:
def extract_url_features(url):
    parsed = urlparse(url)
    domain = parsed.netloc
    path = parsed.path
    query = parsed.query
    
    # Count the number of non-empty segments in the path
    path_depth = len([seg for seg in path.split('/') if seg])
    
    # Count query parameters if any exist
    query_count = len(query.split('&')) if query else 0
    
    # Count digits and special characters in the URL
    digit_count = sum(c.isdigit() for c in url)
    special_char_count = sum(c in string.punctuation for c in url)
    
    return pd.Series({
        'domain': domain,
        'path_depth': path_depth,
        'query_count': query_count,
        'url_digit_count': digit_count,
        'url_special_count': special_char_count
    })

# Apply URL feature extraction
df = df.join(df['url'].apply(extract_url_features))

## Additional Title Feature Extraction

Extract features from the title such as word count, average word length, punctuation count, and digit count.

In [22]:
def extract_title_features(title):
    words = title.split()
    word_count = len(words)
    avg_word_len = sum(len(word) for word in words) / word_count if word_count > 0 else 0
    punctuation_count = sum(c in string.punctuation for c in title)
    digit_count = sum(c.isdigit() for c in title)
    
    return pd.Series({
        'title_word_count': word_count,
        'title_avg_word_len': avg_word_len,
        'title_punctuation_count': punctuation_count,
        'title_digit_count': digit_count
    })

# Apply Title feature extraction
df = df.join(df['title'].apply(extract_title_features))

## Feature Overview

Display selected columns to verify the extracted features.

In [23]:
print(df[['url', 'domain', 'path_depth', 'query_count', 'url_digit_count', 'url_special_count',
          'title', 'title_word_count', 'title_avg_word_len', 'title_punctuation_count', 'title_digit_count']].head())

                                                 url           domain  \
0  http://01088888317.com/bbs/board.php?bo_table=...  01088888317.com   
1                http://3d.jzsc.net/search_3225.html      3d.jzsc.net   
2           http://22gl.nmjrjx.com/v_info/45979.html  22gl.nmjrjx.com   
3              http://88yokohama.com/ishidatami.html   88yokohama.com   
4                  http://8p.wanjxx.com/hr/index.php    8p.wanjxx.com   

   path_depth  query_count  url_digit_count  url_special_count  \
0           2            8               13                 25   
1           1            0                5                  8   
2           2            0                7                  9   
3           1            0                2                  6   
4           2            0                1                  8   

                                               title  title_word_count  \
0  - 010-8888-8317 29 | -O1O-8888-8317,,,,,,,,,,,...               5.0   
1               

## Save Cleaned and Enhanced Data

Save the cleaned and feature-enhanced dataframe to a new parquet file.

In [24]:
df.to_parquet('../data/processed/cleaned_classified.parquet')

In [25]:
df.tail()

Unnamed: 0,url,title,snippet,language,warc_id,warc_date,meta_description,category,url_length,title_length,domain,path_depth,query_count,url_digit_count,url_special_count,title_word_count,title_avg_word_len,title_punctuation_count,title_digit_count
49394,http://cloudninehotspring.com/yorkinstruments_...,96,"96, avav,48,,88 | | | AAV | 777 | | | | | av |...",zh-cn,<urn:uuid:e503aea0-8050-4b59-8ad0-b5d5c8f8c5a8>,2025-02-13T23:10:50Z,"96,,,,",[Uncategorized],69,3,cloudninehotspring.com,1,0,8,9,1.0,3.0,1.0,2.0
49395,http://cmuir.cmu.ac.th/browse?type=author&sort...,CMU Intellectual Repository: Browsing DSpace,CMU Intellectual Repository: Browsing DSpace S...,en,<urn:uuid:bdcecebd-1b72-4284-83d0-293a10e8068a>,2025-02-13T22:13:12Z,,"[Education, Technology]",111,44,cmuir.cmu.ac.th,1,7,5,25,5.0,8.0,1.0,0.0
49396,http://cse.google.co.vi/url?sa=i&url=https://p...,Redirect Notice,Redirect Notice Redirect Notice The previous p...,en,<urn:uuid:df6a28da-84c5-4433-8ac6-15997dbb6843>,2025-02-13T21:41:35Z,,"[Shop, Travel]",118,15,cse.google.co.vi,1,2,7,26,2.0,7.0,0.0,0.0
49397,http://cod61.ru/page,,- 29.08.2024 09:47 21.04.2024 16:50 08.03.2024...,ru,<urn:uuid:6551db1d-f777-487b-8f28-3fe6d8b55621>,2025-02-13T21:50:15Z,,[Uncategorized],20,0,cod61.ru,1,0,2,5,0.0,0.0,0.0,0.0
49398,http://coloriagedisney.50webs.com/images/bambi...,Colorier Bambi,Colorier Bambi Pour imprimer l'image : clic dr...,fr,<urn:uuid:7099d167-2163-45b8-bab1-5a5afc11060f>,2025-02-13T22:32:19Z,,"[Entertainment, Education]",57,14,coloriagedisney.50webs.com,3,0,3,9,2.0,6.5,0.0,0.0
