# Pipeline of data preprocessing for DST-GPT

The corpus of DST-GPT is mainly comprised of [Wiki Fandom on DST](https://dontstarve.fandom.com/wiki/Don%27t_Starve_Together#Return_of_Them)

<style>
    .centered-caption {
        text-align: center;
    }
</style>

<figure class="centered-caption">
  <img src="../assets/DST1.png" alt="Preiew of Wiki Fandom">
  <figcaption>Preview of Wiki Fandom</figcaption>
</figure>

Here is the pipeline of data preprocessing. We divide the whole process into following parts.

**1. Preview data**  
**2. Crawl raw data from webpages**  
**3. Categorize raw data into structured data**

## Premilinary

Prepare the environment

In [None]:
# install required packages
%pip install beautifulsoup4
%pip install requests

## **1. Preview data** 

Crawl a batch of webpages. Here we create a class named `WebCrawlerTest` for testing.

In [3]:
import requests
from bs4 import BeautifulSoup
import json

class WebCrawlerTest:
    def __init__(self):
        self.url_list = ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3']
        
    def test_crawl_webpages(self):
        data_list = self.crawl_webpages(self.url_list)
        
        # Assert that the result is not empty
        assert data_list
        
        # Assert that each item in the result has 'url' and 'content' keys
        for data in data_list:
            assert 'url' in data
            assert 'content' in data
            print(json.dumps(data, indent=4))

    def crawl_webpages(self, url_list):
        data_list = []
        
        for url in url_list:
            try:
                response = requests.get(url)
                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, 'html.parser')
                    raw_text = soup.get_text()
                    data = {'url': url, 'content': raw_text}
                    data_list.append(data)
            except requests.RequestException as e:
                print(f"Request error: {e}")
            except Exception as e:
                print(f"Error occurred: {e}")
        
        return data_list

In [None]:
# Create an instance for testing
test_instance = WebCrawlerTest()
# denote testing urls
test_urls=[
    "https://dontstarve.fandom.com/wiki/Don%27t_Starve_Wiki",
    "https://dontstarve.fandom.com/wiki/File:Abigail_Sounds.ogg",
    "https://forums.kleientertainment.com/forums/topic/102600-game-update-307715/?tab=comments#comment-1151985",
    "https://dontstarve.fandom.com/wiki/Talk:Winona?action=edit&redlink=1",
    "https://dontstarve.fandom.com/wiki/Winona",
    "https://dontstarve.fandom.com/wiki/Winona#Trivia"
]
test_instance.url_list=test_urls
# Run testing
test_instance.test_crawl_webpages()

Given the preview of data, we find that some of links in the `href` hyper label directed to websites out of Wiki Fandom, be it `youtube.com` mostly. Therefore, it is notable to omit the webpages whose url does not start with `"https://dontstarve.fandom.com/wiki"`. Besides, we conclude that URLs with following features should be omitted while crawling. Most of them are numerous with useless information.

**1. contains '?'/'='** 
 
> Webpages that contain '?' and '=' in their URLs are typically dynamic webpages. In this case, they refers to pages with unuseful infomation be it updates history, comments and so on. 

**2. contains 'File:'** 

> Webpages that contain 'File:' in their URLs typically refers to resources be it pictures and voices. 

**3. contains 'Template:'**  

> Webpages that contain 'Template:' in their URLs typically refers to comment templates in forum. 

**4. contains 'Talk:'** 

> Webpages that contain 'Talk:' in their URLs typically refers to players' comments in forum. 

**5. contains '#'**  

> Webpages that contain '#' in their URLs typically refer to anchor links within a webpage. These anchor links are used to navigate to specific sections or elements within the webpage itself.


## **2. Crawl raw data from webpages**  

Here we create a class named `WebCrawler` for crawling text infomation from webpages. The member functions are as follows:  

**1. is_valid_url(url)**  

> Judge whther this webpage is valid/needed or not.

**2. clean_text(text)**  

> Remove the unnecessary characters from the text crawled from webpages.

**3. append_json_data(new_data)**  

> Append the new data crawled from webpages into local JSON file in streaming.

**4. crawl_page(url, history_crawled_urls)**

> Transver all the urls and crawl the text infomation.



In [4]:
import json
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
import time
import os
import re

class WebCrawler:
    def __init__(self, base_url=None, start_url=None, json_file_path=None, history_crawled_urls=None):
        self.base_url = base_url or 'https://dontstarve.fandom.com/wiki/'
        self.start_url = start_url or 'https://dontstarve.fandom.com/wiki/Don%27t_Starve_Wiki'
        self.json_file_path = json_file_path or '../data/dst_raw_text.json'
        self.history_crawled_urls = history_crawled_urls or set()
        self.history_crawled_urls.add(self.start_url)
        self.set_crawled_urls()
        self.mainbranch_urls=set()
        self.set_mainbranch_urls()

    def is_valid_url(self, url):
        if not isinstance(url, str):
            raise ValueError("URL must be a string.")
        invalid_symbols = ['?', 'File:', 'Template:', 'Talk:', '#','Special:','Message_Wall:','User:','Help:','User_blog:','Board_Thread:','Thread:']
        return all(invalid_symbol not in url for invalid_symbol in invalid_symbols) and url.startswith(self.base_url)
    
    def remove_invalid_elements(self):
        with open(self.json_file_path, 'r', encoding='utf-8') as file:
            data_list = json.load(file)
        
        data_list = [data for data in data_list if self.is_valid_url(data['source'])]
        
        with open(self.json_file_path, 'w', encoding='utf-8') as file:
            json.dump(data_list, file, ensure_ascii=False, indent=4)
        

    def clean_text(self, text):
        if not isinstance(text, str):
            raise ValueError("Text must be a string.")
        text = text.replace('\n', ' ').replace('\t', ' ')
        text = re.sub('\u200E', '', text)
        text = ' '.join(text.split())
        return text

    def append_json_data(self,new_data):
        # 检查文件是否存在
        if not os.path.exists(self.json_file_path):
            with open(self.json_file_path, 'w', encoding='utf-8') as file:
                # 初始化文件为包含空列表的JSON
                json.dump([], file, ensure_ascii=False, indent=4)

        # 逐步写入新的数据项
        with open(self.json_file_path, 'r+', encoding='utf-8') as file:
            try:
                # 将文件指针移动到末尾
                file.seek(0, os.SEEK_END)
                if file.tell() > 0:
                    file.seek(file.tell() - 1)
                    file.truncate()
                    file.write(',')

                # 写入新数据
                json.dump(new_data, file, ensure_ascii=False, indent=4)
                file.write(']')
            except json.JSONDecodeError:
                # 如果文件内容不是有效的JSON，初始化数据列表为空列表
                with open(self.json_file_path, 'w', encoding='utf-8') as file:
                    json.dump([new_data], file, ensure_ascii=False, indent=4)

    def set_crawled_urls(self):
        # Read the JSON file
        with open(self.json_file_path, 'r', encoding='utf-8') as file:
            data_list = json.load(file)
        

        for data in data_list:
            self.history_crawled_urls.add(data['source'])
            
    def clean_duplicate_sources(self):
        # Read the JSON file
        with open(self.json_file_path, 'r', encoding='utf-8') as file:
            data_list = json.load(file)

        # Find duplicate sources
        unique_sources = set()
        duplicate_sources = set()
        for data in data_list:
            source = data.get('source')
            if source in unique_sources:
                duplicate_sources.add(source)
            else:
                unique_sources.add(source)

        # Remove duplicate sources
        cleaned_data_list = [data for data in data_list if data.get('source') not in duplicate_sources]

        # Print duplicate sources
        if duplicate_sources:
            print("Duplicate sources found:")
            for source in duplicate_sources:
                print(source)
        else: 
            print("No duplicate source found!")

        # Write cleaned data back to the JSON file
        with open(self.json_file_path, 'w', encoding='utf-8') as file:
            json.dump(cleaned_data_list, file, ensure_ascii=False, indent=4)

    def set_mainbranch_urls(self):
        self.mainbranch_urls = set()
        try:
            response = requests.get(self.start_url, timeout=10)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                for link in soup.find_all('a', href=True):
                    child_url = urljoin(self.start_url, link['href'])
                    self.mainbranch_urls.add(child_url)
                    
        except requests.RequestException as e:
            print(f"Request error: {e}")
        except Exception as e:
            print(f"An error occurred: {e}")
     
    def crawl_page(self, url):
        if not self.is_valid_url(url) :
            return
        else:  # valid url
            try:
                
                response = requests.get(url, timeout=10)
                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, 'html.parser')
                    cleaned_text = self.clean_text(soup.get_text())
                    data = {'text': cleaned_text, 'source': url}
                    if url not in self.history_crawled_urls:
                        self.append_json_data(data)
                    self.history_crawled_urls.add(url)

                    for link in soup.find_all('a', href=True):
                        child_url = urljoin(url, link['href'])
                        # print(child_url)
                        if child_url.startswith(self.base_url) and child_url!=self.start_url and not child_url in self.mainbranch_urls and not child_url in self.history_crawled_urls:
                            self.crawl_page(child_url)
            except requests.RequestException as e:
                print(f"Request error: {e}")
            except Exception as e:
                print(f"An error occurred: {e}")
    def traverse_mainbranch_pages(self):
        for url in self.mainbranch_urls:
            self.crawl_page(url)
    
    def print_info(self):
        print("base_url:", self.base_url)
        print("start_url:", self.start_url)
        print("json_file_path:", self.json_file_path)
        print(f"Num of history_crawled_urls:{len(self.history_crawled_urls)}")
        print(f"Num of mainbranch_urls:{len(self.mainbranch_urls)}")
        # print("mainbranch_urls:", self.mainbranch_urls)






In [9]:
# Start crawling
crawler = WebCrawler()
crawler.clean_duplicate_sources()
crawler.remove_invalid_elements()
crawler.print_info()


base_url: https://dontstarve.fandom.com/wiki/
start_url: https://dontstarve.fandom.com/wiki/Don%27t_Starve_Wiki
json_file_path: ./data/dst_raw_text.json
Num of history_crawled_urls:5164
Num of mainbranch_urls:575


In [None]:
crawler.traverse_mainbranch_pages()

## **3. Categorize raw data into structured data**

In [4]:
import json
import random

class Corpus:
    def __init__(self, raw_data_filepath=None, refined_data_filepath=None, raw_data=None):
        self.raw_data_filepath = raw_data_filepath or '../data/dst_raw_text.json'
        self.refined_data_filepath = refined_data_filepath or '../data/refined_data.json'
        self.raw_data = raw_data or self.load_data()
        self.refined_data = None
        self.common_text=None
        self.find_common_text()
    
    def load_data(self):
        with open(self.raw_data_filepath, 'r', encoding='utf-8') as file:
            raw_data = json.load(file)
        return raw_data

    def get_raw_data_info(self):
        num_elements = len(self.raw_data)
        print("Number of elements:", num_elements)
        
        if num_elements > 0:
            sample_elements = random.sample(self.raw_data, k=min(num_elements, 2))
            num_fields = len(sample_elements[0])
            field_names = sample_elements[0].keys()
            
            print("Number of fields:", num_fields)
            print("Field names:", field_names)
            
            print("Sample elements:")
            for element in sample_elements:
                print(json.dumps(element, ensure_ascii=False, indent=4))
        else:
            print("No elements found in raw data.")
            
    def get_refined_data_info(self):
        num_elements = len(self.refined_data)
        print("Number of refined data elements:", num_elements)
        
        if num_elements > 0:
            sample_elements = random.sample(self.refined_data, k=min(num_elements, 2))
            num_fields = len(sample_elements[0])
            field_names = sample_elements[0].keys()
            
            print("Number of fields:", num_fields)
            print("Field names:", field_names)
            
            print("Sample elements:")
            for element in sample_elements:
                print(json.dumps(element, ensure_ascii=False, indent=4))
        else:
            print("No elements found in refined data.")

    def process_raw_data(self):
        self.refined_data = []
        for element in self.raw_data:
            text = element.get('text', '')
            if '|' in text:
                filename = text.split('|', 1)[0].strip()
                element['filename'] = filename
                if self.common_text:
                    text = text.replace(self.common_text, '')
            element['text'] = text  # 将修改后的text放入refined_data
            self.refined_data.append(element)
    
    def save_refined_data(self):
        with open(self.refined_data_filepath, 'w', encoding='utf-8') as file:
            json.dump(self.refined_data, file, ensure_ascii=False, indent=4)
    
    def find_common_text(self):
        if len(self.raw_data) >= 2:
            random_elements = random.sample(self.raw_data, k=2)
            text1 = random_elements[0].get('text', '')
            text2 = random_elements[1].get('text', '')
            common_text = ''
            if '|' in text1 and '|' in text2:
                index1 = text1.index('|') + 1
                index2 = text2.index('|') + 1
                while index1 < len(text1) and index2 < len(text2) and text1[index1] == text2[index2]:
                    common_text += text1[index1]
                    index1 += 1
                    index2 += 1
            self.common_text = common_text
        else:
            print("Insufficient data to find common text.")

In [5]:
corpus=Corpus()
corpus.get_raw_data_info()
print("common_text:",corpus.common_text)

Number of elements: 5164
Number of fields: 2
Field names: dict_keys(['text', 'source'])
Sample elements:
{
    "text": "Category:Woodie Filter | Don't Starve Wiki | Fandom Don't Starve Wiki Explore Main Page Discuss All Pages Community Interactive Maps Recent Blog Posts Don't Starve Features Hunger Health Sanity Darkness Seasons Freezing Rain Caves Ruins Nightmare Cycle Characters Wilson Willow Wolfgang Wendy WX-78 Wickerbottom Woodie Wes Maxwell Wagstaff Crafting Tools Light Survival Food Science Fight Structures Refine Magic Dress Structures Campfire Science Machine Crock Pot Ice Box Chest Drying Rack Tent Touch Stone Wooden Thing Mobs Chester Pig Beefalo Tallbird Spider Krampus Deerclops MacTusk N' Son Shadow Creatures Abigail Lore The Constant Adventure Mode Timeline Charlie Codex Umbra Nightmare Throne Ancient Civilization Puzzles DLC Reign of Giants Shipwrecked Hamlet Features Wetness Overheating Boats Poison Strong Winds Volcano Pig Shops Pig Traders Housing Ancient Pig Ruins Ch

In [6]:
corpus.process_raw_data()
corpus.save_refined_data()
corpus.get_refined_data_info()

Number of refined data elements: 5164
Number of fields: 3
Field names: dict_keys(['text', 'source', 'filename'])
Sample elements:
{
    "text": "Tin Fishin' Bin |Craftable Structures, Don't Starve Together, Structures, and 6 more Craftable Items Seafaring Filter Storage Solutions Filter Fishing Filter Return of Them Non-Flammable English Tiếng Việt Tin Fishin' Bin Edit Edit source View history Talk (0) Exclusive to: Tin Fishin' Bin \"Keep fish as fresh as the day you net them.\" Crafting Recipe ×1 ×3 Filter Tier Perk Refresh spoilage of Ocean Fishes placed inside. Code \"fish_box\" “They're stuffed in there like sardines!” –Wilson “Did we just... put a hole in the boat?” –Willow “Is new home for fish. For now.” –Wolfgang “It feels a bit cruel to trap them within sight of freedom.” –Wendy “THE DISGUSTING CREATURES WILL PUTRIFY SLOWER IN THERE” –WX-78 “A clever contraption to keep seafood fresh.” –Wickerbottom “Reminds me of the old salmon fisheries.” –Woodie “Ugh, the smell... the thing

In [7]:
import json

# Read the refined_data.json file
with open('../data/refined_data.json', 'r') as file:
    data = json.load(file)

# Create a set to track unique filename field values
seen_filenames = set()

# Filter the data based on item field values
filtered_data = []
for item in data:
    filename = item.get('filename')
    if filename in ['Axe', 'Wilson', 'Hunger', 'Food Tab', 'Caves'] and filename not in seen_filenames:
        seen_filenames.add(filename)
        print(len(item.get('text')))
        filtered_data.append(item)

# Save the filtered data as sample_data.json with formatted indentation and disable ASCII encoding
with open('../data/sample_data.json', 'w', encoding='utf-8') as file:
    json.dump(filtered_data, file, indent=4, ensure_ascii=False)

6230
14403
10830
2151
15233
