Import NumPy, Pandas and os.

In [2]:
import numpy as np
import pandas as pd
import os

SCP 1 to 6999 by C ZE (https://www.kaggle.com/datasets/czzzzzzz/scp1to7) will be used as a source of training data. It's already present in repository.

In [3]:
path_to_df = os.path.abspath("data/scp6999.csv")
df = pd.read_csv(path_to_df, encoding='utf-8')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6999 entries, 0 to 6998
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   code            6999 non-null   object 
 1   title           6999 non-null   object 
 2   text            6999 non-null   object 
 3   image captions  3105 non-null   object 
 4   rating          6611 non-null   float64
 5   state           6999 non-null   object 
 6   tags            6596 non-null   object 
 7   link            6999 non-null   object 
dtypes: float64(1), object(7)
memory usage: 437.6+ KB


In [14]:
df.sample(1)

Unnamed: 0,code,title,text,image captions,rating,state,tags,link
5827,SCP-5828,"""As in Life, So in Death""","""NOTICE TO ALL RESEARCHERS: The following SCP ...",,59.0,active,_licensebox cadaver gaseous humanoid immobile ...,https://scp-wiki.wikidot.com/scp-5828


In [18]:
df['text'].sample(1)

4801    ""Image name: Screaming Seagull\nAvailable via...
Name: text, dtype: object

Most of the parameters except image captions are good to use. However only code, title and text will be used. 

In [5]:
useful_df = df[['code', 'title', 'text']]
useful_df.sample(1)

Unnamed: 0,code,title,text
659,SCP-660,"""Earthen Womb""","""Item #: SCP-660 \n Object Class: Safe \n Spec..."


Also, text appears as web-scrapped page content without any editing, it will be parsed to Object Class, Special Containment Procedures, Description. To make NN's work easier special character '☺' is inserted between text parts, so it's easier to conclude that 'SCP-XXXXX ☺"' is a start of the "file".

In [6]:
def get_description(s: str):
    tags = ['Description:', 'Object Class:', 'Special Containment Procedures:']

    lines = list(map(lambda x: x.strip(), s.splitlines()))
    description = list(filter(lambda x: any(map(lambda tag: x[:len(tag)] == tag, tags)), lines))

    return '\n☺'.join(description) if description else None

In [7]:
useful_df['text'] = useful_df['text'].apply(lambda x: get_description(x))
useful_df = useful_df[useful_df['text'].notna()]
useful_df.sample(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  useful_df['text'] = useful_df['text'].apply(lambda x: get_description(x))


Unnamed: 0,code,title,text
3199,SCP-3200,"""Chronos""",Object Class: Keter\n☺Special Containment Proc...


In [8]:
texts_df = useful_df.apply(lambda x: '\n'.join(list(map(lambda y: '☺' + y, x))), axis=1)
texts_df.sample(1)

5638    ☺SCP-5639\n☺"The Prospero Complex"\n☺Descripti...
dtype: object

One of the main RNN model parameters is vocabulary size. This dataset contains lots of characters used in one or two "files", so they have basically no effect on output and entries containing them are filtered out.

In [9]:
reduced_vocab = ['\n', '\r', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\xa0', '¢', '£', '§', '©', '®', '°', '±', '²', '³', '´', 'µ', 'º', '¼', '½', '¾', 'Æ', 'É', 'Ó', '×', 'Ø', 'Ü', 'ß', 'á', 'æ', 'é', 'í', 'ï', 'ð', 'ó', '÷', 'ø', 'ú', 'þ', 'ć', 'ı', 'ś', 'ź', 'Ɣ', 'Ɵ', 'Ʒ', 'Ƹ', 'ə', 'ʊ', 'ˈ', '˚', 'Δ', 'Ι', 'Σ', 'Ω', 'α', 'β', 'γ', 'δ', 'ε', 'η', 'θ', 'λ', 'μ', 'ν', 'ο', 'π', 'ρ', 'ς', 'σ', 'τ', 'υ', 'ψ', 'ω', '‑', '–', '—', '‘', '’', '“', '”', '†', '•', '…', '′', '″', '€', '℃', '℞', '™', 'Ⅱ', '∆', '−', '∞', '≈', '≡', '≤', '≥', '⊃', '█', '☺', '⚠', '➡',]
set_vocab = set(reduced_vocab)

texts_df = texts_df.apply(lambda x: x if set(x).issubset(set_vocab) else None)
texts_df = texts_df.dropna()
texts_df.info()

<class 'pandas.core.series.Series'>
Int64Index: 6091 entries, 1 to 6998
Series name: None
Non-Null Count  Dtype 
--------------  ----- 
6091 non-null   object
dtypes: object(1)
memory usage: 95.2+ KB


Finally, data can be saved.

In [10]:
path_to_file = r"data\scp6999.txt"
np.savetxt(path_to_file, texts_df.values, fmt='%s', newline="\n\n\n", encoding="utf-8")