# RedPajama Dataset: StackExchange data extraction and processing

> This notebook was presented at PyData London 2023.
>
> A link will be provided to the talk video when available!

First, download the XML data and perform a conversion to JSON for easy ingestion into Daft.

This is done outside of the live demo because it takes a long time due to bandwidth throttling!

In [None]:
###
# Taken from: https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/stack_exchange/download.py
###

# import os
# import pandas as pd
# from tqdm import tqdm

# BASE_URL="https://archive.org/download/stackexchange/"

# table = pd.read_html(BASE_URL)[0]
# sources = [x.replace(" (View Contents)", "") for x in table['Name'].tolist()]
# sources = [x for x in sources if x.endswith(".7z")]
# for source in tqdm(sources):
#     # if ".meta." not in source:
#     print(f"source: {source}")
#     os.system("wget "+BASE_URL+source+" -O "+"./data/"+source)
#     os.system("7z x ./data/"+source+" -o./data/"+source[:-3])
#     os.system(f"mv ./data/{source[:-3]}/Posts.xml ./data/{source[:-3]}.xml")
#     os.system(f"rm -rf ./data/{source[:-3]}")
#     os.system(f"rm ./data/{source}")

###
# Convert data from XML to JSON
###

# import os
# from lxml import etree
# import json

# for filename in os.listdir("data"):
#     if ".meta." in filename:
#         continue
#     if not filename.endswith(".xml"):
#         continue
#     with open(f"data/{filename}") as xml_f, open(f"data/{filename[:-3]}jsonl", "w") as json_f:
#         for line in xml_f:
#             if not line.strip().startswith("<row"):
#                 continue
        
#             # `line` is a <row /> XML object
#             obj = etree.fromstring(line)
#             data = {key: obj.get(key) for key in obj.keys()}
#             data["site"] = filename[:-4]
#             json_f.write(json.dumps(data))
#             json_f.write("\n")


## Reading Data

Read the JSON-lines files

In [1]:
# How much data are we working with here? About 10G!
!ls -la data | grep jsonl | awk '{ total += $5 }; END { print total / 1000000000 }'

9.86196


In [2]:
import daft

df = daft.read_json("data/*.jsonl")
df.collect()

[32m2023-06-04 11:56:49.648[0m | [1mINFO    [0m | [36mdaft.context[0m:[36mrunner[0m:[36m80[0m - [1mUsing PyRunner[0m


Id Utf8,PostTypeId Utf8,AcceptedAnswerId Utf8,CreationDate Utf8,Score Utf8,ViewCount Utf8,Body Utf8,OwnerUserId Utf8,LastActivityDate Utf8,Title Utf8,Tags Utf8,AnswerCount Utf8,CommentCount Utf8,ContentLicense Utf8,site Utf8,LastEditorUserId Utf8,LastEditDate Utf8,ParentId Utf8,ClosedDate Utf8,OwnerDisplayName Utf8,FavoriteCount Utf8,CommunityOwnedDate Utf8,LastEditorDisplayName Utf8
1,1,51.0,2016-01-12T18:45:19.963,10,409.0,<p>When I've printed an object I've had to choose between...,16,2017-10-31T02:31:08.560,How to obtain high resolution prints in a shorter period ...,<resolution><speed><quality>,2.0,6,CC BY-SA 3.0,3dprinting.stackexchange.com,,,,,,,,
2,1,12.0,2016-01-12T18:45:51.287,33,7149.0,"<p>I would like to buy a 3D printer, but I'm concerned ab...",20,2019-06-10T23:18:34.190,Is 3D printing safe for your health?,<print-material><safety><health>,4.0,1,CC BY-SA 3.0,3dprinting.stackexchange.com,334.0,2016-11-15T16:16:11.163,,,,,,
3,1,152.0,2016-01-12T18:46:22.083,18,2670.0,<p>I know the minimum layer height will effect how detail...,11,2016-09-19T15:41:06.537,How important is the minimum layer height on a 3d printer?,<quality><resolution>,3.0,5,CC BY-SA 3.0,3dprinting.stackexchange.com,11.0,2016-01-12T22:00:36.347,,,,,,
4,1,1289.0,2016-01-12T18:50:55.973,17,376.0,<p>Plastic is used in 3D FDM/FFF printing partly because ...,16,2016-06-10T13:32:20.493,Are there any metals that exhibit a large glass state?,<fdm><material><print-material><metal-parts>,4.0,0,CC BY-SA 3.0,3dprinting.stackexchange.com,98.0,2016-06-09T02:10:35.890,,,,,,
5,1,77.0,2016-01-12T18:53:53.623,40,3952.0,<p>What are the main differences when using ABS over PLA ...,11,2017-08-02T09:49:07.263,How is PLA different from ABS material?,<filament><abs><fdm><pla>,5.0,5,CC BY-SA 3.0,3dprinting.stackexchange.com,20.0,2016-01-15T17:02:37.707,,,,,,
6,1,27.0,2016-01-12T18:57:13.350,11,620.0,<p>My MakerBot printer supports only two filaments at the...,20,2018-09-16T12:35:19.097,Multi-color printing with desktop 3D printer?,<filament><makerbot><dual-nozzle><color>,5.0,0,CC BY-SA 3.0,3dprinting.stackexchange.com,8884.0,2018-09-16T12:35:19.097,,,,,,
7,5,,2016-01-12T18:57:48.103,0,,<p>Filament is the plastic strands used as the print mate...,11,2016-01-15T17:04:10.283,,,,0,CC BY-SA 3.0,3dprinting.stackexchange.com,11.0,2016-01-15T17:04:10.283,,,,,,
8,4,,2016-01-12T18:57:48.103,0,,For questions related to different filaments used as the ...,11,2016-01-15T17:04:07.180,,,,0,CC BY-SA 3.0,3dprinting.stackexchange.com,11.0,2016-01-15T17:04:07.180,,,,,,


Count the number of posts per site

In [3]:
site_counts = (df
    .groupby("site")
    .agg([
        (daft.col("site").alias("site_count"), "count")
    ])
    .sort(
        "site_count",
        desc=True,
    )
)
site_counts.show()

site Utf8,site_count UInt64
askubuntu.com,920821
electronics.stackexchange.com,493128
english.stackexchange.com,415984
es.stackoverflow.com,399582
apple.stackexchange.com,312809
gaming.stackexchange.com,269357
ell.stackexchange.com,264967
dba.stackexchange.com,234877


Keep only posts from the top 28 sites

In [4]:
sites = site_counts.limit(28)
sites.collect()

site Utf8,site_count UInt64
askubuntu.com,920821
electronics.stackexchange.com,493128
english.stackexchange.com,415984
es.stackoverflow.com,399582
apple.stackexchange.com,312809
gaming.stackexchange.com,269357
ell.stackexchange.com,264967
dba.stackexchange.com,234877


In [5]:
df = df.join(sites, on="site")
df = df.select(
    "Id",
    "ParentId",
    df["Score"].cast(daft.DataType.uint64()),
    "Body",
    "Title",
    "site",
)
df.collect()

Id Utf8,ParentId Utf8,Score UInt64,Body Utf8,Title Utf8,site Utf8
1,,12,<p>How can I change the default Apache Solr URL path from...,How can I change the Apache Solr search URL?,drupal.stackexchange.com
2,,97,<p>How can I change a user's password from the command li...,How can I change a user's password from the command line ...,drupal.stackexchange.com
3,,37,"<p>In Drupal 7, the API documentation for <code>node_load...",What's the proper way to use EntityFieldQuery?,drupal.stackexchange.com
4,1.0,4,<p>This should work if you place it in settings.php:</p> ...,,drupal.stackexchange.com
7,2.0,15,"<p>If you are using Drush 4, you can use the user-passwor...",,drupal.stackexchange.com
8,2.0,139,"<p>In Drush 9, the command is <a href=""https://drushcomma...",,drupal.stackexchange.com
9,,80,"<p>Basically, one of the greatest questions of all time: ...","Suggestions for settings.php - Local dev, Development ser...",drupal.stackexchange.com
10,,0,<p><strong>Drush</strong> (<em>DRUpal SHell</em>) is a co...,,drupal.stackexchange.com


## Clean up Data

Construct a Questions dataframe and an Answers dataframe

In [6]:
questions_df = df.where(df["ParentId"].is_null() & ~df["Title"].is_null())
questions_df.collect()

Id Utf8,ParentId Utf8,Score UInt64,Body Utf8,Title Utf8,site Utf8
1,,12,<p>How can I change the default Apache Solr URL path from...,How can I change the Apache Solr search URL?,drupal.stackexchange.com
2,,97,<p>How can I change a user's password from the command li...,How can I change a user's password from the command line ...,drupal.stackexchange.com
3,,37,"<p>In Drupal 7, the API documentation for <code>node_load...",What's the proper way to use EntityFieldQuery?,drupal.stackexchange.com
9,,80,"<p>Basically, one of the greatest questions of all time: ...","Suggestions for settings.php - Local dev, Development ser...",drupal.stackexchange.com
13,,6,<p>What are some of the biggest differences between Drush...,What are the major differences between Drush versions 3 & 4?,drupal.stackexchange.com
16,,89,<p>How can I take a site offline using Drush?</p>,How to take a site offline using Drush?,drupal.stackexchange.com
22,,93,"<p>Does having Drupal modules present, but not enabled ha...",Do non-enabled modules affect performance?,drupal.stackexchange.com
24,,9,<p>I'm building a plugin using the WYSIWYG API module for...,How do you load extra javascript files required for wysiw...,drupal.stackexchange.com


Sort answers dataframe by the Question and the Score of the answer

In [7]:
answers_df = df.where(~df["ParentId"].is_null())
answers_df = answers_df.where(~df["Score"].is_null())
answers_df = answers_df.sort(["ParentId", "Score"], desc=True)
answers_df.collect()

Id Utf8,ParentId Utf8,Score UInt64,Body Utf8,Title Utf8,site Utf8
1000027,999999,40,<p>I solved the problem this way:</p> <p><code>sudo apt-...,,askubuntu.com
1001329,999997,0,"<p>Found, how to fix that. I have to intall the latest ve...",,askubuntu.com
1000000,999997,0,<p>I remember having similar issues when switching betwee...,,askubuntu.com
100012,99999,76,<p><strong>Don't walk. Run.</strong></p> <p>Your advisor...,,academia.stackexchange.com
100006,99999,5,<p>Yo lo que suelo hacer cuando tengo una función que sac...,,es.stackoverflow.com
100015,99999,2,<p>I have had success with 5 minute epoxy. The epoxy bond...,,diy.stackexchange.com
100001,99999,2,"<p>Well, a possible loop invariant is ""<span class=""math-...",,cs.stackexchange.com
100007,99999,1,"<p>The <a href=""https://www.blockchain.com/btc/address/1C...",,bitcoin.stackexchange.com


## Join Answers with Questions

1. Sort the answers by score
2. Compute a `list` of answers for each question, sorted by score!

In [8]:
answers_df = answers_df.groupby("site", "ParentId").agg([
    ("Body", "list"),
    ("Score", "list"),
])

In [9]:
answers_df.collect()

site Utf8,ParentId Utf8,Body List[Body:_local_list:Utf8],Score List[Score:_local_list:UInt64]
english.stackexchange.com,63673,['<p>While I agree with @andrewdotnich that I\'d probably...,"[5, 2]"
electronics.stackexchange.com,467994,['<p>In the world of scales multiple load cells are simpl...,[1]
askubuntu.com,303238,['<p>Thers is a very easy way to do so for any and all pr...,"[1, 0]"
ell.stackexchange.com,273625,['<p>This construction is regarded as informal. It actual...,"[3, 3]"
ell.stackexchange.com,91824,"['<p>This is the correct way to say it, if you really wan...",[1]
academia.stackexchange.com,51762,"[""<p>The editor, who selects the reviewers, may not be fa...",[2]
dsp.stackexchange.com,72937,"['<p><strong>HINT:</strong></p>\n<p><span class=""math-con...",[1]
es.stackoverflow.com,131181,['<p>Puedes utilizar <code>array_unique()</code></p>\n\n<...,[1]


In [10]:
joined_df = answers_df.join(questions_df, left_on=["site", "ParentId"], right_on=["site", "Id"])
joined_df = joined_df.select(
    joined_df["Id"].alias("question_id"),
    joined_df["Title"].alias("question_title"),
    joined_df["right.Body"].alias("question_body"),
    joined_df["Body"].alias("answers_ordered"),
)
joined_df.collect()

question_id Utf8,question_title Utf8,question_body Utf8,answers_ordered List[Body:_local_list:Utf8]
73,Custom query in Views?,<p>At some point I found the need to modify an SQL query ...,['<p>You can also use <code>hook_views_query_alter()</cod...
76,"Running Drupal in a Windows environment (IIS, SQL Server)?",<p>We are in the process of evaluating Drupal to replace ...,['<p>Easiest way is to use the Microsoft Web Platform Ins...
126,Any way to add CSS for a single page/node?,<p>I'm cleaning up my big crazy style sheets (possibly pe...,"['<p>This is the sort of thing that I\'d do by code, but ..."
234,Site stuck in maintenance mode,<p>I have put a site in maintenance mode. Before I could ...,"['<p>You should be able to log in by going to <a href=""ht..."
371,Programmatically insert webform result,<p>I'm using drupal 6 with the webform module installed.<...,['<p>Have you tried using PHP to manually send the POST d...
893,Change $<field_name>_rendered output?,<p>In a node I get the values of CCK fields as $ array or...,"['<p>The <a href=""https://www.drupal.org/project/custom_f..."
1283,Creating a View filter for a CCK field,<p>I have created a custom filter which displays some sel...,"['<p>To find out what is missing, implement <code>hook_vi..."
2050,Best practice for building modules using classes,<p>I'm looking to start building my modules as classes no...,"['<p><a href=""http://groups.drupal.org/node/20728#comment..."


## Post-Processing

1. Correctly format the "text" column to our "Q:/A:/A:..." format
2. Infer the language of each entry!

In [11]:
prefixed_joined_answers = daft.lit("A: ") + joined_df["answers_ordered"].list.join("\nA: ")
prefixed_question = daft.lit("Q: ") + joined_df["question_body"]

text_df = joined_df.with_column(
    "text",
    prefixed_question + daft.lit("\n") + prefixed_joined_answers
)

In [12]:
text_df.show()

question_id Utf8,question_title Utf8,question_body Utf8,answers_ordered List[Body:_local_list:Utf8],text Utf8
73,Custom query in Views?,<p>At some point I found the need to modify an SQL query ...,['<p>You can also use <code>hook_views_query_alter()</cod...,Q: <p>At some point I found the need to modify an SQL que...
76,"Running Drupal in a Windows environment (IIS, SQL Server)?",<p>We are in the process of evaluating Drupal to replace ...,['<p>Easiest way is to use the Microsoft Web Platform Ins...,Q: <p>We are in the process of evaluating Drupal to repla...
126,Any way to add CSS for a single page/node?,<p>I'm cleaning up my big crazy style sheets (possibly pe...,"['<p>This is the sort of thing that I\'d do by code, but ...",Q: <p>I'm cleaning up my big crazy style sheets (possibly...
234,Site stuck in maintenance mode,<p>I have put a site in maintenance mode. Before I could ...,"['<p>You should be able to log in by going to <a href=""ht...",Q: <p>I have put a site in maintenance mode. Before I cou...
371,Programmatically insert webform result,<p>I'm using drupal 6 with the webform module installed.<...,['<p>Have you tried using PHP to manually send the POST d...,Q: <p>I'm using drupal 6 with the webform module installe...
893,Change $<field_name>_rendered output?,<p>In a node I get the values of CCK fields as $ array or...,"['<p>The <a href=""https://www.drupal.org/project/custom_f...",Q: <p>In a node I get the values of CCK fields as $ array...
1283,Creating a View filter for a CCK field,<p>I have created a custom filter which displays some sel...,"['<p>To find out what is missing, implement <code>hook_vi...",Q: <p>I have created a custom filter which displays some ...
2050,Best practice for building modules using classes,<p>I'm looking to start building my modules as classes no...,"['<p><a href=""http://groups.drupal.org/node/20728#comment...",Q: <p>I'm looking to start building my modules as classes...


In [15]:
@daft.udf(return_dtype=daft.DataType.string())
class PredictLanguage:

    def __init__(self):
        import fasttext

        pretrained_lang_model = "lid.176.bin"
        self.model = fasttext.load_model(pretrained_lang_model)

    def __call__(self, text: daft.Series):
        preds = []
        for t in text.to_pylist():
            pred = self.model.predict(t.replace("\n", " "), k=1)
            preds.append(pred[0][0].replace("__label__", ""))
        return preds


In [16]:
text_df = text_df.with_column("language", PredictLanguage(text_df["text"]))

In [17]:
text_df.show()

question_id Utf8,question_title Utf8,question_body Utf8,answers_ordered List[Body:_local_list:Utf8],text Utf8,language Utf8
73,Custom query in Views?,<p>At some point I found the need to modify an SQL query ...,['<p>You can also use <code>hook_views_query_alter()</cod...,Q: <p>At some point I found the need to modify an SQL que...,en
76,"Running Drupal in a Windows environment (IIS, SQL Server)?",<p>We are in the process of evaluating Drupal to replace ...,['<p>Easiest way is to use the Microsoft Web Platform Ins...,Q: <p>We are in the process of evaluating Drupal to repla...,en
126,Any way to add CSS for a single page/node?,<p>I'm cleaning up my big crazy style sheets (possibly pe...,"['<p>This is the sort of thing that I\'d do by code, but ...",Q: <p>I'm cleaning up my big crazy style sheets (possibly...,en
234,Site stuck in maintenance mode,<p>I have put a site in maintenance mode. Before I could ...,"['<p>You should be able to log in by going to <a href=""ht...",Q: <p>I have put a site in maintenance mode. Before I cou...,en
371,Programmatically insert webform result,<p>I'm using drupal 6 with the webform module installed.<...,['<p>Have you tried using PHP to manually send the POST d...,Q: <p>I'm using drupal 6 with the webform module installe...,en
893,Change $<field_name>_rendered output?,<p>In a node I get the values of CCK fields as $ array or...,"['<p>The <a href=""https://www.drupal.org/project/custom_f...",Q: <p>In a node I get the values of CCK fields as $ array...,en
1283,Creating a View filter for a CCK field,<p>I have created a custom filter which displays some sel...,"['<p>To find out what is missing, implement <code>hook_vi...",Q: <p>I have created a custom filter which displays some ...,en
2050,Best practice for building modules using classes,<p>I'm looking to start building my modules as classes no...,"['<p><a href=""http://groups.drupal.org/node/20728#comment...",Q: <p>I'm looking to start building my modules as classes...,en


In [18]:
%%time

text_df.limit(20000).collect()

CPU times: user 5.08 s, sys: 1.46 s, total: 6.53 s
Wall time: 6.51 s


question_id Utf8,question_title Utf8,question_body Utf8,answers_ordered List[Body:_local_list:Utf8],text Utf8,language Utf8
73,Custom query in Views?,<p>At some point I found the need to modify an SQL query ...,['<p>You can also use <code>hook_views_query_alter()</cod...,Q: <p>At some point I found the need to modify an SQL que...,en
76,"Running Drupal in a Windows environment (IIS, SQL Server)?",<p>We are in the process of evaluating Drupal to replace ...,['<p>Easiest way is to use the Microsoft Web Platform Ins...,Q: <p>We are in the process of evaluating Drupal to repla...,en
126,Any way to add CSS for a single page/node?,<p>I'm cleaning up my big crazy style sheets (possibly pe...,"['<p>This is the sort of thing that I\'d do by code, but ...",Q: <p>I'm cleaning up my big crazy style sheets (possibly...,en
234,Site stuck in maintenance mode,<p>I have put a site in maintenance mode. Before I could ...,"['<p>You should be able to log in by going to <a href=""ht...",Q: <p>I have put a site in maintenance mode. Before I cou...,en
371,Programmatically insert webform result,<p>I'm using drupal 6 with the webform module installed.<...,['<p>Have you tried using PHP to manually send the POST d...,Q: <p>I'm using drupal 6 with the webform module installe...,en
893,Change $<field_name>_rendered output?,<p>In a node I get the values of CCK fields as $ array or...,"['<p>The <a href=""https://www.drupal.org/project/custom_f...",Q: <p>In a node I get the values of CCK fields as $ array...,en
1283,Creating a View filter for a CCK field,<p>I have created a custom filter which displays some sel...,"['<p>To find out what is missing, implement <code>hook_vi...",Q: <p>I have created a custom filter which displays some ...,en
2050,Best practice for building modules using classes,<p>I'm looking to start building my modules as classes no...,"['<p><a href=""http://groups.drupal.org/node/20728#comment...",Q: <p>I'm looking to start building my modules as classes...,en


## Distributed Processing

In [1]:
import daft
import ray

RAY_ADDRESS = "ray://localhost:10001"

ray.init(address=RAY_ADDRESS, runtime_env={"pip": ["getdaft[aws]"]})
daft.context.set_runner_ray(address=RAY_ADDRESS)

DaftContext(runner_config=_RayRunnerConfig(address='ray://localhost:10001', max_task_backlog=None), disallow_set_runner=True)

In [2]:
%%time

df = daft.read_json("s3://daft-public-data/redpajama-1t/stackexchange-raw-xml-to-jsonl/")

[32m2023-06-02 18:23:47.246[0m | [1mINFO    [0m | [36mdaft.context[0m:[36mrunner[0m:[36m71[0m - [1mUsing RayRunner[0m
2023-06-02 18:23:47,398	INFO client_builder.py:252 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
2023-06-02 18:23:47,398	INFO client_connect.py:39 -- Calling ray.init() again after it has already been called. Reusing the existing Ray client connection.


CPU times: user 77.3 ms, sys: 21 ms, total: 98.3 ms
Wall time: 6.08 s
