# Data cleaner for Jupyterlite (2025)

This code helps you clean social media data for text analysis. As you can tell from the variable names below, the original code was written for working with Twitter content. It will work with any other text upload you provide.
Run the cells below and follow the instructions in the console output. The first step is to install the necessary packages (collections of code).

In [1]:
import re
import numpy as np
from IPython.display import FileLink, display
import ipywidgets as widgets

print("Import complete!")

Import complete!


The second step is to define the cleaning operations, including a list of stopwords. You will be able to edit those stopwords in a dialogue box when running the cleaning function, so please do not change them in the cell below.

In [3]:
# Stopwords (editable)
DEFAULT_STOPWORDS = set([
    "a","an","the","is","are","was","were","be","been","being",
    "and","or","but","if","then","else","when","while","for","to","of","in","on","at","by","with","from",
    "this","that","these","those","it","its","as","not","no","so","too","very",
    # social / URL bits
    "http","https","www","com","co","org","#","@"
])
stopwords_text = widgets.Textarea(
    value=" ".join(sorted(DEFAULT_STOPWORDS)),
    description="Stopwords",
    layout=widgets.Layout(width='100%', height='110px')
)

# Cleaning function

def clean_tweet(tweet: str, stopwords):
    if tweet is None:
        return ""
    if isinstance(tweet, (float, np.floating)):
        return ""
    if not isinstance(tweet, str):
        tweet = str(tweet)

    t = tweet.lower()

    # Normalize and remove patterns
    t = re.sub(r"'", "", t)                  # strip apostrophes (keeps contractions readable)
    t = re.sub(r"@[A-Za-z0-9_]+", " ", t)    # mentions and mail addresses
    t = re.sub(r"http\S+", " ", t)           # URLs
    t = re.sub(r"[()!?]", " ", t)            # some punctuation
    t = re.sub(r"\[.*?\]", " ", t)           # bracketed text
    t = re.sub(r"[^a-z0-9]", " ", t)         # keep alphanumerics

    tokens = [w for w in t.split() if w and w not in stopwords]
    return " ".join(tokens)

In the third step, you can upload your data in .txt format and apply the stopwords and cleaning function. You will see widgets to add a file, provide direct text input, and edit the stopwords. Do not close the browser while the script runs, and do not forgot to download your cleaned data in the end.¶

In [4]:
# Inputs: upload or paste
uploader = widgets.FileUpload(accept='.txt', multiple=False)
paste_area = widgets.Textarea(
    placeholder='Or paste a few lines of text here (one tweet per line)...',
    layout=widgets.Layout(width='100%', height='110px')
)

def _read_uploaded_text(upl: widgets.FileUpload):
    """Return (name, text) or (None, None) if nothing uploaded. Supports ipywidgets v7/v8."""
    if not upl.value:
        return None, None
    # v8: list of dicts; v7: dict-like
    if isinstance(upl.value, list):
        entry = upl.value[0]
    else:
        entry = list(upl.value.values())[0]
    data = entry.get('content', b'')
    name = entry.get('name') or entry.get('metadata', {}).get('name', 'upload.txt')
    try:
        text = data.decode('utf-8')
    except UnicodeDecodeError:
        text = data.decode('latin-1')
    return name, text

def get_input_lines():
    name, text = _read_uploaded_text(uploader)
    if text and text.strip():
        return (name or 'uploaded.txt'), text.splitlines()
    pasted = paste_area.value or ''
    if pasted.strip():
        return ('pasted.txt', pasted.splitlines())
    return (None, [])

# Action + output
run_button = widgets.Button(description='Clean tweets', button_style='success')
out = widgets.Output()

ui = widgets.VBox([
    widgets.HTML("<h3>Upload .txt (one tweet per line)</h3>"),
    uploader,
    widgets.HTML("<b>Or paste text</b>"),
    paste_area,
    widgets.HTML("<b>Edit stopwords (space-separated)</b>"),
    stopwords_text,
    run_button,
    out
])
display(ui)

@run_button.on_click
def _on_click(btn):
    out.clear_output()
    with out:
        fname, lines = get_input_lines()
        stops = set(w.strip() for w in stopwords_text.value.split() if w.strip())

        if not lines:
            print("No input provided. Upload a .txt file or paste text above.")
            return

        cleaned = [clean_tweet(line, stops) for line in lines]

        # Save and offer download
        out_path = 'cleaned_tweets.txt'
        with open(out_path, 'w', encoding='utf-8') as f:
            for line in cleaned:
                f.write(line + '\n')

        # Preview + download link
        preview_n = min(10, len(cleaned))
        print(f"Input: {fname} | Cleaned {len(cleaned)} lines. Preview of first {preview_n}:")
        for i in range(preview_n):
            print(cleaned[i])
        display(widgets.HTML("<hr><b>Download cleaned file:</b>"))
        display(FileLink(out_path))

VBox(children=(HTML(value='<h3>Upload .txt (one tweet per line)</h3>'), FileUpload(value={}, accept='.txt', de…

Once your data download is complete, you can close the browser or move on to another script.
