# Emotional and Linguistic Framing of Digital Detox

#### Notebook 2: Preporcessing and Cleaning

This notebook performs text cleaning and tokenisation on Reddit posts collected in Notebook 1, preparing the data for linguistic and emotional analysis.

The process includes:

- Removing noise (non-English posts, short texts, links)
- Filtering out stopwords and punctuation
- Tokenising text into individual words
- Balancing the dataset by sampling an equal number of detox and control posts
- The cleaned and balanced datasets are saved for use in later analysis notebooks.

In [1]:
!pip install nltk pandas langdetect

# imports
import pandas as pd
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
from langdetect import detect
nltk.download("punkt")
nltk.download("stopwords")



[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# load dataset 
combined_df = pd.read_csv('/home/jovyan/XXX/Back up/XXX/combined_reddit_digital_detox_study_dataframe.csv')

In [3]:
# assign id identifier to help with merging both datasets
combined_df['post_id'] = combined_df.index

In [4]:
# https://stackoverflow.com/questions/43916600/text-language-detection-in-python

# detect language 
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

# preprocessing function before tokenisation to clean data 
def clean_text_data(df):
    # keep only posts with more than 50 characters
    df = df[df['body'].str.len() > 50]

    # keep only English posts
    df = df[df['body'].apply(detect_language) == 'en']

    # remove posts containing links (spam, bots)
    df = df[~df['body'].str.contains('http|www', case=False)]

    return df

In [5]:
# code adapted from notebook - data classification
# initialise tokenisation, lemmatiser, stopwords, and punctuation

# initialise stopwords and punctuation sets
stops = set(stopwords.words('english'))
punct = set(string.punctuation)

# tokenisation function without lemmatisation
def good_tokens(text):
    # handle missing values
    if pd.isna(text):  
        return []
    
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha()]
    tokens = [word for word in words if word not in stops and word not in punct]
    return tokens

In [6]:
# clean 
combined_df = clean_text_data(combined_df)

# split into groups
detox_df = combined_df[combined_df['group'] == 'detox']
control_df = combined_df[combined_df['group'] == 'control']

# tokenise cleaned data
detox_df.loc[:, 'body_tokens'] = detox_df['body'].apply(good_tokens)
control_df.loc[:, 'body_tokens'] = control_df['body'].apply(good_tokens)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detox_df.loc[:, 'body_tokens'] = detox_df['body'].apply(good_tokens)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  control_df.loc[:, 'body_tokens'] = control_df['body'].apply(good_tokens)


In [11]:
# sample size
sample_size = 3500

# sample 3500 posts from each
balanced_detox = detox_df.sample(n=sample_size, random_state=42)
balanced_control = control_df.sample(n=sample_size, random_state=42)

In [12]:
# save to CSV
balanced_detox.to_csv("/home/jovyan/XXX/Back up/XXX/balanced_detox_posts.csv")
balanced_control.to_csv("/home/jovyan/XXX/Back up/XXX/balanced_control_posts.csv")