# Phase 2: Data Preprocessing & Quality Assurance
**Project:** Multiplex Dynamics of Polarization
**Objective:** Prepare raw Twitter data for Master-level Network Analysis and Topic Modeling (LDA).

## The Pipeline
To ensure validity, we apply a strict 3-Layer Filter:
1.  **Noise Filter:** Removes duplicates and short text (< 4 words) to prevent "Garbage In, Garbage Out".
2.  **Bot Filter:** Removes hyper-active users (Top 0.5%) of user tweet distribution to prevent network centrality skewing (RQ1).
3.  **NLP Normalization:** Uses **spaCy** for Lemmatization (reducing dimensionality) and Part-of-Speech filtering (RQ2).

In [13]:
# [Cell 2] Imports
import sys
import os
import pandas as pd
import importlib 

# Auto-install dependencies
try:
    import fasttext
except ImportError:
    !pip install fasttext-wheel

sys.path.append(os.path.abspath('../src'))
import eda
import preprocessing as prep

# FORCE RELOAD to catch your new changes
importlib.reload(prep)
print("‚úÖ Preprocessing module reloaded.")

# Config
TRUMP_PATH = '../data/raw/hashtag_donaldtrump.csv'
BIDEN_PATH = '../data/raw/hashtag_joebiden.csv'

‚úÖ spaCy model 'en_core_web_sm' loaded.
‚úÖ Preprocessing module reloaded.


Load Raw Data

In [14]:
# [Cell 3] Load Data
LIMIT = None # Set to None for full run
print(f"--- Loading Raw Data (Limit={LIMIT}) ---")
df_trump_raw = eda.load_data(TRUMP_PATH, limit=LIMIT)
df_biden_raw = eda.load_data(BIDEN_PATH, limit=LIMIT)



--- Loading Raw Data (Limit=None) ---
üìÇ Loading data from: ../data/raw/hashtag_donaldtrump.csv...
‚úÖ Loaded 970,919 tweets.
üìÇ Loading data from: ../data/raw/hashtag_joebiden.csv...
‚úÖ Loaded 776,886 tweets.


Pipe Line

In [15]:
def run_forked_pipeline(df, candidate_name):
    if df is None: return None
    
    print(f"\nüöÄ Processing {candidate_name} Network...")
    
    # 1. Language Filter (Saves Foreign to CSV)
    if hasattr(prep, 'filter_language'):
        df = prep.filter_language(df, save_prefix=candidate_name.lower())
    else:
        print("‚ùå Error: filter_language not found!")
    
    # 2. Noise Filter
    df = prep.filter_noise(df)
    
    # 3. Bot Filter (Saves Bots to CSV) <--- NEW UPDATE HERE
    # We pass the prefix so it saves 'trump_bots_removed.csv'
    df = prep.remove_bots(df, save_prefix=candidate_name.lower())
    
    # 4. LDA & BERT Prep
    print("   üîπ Generating LDA Text...")
    df['lda_text'] = prep.spacy_clean(df['tweet'].tolist())
    
    print("   üîπ Generating BERT Text...")
    df['bert_text'] = prep.bert_clean(df['tweet'].tolist())
    
    df = df[(df['lda_text'] != "") & (df['bert_text'] != "")]
    
    print(f"‚úÖ {candidate_name} Done. Count: {len(df):,}")
    return df

Execute Pipeline (Step-by-Step)

In [16]:
# [Cell 5] Execute
df_trump_final = run_forked_pipeline(df_trump_raw, "TRUMP")
df_biden_final = run_forked_pipeline(df_biden_raw, "BIDEN")


üöÄ Processing TRUMP Network...
‚úÖ FastText model loaded.
   üåç [Language Filter] Checking 970,919 tweets...
      -> üíæ Saved 237,821 foreign tweets to: ../data/processed/trump_foreign_removed.csv
      -> Retained 733,098 English tweets.
   üßπ [Noise Filter] Starting with 733,098 tweets...
      -> Retained 701,078 high-quality tweets.
   ü§ñ [Bot Filter] Identifying top 0.5% active users...
      -> Found 1,050 bot accounts (Threshold > 64 tweets).
      -> Removed 145,398 tweets total.
      -> üíæ Saved bot tweets to: ../data/processed/trump_bots_removed.csv
   üîπ Generating LDA Text...
   üß† [LDA Prep] Heavy cleaning 555,680 tweets...
   üîπ Generating BERT Text...
   ü§ñ [BERT Prep] Light cleaning 555,680 tweets...
‚úÖ TRUMP Done. Count: 536,906

üöÄ Processing BIDEN Network...
   üåç [Language Filter] Checking 776,886 tweets...
      -> üíæ Saved 177,782 foreign tweets to: ../data/processed/biden_foreign_removed.csv
      -> Retained 599,104 English tweets.


Save Processed Data

In [17]:
# [Cell 6] Save Outputs
os.makedirs('../data/processed', exist_ok=True)

# 1. Save LDA Versions
df_trump_final[['tweet', 'lda_text', 'user_id', 'created_at']].to_csv('../data/processed/trump_lda_ready.csv', index=False)
df_biden_final[['tweet', 'lda_text', 'user_id', 'created_at']].to_csv('../data/processed/biden_lda_ready.csv', index=False)

# 2. Save BERT Versions
df_trump_final[['tweet', 'bert_text', 'user_id', 'created_at']].to_csv('../data/processed/trump_bert_ready.csv', index=False)
df_biden_final[['tweet', 'bert_text', 'user_id', 'created_at']].to_csv('../data/processed/biden_bert_ready.csv', index=False)

print("\nüíæ All files saved successfully:")
print("   1. *_foreign_removed.csv (Garbage)")
print("   2. *_lda_ready.csv (For Topic Modeling)")
print("   3. *_bert_ready.csv (For Sentiment/Deep Learning)")


üíæ All files saved successfully:
   1. *_foreign_removed.csv (Garbage)
   2. *_lda_ready.csv (For Topic Modeling)
   3. *_bert_ready.csv (For Sentiment/Deep Learning)


bot audit check 

In [18]:


print("--- üïµÔ∏è‚Äç‚ôÇÔ∏è FINAL AUDIT OF REMOVED DATA ---")

# Audit Trump Bots
prep.analyze_bot_file('../data/processed/trump_bots_removed.csv')

# Audit Biden Bots
prep.analyze_bot_file('../data/processed/biden_bots_removed.csv')

# Audit Trump Foreign (Optional Check)
# prep.analyze_bot_file('../data/processed/trump_foreign_removed.csv')

--- üïµÔ∏è‚Äç‚ôÇÔ∏è FINAL AUDIT OF REMOVED DATA ---

üïµÔ∏è‚Äç‚ôÇÔ∏è BOT FILE AUDIT: trump_bots_removed.csv
   ‚Ä¢ Total Suspicious Tweets: 145,398
   ‚Ä¢ Total Suspicious Accounts: 1,050

   üö© TOP OFFENDER (User 74268619.0):
      - Posted 1338 times in 23.3 days.
      - Speed: 57.5 tweets/day

   ü§ñ Top Sources in this file:
source
Twitter Web App        62739
Twitter for iPhone     30228
Twitter for Android    28092
Twitter for iPad        7751
TweetDeck               2944

   üìù Sample Content:
['LE FIGARO \nhttps://t.co/icqQ2j2b3u #r2p #LeFigaro #France #Macron #USA #Covid #Coronavirus #Biden #Trump #DupondMoretti #PS #LREM #Insoumis #EELV #LR #Zuckerberg #Castex #Facebook #Merkel #Beyrouth #Loukachenko #Terrorisme #BorisJohnson #Attentat #Erdogan #Nadal #RG #CouvreFeu https://t.co/uDO2Z5Tnb8'
 'LIBERATION \nhttps://t.co/n75plp13S6 #r2p #Liberation #UE #Europe #USA #Macron #Biden #Trump #LR #LREM #EELV #PS #CAC40 #FMI #Insoumis #Covid #Poutine #Coronavirus #Vaccin #Mbapp