# Milestone #1: Data Preparation + Exploratory Data Analysis

## Dataset Access

Following the instructions from the [r/Fakeddit paper](https://arxiv.org/pdf/1911.03854), we obtained the dataset from the official [Fakeddit GitHub repository](https://github.com/entitize/fakeddit).  
The repository provides a link to the dataset’s Google Drive folder:  
<https://drive.google.com/drive/folders/1jU7qgDqU1je9Y0PMKJ_f31yXRo5uWGFm?usp=sharing>

Since our project focuses on **multimodal analysis**, we use only the multimodal samples, which contain **both text and images**.


## Script Instructions

To run this script, please download the following data files from the Google Drive link provided above:

- `multimodal_test_public.tsv`  
- `multimodal_train.tsv`  
- `multimodal_validate.tsv`  

Then, organize your local directory as follows:

```text
data/
├── multimodal_test_public.tsv
├── multimodal_train.tsv
└── multimodal_validate.tsv

## Environment Setup

In [1]:
import pandas as pd
import numpy as np
import torch
import requests
from PIL import Image
from io import BytesIO
import os

# Define data directory
DATA_DIR = "data"

# Define file paths
TRAIN_DATA_FILE = os.path.join(DATA_DIR, "multimodal_train.tsv")
VALIDATION_DATA_FILE = os.path.join(DATA_DIR, "multimodal_validate.tsv")
TEST_DATA_FILE = os.path.join(DATA_DIR, "multimodal_test_public.tsv")

## Load Data

In [2]:
TRAIN_DATA = pd.read_csv(TRAIN_DATA_FILE, sep="\t")
VALIDATION_DATA = pd.read_csv(VALIDATION_DATA_FILE, sep="\t")
TEST_DATA = pd.read_csv(TEST_DATA_FILE, sep="\t")

In [3]:
TRAIN_DATA.head()

Unnamed: 0,author,clean_title,created_utc,domain,hasImage,id,image_url,linked_submission_id,num_comments,score,subreddit,title,upvote_ratio,2_way_label,3_way_label,6_way_label
0,Alexithymia,my walgreens offbrand mucinex was engraved wit...,1551641000.0,i.imgur.com,True,awxhir,https://external-preview.redd.it/WylDbZrnbvZdB...,,2.0,12,mildlyinteresting,My Walgreens offbrand Mucinex was engraved wit...,0.84,1,0,0
1,VIDCAs17,this concerned sink with a tiny hat,1534727000.0,i.redd.it,True,98pbid,https://preview.redd.it/wsfx0gp0f5h11.jpg?widt...,,2.0,119,pareidolia,This concerned sink with a tiny hat,0.99,0,2,2
2,prometheus1123,hackers leak emails from uae ambassador to us,1496511000.0,aljazeera.com,True,6f2cy5,https://external-preview.redd.it/6fNhdbc6K1vFA...,,1.0,44,neutralnews,Hackers leak emails from UAE ambassador to US,0.92,1,0,0
3,,puppy taking in the view,1471341000.0,i.imgur.com,True,4xypkv,https://external-preview.redd.it/HLtVNhTR6wtYt...,,26.0,250,photoshopbattles,PsBattle: Puppy taking in the view,0.95,1,0,0
4,3rikR3ith,i found a face in my sheet music too,1525318000.0,i.redd.it,True,8gnet9,https://preview.redd.it/ri7ut2wn8kv01.jpg?widt...,,2.0,13,pareidolia,I found a face in my sheet music too!,0.84,0,2,2


In [4]:
TRAIN_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564000 entries, 0 to 563999
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   author                535290 non-null  object 
 1   clean_title           564000 non-null  object 
 2   created_utc           564000 non-null  float64
 3   domain                396143 non-null  object 
 4   hasImage              564000 non-null  bool   
 5   id                    564000 non-null  object 
 6   image_url             562466 non-null  object 
 7   linked_submission_id  167857 non-null  object 
 8   num_comments          396143 non-null  float64
 9   score                 564000 non-null  int64  
 10  subreddit             564000 non-null  object 
 11  title                 564000 non-null  object 
 12  upvote_ratio          396143 non-null  float64
 13  2_way_label           564000 non-null  int64  
 14  3_way_label           564000 non-null  int64  
 15  

In [5]:
VALIDATION_DATA.head()

Unnamed: 0,author,clean_title,created_utc,domain,hasImage,id,image_url,linked_submission_id,num_comments,score,subreddit,title,upvote_ratio,2_way_label,3_way_label,6_way_label
0,singingdart7854,my xbox controller says hi,1567436000.0,i.redd.it,True,cypw96,https://preview.redd.it/l0ga0tug17k31.jpg?widt...,,4.0,25,mildlyinteresting,My Xbox controller says hi,0.72,1,0,0
1,mandal0re,new image from the mandalorian,1567745000.0,i.imgur.com,True,d0bzlq,https://external-preview.redd.it/VX7bXDu9Gl8UZ...,,5.0,21,photoshopbattles,PsBattle: New image from The Mandalorian,0.92,1,0,0
2,HE_WHO_DRUELS,say hello to my little friend,1461468000.0,,True,d2ezoob,http://i.imgur.com/F1Zbl3D.jpg,4g6bp9,,10,psbattle_artwork,Say hello to my little friend!,,0,2,4
3,eNaRDe,watch your step little one,1408047000.0,,True,cjqctpw,http://i.imgur.com/KRyMjn1.jpg,2diyh3,,1,psbattle_artwork,Watch your step little one,,0,2,4
4,Thebubster2001,this tree i found with a solo cup on it,1558186000.0,i.redd.it,True,bq3yuk,https://preview.redd.it/bxp58zf01zy21.jpg?widt...,,8.0,6,mildlyinteresting,This tree I found with a solo cup on it,0.62,1,0,0


In [6]:
VALIDATION_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59342 entries, 0 to 59341
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   author                56279 non-null  object 
 1   clean_title           59342 non-null  object 
 2   created_utc           59342 non-null  float64
 3   domain                41532 non-null  object 
 4   hasImage              59342 non-null  bool   
 5   id                    59342 non-null  object 
 6   image_url             59169 non-null  object 
 7   linked_submission_id  17810 non-null  object 
 8   num_comments          41532 non-null  float64
 9   score                 59342 non-null  int64  
 10  subreddit             59342 non-null  object 
 11  title                 59342 non-null  object 
 12  upvote_ratio          41532 non-null  float64
 13  2_way_label           59342 non-null  int64  
 14  3_way_label           59342 non-null  int64  
 15  6_way_label        

In [7]:
TEST_DATA.head()

Unnamed: 0,author,clean_title,created_utc,domain,hasImage,id,image_url,linked_submission_id,num_comments,score,subreddit,title,upvote_ratio,2_way_label,3_way_label,6_way_label
0,trustbytrust,stargazer,1425139000.0,,True,cozywbv,http://i.imgur.com/BruWKDi.jpg,2xct9d,,3,psbattle_artwork,stargazer,,0,2,4
1,,yeah,1438173000.0,,True,ctk61yw,http://i.imgur.com/JRZT727.jpg,3f0h7o,,2,psbattle_artwork,yeah,,0,2,4
2,chaseoes,pd phoenix car thief gets instructions from yo...,1560492000.0,abc15.com,True,c0gl7r,https://external-preview.redd.it/1A2_4VwgS8Qd2...,,2.0,16,nottheonion,PD: Phoenix car thief gets instructions from Y...,0.89,1,0,0
3,SFepicure,as trump accuses iran he has one problem his o...,1560606000.0,nytimes.com,True,c0xdqy,https://external-preview.redd.it/9BKRcgvaobpTo...,,4.0,45,neutralnews,"As Trump Accuses Iran, He Has One Problem: His...",0.78,1,0,0
4,fragments_from_Work,believers hezbollah,1515139000.0,i.imgur.com,True,7o9rmx,https://external-preview.redd.it/rbwXHncnjVh51...,,40.0,285,propagandaposters,"""Believers"" - Hezbollah 2011",0.95,0,1,5


In [8]:
TEST_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59319 entries, 0 to 59318
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   author                56251 non-null  object 
 1   clean_title           59319 non-null  object 
 2   created_utc           59319 non-null  float64
 3   domain                41847 non-null  object 
 4   hasImage              59319 non-null  bool   
 5   id                    59319 non-null  object 
 6   image_url             59163 non-null  object 
 7   linked_submission_id  17472 non-null  object 
 8   num_comments          41847 non-null  float64
 9   score                 59319 non-null  int64  
 10  subreddit             59319 non-null  object 
 11  title                 59319 non-null  object 
 12  upvote_ratio          41847 non-null  float64
 13  2_way_label           59319 non-null  int64  
 14  3_way_label           59319 non-null  int64  
 15  6_way_label        

## Feature Selection and Rationale

Our dataset contains several columns from Reddit posts. Below is a summary of why we are keeping or dropping certain features for our analysis and modeling:

| Column Name             | Action      | Reason |
|-------------------------|------------|--------|
| `author`                | Drop       | Not relevant for fake news analysis. The author's identity does not provide information about the content or veracity of the post, and Reddit usernames can be arbitrary.|
| `clean_title`           | Keep       | Already cleaned for us in the r/Fakeddit paper. Represents the text content of the post, essential for NLP analysis. |
| `created_utc`           | Keep       | Useful for downstream temporal analysis, e.g., examining when fake news spikes over time. |
| `domain`                | Keep       | Can help explore if posts from certain domains are more or less likely to be fake news. |
| `hasImage`              | Keep [Drop Eventually]       | Indicates if the post contains an image; Will be dropped after Sanity Checks[Refer to Sanity Check Section] |
| `id`                    | Drop       | Unique identifier, not informative for modeling. |
| `image_url`             | Keep       | Necessary to access image data for multimodal modeling. |
| `linked_submission_id`  | Drop       | Mostly missing and not relevant for our analysis. |
| `num_comments`          | Keep       | Can provide insights into engagement and post virality. |
| `score`                 | Keep       | Represents post popularity; potentially correlates with the spread of fake news. |
| `subreddit`             | Keep       | Useful for understanding community context and post categorization. |
| `title`                 | Drop       | Original title is redundant with `clean_title`. |
| `upvote_ratio`          | Keep       | Indicates community approval; may provide signals for fake vs real news. |
| `2_way_label`, `3_way_label`, `6_way_label` | Keep | These are the target labels used for classification tasks. |

In summary, we drop columns that are either identifiers (`id`, `linked_submission_id`), redundant (`title`), or not informative for fake news detection (`author`). We retain features that provide textual, temporal, engagement, or community context, as well as the labels needed for our deep learning tasks.


In [9]:
RELEVANT_COLUMNS = ['clean_title', 'created_utc', 'domain', 'hasImage', 'image_url', 'num_comments',
                    'score', 'subreddit', 'upvote_ratio', '2_way_label', '3_way_label', '6_way_label']

# Pick Appropriate Subset of Train Data, Validation Data, and Test Data
TRAIN_DATA = TRAIN_DATA[RELEVANT_COLUMNS]
VALIDATION_DATA = VALIDATION_DATA[RELEVANT_COLUMNS]
TEST_DATA = TEST_DATA[RELEVANT_COLUMNS]

## Sanity Checks

Before proceeding with our analysis, we perform the following checks to ensure data quality for our **multimodal dataset**:

1. **Remove incomplete samples**: Drop any rows where `clean_title` or `image_url` is null.  
2. **Verify image availability**: Confirm that `hasImage` is `True` for all remaining samples, then remove the `hasImage` column.  
3. **Confirm labeling**: Ensure every sample has a label for our supervised learning task.

In [10]:
def sanity_checks(DATA: pd.DataFrame):
    # 1. Drop rows with null clean_title or image_url
    DATA = DATA.dropna(axis = 'index', subset=['clean_title', 'image_url'], how = 'any')

    # 2. Keep only rows where hasImage is True, then drop the column
    DATA = DATA[DATA['hasImage']].drop(columns=['hasImage'])

    # 3. Drop rows where any label is missing
    label_columns = ['2_way_label', '3_way_label', '6_way_label']
    DATA = DATA.dropna(axis = 'index', subset = label_columns, how = 'any')

    # 4. Reset Index
    DATA = DATA.reset_index(drop=True)

    return DATA

In [11]:
# Perform Sanity Checks on ALL splits of data
TRAIN_DATA = sanity_checks(TRAIN_DATA)
VALIDATION_DATA = sanity_checks(VALIDATION_DATA)
TEST_DATA = sanity_checks(TEST_DATA)

In [12]:
TRAIN_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562466 entries, 0 to 562465
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   clean_title   562466 non-null  object 
 1   created_utc   562466 non-null  float64
 2   domain        394609 non-null  object 
 3   image_url     562466 non-null  object 
 4   num_comments  394609 non-null  float64
 5   score         562466 non-null  int64  
 6   subreddit     562466 non-null  object 
 7   upvote_ratio  394609 non-null  float64
 8   2_way_label   562466 non-null  int64  
 9   3_way_label   562466 non-null  int64  
 10  6_way_label   562466 non-null  int64  
dtypes: float64(3), int64(4), object(4)
memory usage: 47.2+ MB


In [13]:
VALIDATION_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59169 entries, 0 to 59168
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   clean_title   59169 non-null  object 
 1   created_utc   59169 non-null  float64
 2   domain        41359 non-null  object 
 3   image_url     59169 non-null  object 
 4   num_comments  41359 non-null  float64
 5   score         59169 non-null  int64  
 6   subreddit     59169 non-null  object 
 7   upvote_ratio  41359 non-null  float64
 8   2_way_label   59169 non-null  int64  
 9   3_way_label   59169 non-null  int64  
 10  6_way_label   59169 non-null  int64  
dtypes: float64(3), int64(4), object(4)
memory usage: 5.0+ MB


In [14]:
TEST_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59163 entries, 0 to 59162
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   clean_title   59163 non-null  object 
 1   created_utc   59163 non-null  float64
 2   domain        41691 non-null  object 
 3   image_url     59163 non-null  object 
 4   num_comments  41691 non-null  float64
 5   score         59163 non-null  int64  
 6   subreddit     59163 non-null  object 
 7   upvote_ratio  41691 non-null  float64
 8   2_way_label   59163 non-null  int64  
 9   3_way_label   59163 non-null  int64  
 10  6_way_label   59163 non-null  int64  
dtypes: float64(3), int64(4), object(4)
memory usage: 5.0+ MB


## Convert Data Types

We convert the columns in our dataset to appropriate data types to ensure consistency and facilitate analysis:

- `clean_title`: string  
- `created_utc`: convert from UTC timestamp to `datetime`  
- `domain`: string  
- `image_url`: string  
- `subreddit`: string


In [15]:
def convert_data_types(DATA: pd.DataFrame):
    DATA['clean_title'] = DATA['clean_title'].astype('string')
    DATA['created_utc'] = pd.to_datetime(DATA['created_utc'], unit='s')
    DATA['domain'] = DATA['domain'].astype('string')
    DATA['image_url'] = DATA['image_url'].astype('string')
    DATA['subreddit'] = DATA['subreddit'].astype('string')
    return DATA

In [16]:
# Perform Data Conversions on ALL splits of data
TRAIN_DATA = convert_data_types(TRAIN_DATA)
VALIDATION_DATA = convert_data_types(VALIDATION_DATA)
TEST_DATA = convert_data_types(TEST_DATA)

In [17]:
TRAIN_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562466 entries, 0 to 562465
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   clean_title   562466 non-null  string        
 1   created_utc   562466 non-null  datetime64[ns]
 2   domain        394609 non-null  string        
 3   image_url     562466 non-null  string        
 4   num_comments  394609 non-null  float64       
 5   score         562466 non-null  int64         
 6   subreddit     562466 non-null  string        
 7   upvote_ratio  394609 non-null  float64       
 8   2_way_label   562466 non-null  int64         
 9   3_way_label   562466 non-null  int64         
 10  6_way_label   562466 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(4), string(4)
memory usage: 47.2 MB


In [18]:
TRAIN_DATA.head()

Unnamed: 0,clean_title,created_utc,domain,image_url,num_comments,score,subreddit,upvote_ratio,2_way_label,3_way_label,6_way_label
0,my walgreens offbrand mucinex was engraved wit...,2019-03-03 19:27:24,i.imgur.com,https://external-preview.redd.it/WylDbZrnbvZdB...,2.0,12,mildlyinteresting,0.84,1,0,0
1,this concerned sink with a tiny hat,2018-08-20 01:10:13,i.redd.it,https://preview.redd.it/wsfx0gp0f5h11.jpg?widt...,2.0,119,pareidolia,0.99,0,2,2
2,hackers leak emails from uae ambassador to us,2017-06-03 17:26:38,aljazeera.com,https://external-preview.redd.it/6fNhdbc6K1vFA...,1.0,44,neutralnews,0.92,1,0,0
3,puppy taking in the view,2016-08-16 09:51:30,i.imgur.com,https://external-preview.redd.it/HLtVNhTR6wtYt...,26.0,250,photoshopbattles,0.95,1,0,0
4,i found a face in my sheet music too,2018-05-03 03:30:18,i.redd.it,https://preview.redd.it/ri7ut2wn8kv01.jpg?widt...,2.0,13,pareidolia,0.84,0,2,2


In [19]:
VALIDATION_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59169 entries, 0 to 59168
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   clean_title   59169 non-null  string        
 1   created_utc   59169 non-null  datetime64[ns]
 2   domain        41359 non-null  string        
 3   image_url     59169 non-null  string        
 4   num_comments  41359 non-null  float64       
 5   score         59169 non-null  int64         
 6   subreddit     59169 non-null  string        
 7   upvote_ratio  41359 non-null  float64       
 8   2_way_label   59169 non-null  int64         
 9   3_way_label   59169 non-null  int64         
 10  6_way_label   59169 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(4), string(4)
memory usage: 5.0 MB


In [20]:
VALIDATION_DATA.head()

Unnamed: 0,clean_title,created_utc,domain,image_url,num_comments,score,subreddit,upvote_ratio,2_way_label,3_way_label,6_way_label
0,my xbox controller says hi,2019-09-02 14:47:48,i.redd.it,https://preview.redd.it/l0ga0tug17k31.jpg?widt...,4.0,25,mildlyinteresting,0.72,1,0,0
1,new image from the mandalorian,2019-09-06 04:43:01,i.imgur.com,https://external-preview.redd.it/VX7bXDu9Gl8UZ...,5.0,21,photoshopbattles,0.92,1,0,0
2,say hello to my little friend,2016-04-24 03:21:05,,http://i.imgur.com/F1Zbl3D.jpg,,10,psbattle_artwork,,0,2,4
3,watch your step little one,2014-08-14 20:11:37,,http://i.imgur.com/KRyMjn1.jpg,,1,psbattle_artwork,,0,2,4
4,this tree i found with a solo cup on it,2019-05-18 13:24:40,i.redd.it,https://preview.redd.it/bxp58zf01zy21.jpg?widt...,8.0,6,mildlyinteresting,0.62,1,0,0


In [21]:
TEST_DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59163 entries, 0 to 59162
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   clean_title   59163 non-null  string        
 1   created_utc   59163 non-null  datetime64[ns]
 2   domain        41691 non-null  string        
 3   image_url     59163 non-null  string        
 4   num_comments  41691 non-null  float64       
 5   score         59163 non-null  int64         
 6   subreddit     59163 non-null  string        
 7   upvote_ratio  41691 non-null  float64       
 8   2_way_label   59163 non-null  int64         
 9   3_way_label   59163 non-null  int64         
 10  6_way_label   59163 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(4), string(4)
memory usage: 5.0 MB


In [22]:
TEST_DATA.head()

Unnamed: 0,clean_title,created_utc,domain,image_url,num_comments,score,subreddit,upvote_ratio,2_way_label,3_way_label,6_way_label
0,stargazer,2015-02-28 15:51:00,,http://i.imgur.com/BruWKDi.jpg,,3,psbattle_artwork,,0,2,4
1,yeah,2015-07-29 12:29:55,,http://i.imgur.com/JRZT727.jpg,,2,psbattle_artwork,,0,2,4
2,pd phoenix car thief gets instructions from yo...,2019-06-14 05:58:56,abc15.com,https://external-preview.redd.it/1A2_4VwgS8Qd2...,2.0,16,nottheonion,0.89,1,0,0
3,as trump accuses iran he has one problem his o...,2019-06-15 13:38:48,nytimes.com,https://external-preview.redd.it/9BKRcgvaobpTo...,4.0,45,neutralnews,0.78,1,0,0
4,believers hezbollah,2018-01-05 07:53:31,i.imgur.com,https://external-preview.redd.it/rbwXHncnjVh51...,40.0,285,propagandaposters,0.95,0,1,5
