# 🧠 Digital Forensics Analysis from Social Media Data
**Author:** Ashif Rabbani

**Environment:** Google Colab

**Data Source:** Social Media Platforms, Github

This project leverages Natural Language Processing (NLP) and Blockchain technology to identify and preserve potential digital forensic evidence from social media posts.

# Step 1: Explore and clean Raw Data

📌 Objective

Clone a public GitHub repository containing sample social media datasets, load the data, and take an initial look at the structure and contents.

In [None]:
# Clone the repository (only needs to be done once per runtime session)
!git clone https://github.com/luminati-io/Social-media-dataset-samples.git

# Change directory to where the datasets are stored
%cd Social-media-dataset-samples

In [None]:
!ls

 Facebook-datasets.csv	  README.md		      TikTok-datasets.csv
 Instagram-datasets.csv   Social-media-datasets.png  'Twitter- datasets.csv'


**1. Explore Twitter dataset**

In [None]:
import pandas as pd

# Load with correct filename (handle space)
df_twitter = pd.read_csv('Twitter- datasets.csv')

# Quick preview of the first few rows
df_twitter.head()


Unnamed: 0,id,user_posted,name,description,date_posted,photos,url,tagged_users,replies,reposts,...,posts_count,profile_image_link,following,is_verified,quotes,bookmarks,parent_post_details,external_image_urls,videos,quoted_post
0,1868428607451799983,Glo███ews███,Glo███ews███,"Com o fim da ditadura Assad, muitos sírios con...","""2024-12-15T22:51:08.000Z""",,https://x.com/GloboNews/status/186842860745179...,,2,1,...,222223,https://pbs.twimg.com/profile_images/155910271...,122,False,1,1,"{""post_id"":null,""profile_id"":null,""profile_nam...",,"[{""duration"":148167,""video_url"":""https://video...","{""data_posted"":null,""description"":null,""photos..."
1,1868159094567121215,bil███ard███,bil███ard███,Brian Austin Green Tells MGK to ‘Grow Up’ Afte...,"""2024-12-15T05:00:11.000Z""",,https://x.com/billboard/status/186815909456712...,,7,3,...,357584,https://pbs.twimg.com/profile_images/169657720...,3784,False,1,2,"{""post_id"":null,""profile_id"":null,""profile_nam...","[""https://pbs.twimg.com/card_img/1867636129563...",,"{""data_posted"":null,""description"":null,""photos..."
2,1868451534708883739,TNT███rts███,TNT███ort███R,VENCE O PSG NO CLÁSSICO! 💪🇫🇷 Nossa @claalbuque...,"""2024-12-16T00:22:14.000Z""",,https://x.com/TNTSportsBR/status/1868451534708...,"[{""biography"":null,""followers"":null,""following...",2,1,...,456734,https://pbs.twimg.com/profile_images/180701304...,859,False,0,1,"{""post_id"":null,""profile_id"":null,""profile_nam...",,"[{""duration"":94861,""video_url"":""https://video....","{""data_posted"":null,""description"":null,""photos..."
3,1868441382022717466,TNT███rts███,TNT███ort███R,ÍDOLO E AGORA PRESIDENTE! 🇦🇷🇦🇷 O ex-atacante D...,"""2024-12-15T23:41:54.000Z""","[""https://pbs.twimg.com/media/Ge4JhOgXcAAQy5K....",https://x.com/TNTSportsBR/status/1868441382022...,"[{""biography"":null,""followers"":null,""following...",6,5,...,456734,https://pbs.twimg.com/profile_images/180701304...,859,False,1,3,"{""post_id"":null,""profile_id"":null,""profile_nam...",,,"{""data_posted"":null,""description"":null,""photos..."
4,1868418260892565925,Glo███ews███,Glo███ews███,.@DanielaLima_ : cirurgia de Lula travou negoc...,"""2024-12-15T22:10:01.000Z""",,https://x.com/GloboNews/status/186841826089256...,"[{""biography"":null,""followers"":null,""following...",50,4,...,222223,https://pbs.twimg.com/profile_images/155910271...,122,False,3,3,"{""post_id"":null,""profile_id"":null,""profile_nam...",,"[{""duration"":127067,""video_url"":""https://video...","{""data_posted"":null,""description"":null,""photos..."


In [None]:
# Rename file to avoid issues with spaces
!mv 'Twitter- datasets.csv' twitter_datasets.csv

# Re-load with clean name
df_twitter = pd.read_csv('twitter_datasets.csv')

In [None]:
# See all column names to choose what to keep
df_twitter.columns.tolist()

['id',
 'user_posted',
 'name',
 'description',
 'date_posted',
 'photos',
 'url',
 'tagged_users',
 'replies',
 'reposts',
 'likes',
 'views',
 'external_url',
 'hashtags',
 'followers',
 'biography',
 'posts_count',
 'profile_image_link',
 'following',
 'is_verified',
 'quotes',
 'bookmarks',
 'parent_post_details',
 'external_image_urls',
 'videos',
 'quoted_post']

In [None]:
# Select relevant columns
df_relevant = df_twitter[['id', 'user_posted', 'description', 'date_posted', 'likes', 'replies', 'reposts', 'views']]

# Show first 5 rows with clean formatting
df_relevant.head().style.set_properties(**{'text-align': 'left'})


Unnamed: 0,id,user_posted,description,date_posted,likes,replies,reposts,views
0,1868428607451799983,Glo███ews███,"Com o fim da ditadura Assad, muitos sírios consideram voltar para casa. 12 milhões fugiram desde o início da guerra civil, há 13 anos. Países europeus estão revendo políticas de refúgio a sírios. ➡ Assista à #GloboNews:","""2024-12-15T22:51:08.000Z""",33,2,1,8369
1,1868159094567121215,bil███ard███,Brian Austin Green Tells MGK to ‘Grow Up’ After Musician’s Split From Megan Fox: ‘She’s Pregnant’,"""2024-12-15T05:00:11.000Z""",43,7,3,25007
2,1868451534708883739,TNT███rts███,"VENCE O PSG NO CLÁSSICO! 💪🇫🇷 Nossa @claalbuquerque traz os detalhes de mais um triunfo do time da capital, dessa vez contra o Lyon, pela #Ligue1 e destaca a atuação de uma jovem promessa da equipe comandada por Luis Enrique!","""2024-12-16T00:22:14.000Z""",33,2,1,15497
3,1868441382022717466,TNT███rts███,ÍDOLO E AGORA PRESIDENTE! 🇦🇷🇦🇷 O ex-atacante Diego Milito foi confirmado como novo presidente do Racing! 🗞️ @TNTSportsAR,"""2024-12-15T23:41:54.000Z""",230,6,5,18267
4,1868418260892565925,Glo███ews███,".@DanielaLima_ : cirurgia de Lula travou negociações para reforma ministerial, que ficou para início de 2025. ➡ Assista à #GloboNews:","""2024-12-15T22:10:01.000Z""",84,50,4,9569


In [None]:
# some texts are not in english
# install required libraries to detect and translate to english

!pip install deep-translator langdetect
from deep_translator import GoogleTranslator
from langdetect import detect

def translate_if_not_english(text):
    try:
        lang = detect(text)
        if lang != 'en':
            translated = GoogleTranslator(source='auto', target='en').translate(text)
            return translated
        else:
            return text
    except:
        return text  # return original if error (e.g., empty text)

In [None]:
# Translating description column entries into english

df_relevant['description_translated'] = df_relevant['description'].apply(translate_if_not_english)

In [None]:
df_relevant[['user_posted', 'description_translated', 'date_posted']].head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,user_posted,description_translated,date_posted
0,Glo███ews███,"With the end of the Assad dictatorship, many Syrians consider coming home. 12 million have fled since the beginning of the civil war 13 years ago. European countries are reviewing refuge policies to Syrians. ➡ Watch #globonews:","""2024-12-15T22:51:08.000Z"""
1,bil███ard███,Brian Austin Green Tells MGK to ‘Grow Up’ After Musician’s Split From Megan Fox: ‘She’s Pregnant’,"""2024-12-15T05:00:11.000Z"""
2,TNT███rts███,"PSG beats in the classic! 💪🇫🇷 Our @claalbuquerque brings the details of another triumph of the capital team, this time against Lyon, for #Ligue1 and highlights the performance of a young promise of the team led by Luis Enrique!","""2024-12-16T00:22:14.000Z"""
3,TNT███rts███,Idol and now President! 🇦🇷🇦🇷 Former striker Diego Milito was confirmed as new president of Racing! 🗞️ @tntsportsar,"""2024-12-15T23:41:54.000Z"""
4,Glo███ews███,".@Danielalima_: Lula surgery had negotiated ministerial reform, which was early 2025. ➡ Watch #globonews:","""2024-12-15T22:10:01.000Z"""


**Saving clean translated dataset**

In [None]:
# Select relevant columns
df_cleaned = df_relevant[['id', 'user_posted', 'description_translated', 'date_posted', 'likes', 'replies', 'reposts', 'views']]

# Save the cleaned dataset to a CSV file in the current session
file_path = '/content/cleaned_twitter_dataset.csv'
df_cleaned.to_csv(file_path, index=False)

# Display message with file path
print(f"Cleaned dataset saved to session storage: {file_path}")


Cleaned dataset saved to session storage: /content/cleaned_twitter_dataset.csv


**Cleaning remaining datasets**

**2. Facebook dataset**

In [None]:
# Load with correct filename (handle space)
df_facebook = pd.read_csv('Facebook-datasets.csv')

# See all column names to choose what to keep
df_facebook.columns.tolist()

['url',
 'post_id',
 'post_url',
 'comment_id',
 'user_name',
 'user_id',
 'user_url',
 'date_created',
 'comment_text',
 'num_likes',
 'num_replies',
 'attached_files',
 'video_length',
 'source_type',
 'subtype',
 'type']

In [None]:
# Cleaning out irrelevant columns
df_facebook = df_facebook[['user_id', 'comment_text', 'date_created', 'num_likes', 'num_replies']]

# Replace NaN with 0
df_facebook.fillna(0, inplace=True)
# Convert float columns to integers
df_facebook[['num_likes', 'num_replies']] = df_facebook[['num_likes', 'num_replies']].astype(int)

# translate comment if not english
df_facebook['comment_text'] = df_facebook['comment_text'].apply(translate_if_not_english)

df_facebook.head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,user_id,comment_text,date_created,num_likes,num_replies
0,pfbid02QbSc7Zs4wR67DWEXtcGGyDdP6MjktDUBe7MHb2bNJQZsTgRbJ4dPQNeSXudv6w8rl,Joy House,"""2024-01-12T13:41:08.000Z""",0,1
1,pfbid02DAD4uXif7L2ibVx7zZ4B8vbza7Tqrqzdn1kF3T1qkUSX1otAu4uMvHoXyA1YtezYl,"Yet if a Verizon customer travels outside of the USA and loses their phone, they are screwed because Verizon will not help.","""2024-01-19T22:33:51.000Z""",0,1
2,pfbid02DVKaLESKQqoYHcmo2HeMBuUs2JNK18KZDMa1hMLfh4TiGJPQ9ByYkarcYSh4QmdVl,Where is the 5g we was promised 2 years ago,"""2024-01-16T01:39:54.000Z""",0,1
3,pfbid0WfY4JxPXbR2Xp4tsUCaaVtwGqwArAbCEL3BM7QXS5X496Gu23LQe1H1KhptCVnoFl,"Your customer service may be the worst on earth. Today I tried to update my wireless account to the tune of an $8 per month increase. I have been an AT&T customer for over 30 years. I spent 4 hours in the store and on the phone with your “fraud” people giving them more information than my bank required for my first mortgage and they could not “identify” me. Needless to say, I could not make any change. What was worse was they refused to tell me what the problem was. How can I reach a solution to a problem when I don’t know what the problem is? Tomorrow I am filing a formal complaint with the NC Attorney General. I will be shopping for a new wireless provider as well.","""2024-02-06T02:46:19.000Z""",0,1
4,100004055062562,And what does this picture suppose to mean? That the man in the picture said what your caption says? Are you guys serious at all?,"""2024-11-28T08:40:43.000Z""",0,1


In [None]:
# Save the cleaned dataset to a CSV file in the current session
file_path = '/content/cleaned_facebook_dataset.csv'
df_facebook.to_csv(file_path, index=False)

# Display message with file path
print(f"Cleaned dataset saved to session storage: {file_path}")

Cleaned dataset saved to session storage: /content/cleaned_facebook_dataset.csv


**3. Instagram dataset**

In [None]:
# Load with correct filename (handle space)
df_instagram = pd.read_csv('Instagram-datasets.csv')

# See all column names to choose what to keep
df_instagram.columns.tolist()

['url',
 'comment_user',
 'comment_user_url',
 'comment_date',
 'comment',
 'likes_number',
 'replies_number',
 'replies',
 'hashtag_comment',
 'tagged_users_in_comment',
 'post_url',
 'post_user',
 'comment_id',
 'post_id']

In [None]:
# Cleaning out irrelevant columns
df_instagram = df_instagram[['comment_user', 'comment', 'comment_date', 'likes_number', 'replies_number']]

# Replace NaN with 0
# df_facebook.fillna(0, inplace=True)
# Convert float columns to integers
# df_facebook[['num_likes', 'num_replies']] = df_facebook[['num_likes', 'num_replies']].astype(int)

# translate comment if not english
df_instagram['comment'] = df_instagram['comment'].apply(translate_if_not_english)

df_instagram.head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,comment_user,comment,comment_date,likes_number,replies_number
0,fab███vf,👏👏👏,2024-11-13T20:01:57.000Z,1,1
1,des███opi███,😍😍😍,2024-11-13T17:11:39.000Z,1,1
2,mar███osb███,"My dear @Euwanderson7 who was our delegate representing our state and the Sabito Valley, Novo East of Piauí.",2024-11-13T23:00:46.000Z,1,1
3,tic███cau███,"With Professor Matheus Carvalho, really sensational 👏👏👏👏",2024-11-14T00:16:19.000Z,3,1
4,bru███nca███cao███,"@Rafael.Fonteles every year is that. The only municipality in South America, which runs out of water when the water arrives. The citizens can no longer stand. When will you solve this? Delegates someone to solve this problem, my buddy. There is no such thing as a budget, 6 days without water. You don't solve because they don't want to.",2024-11-14T17:33:11.000Z,1,1


In [None]:
# Save the cleaned dataset to a CSV file in the current session
file_path = '/content/cleaned_instagram_dataset.csv'
df_instagram.to_csv(file_path, index=False)

# Display message with file path
print(f"Cleaned dataset saved to session storage: {file_path}")

Cleaned dataset saved to session storage: /content/cleaned_instagram_dataset.csv


**4. TikTok dataset**

In [None]:
# Load with correct filename (handle space)
df_tiktok = pd.read_csv('TikTok-datasets.csv')

# See all column names to choose what to keep
df_tiktok.columns.tolist()

['url',
 'post_url',
 'post_id',
 'post_date_created',
 'date_created',
 'comment_text',
 'num_likes',
 'num_replies',
 'commenter_user_name',
 'commenter_id',
 'commenter_url',
 'comment_id',
 'comment_url']

In [None]:
# Cleaning out irrelevant columns
df_tiktok = df_tiktok[['comment_text', 'commenter_user_name', 'comment_id', 'date_created', 'num_likes', 'num_replies']]

# Replace NaN with 0
# df_facebook.fillna(0, inplace=True)
# Convert float columns to integers
# df_facebook[['num_likes', 'num_replies']] = df_facebook[['num_likes', 'num_replies']].astype(int)

# translate comment if not english
df_tiktok['comment'] = df_tiktok['comment_text'].apply(translate_if_not_english)

df_tiktok.head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,comment_text,commenter_user_name,comment_id,date_created,num_likes,num_replies,comment
0,What is the dividend yield on this fund ?,fin███ Ai███n,7381564492096340768,"""2024-06-17T20:00:29.000Z""",1,1,What is the dividend yield on this fund ?
1,Too much stress with stocks bro,kap███ate███19,7412215320918655777,"""2024-09-08T10:21:17.000Z""",2,2,Too much stress with stocks bro
2,"wouldn't really call it a crash , maybe use the term ""falling "" crash would be like double digit % decline within 48 hours , unless you just using buzz words for algorithms",Narz,7412192541975790369,"""2024-09-08T08:52:47.000Z""",0,2,"wouldn't really call it a crash , maybe use the term ""falling "" crash would be like double digit % decline within 48 hours , unless you just using buzz words for algorithms"
3,Delivery hero.. wish I’d put kkkk in that 🤣,tra███e13███,7438652110797849376,"""2024-11-18T16:09:45.000Z""",1,1,Delivery hero.. wish I’d put kkkk in that 🤣
4,"I made $82,000 in 7 days investing in trading @PAULWHEELERFX start your investment to make a change",kar███,7400426764684264198,"""2024-08-07T15:55:35.000Z""",0,0,"I made $82,000 in 7 days investing in trading @PAULWHEELERFX start your investment to make a change"


In [None]:
# Save the cleaned dataset to a CSV file in the current session
file_path = '/content/cleaned_tiktok_dataset.csv'
df_tiktok.to_csv(file_path, index=False)

# Display message with file path
print(f"Cleaned dataset saved to session storage: {file_path}")

Cleaned dataset saved to session storage: /content/cleaned_tiktok_dataset.csv




> All the clean datasets are saved in a github repo for further use.



# Step 2 : Data pre-processing

**Making unified social media dataset**

In [None]:
!git clone https://github.com/Ashif-1/NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data.git

In [8]:
!ls NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets

cleaned_facebook_dataset.csv   cleaned_tiktok_dataset.csv
cleaned_instagram_dataset.csv  cleaned_twitter_dataset.csv


In [15]:
import pandas as pd

# Function to ensure unique columns and enforce schema
def load_and_prepare(path, rename_map, source_name):
    df = pd.read_csv(path)

    # Drop any duplicate columns
    df = df.loc[:, ~df.columns.duplicated(keep='first')]

    # Rename columns to standard names
    df = df.rename(columns=rename_map)

    # Keep only necessary columns
    df = df[list(rename_map.values())]
    df['source'] = source_name

    return df

# Define common columns and rename mappings
common_columns = ['user_id', 'comment_text', 'date_created', 'num_likes', 'num_replies', 'source']

facebook_path = 'NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets/cleaned_facebook_dataset.csv'
instagram_path = 'NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets/cleaned_instagram_dataset.csv'
tiktok_path = 'NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets/cleaned_tiktok_dataset.csv'
twitter_path = 'NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets/cleaned_twitter_dataset.csv'

# Load and prepare all datasets
df_fb = load_and_prepare(facebook_path, {
    'user_posted': 'user_id',
    'description_translated': 'comment_text',
    'date_posted': 'date_created',
    'likes': 'num_likes',
    'replies': 'num_replies'
}, 'facebook')

df_ig = load_and_prepare(instagram_path, {
    'comment_user': 'user_id',
    'comment': 'comment_text',
    'comment_date': 'date_created',
    'likes_number': 'num_likes',
    'replies_number': 'num_replies'
}, 'instagram')

df_tt = load_and_prepare(tiktok_path, {
    'commenter_user_name': 'user_id',
    'comment': 'comment_text',
    'date_created': 'date_created',
    'num_likes': 'num_likes',
    'num_replies': 'num_replies'
}, 'tiktok')

df_tw = load_and_prepare(twitter_path, {
    'user_posted': 'user_id',
    'description_translated': 'comment_text',
    'date_posted': 'date_created',
    'likes': 'num_likes',
    'replies': 'num_replies'
}, 'twitter')

# Final sanity check before concat
for df, name in zip([df_fb, df_ig, df_tt, df_tw], ['Facebook', 'Instagram', 'TikTok', 'Twitter']):
    print(f"{name} columns: {df.columns.tolist()}")

# Load and prepare TikTok safely (fixing duplicate column issue)
df_tt_raw = pd.read_csv(tiktok_path)

# Drop the original comment_text if both exist
if 'comment_text' in df_tt_raw.columns and 'comment' in df_tt_raw.columns:
    df_tt_raw = df_tt_raw.drop(columns=['comment_text'])

# Then rename
df_tt = df_tt_raw.rename(columns={
    'commenter_user_name': 'user_id',
    'comment': 'comment_text',
    'date_created': 'date_created',
    'num_likes': 'num_likes',
    'num_replies': 'num_replies'
})

# Retain relevant and add source
df_tt = df_tt[['user_id', 'comment_text', 'date_created', 'num_likes', 'num_replies']]
df_tt['source'] = 'tiktok'


# Combine all datasets
df_combined = pd.concat([df_fb, df_ig, df_tt, df_tw], ignore_index=True)

# Shuffle the dataset
df_combined = df_combined.sample(frac=1, random_state=42).reset_index(drop=True)

# Save unified dataset
output_path = 'NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets/unified_social_media_dataset.csv'
df_combined.to_csv(output_path, index=False)

# Preview
print(f"\n✅ Unified dataset saved to: {output_path}")


Facebook columns: ['user_id', 'comment_text', 'date_created', 'num_likes', 'num_replies', 'source']
Instagram columns: ['user_id', 'comment_text', 'date_created', 'num_likes', 'num_replies', 'source']
TikTok columns: ['user_id', 'comment_text', 'comment_text', 'date_created', 'num_likes', 'num_replies', 'source']
Twitter columns: ['user_id', 'comment_text', 'date_created', 'num_likes', 'num_replies', 'source']

✅ Unified dataset saved to: NLP-Based-Digital-Forensics-Analysis-from-Social-Media-Data/cleaned_datasets/unified_social_media_dataset.csv


In [18]:
# df_combined.head()

from IPython.display import display, HTML

# Show first 20 rows
df_display = df_combined.head(20)

# HTML style settings
html = df_display.to_html(escape=False, index=False)

# Add CSS to wrap text and set column widths
style = """
<style>
table {
    table-layout: fixed;
    width: 100%;
    word-wrap: break-word;
}
th, td {
    padding: 8px;
    text-align: right;
    vertical-align: top;
}
th:nth-child(1), td:nth-child(1) { width: 15%; }
th:nth-child(2), td:nth-child(2) { width: 50%; }
th:nth-child(3), td:nth-child(3) { width: 15%; }
th:nth-child(4), td:nth-child(4) { width: 10%; }
th:nth-child(5), td:nth-child(5) { width: 10%; }
th:nth-child(6), td:nth-child(6) { width: 10%; }
</style>
"""

# Display styled table
display(HTML(style + html))

user_id,comment_text,date_created,num_likes,num_replies,source
100060392475788,"Wow, congratulations on reaching such an incredible milestone! 🎉 Fifty years of dedication to North Texas is a testament to the quality and trust you bring to the community. The celebration looks like it was a beautiful way to honor everyone who's been part of the journey—especially with that delicious BBQ from Hurtado! 🍖 Here’s to many more years of keeping Texas homes and businesses safe and sound under a Lon Smith roof","""2024-11-11T10:41:43.000Z""",0,0,facebook
nyp███,Billie Eilish gets hit in the face by a necklace while performing at Arizona concert,"""2024-12-15T15:34:40.000Z""",79,64,twitter
pfbid026KZKzgAK7LPCc4rdHeHMxXBityqcnbbLr64qiRCSD9xANTBmJBTTaA13Mf6QfVs7l,Hello husband,"""2023-04-20T14:46:04.000Z""",0,0,facebook
F1,From breakthrough victories to jaw-dropping comebacks 💪\n\nLewis Hamilton's top 10 greatest moments for @MercedesAMGF1 👇\n\n#F1,"""2024-12-15T09:23:00.000Z""",1924,42,twitter
⠀⠀ ███ ⠀s███,"As a Muslim who wants to invest but interests are haram, so what best way to invest?","""2024-10-03T08:11:29.000Z""",2,2,tiktok
pfbid0WDSDFkeVYwg6X4K9Uae2xhNnugSM2BiRTd4QXZq2V382YtQnFBRRMGdDcKuiyoS3l,"Just switched my internet from AT&T, and I couldn't be happier... bad costumer service bad internet service bad bad bad all around","""2024-04-11T23:16:15.000Z""",0,1,facebook
shi███ato███n,Love❤️❤️❤️❤️,2024-11-24T16:54:37.000Z,1,1,instagram
pfbid0KaHhbEzyVkuaGqfmi5vFvNRUbrjUjigG9XbHss8M86DjBDGXkve1aoLaoMWvgfSml,So I check my bank account today to find out ATT has charged my business account that I’ve had with them for umteen years almost 3000.00!!!!!!!! What???? So after 2.5 hours on the phone and 2 phone calls later ( the first guy left me on hold for 1.5 hours before hanging up on me ) they saw “ their” mistake and will refund my money…… in 3-5 biz days. Meanwhile I’m out countless overdraft fees and have other bills to pay. They get a large amount from me every month as it is for like 12 or more years. I’m thinking I may need to make a change,"""2024-01-17T22:40:13.000Z""",0,2,facebook
pfbid0nLGr4bBJPp5rCQRGFQrWaaeZLinxXM9Cp6Uxbe8Lm3GS7u3kKKZDyNzVrpLTYNfgl,My phone is not working,"""2023-03-29T20:56:28.000Z""",0,2,facebook
pfbid0dDG3ksF1NbaCthwuZfx1tAuzycuhay6nRzvpcb9qZY8uyKnaJAwHJVTaXdrCBFb5l,Well done!!,"""2024-07-31T01:41:21.000Z""",0,0,facebook


In [19]:
# Basic insights
num_entries = len(df_combined)
num_sources = df_combined['source'].nunique()
entries_per_source = df_combined['source'].value_counts()
date_range = df_combined['date_created'].min(), df_combined['date_created'].max()
most_liked = df_combined.sort_values(by='num_likes', ascending=False).head(1)
most_replied = df_combined.sort_values(by='num_replies', ascending=False).head(1)

print(f"🔢 Total entries: {num_entries}")
print(f"📱 Sources found: {num_sources} => {df_combined['source'].unique().tolist()}")
print(f"\n🗂️ Entries per platform:\n{entries_per_source}")
print(f"\n🗓️ Date range: {date_range[0]} → {date_range[1]}")

print("\n🔥 Most liked post:")
display(most_liked[['source', 'comment_text', 'user_id', 'num_likes']])

print("\n💬 Most replied post:")
display(most_replied[['source', 'comment_text', 'user_id', 'num_replies']])


🔢 Total entries: 4000
📱 Sources found: 4 => ['facebook', 'twitter', 'tiktok', 'instagram']

🗂️ Entries per platform:
source
facebook     1000
twitter      1000
tiktok       1000
instagram    1000
Name: count, dtype: int64

🗓️ Date range: "2021-10-16T22:42:03.000Z" → 2024-12-14T00:31:37.000Z

🔥 Most liked post:


Unnamed: 0,source,comment_text,user_id,num_likes
994,twitter,"Really thrilled to tell you this!! Mexico, Arg...",tay███swi███3,592538



💬 Most replied post:


Unnamed: 0,source,comment_text,user_id,num_replies
994,twitter,"Really thrilled to tell you this!! Mexico, Arg...",tay███swi███3,28180
