***
# Dataset Tidying
The primary objective of this notebook is to aggregate and standardize data that we get from diverse sources including social media platforms (Reddit, Instagram, LinkedIn) and various online newspapers into a single, unified data structure. This integrated dataset is designed to facilitate sentiment analysis by providing a standardized format for data comparison and manipulation.


The unified data structure incorporates the following fields for each entry:

- Post Title: Title of the post or article.
- Post URL: URL Link to the original post or article.
- Comment Body: Content of a user comment.
- Comment Date: Timestamp of the comment.
- Source Category: Origin of the data (e.g., Reddit, LinkedIn).
- Flag: Indicator the type of data that we have.
- Type Category: Classification of the post or comment based on predefined criteria.
- Like: Count of likes or similar engagements on the post or comment.


Data was programmatically collected from multiple sources, each differing in structure and content format. For this reason we make Data Cleaning and Standardization, given special attention to standardizing date formats and other variable formats to ensure consistency across the dataset. For example, relative dates like "12 weeks ago" on Instagram were converted to absolute dates based on the data extraction timestamp or like the 'reply' part in some comment which was defined as source category and had to be removed.
Missing data, were imputed where possible using logical assumptions based on the data context. Records with imputed values were flagged for easy identification during analysis.
We merge all the Data from various sources was merged using Python libraries such as Pandas. This involved rigorous checks for data consistency and duplication, ensuring the integrity of the final dataset.





In [1]:
import re
import requests
import pandas as pd
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import datetime
import os
import numpy as np

***
# Data Loading
Data from various sources is loaded into the notebook, likely using pandas.read_csv or similar methods, preparing the dataset for subsequent operations.

In [2]:
reddit_df=pd.read_csv('ferragni_balocco_comments-2.csv')
reddit_df

Unnamed: 0,post_title,post_url,comment_body,comment_score,comment_date,source_category,flag,type_category
0,Caso Balocco: Alessandra Balocco e Chiara Ferr...,https://www.agi.it/cronaca/news/2024-01-08/chi...,Qualcuno disse che tutto ciò che sapeva su Chi...,317,2024-01-09,Reddit,User Opinion,Comment
1,Caso Balocco: Alessandra Balocco e Chiara Ferr...,https://www.agi.it/cronaca/news/2024-01-08/chi...,"“Io non li volevo quei soldi, me li hanno adde...",88,2024-01-09,Reddit,User Opinion,Comment
2,Caso Balocco: Alessandra Balocco e Chiara Ferr...,https://www.agi.it/cronaca/news/2024-01-08/chi...,In che senso Alessandra Balocco? Non c'è nessu...,202,2024-01-09,Reddit,User Opinion,Comment
3,Caso Balocco: Alessandra Balocco e Chiara Ferr...,https://www.agi.it/cronaca/news/2024-01-08/chi...,Io non godo. Sarebbe stato meglio che non foss...,160,2024-01-09,Reddit,User Opinion,Comment
4,Caso Balocco: Alessandra Balocco e Chiara Ferr...,https://www.agi.it/cronaca/news/2024-01-08/chi...,Non so quanto possa tenere botta una contestaz...,61,2024-01-09,Reddit,User Opinion,Comment
...,...,...,...,...,...,...,...,...
2591,"Chiara Ferragni, dopo il pandoro le uova di Pa...",https://www.open.online/2023/12/19/chiara-ferr...,"No. Il Presidente del Consiglio è nominato, no...",1,2023-12-19,Reddit,User Opinion,Comment
2592,"Chiara Ferragni, dopo il pandoro le uova di Pa...",https://www.open.online/2023/12/19/chiara-ferr...,"Oook. Nel conto metto anche: ""Fa fatica con le...",6,2023-12-19,Reddit,User Opinion,Comment
2593,"Chiara Ferragni, dopo il pandoro le uova di Pa...",https://www.open.online/2023/12/19/chiara-ferr...,Sono meri passaggi teorici che non cambiano co...,2,2023-12-19,Reddit,User Opinion,Comment
2594,"Chiara Ferragni, dopo il pandoro le uova di Pa...",https://www.open.online/2023/12/19/chiara-ferr...,"Si, sei l’unico intelligente qua. Se usi analo...",1,2023-12-19,Reddit,User Opinion,Comment


In [3]:
reddit_df.rename(columns={'post_url':'url', 'comment_score':'likes', 'comment_date':'date', 'comment_body':'text_content'}, inplace=True)
reddit_df.drop(columns=['post_title'], inplace=True)
reddit_df

Unnamed: 0,url,text_content,likes,date,source_category,flag,type_category
0,https://www.agi.it/cronaca/news/2024-01-08/chi...,Qualcuno disse che tutto ciò che sapeva su Chi...,317,2024-01-09,Reddit,User Opinion,Comment
1,https://www.agi.it/cronaca/news/2024-01-08/chi...,"“Io non li volevo quei soldi, me li hanno adde...",88,2024-01-09,Reddit,User Opinion,Comment
2,https://www.agi.it/cronaca/news/2024-01-08/chi...,In che senso Alessandra Balocco? Non c'è nessu...,202,2024-01-09,Reddit,User Opinion,Comment
3,https://www.agi.it/cronaca/news/2024-01-08/chi...,Io non godo. Sarebbe stato meglio che non foss...,160,2024-01-09,Reddit,User Opinion,Comment
4,https://www.agi.it/cronaca/news/2024-01-08/chi...,Non so quanto possa tenere botta una contestaz...,61,2024-01-09,Reddit,User Opinion,Comment
...,...,...,...,...,...,...,...
2591,https://www.open.online/2023/12/19/chiara-ferr...,"No. Il Presidente del Consiglio è nominato, no...",1,2023-12-19,Reddit,User Opinion,Comment
2592,https://www.open.online/2023/12/19/chiara-ferr...,"Oook. Nel conto metto anche: ""Fa fatica con le...",6,2023-12-19,Reddit,User Opinion,Comment
2593,https://www.open.online/2023/12/19/chiara-ferr...,Sono meri passaggi teorici che non cambiano co...,2,2023-12-19,Reddit,User Opinion,Comment
2594,https://www.open.online/2023/12/19/chiara-ferr...,"Si, sei l’unico intelligente qua. Se usi analo...",1,2023-12-19,Reddit,User Opinion,Comment


***
We began to concatenate the data that David provided

In [4]:
david_df=pd.read_csv('concatenatedD_file.csv')
david_df

Unnamed: 0,text_content,url,post_creator,date,source_category,flag,type_category,likes
0,"Dopo le accuse per il caso Balocco, Chiara Fer...",https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,"December 18, 2023",Instagram,news,post,192
1,È ovunque non ne posso piu,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,17w,Instagram,user opinion,comment,5 likes
2,"Io sono con la Ferragni, inoltre ha fatto semp...",https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,17w,Instagram,user opinion,comment,Reply
3,L'importante è riconoscere l'errore. Hai fatto...,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,17w,Instagram,user opinion,comment,Reply
4,👏👏👏Il tuo chiarimento e le tue scuse dimostran...,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,17w,Instagram,user opinion,comment,2 likes
...,...,...,...,...,...,...,...,...
11138,I giornalisti sono CAROGNE patentate!!!! Ingig...,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,14w,Instagram,user opinion,comment,Reply
11139,Truffagnez 😂😂,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,14w,Instagram,user opinion,comment,Reply
11140,Brava adesso devolvi un MILIONE pure alla guar...,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,14w,Instagram,user opinion,comment,Reply
11141,Pagliaccia😂,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,14w,Instagram,user opinion,comment,Reply


In [5]:
os.listdir(r'C:\Users\Utente\Desktop\Luiss\DS_inAction\Final Project\David_to_merge')

['CFB_Insta_1.csv',
 'CFB_Insta_2.csv',
 'CFB_Insta_3.csv',
 'CF_Insta_1.csv',
 'CF_Insta_2.csv',
 'CF_Insta_3.csv',
 'CF_Insta_4.csv',
 'CF_Insta_5.csv',
 'CF_Insta_6.csv',
 'CF_Insta_7.csv',
 'CF_Insta_9.csv',
 'Journal_Insta_29.csv',
 'Journal_Insta_30.csv',
 'LinkedIn_1.csv',
 'LinkedIn_2.csv',
 'LinkedIn_3.csv',
 'Paparazzi_Insta_1.csv',
 'Paparazzi_Insta_19.csv',
 'Paparazzi_Insta_2.csv',
 'Paparazzi_Insta_20.csv',
 'Paparazzi_Insta_21.csv',
 'Paparazzi_Insta_3.csv',
 'Paparazzi_Insta_4.csv',
 'Paparazzi_Insta_5.csv',
 'Paparazzi_Insta_6.csv']

This shows how we addressed the issue with Instagram's date format

In [6]:
filelist=['CFB_Insta_1.csv',
 'CFB_Insta_2.csv',
 'CFB_Insta_3.csv',
 'CF_Insta_1.csv',
 'CF_Insta_2.csv',
 'CF_Insta_3.csv',
 'CF_Insta_4.csv',
 'CF_Insta_5.csv',
 'CF_Insta_6.csv',
 'CF_Insta_7.csv',
 'CF_Insta_9.csv',
 'Paparazzi_Insta_1.csv',
 'Paparazzi_Insta_2.csv',
 'Paparazzi_Insta_3.csv',
 'Paparazzi_Insta_4.csv',
 'Paparazzi_Insta_5.csv',
 'Paparazzi_Insta_6.csv']

insta_df=pd.DataFrame()
for filename in filelist:
    path=r'C:\Users\Utente\Desktop\Luiss\DS_inAction\Final Project\David_to_merge'
    df=pd.read_csv(filepath_or_buffer=path+'\\'+filename)
    insta_df=pd.concat([insta_df, df], ignore_index=True)
    
insta_df

Unnamed: 0,Page_URL,Post_Creator,Post_Date,Comment,Comment_Date,Comment_Likes
0,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",,18w,142 likes
1,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",No ma usare tua figlia appena nata come modell...,18w,252 likes
2,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Passare da oltre 1500 commenti a poche centina...,10w,61 likes
3,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma quindi la beneficenza la facciamo a te ?,17w,21 likes
4,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Sa come fregare la gente,17w,31 likes
...,...,...,...,...,...,...
6767,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Da che pulpito viene la predica😮,11w,Reply
6768,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava Giorgia,5w,Reply
6769,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023","Giorgia ,hai fatto bene a smascherarla .",4w,Reply
6770,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Parla la Meloni che non ne ha combinato nessun...,3w,Reply


In [7]:
insta_df.dropna(inplace=True)
insta_df.rename(columns={'Page_URL':'url', 'Post_Creator':'post_creator', 'Comment':'text_content', 'Comment_Likes':'likes', 'Comment_Date':'date'}, inplace=True)
insta_df

Unnamed: 0,url,post_creator,Post_Date,text_content,date,likes
1,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",No ma usare tua figlia appena nata come modell...,18w,252 likes
2,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Passare da oltre 1500 commenti a poche centina...,10w,61 likes
3,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma quindi la beneficenza la facciamo a te ?,17w,21 likes
4,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Sa come fregare la gente,17w,31 likes
5,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma questa qua come può avere 29 milioni di cog...,14w,181 likes
...,...,...,...,...,...,...
6766,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava,12w,Reply
6767,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Da che pulpito viene la predica😮,11w,Reply
6768,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava Giorgia,5w,Reply
6769,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023","Giorgia ,hai fatto bene a smascherarla .",4w,Reply


In [8]:
insta_df.loc[insta_df['likes'] == 'Reply', 'likes'] = np.nan
insta_df['likes'] = insta_df['likes'].str.extract('(\d+)').astype('Int64')
insta_df

Unnamed: 0,url,post_creator,Post_Date,text_content,date,likes
1,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",No ma usare tua figlia appena nata come modell...,18w,252
2,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Passare da oltre 1500 commenti a poche centina...,10w,61
3,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma quindi la beneficenza la facciamo a te ?,17w,21
4,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Sa come fregare la gente,17w,31
5,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma questa qua come può avere 29 milioni di cog...,14w,181
...,...,...,...,...,...,...
6766,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava,12w,
6767,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Da che pulpito viene la predica😮,11w,
6768,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava Giorgia,5w,
6769,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023","Giorgia ,hai fatto bene a smascherarla .",4w,


***
### This shows how we addressed the date issue


In [9]:
collection_date = datetime.datetime(2024, 4, 16)

# Function for converting Instagram format dates
def calculate_date_from_insta(date_str):
    if pd.isna(date_str):
        return None
    number = int(date_str[:-1])  # Extract the number part
    unit = date_str[-1]  # Extract the unit (w, d, h)

    if unit == 'w':
        return collection_date - datetime.timedelta(weeks=number)
    elif unit == 'd':
        return collection_date - datetime.timedelta(days=number)
    elif unit == 'h':
        return collection_date - datetime.timedelta(hours=number)
    else:
        return None

insta_df['actual_date'] = insta_df['date'].apply(calculate_date_from_insta)
insta_df['actual_date'] = insta_df['actual_date'].dt.strftime('%Y-%m-%d')  # Format the date

insta_df

Unnamed: 0,url,post_creator,Post_Date,text_content,date,likes,actual_date
1,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",No ma usare tua figlia appena nata come modell...,18w,252,2023-12-12
2,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Passare da oltre 1500 commenti a poche centina...,10w,61,2024-02-06
3,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma quindi la beneficenza la facciamo a te ?,17w,21,2023-12-19
4,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Sa come fregare la gente,17w,31,2023-12-19
5,https://www.instagram.com/p/C0vxCg4NO0H/,chiaraferragnibrand,"December 12, 2023",Ma questa qua come può avere 29 milioni di cog...,14w,181,2024-01-09
...,...,...,...,...,...,...,...
6766,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava,12w,,2024-01-23
6767,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Da che pulpito viene la predica😮,11w,,2024-01-30
6768,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023",Brava Giorgia,5w,,2024-03-12
6769,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,"December 17, 2023","Giorgia ,hai fatto bene a smascherarla .",4w,,2024-03-19


In [10]:
insta_df.drop(columns=['Post_Date', 'date'], inplace=True)
insta_df.rename(columns={'actual_date':'date'}, inplace=True)
insta_df['source_category']='Instagram'
insta_df['flag']='user_opinion'
insta_df['type_category']='comment'

In [11]:
mask_posts = david_df['type_category'] == 'post'
david_df.loc[mask_posts, 'likes'] = david_df.loc[mask_posts, 'likes'].astype(str).str.replace(',', '').astype(int)

# Replace 'Reply' with NaN in 'likes' column
david_df.loc[david_df['likes'] == 'Reply', 'likes'] = np.nan

# Convert 'likes' for comments: extract numbers and handle 'Reply'
mask_comments = david_df['type_category'] == 'comment'
david_df.loc[mask_comments, 'likes'] = david_df.loc[mask_comments, 'likes'].astype(str).str.extract('(\d+)')[0].astype('Int64')

In [12]:
# Define masks for posts and comments
mask_posts = david_df['type_category'] == 'post'
mask_comments = david_df['type_category'] == 'comment'

# For comments: Using your Instagram date conversion function
david_df.loc[mask_comments, 'actual_date'] = david_df.loc[mask_comments, 'date'].apply(calculate_date_from_insta)

# For posts: Parse regular dates directly using to_datetime for known formats
# Handling cases with incomplete dates assuming the data is from 2024 if year is missing
david_df.loc[mask_posts, 'actual_date'] = pd.to_datetime(david_df.loc[mask_posts, 'date'].replace(r'(\bJanuary\b|\bFebruary\b|\bMarch\b|\bApril\b|\bMay\b|\bJune\b|\bJuly\b|\bAugust\b|\bSeptember\b|\bOctober\b|\bNovember\b|\bDecember\b)', r'\1 2024', regex=True), errors='coerce')

# Format the dates uniformly
david_df['actual_date'] = david_df['actual_date'].dt.strftime('%Y-%m-%d')

# Clean up: drop the original 'date' column and rename 'actual_date' to 'date'
david_df.drop(columns=['date'], inplace=True)
david_df.rename(columns={'actual_date': 'date'}, inplace=True)

In [13]:
david_df = david_df[david_df['text_content'].notna()]

In [14]:
david_df

Unnamed: 0,text_content,url,post_creator,source_category,flag,type_category,likes,date
0,"Dopo le accuse per il caso Balocco, Chiara Fer...",https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,news,post,192,2024-12-18
1,È ovunque non ne posso piu,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,5,2023-12-19
2,"Io sono con la Ferragni, inoltre ha fatto semp...",https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,,2023-12-19
3,L'importante è riconoscere l'errore. Hai fatto...,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,,2023-12-19
4,👏👏👏Il tuo chiarimento e le tue scuse dimostran...,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,2,2023-12-19
...,...,...,...,...,...,...,...,...
11137,Con tutti i milioni che fa che bisogno c’era d...,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,Instagram,user opinion,comment,,2024-01-09
11138,I giornalisti sono CAROGNE patentate!!!! Ingig...,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,Instagram,user opinion,comment,,2024-01-09
11139,Truffagnez 😂😂,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,Instagram,user opinion,comment,,2024-01-09
11140,Brava adesso devolvi un MILIONE pure alla guar...,https://www.instagram.com/p/C12ObHFMAlH/,larepubblica,Instagram,user opinion,comment,,2024-01-09


***
### Here we combine Marco's and David's data

In [15]:
insta_df=pd.concat([david_df, insta_df], ignore_index=True, join='outer')
insta_df

Unnamed: 0,text_content,url,post_creator,source_category,flag,type_category,likes,date
0,"Dopo le accuse per il caso Balocco, Chiara Fer...",https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,news,post,192,2024-12-18
1,È ovunque non ne posso piu,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,5,2023-12-19
2,"Io sono con la Ferragni, inoltre ha fatto semp...",https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,,2023-12-19
3,L'importante è riconoscere l'errore. Hai fatto...,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,,2023-12-19
4,👏👏👏Il tuo chiarimento e le tue scuse dimostran...,https://www.instagram.com/reel/C0_7OWprCMd/,notizieit,Instagram,user opinion,comment,2,2023-12-19
...,...,...,...,...,...,...,...,...
16205,Brava,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,Instagram,user_opinion,comment,,2024-01-23
16206,Da che pulpito viene la predica😮,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,Instagram,user_opinion,comment,,2024-01-30
16207,Brava Giorgia,https://www.instagram.com/p/C09dYbuOJYy/,notizieit,Instagram,user_opinion,comment,,2024-03-12
16208,"Giorgia ,hai fatto bene a smascherarla .",https://www.instagram.com/p/C09dYbuOJYy/,notizieit,Instagram,user_opinion,comment,,2024-03-19


In [16]:
filelist=[
'Journal_Insta_29.csv',
 'Journal_Insta_30.csv',
 'Paparazzi_Insta_19.csv',
 'Paparazzi_Insta_20.csv',
 'Paparazzi_Insta_21.csv'
]

insta_df2=pd.DataFrame()
for filename in filelist:
    path=r'C:\Users\Utente\Desktop\Luiss\DS_inAction\Final Project\David_to_merge'
    df=pd.read_csv(filepath_or_buffer=path+'\\'+filename)
    insta_df2=pd.concat([insta_df2, df], ignore_index=True)
    
insta_df2

Unnamed: 0,text_content,url,post_creator,date,source_category,flag,type_category,likes
0,"📱 Il Consiglio dei ministri di oggi, 25 gennai...",https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,January 25,Instagram,news,post,1458
1,Peccato che lei odia selvaggia che è colei gra...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,12w,Instagram,user opinion,comment,63 likes
2,... e blocca tutti i commenti negativi sui suo...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,12w,Instagram,user opinion,comment,82 likes
3,Sono curioso vedere la sua strategia di rilanc...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,12w,Instagram,user opinion,comment,9 likes
4,L’ok della ferragni è l’ultimo step prima che ...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,12w,Instagram,user opinion,comment,105 likes
...,...,...,...,...,...,...,...,...
685,,https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,,Instagram,user opinion,comment,
686,"Ma solo a me non frega un caxxo? Se ha ""rubato...",https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,17w,Instagram,user opinion,comment,Reply
687,"Cioè, si era intascata i soldi raccolti con FI...",https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,17w,Instagram,user opinion,comment,Reply
688,Piccina..😒 chissà perché..!😂😂😂,https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,16w,Instagram,user opinion,comment,Reply


***
### As I mentioned before, our goal was to remove the 'reply' type, and we achieved it in this manner

In [17]:
insta_df2 = insta_df2[insta_df2['text_content'].notna()]

mask_posts = insta_df2['type_category'] == 'post'
insta_df2.loc[mask_posts, 'likes'] = insta_df2.loc[mask_posts, 'likes'].astype(str).str.replace(',', '').astype(int)

# Replace 'Reply' with NaN in 'likes' column
insta_df2.loc[insta_df2['likes'] == 'Reply', 'likes'] = np.nan

# Convert 'likes' for comments: extract numbers and handle 'Reply'
mask_comments = insta_df2['type_category'] == 'comment'
insta_df2.loc[mask_comments, 'likes'] = insta_df2.loc[mask_comments, 'likes'].astype(str).str.extract('(\d+)')[0].astype('Int64')

# Define masks for posts and comments
mask_posts = insta_df2['type_category'] == 'post'
mask_comments = insta_df2['type_category'] == 'comment'

# Apply conversions to 'date' based on the type category
# For comments: Using your Instagram date conversion function
insta_df2.loc[mask_comments, 'actual_date'] = insta_df2.loc[mask_comments, 'date'].apply(calculate_date_from_insta)

# For posts: Parse regular dates directly using to_datetime for known formats
# Handling cases with incomplete dates assuming the data is from 2024 if year is missing
insta_df2.loc[mask_posts, 'actual_date'] = pd.to_datetime(insta_df2.loc[mask_posts, 'date'].replace(r'(\bJanuary\b|\bFebruary\b|\bMarch\b|\bApril\b|\bMay\b|\bJune\b|\bJuly\b|\bAugust\b|\bSeptember\b|\bOctober\b|\bNovember\b|\bDecember\b)', r'\1 2024', regex=True), errors='coerce')

# Format the dates uniformly
insta_df2['actual_date'] = insta_df2['actual_date'].dt.strftime('%Y-%m-%d')

# Clean up: drop the original 'date' column and rename 'actual_date' to 'date'
insta_df2.drop(columns=['date'], inplace=True)
insta_df2.rename(columns={'actual_date': 'date'}, inplace=True)

insta_df2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  insta_df2.loc[mask_posts, 'likes'] = insta_df2.loc[mask_posts, 'likes'].astype(str).str.replace(',', '').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  insta_df2.loc[insta_df2['likes'] == 'Reply', 'likes'] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  insta_df2.loc[mask_comments, 'likes'] = insta_df2.loc[mask_comments, 'likes'].astype(str).str.extract('(\d+)')[0].astype('Int64')
A value is trying to be set on a copy of a slice from a DataFrame


Unnamed: 0,text_content,url,post_creator,source_category,flag,type_category,likes,date
0,"📱 Il Consiglio dei ministri di oggi, 25 gennai...",https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,Instagram,news,post,1458,2024-01-25
1,Peccato che lei odia selvaggia che è colei gra...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,Instagram,user opinion,comment,63,2024-01-23
2,... e blocca tutti i commenti negativi sui suo...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,Instagram,user opinion,comment,82,2024-01-23
3,Sono curioso vedere la sua strategia di rilanc...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,Instagram,user opinion,comment,9,2024-01-23
4,L’ok della ferragni è l’ultimo step prima che ...,https://www.instagram.com/p/C2h-WocsG84/,open_giornaleonline,Instagram,user opinion,comment,105,2024-01-23
...,...,...,...,...,...,...,...,...
683,Solo una con poca morale si presta a questi gi...,https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,Instagram,user opinion,comment,,2023-12-12
684,Ma che bella figura di m. 🫣!,https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,Instagram,user opinion,comment,,2023-12-19
686,"Ma solo a me non frega un caxxo? Se ha ""rubato...",https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,Instagram,user opinion,comment,,2023-12-19
687,"Cioè, si era intascata i soldi raccolti con FI...",https://www.instagram.com/p/C038iTOMTwJ/,wireditalia,Instagram,user opinion,comment,,2023-12-19


In [23]:
insta_df2[insta_df2['type_category']=='comment']['likes']

1        63
2        82
3         9
4       105
5       283
       ... 
683    <NA>
684    <NA>
686    <NA>
687    <NA>
688    <NA>
Name: likes, Length: 627, dtype: object

***
# Data Cleaning and Standardization
The notebook includes steps to clean and standardize the data. This might involve handling missing values, standardizing date formats, and ensuring uniformity in categorical data.


In [24]:
insta_df=pd.concat([insta_df, insta_df2], ignore_index=True, join='outer')

In [25]:
insta_reddit_df=pd.concat([insta_df, reddit_df], ignore_index=True, join='outer')

In [27]:
insta_df.to_csv(path_or_buf='merged_instagram_reddit_posts_comments.csv', index=False, header=True)

In [28]:
filelist=['LinkedIn_1.csv',
 'LinkedIn_2.csv',
 'LinkedIn_3.csv']

linkedin_df=pd.DataFrame()
for filename in filelist:
    path=r'C:\Users\Utente\Desktop\Luiss\DS_inAction\Final Project\David_to_merge'
    df=pd.read_csv(filepath_or_buffer=path+'\\'+filename)
    linkedin_df=pd.concat([linkedin_df, df], ignore_index=True)
    
linkedin_df

Unnamed: 0,text_content,url,post_creator,date,source_category,flag,type_category
0,In Defence of Chiara Ferragni“Have they no bre...,https://www.linkedin.com/pulse/defence-chiara-...,Edoardo Sala,"December 27, 2023",LinkedIn,Opinion,Article
1,Chiara Ferragni: crisis management & brand rep...,https://www.linkedin.com/pulse/chiara-ferragni...,Francesca C.,"January 8, 2024",LinkedIn,Opinion,Article
2,Difficulties are piling up for the influencer ...,https://www.linkedin.com/pulse/difficulties-pi...,Luxury Tribune,"January 9, 2024",LinkedIn,Opinion,Article


In [30]:
linkedin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text_content     3 non-null      object
 1   url              3 non-null      object
 2   post_creator     3 non-null      object
 3   date             3 non-null      object
 4   source_category  3 non-null      object
 5   flag             3 non-null      object
 6   type_category    3 non-null      object
dtypes: object(7)
memory usage: 300.0+ bytes


In [32]:
linkedin_df['flag']='user opinion'
linkedin_df['type_category']='article'
linkedin_df['date'] = pd.to_datetime(linkedin_df['date'])
linkedin_df['date'] = linkedin_df['date'].dt.strftime('%Y-%m-%d')

***
### Our small linkedin data

In [33]:
linkedin_df

Unnamed: 0,text_content,url,post_creator,date,source_category,flag,type_category
0,In Defence of Chiara Ferragni“Have they no bre...,https://www.linkedin.com/pulse/defence-chiara-...,Edoardo Sala,2023-12-27,LinkedIn,user opinion,Article
1,Chiara Ferragni: crisis management & brand rep...,https://www.linkedin.com/pulse/chiara-ferragni...,Francesca C.,2024-01-08,LinkedIn,user opinion,Article
2,Difficulties are piling up for the influencer ...,https://www.linkedin.com/pulse/difficulties-pi...,Luxury Tribune,2024-01-09,LinkedIn,user opinion,Article


In [34]:
insta_reddit_linkedin_df=pd.concat([insta_reddit_df, linkedin_df], ignore_index=True, join='outer')

***
### The twitter data

In [35]:
os.listdir(r'C:\Users\Utente\Desktop\Luiss\DS_inAction\Final Project\PostTwitter')

['Twitter_adtoomuch_04_03_2024.csv',
 'Twitter_ansa_08_03_2024.csv',
 'Twitter_ansa_12_01_2024.csv',
 'Twitter_ansa_13_01_2024.csv',
 'Twitter_ansa_14_01_2024.csv',
 'Twitter_ansa_14_02_2024.csv',
 'Twitter_ANSA_27_12_2023.csv',
 'Twitter_bonifacio_castellane_06_02_2024.csv',
 'Twitter_camilla_28_02_2024.csv',
 'Twitter_change_gardiner_23_01_2024.csv',
 'Twitter_claudio_piazzotta_22_02_2024.csv',
 'Twitter_corriere_20_01_2024.csv',
 'Twitter_david_scifo_23_01_2024.csv',
 'Twitter_esterrefatta_08_03_2024.csv',
 'Twitter_esterrefatta_26_02_2024.csv',
 'Twitter_fanpage_23_01_2024.csv',
 'Twitter_fanpage_27_01_2024.csv',
 'Twitter_fanpage_28_12_2023.csv',
 'Twitter_fatto_quotidiano_03_02_2024.csv',
 'Twitter_ferdinando_cotugno_04_03_2024.csv',
 'Twitter_Ferragni_01_11_2023.csv',
 'Twitter_Ferragni_11_11_2023.csv',
 'Twitter_Ferragni_19_09_2023.csv',
 'Twitter_francesca_totolo_23_01_2024.csv',
 'Twitter_francesco_04_03_2024.csv',
 'Twitter_genio78_03_03_2024.csv',
 'Twitter_giancarlo_de_ris

In [47]:
filelist=['Twitter_adtoomuch_04_03_2024.csv',
 'Twitter_ansa_08_03_2024.csv',
 'Twitter_ansa_12_01_2024.csv',
 'Twitter_ansa_13_01_2024.csv',
 'Twitter_ansa_14_01_2024.csv',
 'Twitter_ansa_14_02_2024.csv',
 'Twitter_ANSA_27_12_2023.csv',
 'Twitter_bonifacio_castellane_06_02_2024.csv',
 'Twitter_camilla_28_02_2024.csv',
 'Twitter_change_gardiner_23_01_2024.csv',
 'Twitter_claudio_piazzotta_22_02_2024.csv',
 'Twitter_corriere_20_01_2024.csv',
 'Twitter_david_scifo_23_01_2024.csv',
 'Twitter_esterrefatta_08_03_2024.csv',
 'Twitter_esterrefatta_26_02_2024.csv',
 'Twitter_fanpage_23_01_2024.csv',
 'Twitter_fanpage_27_01_2024.csv',
 'Twitter_fanpage_28_12_2023.csv',
 'Twitter_fatto_quotidiano_03_02_2024.csv',
 'Twitter_ferdinando_cotugno_04_03_2024.csv',
 'Twitter_Ferragni_01_11_2023.csv',
 'Twitter_Ferragni_11_11_2023.csv',
 'Twitter_Ferragni_19_09_2023.csv',
 'Twitter_francesca_totolo_23_01_2024.csv',
 'Twitter_francesco_04_03_2024.csv',
 'Twitter_genio78_03_03_2024.csv',
 'Twitter_giancarlo_de_risi_04_02_2024.csv',
 'Twitter_giancarlo_de_risi_28_12_2023.csv',
 'Twitter_giorgio_gori_23_12_2023.csv',
 'Twitter_ilfattoquotidiano_03_03_2024.csv',
 'Twitter_ilgiornale_09_02_2024.csv',
 'Twitter_ilgiornale_21_03_2024.csv',
 'Twitter_IlGiornale_29_12_2023.csv',
 'Twitter_ilmaziano_22_02_2024.csv',
 'Twitter_IlPolitico_22_12_2023.csv',
 'Twitter_iltempo_24_02_2024.csv',
 'Twitter_Lercio_17_12_2023.csv',
 'Twitter_luciano_capone_21_02_2024.csv',
 'Twitter_marcello_crescentini_23_12_2023.csv',
 'Twitter_martinoloiacono_27_12_2023.csv',
 'Twitter_massimo_falcioni_22_02_2024.csv',
 'Twitter_open_25_02_2024.csv',
 'Twitter_pietro_diomede_04_03_2024.csv',
 'Twitter_repubblica_03_03_2024.csv',
 'Twitter_repubblica_12_01_2024.csv',
 'Twitter_ricpuglisi_23_12_2023.csv',
 'Twitter_ruggiero_quarto_24_02_2024.csv',
 'Twitter_sabrina_f_15_01_2024.csv',
 'Twitter_selvaggia_lucarelli_03_03_2024.csv',
 'Twitter_sofi_03_03_2024.csv',
 'Twitter_sole24ore_30_12_2023.csv',
 'Twitter_strummer_15_02_2024.csv',
 'Twitter_thegreekboy_12_01_2024.csv',
 'Twitter_tommaso_cerno_14_01_2024.csv',
 'Twitter_virna_30_01_2024.csv',
 'Twitter_vitalba_azzollini_24_03_2024.csv']

twitter_df=pd.DataFrame()
for filename in filelist:
    path=r'C:\Users\Utente\Desktop\Luiss\DS_inAction\Final Project\PostTwitter'
    df=pd.read_csv(filepath_or_buffer=path+'\\'+filename)
    twitter_df=pd.concat([twitter_df, df], ignore_index=True)
    
twitter_df

Unnamed: 0,text_content,url,date,source_category,flag,type_category,post_creator
0,C’è stato un fraintendimento,https://twitter.com/adtoomuch/status/176460349...,"11:47 AM · Mar 4, 2024",X,user opinion,post,ADTOOMUCH
1,Che poi il problema non era il cartiglio in sé...,https://twitter.com/adtoomuch/status/176460349...,Mar 4,X,user opinion,comment,ADTOOMUCH
2,Brava ahahhah,https://twitter.com/adtoomuch/status/176460349...,Mar 4,X,user opinion,comment,ADTOOMUCH
3,"Quello è dovuto ad un accordo con la Balocco, ...",https://twitter.com/adtoomuch/status/176460349...,Mar 4,X,user opinion,comment,ADTOOMUCH
4,Un accordo comunque truffaldino. E comunque no...,https://twitter.com/adtoomuch/status/176460349...,Mar 5,X,user opinion,comment,ADTOOMUCH
...,...,...,...,...,...,...,...
2825,As usual…,https://twitter.com/vitalbaa/status/1771813607...,Mar 24,X,user opinion,comment,Vitalba Azzollini
2826,A \n@QRepubblica\n Fittipaldi racconta che a s...,https://twitter.com/vitalbaa/status/1771813607...,15h,X,user opinion,comment,Vitalba Azzollini
2827,Questa mattina è accaduta una cosa meraviglios...,https://twitter.com/vitalbaa/status/1771813607...,20h,X,user opinion,comment,Vitalba Azzollini
2828,Se pensi sticazzi condividi,https://twitter.com/vitalbaa/status/1771813607...,16h,X,user opinion,comment,Vitalba Azzollini


In [48]:
from datetime import datetime

def parse_twitter_date(date_str):
    if any(char.isdigit() and char.endswith('h') for char in date_str.split()):
        return 'REMOVE'
    if '·' in date_str:
        date_str = date_str.split('·')[1].strip()
    if ',' not in date_str:
        year = '2024' if not date_str.startswith('Dec') else '2023'
        date_str = f"{date_str}, {year}"
    try:
        date_obj = datetime.strptime(date_str, "%b %d, %Y")
        return date_obj.strftime('%Y-%m-%d')
    except ValueError:
        return None

# Apply the function to the 'date' column
twitter_df['date'] = twitter_df['date'].apply(parse_twitter_date)

# Convert 'None' strings to actual NaN values and remove 'REMOVE'
twitter_df['date'].replace('None', np.nan, inplace=True)
twitter_df['date'].replace('REMOVE', np.nan, inplace=True)

# Drop rows where 'date' is NaN
twitter_df.dropna(subset=['date'], inplace=True)

# Display the DataFrame to verify the changes
print(twitter_df['date'].head(20))

0     2024-03-04
1     2024-03-04
2     2024-03-04
3     2024-03-04
4     2024-03-05
5     2024-03-04
6     2024-03-05
7     2024-03-05
8     2024-03-04
9     2024-03-05
10    2024-03-05
11    2024-03-05
12    2024-03-04
13    2024-03-04
14    2024-03-04
15    2024-03-04
16    2024-03-04
17    2024-03-04
18    2024-03-05
19    2024-03-05
Name: date, dtype: object


In [49]:
twitter_df

Unnamed: 0,text_content,url,date,source_category,flag,type_category,post_creator
0,C’è stato un fraintendimento,https://twitter.com/adtoomuch/status/176460349...,2024-03-04,X,user opinion,post,ADTOOMUCH
1,Che poi il problema non era il cartiglio in sé...,https://twitter.com/adtoomuch/status/176460349...,2024-03-04,X,user opinion,comment,ADTOOMUCH
2,Brava ahahhah,https://twitter.com/adtoomuch/status/176460349...,2024-03-04,X,user opinion,comment,ADTOOMUCH
3,"Quello è dovuto ad un accordo con la Balocco, ...",https://twitter.com/adtoomuch/status/176460349...,2024-03-04,X,user opinion,comment,ADTOOMUCH
4,Un accordo comunque truffaldino. E comunque no...,https://twitter.com/adtoomuch/status/176460349...,2024-03-05,X,user opinion,comment,ADTOOMUCH
...,...,...,...,...,...,...,...
2821,Le solite leggette di propaganda della destra.,https://twitter.com/vitalbaa/status/1771813607...,2024-03-24,X,user opinion,comment,Vitalba Azzollini
2822,Ma si il disegno di legge finirà in qualche ca...,https://twitter.com/vitalbaa/status/1771813607...,2024-03-24,X,user opinion,comment,Vitalba Azzollini
2823,Ce lo aspettavamo?\nSi.,https://twitter.com/vitalbaa/status/1771813607...,2024-03-25,X,user opinion,comment,Vitalba Azzollini
2824,Visto il comma 10 tris dell'art. 10 della legg...,https://twitter.com/vitalbaa/status/1771813607...,2024-03-24,X,user opinion,comment,Vitalba Azzollini


In [50]:
insta_reddit_linkedin_twitter_df=pd.concat([insta_reddit_linkedin_df, twitter_df], ignore_index=True, join='outer')

In [51]:
insta_reddit_linkedin_twitter_df['source_category'].unique()

array(['Instagram', 'Reddit', 'LinkedIn', 'X'], dtype=object)

In [52]:
insta_reddit_linkedin_twitter_df.to_csv(path_or_buf='insta_reddit_linkedin_twitter_scraping.csv', index=False, header=True)

In [53]:
articles_df=pd.read_csv('merged_scraped_articles.csv')
articles_df

Unnamed: 0,text_content,url,date,source_category,flag,type_category
0,Analista studia il caso Ferragni: quanti milio...,https://www.stranotizie.it/analista-studia-il-...,2024-03-14,stranotizie,news,article
1,Chiara Ferragni intervista Fabio Fazio e Miche...,https://www.vogue.it/article/chiara-ferragni-i...,2021-10-05,vogue,news,article
2,Balocco-Ferragni ancora insieme: arriva il pan...,https://www.lercio.it/balocco-ferragni-ancora-...,2023-12-17,lercio,news,article
3,"Caso Ferragni, Balocco al Codacons: ecco perch...",https://www.repubblica.it/cronaca/2024/01/14/n...,2024-01-14,repubblica,news,article
4,"Chiara Ferragni-Balocco, l'Antitrust: «Commist...",https://www.leggo.it/gossip/fedez_ferragni/chi...,2024-04-17,leggo,news,article
...,...,...,...,...,...,...
93,Abstract: A worldwide media storm hit hard the...,https://www.lexology.com/library/detail.aspx?g...,2024-02-07,lexology,news,article
94,Italian influencer Chiara Ferragni sorry for h...,https://www.bbc.co.uk/news/world-europe-67759633,2023-12-19,bbc,news,article
95,"EntertainmentJanuary 05 2024Chiara Ferragni, C...",https://www.napolike.com/chiara-ferragni-coca-...,2024-01-05,napolike,news,article
96,A popular Italian influencer has been placed u...,https://www.hurriyetdailynews.com/italian-infl...,2024-01-10,hurriyetdailynews,news,article


In [56]:
merged_df=pd.concat([insta_reddit_linkedin_twitter_df, articles_df], ignore_index=False)

***
# The Structure of Our Dataset and the Sources of Our Data

In [73]:
merged_df.groupby('source_category').count()

Unnamed: 0_level_0,text_content,url,post_creator,flag,type_category,likes,date
source_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Instagram,16842,16842,16842,16842,16842,9244,16815
LinkedIn,3,3,3,3,3,0,3
Reddit,2596,2596,0,2596,2596,2596,2596
X,2782,2782,1946,2782,2782,0,2782
aboutresilience,1,1,0,1,1,0,1
acrimonia,1,1,0,1,1,0,1
agenzianova,1,1,0,1,1,0,1
agi,1,1,0,1,1,0,1
ansa,11,11,0,11,11,0,11
apnews,1,1,0,1,1,0,1


***
# Data Export

In [57]:
merged_df.to_csv(path_or_buf='scraped_articles_instagram_reddit_linkedin_twitter.csv', index=False, header=True)