***
# Article Scraping 
The objective of this notebook is to systematically gather and process article data from diverse online platforms using crawlbase web scraping techniques. This effort is primarily facilitated through the integration of the CrawlingAPI and ScraperAPI, which are essential for the efficient extraction of relevant web content. The extracted data is intended for comprehensive analysis, focusing on media coverage and sentiment regarding specific, trending topics.


We uses different library like: CrawlingAPI and ScraperAPI: These APIs provide robust tools for web scraping.
BeautifulSoup: Employed to parse HTML content, facilitating the extraction of structured data from web pages.
Python Standard Libraries: HTTP requests (requests), and date handling (datetime) are used to support data scraping and processing tasks.

Data Collection: Automated scripts scrape URLs from targeted sources, filtering and refining the selection to ensure relevance and accuracy.
Data Cleaning and Standardization: The raw data is meticulously cleaned to standardize date formats and textual content, addressing 


In [34]:
from crawlbase import CrawlingAPI, ScraperAPI, LeadsAPI, ScreenshotsAPI, StorageAPI
import re
import requests
import pandas as pd
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import datetime

In [2]:
scraper_api = ScraperAPI({ 'token': 'tehpiQNcS1QfnF-eVazfMQ' })

In [3]:
# Regular expression for matching URLs
url_regex = r'https?://[^\s]+'

# List to store found URLs
found_urls = []

# Open and read the file
with open('articles_to_scrape.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    
    # Find all URLs using the regular expression
    found_urls = re.findall(url_regex, content)

# Print the found URLs
for url in found_urls:
    print(url)

https://www.lercio.it/balocco-ferragni-ancora-insieme-arriva-il-pandoro-per-pagare-la-multa-dellantitrust/
https://www.ilgiornale.it/news/nazionale/ora-anche-fedez-perde-followers-2261517.html?utm_source=instagram&utm_medium=social&utm_campaign=newsfeed
https://www.ansa.it/sito/notizie/cronaca/2023/12/27/gdf-acquisisce-materiale-antitrust-su-caso-ferragni-balocco_b96544c9-a9cd-404b-80fc-08a064d62657.html
https://www.fanpage.it/milano/chiara-ferragni-pandoro-balocco-beneficienza-procura-indagine/
https://www.ilsole24ore.com/art/caso-balocco-ferragni-sono-quattro-procure-che-indagano-AFtUNfCC?utm_medium=TWSole24Ore&utm_source=Twitter#Echobox=1703923266-2
https://www.ansa.it/lombardia/notizie/2024/01/12/caso-ferragni-al-codacons-250-segnalazioni-da-consumatori_bd8b9eff-433e-4899-8b71-bfea5ede659d.html
https://www.ilfattoquotidiano.it/2024/01/15/balocco-replica-a-codacons-sul-caso-ferragni-costo-del-pandoro-giustificato-da-elementi-peculiari-ecco-quali/7410561/?utm_term=Autofeed&utm_campai

In [4]:
filtered_urls = [url for url in found_urls if not ('reddit.com' in url or 'instagram.com' in url)]
filtered_urls = list(set(filtered_urls))
for url in filtered_urls:
    print(url)
        
print(f'{len(filtered_urls)} URLs to look up')

https://www.stranotizie.it/analista-studia-il-caso-ferragni-quanti-milioni-di-euro-avrebbe-perso-dopo-lo-scandalo-balocco/
https://www.vogue.it/article/chiara-ferragni-intervista-crisi-social-privacy
https://www.lercio.it/balocco-ferragni-ancora-insieme-arriva-il-pandoro-per-pagare-la-multa-dellantitrust/
https://www.repubblica.it/cronaca/2024/01/14/news/ferragni_pandoro_risposta_balocco_a_codacons-421874981/
https://www.leggo.it/gossip/fedez_ferragni/chiara_ferragni_balocco_antitrust_ultimissime_oggi_17_4_2024-8062634.html
https://milano.corriere.it/notizie/cronaca/24_febbraio_24/chiara-ferragni-intervista-4808c99d-7e19-4b1c-ae33-d5920ab70xlk.shtml
https://www.ilmessaggero.it/italia/chiara_ferragni_truffa_pandoro_avvocati_procura_domumenti_joker_copertina_cosa_succede-7982692.html
https://www.ilgiornale.it/news/nazionale/chiara-ferragni-quasi-380mila-followers-persi-due-mesi-2281291.html?utm_source=social&utm_medium=social&utm_campaign=newsfeed
https://www.dailymail.co.uk/news/article

In [5]:
response = scraper_api.get('https://www.ansa.it/sito/notizie/cronaca/2023/12/15/antitrustmaxi-multa-a-chiara-ferragni-e-balocco-per-pandoro_17fa1374-8eee-4ca5-82ee-09c290121ae9.html?fbclid=IwAR04__zM6OIlDVF9onvVsiWuYKdIsuaVNapiRl3mn_b09rN_kuWBvW818-Y')
if response['status_code'] == 200:
    print(response)  

{'headers': {'original_status': '200', 'pc_status': '200', 'url': 'https://www.ansa.it/sito/notizie/cronaca/2023/12/15/antitrustmaxi-multa-a-chiara-ferragni-e-balocco-per-pandoro_17fa1374-8eee-4ca5-82ee-09c290121ae9.html?fbclid=IwAR04__zM6OIlDVF9onvVsiWuYKdIsuaVNapiRl3mn_b09rN_kuWBvW818-Y'}, 'status_code': 200, 'body': b'{"remaining_requests":958,"original_status":200,"pc_status":200,"url":"https://www.ansa.it/sito/notizie/cronaca/2023/12/15/antitrustmaxi-multa-a-chiara-ferragni-e-balocco-per-pandoro_17fa1374-8eee-4ca5-82ee-09c290121ae9.html?fbclid=IwAR04__zM6OIlDVF9onvVsiWuYKdIsuaVNapiRl3mn_b09rN_kuWBvW818-Y","body":{"alert":"A generic web scraper has been selected. Please contact support if you require a more detailed scraper for your given URL.","title":"Pasticcio di Natale Balocco-Ferragni, multa Antitrust - Notizie - Ansa.it","favicon":"https://www.ansa.it/sito/imgnew/favicon.svg","meta":{"description":"L\'influencer: \'Ingiusta la decisione, la impugner\xc3\xb2\' (ANSA)","keyword

In [7]:
response['json'].get('content')

'Se hai scelto di non accettare i cookie di profilazione e tracciamento, puoi aderire all’abbonamento "Consentless" a un costo molto accessibile, oppure scegliere un altro abbonamento per accedere ad ANSA.it. e 10 contenuti ogni 30 giorni a €16,99/anno Servizio equivalente a quello accessibile prestando il consenso ai cookie di profilazione pubblicitaria e tracciamento Durata annuale (senza rinnovo automatico) Un pop-up ti avvertirà che hai raggiunto i contenuti consentiti in 30 giorni (potrai continuare a vedere tutti i titoli del sito, ma per aprire altri contenuti dovrai attendere il successivo periodo di 30 giorni) Pubblicità presente ma non profilata o gestibile mediante il pannello delle preferenze Iscrizione alle Newsletter tematiche curate dalle redazioni ANSA. Per accedere senza limiti a tutti i contenuti di ANSA.it Scegli il piano di abbonamento più adatto alle tue esigenze. Se hai cambiato idea e non ti vuoi abbonare, puoi sempre esprimere il tuo consenso ai cookie di profil

In [8]:
article_details = []

for url in filtered_urls:    
    response = scraper_api.get(url)
    if response['status_code'] == 200:
        title = response['json'].get('title', '')
        content = response['json'].get('content', '')
        text_content = title + " " + content
        
        details = {
            'text_content': text_content,  
            'url': url,
            'date': '',
            'source_category': '',
            'flag': 'news', 
            'type_category': 'article'
        }
        article_details.append(details)
    else:
        print(f"Could not scrape {url}, code {response['status_code']}")

# Creating a pandas DataFrame
df = pd.DataFrame(article_details, columns=['text_content', 'url', 'date', 'source_category', 'flag', 'type_category'])

Could not scrape https://www.nytimes.com/2024/01/14/style/influencers-scandal-chiara-ferragni-milan-fashion-week.html, code 400


In [9]:
df

Unnamed: 0,text_content,url,date,source_category,flag,type_category
0,Analista studia il caso Ferragni: quanti milio...,https://www.stranotizie.it/analista-studia-il-...,,,news,
1,Chiara Ferragni intervista Fabio Fazio e Miche...,https://www.vogue.it/article/chiara-ferragni-i...,,,news,
2,Balocco-Ferragni ancora insieme: arriva il pan...,https://www.lercio.it/balocco-ferragni-ancora-...,,,news,
3,"Caso Ferragni, Balocco al Codacons: ecco perch...",https://www.repubblica.it/cronaca/2024/01/14/n...,,,news,
4,"Chiara Ferragni-Balocco, l'Antitrust: «Commist...",https://www.leggo.it/gossip/fedez_ferragni/chi...,,,news,
5,"Chiara Ferragni: «Sono imperfetta anche io, ma...",https://milano.corriere.it/notizie/cronaca/24_...,,,news,
6,"Ferragni, avvocati in Procura per depositare d...",https://www.ilmessaggero.it/italia/chiara_ferr...,,,news,
7,"Chiara Ferragni, quasi 380mila followers persi...",https://www.ilgiornale.it/news/nazionale/chiar...,,,news,
8,Italy's biggest influencer Chiara Ferragni now...,https://www.dailymail.co.uk/news/article-12941...,,,news,
9,Non solo la Ferragni in caduta libera. Ora anc...,https://www.ilgiornale.it/news/nazionale/ora-a...,,,news,


***
### We noticed missing data and source categories in our dataset; our goal was to identify and rectify these issues

In [10]:
df['type_category']='article'

In [11]:
df

Unnamed: 0,text_content,url,date,source_category,flag,type_category
0,Analista studia il caso Ferragni: quanti milio...,https://www.stranotizie.it/analista-studia-il-...,,,news,article
1,Chiara Ferragni intervista Fabio Fazio e Miche...,https://www.vogue.it/article/chiara-ferragni-i...,,,news,article
2,Balocco-Ferragni ancora insieme: arriva il pan...,https://www.lercio.it/balocco-ferragni-ancora-...,,,news,article
3,"Caso Ferragni, Balocco al Codacons: ecco perch...",https://www.repubblica.it/cronaca/2024/01/14/n...,,,news,article
4,"Chiara Ferragni-Balocco, l'Antitrust: «Commist...",https://www.leggo.it/gossip/fedez_ferragni/chi...,,,news,article
5,"Chiara Ferragni: «Sono imperfetta anche io, ma...",https://milano.corriere.it/notizie/cronaca/24_...,,,news,article
6,"Ferragni, avvocati in Procura per depositare d...",https://www.ilmessaggero.it/italia/chiara_ferr...,,,news,article
7,"Chiara Ferragni, quasi 380mila followers persi...",https://www.ilgiornale.it/news/nazionale/chiar...,,,news,article
8,Italy's biggest influencer Chiara Ferragni now...,https://www.dailymail.co.uk/news/article-12941...,,,news,article
9,Non solo la Ferragni in caduta libera. Ora anc...,https://www.ilgiornale.it/news/nazionale/ora-a...,,,news,article


In [12]:
new_data=pd.DataFrame([{
    'text_content': '''Besieged Influencer Chiara Ferragni Is the Talk of Milan Fashion Week Embroiled in a scandal, the social media queen faces criminal charges, sponsor defection and a loss of faith among her 29 million Instagram followers. What becomes of a famous influencer when followers suddenly drop her by the hundreds of thousands, sponsors start running for the exits and the reputation underpinning all that influence is suddenly derailed? That question was on a lot of minds during men’s fashion week in Milan, where even excited chatter about a surprise front-row appearance by Jeff Bezos and his fiancée, Lauren Sánchez, at the Dolce & Gabbana show was quickly overtaken by whispered updates on the weird case of Chiara Ferragni. As many are aware, Ms. Ferragni is a digital entrepreneur with her own production agency, her own Prime Video series, 29 million Instagram followers and more problems at the moment than the best glow-up can conceal. By far Italy’s most glamorous and celebrated influencer, Ms. Ferragni, 36, pioneered the business of self-branding and paid posts in Milan, starting as a fashion blogger before shifting her storytelling from documenting designer get-ups and handbags and her tastes in nail art, makeup palettes and drip coffee to tracking every dimension of her personal life. Following her marriage to the Italian rapper Federico Leonardo Lucia — stage name Fedez — in 2018, she went on to showcase her daily existence minutely across social media. Using a variety of formats and platforms, Ms. Ferragni posted details of her idyllic wedding and subsequent relationship bobbles, including sessions in couples therapy. When she became pregnant, she posted prenatal ultrasounds of her children, Vittoria and Leone, and would later track their growth spurts and elf costumes in so much detail it attracted the attention of child welfare advocates. She deployed social media to track her husband’s cancer treatment and family ski holidays in the Alps, and she routinely beckoned followers into the closet of a multimillion-dollar penthouse she owns in the luxury CityLife complex in central Milan as well as a retreat they recently constructed near George Clooney’s villa on the shores of Lake Como. Then, abruptly, in the weeks before Christmas, Ms. Ferragni went dark. The reasons became clear when accusations resurfaced of a charitable giving scam involving her and first reported on in December 2022. “Italian social media star faces sticky questions over charity cake fraud,” The Financial Times wrote of an investigation by the Milan prosecutor’s office into the sale of a pink-boxed Christmas cake, or pandoro, produced that year by the venerable Balocco bakery conglomerate and branded with Ms. Ferragni’s name. Consumers flocked to buy the traditional cake despite its $10 price tag — more than two and a half times the cost of a normal pandoro — lured by the Ferragni association but also the influencer’s implication on social media posts that money from the sales would be directed toward buying equipment for a children’s cancer hospital. As it happened, the cause the influencer had earmarked for charity was herself. (The Balocco bakery donated money to the hospital months before the cakes were ever put on sale.) “In reality, Chiara Ferragni earned a million euros for putting her name on the pandoro,” Selvaggia Lucarelli, a journalist who first reported the story in the newspaper Domani, wrote in an email. But in 2022 “the story had limited media resonance,” she said, “because Chiara Ferragni was powerful and untouchable.” But that was before an Italian consumer group brought a class-action suit against her, and local officials initiated a criminal investigation charging her with aggravated fraud. Ms. Ferragni, while appealing the class-action fine, would ultimately donate a substantial sum to a women’s rights association to fund anti-violence centers and would claim that pandoro-gate was the result of a “communication error.” Despite these gestures of contrition, she found herself abandoned by major sponsors like Coca-Cola, the eyewear giant Safilo and then by followers in the hundreds of thousands. When, at last, Ms. Ferragni resurfaced on Instagram, it was to post a distinctly unglamorous mea culpa reel. In it, the social media star appeared starkly deglamorized, wearing scant makeup and dressed in a drab gray shirt resembling prison garb, to issue a public apology and to announce a genuine personal donation, this one a million euros to the Regina Margherita Hospital in Turin, Italy. Even at that, it took no time for critics to seize on her misjudged optics, noting that the shirt she was wearing sells for 600 euros and is cashmere, and for memes to proliferate poking fun at her wardrobe choices, her no longer exalted status as one half of a glamour couple known as “Ferragnez” and even the family dog. “Unfortunately, she must have bad people around her and made all the wrong choices,” said Raffaello Napoleone, the chief executive of Pitti Immagine, the Italian fashion and design trade group, before the Neil Barrett men's wear show on Saturday. “When you make an apology, you have to appear as you really are. She appeared as a nun, and she is not a nun.” It was no help to Ms. Ferragni’s cause that Giorgia Meloni, Italy’s prime minister, took to attacking her in fiery public speeches as an affront to decency, honesty and the very core of “Italianity,” or Italianness. And in that, some saw not only political opportunism (Fedez has been a vocal critic of the right-wing leader), but also “more than a whiff of misogyny,” as one fashion critic noted at the Dsquared show, speaking anonymously in adherence with her publication’s employment guidelines. “Yes, she overstepped, as maybe all the influencers overstep by becoming gurus,” the critic added. “But the woman-on-woman attack feeds into a prevailing anti-feminist rhetoric.” Whether that is so, there is no doubting the pass-along effects Ms. Ferragni’s missteps have had on what Rupert Younger, the director of the Center for Corporate Reputation at the University of Oxford, termed in The Financial Times as “reputational risk.” Consider that, despite deep seasonal discounts posted on the shiny branded goods that fill the racks and windows of the slick Ferragni store near Corso Como this busy past weekend, there was barely a shopper to be seen. No one around was spotted toting one of Chiara Ferragni’s signature bags with their logo of an oversize eye. The scene was similar at the Ferragni outpost in Rome, according to a report in La Repubblica. Gawkers stopped to gape at the windows. Then they walked right past. “The criminal charges are not the most important part of the story,” Ms. Lucarelli, the journalist, said. “They might lead to nothing.” What is important to watch, she said, is what happens after a powerful influencer’s reputation falters and with it the “adoration of her followers, who felt betrayed.” What, in other words, is left to sell once you’ve sold out?''',  
    'url': 'https://www.nytimes.com/2024/01/14/style/influencers-scandal-chiara-ferragni-milan-fashion-week.html',
    'date': '2024-01-14',
    'source_category': 'nytimes',
    'flag': 'news',
    'type_category': 'article'
}])

df = pd.concat([df, new_data], ignore_index=True)

In [31]:
print(df)

                                         text_content  \
0   Analista studia il caso Ferragni: quanti milio...   
1   Chiara Ferragni intervista Fabio Fazio e Miche...   
2   Balocco-Ferragni ancora insieme: arriva il pan...   
3   Caso Ferragni, Balocco al Codacons: ecco perch...   
4   Chiara Ferragni-Balocco, l'Antitrust: «Commist...   
5   Chiara Ferragni: «Sono imperfetta anche io, ma...   
6   Ferragni, avvocati in Procura per depositare d...   
7   Chiara Ferragni, quasi 380mila followers persi...   
8   Italy's biggest influencer Chiara Ferragni now...   
9   Non solo la Ferragni in caduta libera. Ora anc...   
10  «Molto Fake e poco Chiara». Selvaggia Lucarell...   
11  Cover Espresso-Ferragni, l'inchiesta: "È a cap...   
12  Fashion influencer Chiara Ferragni apologises ...   
13  Gdf acquisisce materiale antitrust su caso Fer...   
14  Da Ferragni al Ddl "Beneficenza": serviva una ...   
15  Italian cake maker in influencer charity scand...   
16  Chiara Ferragni in lacrime 

In [71]:
def extract_info(url):
    parsed_url = urlparse(url)
    domain = parsed_url.netloc.replace('www.', '').split('.')[0]
    return domain
df['source_category'] = df['url'].apply(extract_info)

In [22]:
df['url'].values

array(['https://www.stranotizie.it/analista-studia-il-caso-ferragni-quanti-milioni-di-euro-avrebbe-perso-dopo-lo-scandalo-balocco/',
       'https://www.vogue.it/article/chiara-ferragni-intervista-crisi-social-privacy',
       'https://www.lercio.it/balocco-ferragni-ancora-insieme-arriva-il-pandoro-per-pagare-la-multa-dellantitrust/',
       'https://www.repubblica.it/cronaca/2024/01/14/news/ferragni_pandoro_risposta_balocco_a_codacons-421874981/',
       'https://www.leggo.it/gossip/fedez_ferragni/chiara_ferragni_balocco_antitrust_ultimissime_oggi_17_4_2024-8062634.html',
       'https://milano.corriere.it/notizie/cronaca/24_febbraio_24/chiara-ferragni-intervista-4808c99d-7e19-4b1c-ae33-d5920ab70xlk.shtml',
       'https://www.ilmessaggero.it/italia/chiara_ferragni_truffa_pandoro_avvocati_procura_domumenti_joker_copertina_cosa_succede-7982692.html',
       'https://www.ilgiornale.it/news/nazionale/chiara-ferragni-quasi-380mila-followers-persi-due-mesi-2281291.html?utm_source=social&ut

In [27]:
def fetch_date_from_webpage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    date_tag = soup.find('meta', {'property': 'article:published_time'}) or soup.find('meta', {'name': 'date'})
    if date_tag and date_tag['content']:
        return pd.to_datetime(date_tag['content']).date()
    
    else: 
        return None

for index, row in df.iterrows():
    row['date']
    fetched_date = fetch_date_from_webpage(row['url'])
    if fetched_date:
        df.at[index, 'date'] = fetched_date

In [29]:
df[['url', 'date']]

Unnamed: 0,url,date
0,https://www.stranotizie.it/analista-studia-il-...,2024-03-14
1,https://www.vogue.it/article/chiara-ferragni-i...,2021-10-05
2,https://www.lercio.it/balocco-ferragni-ancora-...,2023-12-17
3,https://www.repubblica.it/cronaca/2024/01/14/n...,2024-01-14
4,https://www.leggo.it/gossip/fedez_ferragni/chi...,2024-04-17
5,https://milano.corriere.it/notizie/cronaca/24_...,2024-02-23
6,https://www.ilmessaggero.it/italia/chiara_ferr...,2024-03-08
7,https://www.ilgiornale.it/news/nazionale/chiar...,2024-02-09
8,https://www.dailymail.co.uk/news/article-12941...,2024-01-09
9,https://www.ilgiornale.it/news/nazionale/ora-a...,2023-12-29


***
### Managing Missing Values and Adding Date and Source Category to the Dataset



In [36]:
missing_dates_mask = df['date'].apply(lambda x: not isinstance(x, datetime.date))
df_missing_dates = df[missing_dates_mask]
df_missing_dates

Unnamed: 0,text_content,url,date,source_category,flag,type_category
15,Italian cake maker in influencer charity scand...,https://www.washingtontimes.com/news/2024/jan/...,2024,washingtontimes,news,article
19,"Ferragni Balocco, Antitrust sul caso pandoro: ...",https://tg24.sky.it/cronaca/2024/04/17/ferragn...,2024,tg24,news,article
20,La difesa di Ferragni sulla beneficenza per il...,https://www.fanpage.it/milano/la-difesa-di-fer...,la-difesa-di-ferragni-sulla-beneficenza-per-il...,fanpage,news,article
22,Italian social media star faces sticky questio...,https://www.ft.com/content/bd79dc79-e331-4b3e-...,bd79dc79-e331-4b3e-b17e-925f1f94aaa3,ft,news,article
27,"Il caso Ferragni, oltre che etico, è stato per...",https://www.ilfoglio.it/economia/2024/02/21/ne...,2024,ilfoglio,news,article
36,Perché più Procure stanno indagando su Chiara ...,https://www.fanpage.it/milano/chiara-ferragni-...,chiara-ferragni-pandoro-balocco-beneficienza-p...,fanpage,news,article
38,#iltempodioshø – Il Tempo a a a Si affaccia un...,https://www.iltempo.it/tempodiosh/2024/02/24/n...,2024,iltempo,news,article
39,Besieged Influencer Chiara Ferragni Is the Tal...,https://www.nytimes.com/2024/01/14/style/influ...,01,nytimes,news,article


In [38]:
df_missing_dates['url'].values

array(['https://www.washingtontimes.com/news/2024/jan/9/italian-cake-maker-in-influencer-charity-scandal-s/',
       'https://tg24.sky.it/cronaca/2024/04/17/ferragni-balocco-pandoro-antitrust',
       'https://www.fanpage.it/milano/la-difesa-di-ferragni-sulla-beneficenza-per-il-caso-pandoro-perche-e-tutto-regolare-secondo-linfluencer/',
       'https://www.ft.com/content/bd79dc79-e331-4b3e-b17e-925f1f94aaa3',
       'https://www.ilfoglio.it/economia/2024/02/21/news/il-caso-ferragni-oltre-che-etico-e-stato-per-balocco-un-disastro-commerciale-numeri-alla-mano-6242639/',
       'https://www.fanpage.it/milano/chiara-ferragni-pandoro-balocco-beneficienza-procura-indagine/',
       'https://www.iltempo.it/tempodiosh/2024/02/24/news/chiara-ferragni-fedez-follower-indagini-osho-social-38542161/',
       'https://www.nytimes.com/2024/01/14/style/influencers-scandal-chiara-ferragni-milan-fashion-week.html'],
      dtype=object)

In [59]:
df.loc[df['url'] == 'https://www.washingtontimes.com/news/2024/jan/9/italian-cake-maker-in-influencer-charity-scandal-s/', 'date'] = pd.to_datetime('2024-01-09')
df.loc[df['url'] == 'https://tg24.sky.it/cronaca/2024/04/17/ferragni-balocco-pandoro-antitrust', 'date'] = pd.to_datetime('2024-04-17')
df.loc[df['url'] == 'https://www.fanpage.it/milano/la-difesa-di-ferragni-sulla-beneficenza-per-il-caso-pandoro-perche-e-tutto-regolare-secondo-linfluencer/', 'date'] = pd.to_datetime('2024-02-24')
df.loc[df['url'] == 'https://www.ft.com/content/bd79dc79-e331-4b3e-b17e-925f1f94aaa3', 'date'] = pd.to_datetime('2024-01-13')
df.loc[df['url'] == 'https://www.ilfoglio.it/economia/2024/02/21/news/il-caso-ferragni-oltre-che-etico-e-stato-per-balocco-un-disastro-commerciale-numeri-alla-mano-6242639/', 'date'] = pd.to_datetime('2024-02-21')
df.loc[df['url'] == 'https://www.fanpage.it/milano/chiara-ferragni-pandoro-balocco-beneficienza-procura-indagine/', 'date'] = pd.to_datetime('2023-12-28')
df.loc[df['url'] == 'https://www.iltempo.it/tempodiosh/2024/02/24/news/chiara-ferragni-fedez-follower-indagini-osho-social-38542161', 'date'] = pd.to_datetime('2024-02-24')
df.loc[df['url'] == 'https://www.nytimes.com/2024/01/14/style/influencers-scandal-chiara-ferragni-milan-fashion-week.html', 'date'] = pd.to_datetime('2024-01-14')

df['date']=df['date'].apply(lambda x: pd.to_datetime(x).date())

***
# The structure of our dataset following the scraping process

In [60]:
df

Unnamed: 0,text_content,url,date,source_category,flag,type_category
0,Analista studia il caso Ferragni: quanti milio...,https://www.stranotizie.it/analista-studia-il-...,2024-03-14,stranotizie,news,article
1,Chiara Ferragni intervista Fabio Fazio e Miche...,https://www.vogue.it/article/chiara-ferragni-i...,2021-10-05,vogue,news,article
2,Balocco-Ferragni ancora insieme: arriva il pan...,https://www.lercio.it/balocco-ferragni-ancora-...,2023-12-17,lercio,news,article
3,"Caso Ferragni, Balocco al Codacons: ecco perch...",https://www.repubblica.it/cronaca/2024/01/14/n...,2024-01-14,repubblica,news,article
4,"Chiara Ferragni-Balocco, l'Antitrust: «Commist...",https://www.leggo.it/gossip/fedez_ferragni/chi...,2024-04-17,leggo,news,article
5,"Chiara Ferragni: «Sono imperfetta anche io, ma...",https://milano.corriere.it/notizie/cronaca/24_...,2024-02-23,milano,news,article
6,"Ferragni, avvocati in Procura per depositare d...",https://www.ilmessaggero.it/italia/chiara_ferr...,2024-03-08,ilmessaggero,news,article
7,"Chiara Ferragni, quasi 380mila followers persi...",https://www.ilgiornale.it/news/nazionale/chiar...,2024-02-09,ilgiornale,news,article
8,Italy's biggest influencer Chiara Ferragni now...,https://www.dailymail.co.uk/news/article-12941...,2024-01-09,dailymail,news,article
9,Non solo la Ferragni in caduta libera. Ora anc...,https://www.ilgiornale.it/news/nazionale/ora-a...,2023-12-29,ilgiornale,news,article


In [61]:
df.to_csv('article_links_scraping.csv', header=True, index=False)

In [64]:
df1=pd.read_csv('reddit_links_scraping.csv', header=0, index_col=False)
df1

Unnamed: 0,url,title,content
0,https://www.open.online/2023/06/14/antitrust-i...,L'Antitrust bacchetta i pandori Balocco di Chi...,14 Giugno 2023 - 11:46 La guardia di finanza n...
1,https://www.ansa.it/english/news/business/2024...,Ferragni-Balocco case could become 'fraud' pro...,Se hai scelto di non accettare i cookie di pro...
2,https://www.repubblica.it/economia/2023/12/15/...,Chiara Ferragni multata dall’Antitrust per il ...,Domande e risposte Ultim'ora 19.12 di Aldo Fon...
3,https://tg24.sky.it/cronaca/2024/01/15/balocco...,Balocco svela in una lettera al Codacons perch...,Polvere rosa e stencil per le decorazioni. Que...
4,https://www.ansa.it/sito/notizie/cronaca/2023/...,"Pasticcio di Natale Balocco-Ferragni, multa An...",Se hai scelto di non accettare i cookie di pro...
5,https://www.leggo.it/gossip/fedez_ferragni/fer...,"Ferragni, Balocco ricorre al Tar del Lazio con...","Balocco comunica che ""in data odierna ha impug..."
6,https://www.fanpage.it/spettacolo/personaggi/c...,"Chiara Ferragni sulla vicenda Balocco: ""Chiedo...",14:59 Chiara Ferragni sulla vicenda Balocco: “...
7,https://www.leggo.it/gossip/fedez_ferragni/chi...,"Chiara Ferragni, 1 milione di euro di multa da...",59 share di Redazione web Sanzione di oltre 1 ...
8,https://www.ilsole24ore.com/art/caso-balocco-f...,"Caso Balocco-Ferragni, sono quattro le procure...","Caso Balocco-Ferragni, sono quattro le procure..."
9,https://www.open.online/2023/12/23/chiara-ferr...,Pandori Ferragni-Balocco venduti anche a 120 e...,23 Dicembre 2023 - 19:32 C’è chi ha messo all’...


In [66]:
df1['text_content']=df1['title']+' '+df1['content']

In [67]:
df1.drop(columns=['title', 'content'], inplace=True)

In [68]:
df1

Unnamed: 0,url,text_content
0,https://www.open.online/2023/06/14/antitrust-i...,L'Antitrust bacchetta i pandori Balocco di Chi...
1,https://www.ansa.it/english/news/business/2024...,Ferragni-Balocco case could become 'fraud' pro...
2,https://www.repubblica.it/economia/2023/12/15/...,Chiara Ferragni multata dall’Antitrust per il ...
3,https://tg24.sky.it/cronaca/2024/01/15/balocco...,Balocco svela in una lettera al Codacons perch...
4,https://www.ansa.it/sito/notizie/cronaca/2023/...,"Pasticcio di Natale Balocco-Ferragni, multa An..."
5,https://www.leggo.it/gossip/fedez_ferragni/fer...,"Ferragni, Balocco ricorre al Tar del Lazio con..."
6,https://www.fanpage.it/spettacolo/personaggi/c...,"Chiara Ferragni sulla vicenda Balocco: ""Chiedo..."
7,https://www.leggo.it/gossip/fedez_ferragni/chi...,"Chiara Ferragni, 1 milione di euro di multa da..."
8,https://www.ilsole24ore.com/art/caso-balocco-f...,"Caso Balocco-Ferragni, sono quattro le procure..."
9,https://www.open.online/2023/12/23/chiara-ferr...,Pandori Ferragni-Balocco venduti anche a 120 e...


In [72]:
df1['flag']='news'
df1['type_category']='article'
df1['source_category']=df1['url'].apply(extract_info)
df1

Unnamed: 0,url,text_content,flag,type_category,source_category
0,https://www.open.online/2023/06/14/antitrust-i...,L'Antitrust bacchetta i pandori Balocco di Chi...,news,article,open
1,https://www.ansa.it/english/news/business/2024...,Ferragni-Balocco case could become 'fraud' pro...,news,article,ansa
2,https://www.repubblica.it/economia/2023/12/15/...,Chiara Ferragni multata dall’Antitrust per il ...,news,article,repubblica
3,https://tg24.sky.it/cronaca/2024/01/15/balocco...,Balocco svela in una lettera al Codacons perch...,news,article,tg24
4,https://www.ansa.it/sito/notizie/cronaca/2023/...,"Pasticcio di Natale Balocco-Ferragni, multa An...",news,article,ansa
5,https://www.leggo.it/gossip/fedez_ferragni/fer...,"Ferragni, Balocco ricorre al Tar del Lazio con...",news,article,leggo
6,https://www.fanpage.it/spettacolo/personaggi/c...,"Chiara Ferragni sulla vicenda Balocco: ""Chiedo...",news,article,fanpage
7,https://www.leggo.it/gossip/fedez_ferragni/chi...,"Chiara Ferragni, 1 milione di euro di multa da...",news,article,leggo
8,https://www.ilsole24ore.com/art/caso-balocco-f...,"Caso Balocco-Ferragni, sono quattro le procure...",news,article,ilsole24ore
9,https://www.open.online/2023/12/23/chiara-ferr...,Pandori Ferragni-Balocco venduti anche a 120 e...,news,article,open


In [75]:
df1['date']=''
for index, row in df1.iterrows():
    row['date']
    fetched_date = fetch_date_from_webpage(row['url'])
    if fetched_date:
        df1.at[index, 'date'] = fetched_date

In [76]:
df1

Unnamed: 0,url,text_content,flag,type_category,source_category,date
0,https://www.open.online/2023/06/14/antitrust-i...,L'Antitrust bacchetta i pandori Balocco di Chi...,news,article,open,2023-06-14
1,https://www.ansa.it/english/news/business/2024...,Ferragni-Balocco case could become 'fraud' pro...,news,article,ansa,2024-01-08
2,https://www.repubblica.it/economia/2023/12/15/...,Chiara Ferragni multata dall’Antitrust per il ...,news,article,repubblica,2023-12-15
3,https://tg24.sky.it/cronaca/2024/01/15/balocco...,Balocco svela in una lettera al Codacons perch...,news,article,tg24,
4,https://www.ansa.it/sito/notizie/cronaca/2023/...,"Pasticcio di Natale Balocco-Ferragni, multa An...",news,article,ansa,2023-12-15
5,https://www.leggo.it/gossip/fedez_ferragni/fer...,"Ferragni, Balocco ricorre al Tar del Lazio con...",news,article,leggo,2024-02-13
6,https://www.fanpage.it/spettacolo/personaggi/c...,"Chiara Ferragni sulla vicenda Balocco: ""Chiedo...",news,article,fanpage,
7,https://www.leggo.it/gossip/fedez_ferragni/chi...,"Chiara Ferragni, 1 milione di euro di multa da...",news,article,leggo,2023-12-15
8,https://www.ilsole24ore.com/art/caso-balocco-f...,"Caso Balocco-Ferragni, sono quattro le procure...",news,article,ilsole24ore,2023-12-30
9,https://www.open.online/2023/12/23/chiara-ferr...,Pandori Ferragni-Balocco venduti anche a 120 e...,news,article,open,2023-12-23


In [78]:
missing_dates_mask = df1['date'].apply(lambda x: not isinstance(x, datetime.date))
df_missing_dates = df1[missing_dates_mask]
df_missing_dates['url'].values

array(['https://tg24.sky.it/cronaca/2024/01/15/balocco-pandoro-chiara-ferragni-prezzo',
       'https://www.fanpage.it/spettacolo/personaggi/chiara-ferragni-sulla-vicenda-balocco-chiedo-scusa-devolvero-un-milione-di-euro-per-le-cure-dei-bambini/',
       'https://tg24.sky.it/cronaca/2023/12/16/ferragni-balocco',
       'https://www.agi.it/cronaca/news/2024-01-08/chiara-ferragni-pandori-balocco-ipotesi-truffa-procura-24737337/'],
      dtype=object)

***
### Continue with the scraping


In [86]:
df1.loc[df1['url'] == 'https://tg24.sky.it/cronaca/2024/01/15/balocco-pandoro-chiara-ferragni-prezzo', 'date'] = pd.to_datetime('2024-01-15').date()
df1.loc[df1['url'] == 'https://www.fanpage.it/spettacolo/personaggi/chiara-ferragni-sulla-vicenda-balocco-chiedo-scusa-devolvero-un-milione-di-euro-per-le-cure-dei-bambini/', 'date'] = pd.to_datetime('2023-12-18').date()
df1.loc[df1['url'] == 'https://tg24.sky.it/cronaca/2023/12/16/ferragni-balocco', 'date'] = pd.to_datetime('2023-12-16').date()
df1.loc[df1['url'] == 'https://www.agi.it/cronaca/news/2024-01-08/chiara-ferragni-pandori-balocco-ipotesi-truffa-procura-24737337/', 'date'] = pd.to_datetime('2024-01-08').date()

In [87]:
df1

Unnamed: 0,url,text_content,flag,type_category,source_category,date
0,https://www.open.online/2023/06/14/antitrust-i...,L'Antitrust bacchetta i pandori Balocco di Chi...,news,article,open,2023-06-14
1,https://www.ansa.it/english/news/business/2024...,Ferragni-Balocco case could become 'fraud' pro...,news,article,ansa,2024-01-08
2,https://www.repubblica.it/economia/2023/12/15/...,Chiara Ferragni multata dall’Antitrust per il ...,news,article,repubblica,2023-12-15
3,https://tg24.sky.it/cronaca/2024/01/15/balocco...,Balocco svela in una lettera al Codacons perch...,news,article,tg24,2024-01-15
4,https://www.ansa.it/sito/notizie/cronaca/2023/...,"Pasticcio di Natale Balocco-Ferragni, multa An...",news,article,ansa,2023-12-15
5,https://www.leggo.it/gossip/fedez_ferragni/fer...,"Ferragni, Balocco ricorre al Tar del Lazio con...",news,article,leggo,2024-02-13
6,https://www.fanpage.it/spettacolo/personaggi/c...,"Chiara Ferragni sulla vicenda Balocco: ""Chiedo...",news,article,fanpage,2023-12-18
7,https://www.leggo.it/gossip/fedez_ferragni/chi...,"Chiara Ferragni, 1 milione di euro di multa da...",news,article,leggo,2023-12-15
8,https://www.ilsole24ore.com/art/caso-balocco-f...,"Caso Balocco-Ferragni, sono quattro le procure...",news,article,ilsole24ore,2023-12-30
9,https://www.open.online/2023/12/23/chiara-ferr...,Pandori Ferragni-Balocco venduti anche a 120 e...,news,article,open,2023-12-23


In [88]:
df.to_csv('reddit_article_links_scraping.csv', header=True, index=False)

In [91]:
df3=pd.read_json('articles.json')
df3

Unnamed: 0,url,content,publish_date
0,https://thenewglobalorder.com/newsroom/lets-tu...,"Let's Tune In,NewsRoomLet’s Tune In To The EU’...",2024-01-31T16:59:16+01:00
1,https://apnews.com/article/italy-pandoro-pink-...,Copyright 2024 The Associated Press. All Right...,No date found
2,https://lavocedinewyork.com/en/news/2024/01/09...,Mayor Adams Announces Most Recent New York Cit...,No date found
3,https://www.aboutresilience.com/can-we-boycott...,HomeCase studiesCan we boycott Chiara Ferragni...,2024-01-03T14:30:31+01:00
4,https://katten.com/let-them-eat-cake-italys-an...,Article Published byKattison Avenue/Katten Kat...,No date found
5,https://lifeinitaly.com/pandoro-gate-recap-of-...,Home»News»Pandoro-Gate: Recap of Chiara Ferrag...,2024-01-15T01:01:32+02:00
6,https://www.elle.com/it/showbiz/gossip/a461461...,Un anno dopo il lancio del progetto e la nasci...,No date found
7,https://www.ilmessaggero.it/en/chiara_ferragni...,Chiara Ferragni Facing Legal Consequences Over...,No date found
8,https://www.agenzianova.com/en/news/caso-ferra...,"Ferragni case, Codacons: The antitrust denies ...",2024-01-14T15:14:02+01:00
9,https://www.dailymail.co.uk/news/article-12941...,Italy's biggest influencer Chiara Ferragni now...,2024-01-09T08:58:08+0000


***
### Extract the data source, category, and date from this dataset

In [92]:
df3['flag']='news'
df3['type_category']='article'
df3['source_category']=df3['url'].apply(extract_info)
df3

Unnamed: 0,url,content,publish_date,flag,type_cateogory,source_category
0,https://thenewglobalorder.com/newsroom/lets-tu...,"Let's Tune In,NewsRoomLet’s Tune In To The EU’...",2024-01-31T16:59:16+01:00,news,article,thenewglobalorder
1,https://apnews.com/article/italy-pandoro-pink-...,Copyright 2024 The Associated Press. All Right...,No date found,news,article,apnews
2,https://lavocedinewyork.com/en/news/2024/01/09...,Mayor Adams Announces Most Recent New York Cit...,No date found,news,article,lavocedinewyork
3,https://www.aboutresilience.com/can-we-boycott...,HomeCase studiesCan we boycott Chiara Ferragni...,2024-01-03T14:30:31+01:00,news,article,aboutresilience
4,https://katten.com/let-them-eat-cake-italys-an...,Article Published byKattison Avenue/Katten Kat...,No date found,news,article,katten
5,https://lifeinitaly.com/pandoro-gate-recap-of-...,Home»News»Pandoro-Gate: Recap of Chiara Ferrag...,2024-01-15T01:01:32+02:00,news,article,lifeinitaly
6,https://www.elle.com/it/showbiz/gossip/a461461...,Un anno dopo il lancio del progetto e la nasci...,No date found,news,article,elle
7,https://www.ilmessaggero.it/en/chiara_ferragni...,Chiara Ferragni Facing Legal Consequences Over...,No date found,news,article,ilmessaggero
8,https://www.agenzianova.com/en/news/caso-ferra...,"Ferragni case, Codacons: The antitrust denies ...",2024-01-14T15:14:02+01:00,news,article,agenzianova
9,https://www.dailymail.co.uk/news/article-12941...,Italy's biggest influencer Chiara Ferragni now...,2024-01-09T08:58:08+0000,news,article,dailymail


***
# Extract the content of article

In [116]:
df3.rename(columns={"publish_date": "date", "content": "text_content"}, inplace=True)
for index, row in df3.iterrows():
    if row['date']=='No date found':
        fetched_date = fetch_date_from_webpage(row['url'])
        if fetched_date:
            df3.at[index, 'date'] = fetched_date

In [99]:
df3

Unnamed: 0,url,content,date,flag,type_cateogory,source_category
0,https://thenewglobalorder.com/newsroom/lets-tu...,"Let's Tune In,NewsRoomLet’s Tune In To The EU’...",2024-01-31T16:59:16+01:00,news,article,thenewglobalorder
1,https://apnews.com/article/italy-pandoro-pink-...,Copyright 2024 The Associated Press. All Right...,2024-01-09,news,article,apnews
2,https://lavocedinewyork.com/en/news/2024/01/09...,Mayor Adams Announces Most Recent New York Cit...,2024-01-09,news,article,lavocedinewyork
3,https://www.aboutresilience.com/can-we-boycott...,HomeCase studiesCan we boycott Chiara Ferragni...,2024-01-03T14:30:31+01:00,news,article,aboutresilience
4,https://katten.com/let-them-eat-cake-italys-an...,Article Published byKattison Avenue/Katten Kat...,No date found,news,article,katten
5,https://lifeinitaly.com/pandoro-gate-recap-of-...,Home»News»Pandoro-Gate: Recap of Chiara Ferrag...,2024-01-15T01:01:32+02:00,news,article,lifeinitaly
6,https://www.elle.com/it/showbiz/gossip/a461461...,Un anno dopo il lancio del progetto e la nasci...,2023-12-15,news,article,elle
7,https://www.ilmessaggero.it/en/chiara_ferragni...,Chiara Ferragni Facing Legal Consequences Over...,2023-12-28,news,article,ilmessaggero
8,https://www.agenzianova.com/en/news/caso-ferra...,"Ferragni case, Codacons: The antitrust denies ...",2024-01-14T15:14:02+01:00,news,article,agenzianova
9,https://www.dailymail.co.uk/news/article-12941...,Italy's biggest influencer Chiara Ferragni now...,2024-01-09T08:58:08+0000,news,article,dailymail


In [103]:
df3['date'] = df3['date'].apply(lambda x: pd.to_datetime(x).date() if  not x == 'No date found' else None)
missing_dates_mask = df3['date'].apply(lambda x: not isinstance(x, datetime.date))
df_missing_dates = df3[missing_dates_mask]
df_missing_dates['url'].values

array(['https://katten.com/let-them-eat-cake-italys-antitrust-and-advertising-authorities-crack-down-on-influencers',
       'https://acrimonia.it/en/articles/dal-pandoro-gate-al-ricorso-contro-lantitrust',
       'https://www.engage.it/web-marketing/pandoro-gate-balocco-chiara-ferragni-chiedo-scusa-1-milione-di-euro-al-regina-margherita.aspx',
       'https://www.engage.it/web-marketing/pandoro-gate-linfluencer-chiara-ferragni-fa-ricorso-al-tar-sul-caso-balocco-illegittima-.aspx',
       'https://www.lexology.com/library/detail.aspx?g=0b201b67-a6c1-4622-bcdb-e32d10e1fd21'],
      dtype=object)

In [104]:
df3.loc[df3['url'] == 'https://katten.com/let-them-eat-cake-italys-antitrust-and-advertising-authorities-crack-down-on-influencers', 'date'] = pd.to_datetime('2024-01-31').date()
df3.loc[df3['url'] == 'https://acrimonia.it/en/articles/dal-pandoro-gate-al-ricorso-contro-lantitrust', 'date'] = pd.to_datetime('2024-02-15').date()
df3.loc[df3['url'] == 'https://www.engage.it/web-marketing/pandoro-gate-balocco-chiara-ferragni-chiedo-scusa-1-milione-di-euro-al-regina-margherita.aspx', 'date'] = pd.to_datetime('2023-12-18').date()
df3.loc[df3['url'] == 'https://www.engage.it/web-marketing/pandoro-gate-linfluencer-chiara-ferragni-fa-ricorso-al-tar-sul-caso-balocco-illegittima-.aspx', 'date'] = pd.to_datetime('2024-02-15').date()
df3.loc[df3['url'] == 'https://www.lexology.com/library/detail.aspx?g=0b201b67-a6c1-4622-bcdb-e32d10e1fd21', 'date'] = pd.to_datetime('2024-02-07').date()

In [105]:
df3

Unnamed: 0,url,content,date,flag,type_cateogory,source_category
0,https://thenewglobalorder.com/newsroom/lets-tu...,"Let's Tune In,NewsRoomLet’s Tune In To The EU’...",2024-01-31,news,article,thenewglobalorder
1,https://apnews.com/article/italy-pandoro-pink-...,Copyright 2024 The Associated Press. All Right...,2024-01-09,news,article,apnews
2,https://lavocedinewyork.com/en/news/2024/01/09...,Mayor Adams Announces Most Recent New York Cit...,2024-01-09,news,article,lavocedinewyork
3,https://www.aboutresilience.com/can-we-boycott...,HomeCase studiesCan we boycott Chiara Ferragni...,2024-01-03,news,article,aboutresilience
4,https://katten.com/let-them-eat-cake-italys-an...,Article Published byKattison Avenue/Katten Kat...,2024-01-31,news,article,katten
5,https://lifeinitaly.com/pandoro-gate-recap-of-...,Home»News»Pandoro-Gate: Recap of Chiara Ferrag...,2024-01-15,news,article,lifeinitaly
6,https://www.elle.com/it/showbiz/gossip/a461461...,Un anno dopo il lancio del progetto e la nasci...,2023-12-15,news,article,elle
7,https://www.ilmessaggero.it/en/chiara_ferragni...,Chiara Ferragni Facing Legal Consequences Over...,2023-12-28,news,article,ilmessaggero
8,https://www.agenzianova.com/en/news/caso-ferra...,"Ferragni case, Codacons: The antitrust denies ...",2024-01-14,news,article,agenzianova
9,https://www.dailymail.co.uk/news/article-12941...,Italy's biggest influencer Chiara Ferragni now...,2024-01-09,news,article,dailymail


***
# In this part we merge all the dataset

In [106]:
df3.to_csv('david_articles_scraping.csv', header=True, index=False)

In [113]:
df.columns

Index(['text_content', 'url', 'date', 'source_category', 'flag',
       'type_category'],
      dtype='object')

In [118]:
df1.columns

Index(['url', 'text_content', 'flag', 'type_category', 'source_category',
       'date'],
      dtype='object')

In [117]:
df3.columns

Index(['url', 'text_content', 'date', 'flag', 'type_cateogory',
       'source_category'],
      dtype='object')

In [121]:
df_merged=pd.concat([df, df1, df3], ignore_index=True)
df_merged

Unnamed: 0,text_content,url,date,source_category,flag,type_category
0,Analista studia il caso Ferragni: quanti milio...,https://www.stranotizie.it/analista-studia-il-...,2024-03-14,stranotizie,news,article
1,Chiara Ferragni intervista Fabio Fazio e Miche...,https://www.vogue.it/article/chiara-ferragni-i...,2021-10-05,vogue,news,article
2,Balocco-Ferragni ancora insieme: arriva il pan...,https://www.lercio.it/balocco-ferragni-ancora-...,2023-12-17,lercio,news,article
3,"Caso Ferragni, Balocco al Codacons: ecco perch...",https://www.repubblica.it/cronaca/2024/01/14/n...,2024-01-14,repubblica,news,article
4,"Chiara Ferragni-Balocco, l'Antitrust: «Commist...",https://www.leggo.it/gossip/fedez_ferragni/chi...,2024-04-17,leggo,news,article
...,...,...,...,...,...,...
93,Abstract: A worldwide media storm hit hard the...,https://www.lexology.com/library/detail.aspx?g...,2024-02-07,lexology,news,article
94,Italian influencer Chiara Ferragni sorry for h...,https://www.bbc.co.uk/news/world-europe-67759633,2023-12-19,bbc,news,article
95,"EntertainmentJanuary 05 2024Chiara Ferragni, C...",https://www.napolike.com/chiara-ferragni-coca-...,2024-01-05,napolike,news,article
96,A popular Italian influencer has been placed u...,https://www.hurriyetdailynews.com/italian-infl...,2024-01-10,hurriyetdailynews,news,article


***
# We save all the article in just one file

In [122]:
df_merged.to_csv(path_or_buf='merged_scraped_articles.csv', header=True, index=False)