# Data Science - Social Media Analytics SoSe 24 📊🔍

## Problemset 1 📝

This notebook represents my submission for the weekly tasks in Social Media Analytics for the summer semester of 2024.

### Authors 👥
- **Martin Brucker** (942815) 🧑‍💻
- **Frederik Brinkmann** (943915) 🧑‍💻

**Due**: Friday, 5th April 2024, 11:59 PM

**Contact Information**: martin.brucker@student.fh-kiel.de 📧


# Exersice 1 🤠

Your task is to scrape articles from Deutsche Welle (DW) along with additional metadata about
the articles.

Use the Python Package BeautifulSoup to extract the following information from this
website and store it in a Python dictionary:

1. author
1. date
1. title
1. summary (text printed in bold letters at the beginning)
1. main text
1. related topics (Minorities, Women’s rights, . . . )


• Analyze the article: what additional information (metadata or other) related to this article
is available? Make a list of such items. Then pick one of these items, extract it also from
this website, and update your dictionary with this item.


• Print the dictionary

In [2]:
# importing the required lib's
import requests
import pandas as pd

# !pip install beautifulsoup4
from bs4 import BeautifulSoup

In [3]:
## creating an instance of the BeautifulSoup class and setup the html parser

# the URL is the given URL from wich the html code is downloaded
url = "https://www.dw.com/en/why-south-korean-women-arent-having-babies/a-68419317"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")

In [29]:
# Extracting information directly without creating separate variables
resultDict = {
    "author": soup.find("div", class_="author-details").get_text(),
    "date": soup.time.get_text(),
    "title": soup.title.get_text(),
    "summary": soup.find(class_="teaser-text").get_text(),
    "main_text": [p.get_text() for p in soup.find_all("p") 
                  if "cookie__text" not in p.get("class", []) and "vjs-no-js" not in p.get("class", [])],
    "related_topics": [a.get_text() for a in soup.aside.find_all("a")]
}

# Note: The 'main_text' filters out unnecessary <p> elements
# with classes 'cookie__text' and 'vjs-no-js'

#### Brainstorming additional information

* Image information (amount/url/size/alt text)
* Subsection Headlines (amount/text)
* Links to other pages (amount/connections)


In [43]:
## getting the sub-section headlines from the article

# add the sub_sections to the earlyer created dictionary
resultDict["sub_headlines"] = [h2.get_text() for h2 in soup.find_all("h2", class_=None)]


In [44]:
# printing the final dict
resultDict

{'author': 'Julian Ryall',
 'date': '03/01/2024',
 'title': "Why South Korean women aren't having babies – DW – 03/01/2024",
 'summary': 'New statistics show a record low number of children were born last year in South Korea, with women citing a desire for a career and to push back against a male-dominated society as key reasons.',
 'main_text': ['New statistics show a record low number of children were born last year in South Korea, with women citing a desire for a career and to push back against a male-dominated society as key reasons.',
  'When she was younger, Hyobin Lee yearned to be a mother. There came a point, however, when she had to make a difficult decision. Ultimately, she chose her career over a family and is now a successful academic in the South Korean city of Daejeon.',
  "Lee, now 44, is just one of millions of Korean women who are making a conscious decision to remain childless — resulting in the nation's fertility rate dropping to a new record low.",
  'The fertility

# Exersice 2 🤠

In [24]:
import datetime
import time

in the following the task will be split into two code-blocks
in the first block all URL's to articles since 01.03.2024 will be collected

in the second code block the relevant data will be extracted from all the links

In [25]:
articleLinks = []
page = 10
more_pages = True
target_date = "2024-04-01"

while more_pages == True:

    print(f"Now scraping page {page}")

    url = f"https://www.dw.com/search/?languageCode=en&sort=DATE&resultsCounter={page}"
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    
    news = soup.select("div.searchResult")
    articleLinks.extend([article.a["href"] for article in news])

    if news:
        latest_date = soup.find_all(name="span", attrs={"class" : "date"})[-1].getText()

        ## abbruchbedingung wenn das Datum des letzten Artikels vor dem 01.03.2024 liegt
        if datetime.datetime.strptime(latest_date, "%d.%m.%Y") < datetime.datetime.strptime(target_date, "%Y-%m-%d"):
            more_pages = False

    if soup.select("a.addPage"):
        page += 10
    else:
        more_pages = False


    time.sleep(1)   

print(f"Finished scraping {len(articleLinks)} article URL's")

Now scraping page 10


Now scraping page 20
Now scraping page 30
Now scraping page 40
Now scraping page 50
Now scraping page 60
Now scraping page 70
Now scraping page 80
Finished scraping 360 article URL's


In [26]:
# to cache the links as csv file
pd.DataFrame(articleLinks, columns=["links"]).to_csv("article_links.csv", index=False)

In [27]:
i = 0
result = []

for article in articleLinks:
    if(i % 10 == 0):
        print(f"Finished {i/len(articleLinks)} %")

    url = f"https://www.dw.com{article}"
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")

    data = {}

    # sadly the authoer is not so easy to extract so in the following tree propular examples of author formating is used to get the author
    if soup.find_all("div", class_="author-details"):
        data["author"] = soup.find_all("div", class_="author-details")
    elif soup.find_all("span", class_="author"):
        data["author"] = soup.find_all("span", class_="author")
    else:
        data["author"] = soup.find_all("div", class_="author")
    
    try: 
        data["date"] = soup.time.getText()
    except: 
        print("noDate")

    try:
        data["title"] = soup.title.getText()
    except:
        print("no title")

    try:
        data["summary"] = soup.find_all(class_="teaser-text")[0].getText()
    except:
        print("no summary")

    try:
        data["text"] = [p.getText() for p in soup.find_all("p", class_=lambda x: x not in ["cookie__text", "vjs-no-js"])]
    except:
        print("no text")
    
    data["related_topics"] = soup.aside

    result.append(data)
    i += 1

Finished 0.0 %
noDate
no summary
Finished 0.027777777777777776 %
noDate
no summary
Finished 0.05555555555555555 %
Finished 0.08333333333333333 %
noDate
no summary
Finished 0.1111111111111111 %
Finished 0.1388888888888889 %
noDate
no summary
noDate
no summary
Finished 0.16666666666666666 %
noDate
no summary
Finished 0.19444444444444445 %
Finished 0.2222222222222222 %
noDate
no summary
noDate
no summary
Finished 0.25 %
no summary
Finished 0.2777777777777778 %
noDate
no summary
Finished 0.3055555555555556 %
Finished 0.3333333333333333 %
noDate
no summary
noDate
no summary
Finished 0.3611111111111111 %
no summary
Finished 0.3888888888888889 %
Finished 0.4166666666666667 %
noDate
no summary
Finished 0.4444444444444444 %
Finished 0.4722222222222222 %
noDate
no summary
noDate
no summary
Finished 0.5 %
no summary
Finished 0.5277777777777778 %
Finished 0.5555555555555556 %
Finished 0.5833333333333334 %
noDate
no summary
Finished 0.6111111111111112 %
Finished 0.6388888888888888 %
noDate
no summa

In [39]:
articlesAsDataFrame = pd.DataFrame(result)
articlesAsDataFrame.head()

Unnamed: 0,author,date,title,summary,text,related_topics
0,[[Flourish Ubanyi ]],04/02/2024,Under pressure from the online beauty craze? –...,Does social media influence our perception of ...,[Does social media influence our perception of...,[[Lifestyle]]
1,[],04/02/2024,Inflation in Germany drops again in March – DW...,Year-on-year inflation dropped to 2.2% last mo...,[Year-on-year inflation dropped to 2.2% last m...,
2,"[[[<a class=""sc-ivDvhZ iBEcrE sc-bZCZoC jpmmcy...",04/02/2024,What's behind high joblessness among India's y...,The outlook for millions of young people enter...,[The outlook for millions of young people ente...,"[[BRICS], [Hindu nationalism], [Manipur], [Nar..."
3,[],04/02/2024,By train through Vietnam – DW – 04/02/2024,"A journey of about 1,700 kilometers. In Vietna...","[A journey of about 1,700 kilometers. In Vietn...",
4,[],,Checking facts and building trust in Burkina F...,,[\nDW Akademie and Fasocheck believe that acce...,


In [38]:
# transform date into datetime
articlesAsDataFrame["date"] = pd.to_datetime(articlesAsDataFrame["date"], format="%m/%d/%Y")

# transform related_topics into related_topics
# maybe formating it in multible columns

#check max related_topics length care of none
articlesAsDataFrame["related_topics"].apply(lambda x: len(x) if x else 0).max()

articlesAsDataFrame["related_topics"]

0                                          [[Lifestyle]]
1                                                   None
2      [[BRICS], [Hindu nationalism], [Manipur], [Nar...
3                                                   None
4                                                   None
                             ...                        
355    [[Black Sea], [BRICS], [Dmitry Medvedev], [Rus...
356    [[Black Sea], [Turkey elections], [Turkey-Syri...
357                     [[Israel], [Benjamin Netanyahu]]
358                     [[Israel], [Benjamin Netanyahu]]
359                                        [[The Hague]]
Name: related_topics, Length: 360, dtype: object

In [None]:
# transforming the data usefull for further research

# transform date into datetime
# articlesAsDataFrame["date"] = pd.to_datetime(articlesAsDataFrame["date"])

# unwrap the outer array from the related_topics with the soup .getText(), check if none

# change dtype of the related_topics into categories
# articlesAsDataFrame["related_topics"] = articlesAsDataFrame["related_topics"].astype("category")

In [None]:
articlesAsDataFrame["related_topics"]

0     [[Minorities], [Women's rights], [Taliban], [E...
1     [[Black Sea], [BRICS], [Dmitry Medvedev], [Rus...
2                                                  None
3     [[Artificial intelligence], [Asia], [South Kor...
4     [[EU migration policy], [BRICS], [Taiwan], [Uy...
5        [[Palestinian territories], [Israel], [Hamas]]
6                [[LGBTQ+ rights in Africa], [Namibia]]
7                 [[Michael Schumacher], [Formula One]]
8     [[BRICS], [Hindu nationalism], [Manipur], [Nar...
9                                [[Olympics], [France]]
10    [[Law and Justice], [Medical marijuana], [Rhin...
11    [[Medical marijuana], [Rhine River], [Poverty ...
12    [[Bayern Munich], [Borussia Dortmund], [Bundes...
13    [[BRICS], [China-Taiwan crisis], [Taiwan], [Uy...
14    [[Black Sea], [Turkey elections], [Turkey-Syri...
15    [[BRICS], [Hindu nationalism], [Manipur], [Nar...
16    [[Rhine River], [Poverty in Germany], [Robert ...
Name: related_topics, dtype: object