# From Extracting Data through Web Scraping to Organizing Structural Information <br>

## Muzammil Mushtaq

## Task 1

Analyzing the website "https://www.dw.com/search/?languageCode=en " using Python library BeautifulSoup. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
main_url = "https://www.dw.com/search/?languageCode=en"
short_url = "https://www.dw.com"
def scrape_dw(input_date1, input_date2):
    df = pd.DataFrame()
    def modify_url(url):

        first_response = requests.get(url)
        if first_response.status_code == 200:
            html_content = first_response.text
        else:
            print ("Error:", first_response.status_code)

        soup = BeautifulSoup(html_content, "html.parser")

        '''                        Access the main link from the given Dates.
        '''

        input_navigator = soup.find("input", {"name": "searchNavigationId"})
        if input_navigator:        
            input_dates = soup.find_all("input",{"class":"cal datepicker"})
            if input_dates:
                from_date = input_dates[0]["name"]
                to_date = input_dates[1]["name"]      
                if url != main_url:
                    total_pages = soup.find("span", class_="hits all")
                    if total_pages:
                        total_pages_number = total_pages.text
                    else:
                        print ("Span with class 'hits all' not found.")
            else:
                print ("Input with class 'cal datepicker' not found.")
        else:
            print("Input with name 'searchNavigationId' not found.")

        '''                        Modifying the main url
        '''
        if url == main_url:
            modification_url =   f"https://www.dw.com/search/?languageCode=en&searchNavigationId={input_navigator['value']}&{from_date}={input_date1}&{to_date}={input_date2}"   
            return modification_url, soup
        else:
            modification_url = f"https://www.dw.com/search/?languageCode=en&searchNavigationId=9097&{from_date}={input_date1}&{to_date}={input_date2}&resultsCounter={str(total_pages_number).strip()}"   
            return modification_url, soup  
    link, soup = modify_url(main_url)
    link2, soup = modify_url(link)
    link3, soup = modify_url(link2)

    df = pd.DataFrame()
    def articles():
        articles = soup.find_all("div", class_="searchResult")
        Title, Summary, Link = [], [], []
        for article in articles:
            title = article.select_one('h2').text
            summary = article.select_one('p').text
            link = article.find('a')['href']  # Access the href attribute
            # Append
            Title.append(title)
            Summary.append(summary)
            Link.append(short_url+link)
        # saved in Dataframe
        df['Mix_Title'] = Title
        df['Summary'] = Summary
        df['Link'] = Link
        #Seperate out Mix_title from Title and Publication_Date
        df['Title'] = df['Mix_Title'].apply(lambda x: " ".join(x.split()[:-1]))
        df['Publication_Date'] = df['Mix_Title'].apply(lambda x: " ".join(x.split()[-1]))

    articles()

    def access_articles(df):
        # Define all variables
        merge_subheading = []
        merge_related_topic_and_author = []
        merge_text = []
        merge_CAT = []
        merge_REG = []


        for link in df['Link']:      #df['Link'] have all the links of articles within certain time period
            response = requests.get(link)
            # Check if the request was successful (status code 200)
            if response.status_code == 200:
                html_content = response.text
            else:
                print("Error:", response.status_code)
            soup = BeautifulSoup(html_content, "html.parser")

            '''                                  Finding Sub-headings
            '''   
            subheadings = soup.select('h2')
            real_subheading = []
            for i, subheading in enumerate(subheadings):
                all_subheading = subheading.get_text()
                # List of excluding subheadings
                excluded_subheadings = ['Regions', 'Topics', 'Categories', 'Related topics', 'About DW', 'DW offers', 'Service', 'B2B', 'Follow us on']
                if all_subheading not in excluded_subheadings:
                    if "About" not in all_subheading and "Similar" not in all_subheading and 'Show' not in all_subheading and 'More on' not in all_subheading:
                        real_subheading.append(all_subheading)
            merge_subheading.append(real_subheading)

            '''                                 Related topics and Author detail
            '''
            Spans = soup.find_all("aside", class_="link-wrapper")
            real_related_topic_and_author = []
            for span in Spans:
                names = [a.text for a in span.find_all("a")]
                result = ', '.join(names)
                real_related_topic_and_author.append(result)
            merge_related_topic_and_author.append(real_related_topic_and_author)


            '''                                  Texts
            '''
            texts = soup.find_all("p")
            Text = []
            for text in texts:
                if 'To view this video' not in text.text: 
                    Text.append(text.text)
            merge_text.append(Text)


            '''                                 Category and Regions
            '''
            span2 = soup.select('span')
            category_and_region = soup.find_all('span', class_="")
            i = 0 
            CAT, REG = [], []
            for cat_reg in category_and_region: # split the category with region
                if i == 0:
                    CAT.append(cat_reg.text)
                elif i == 1:
                    REG.append(cat_reg.text)
                elif i == 2:
                    break
                i+=1
            merge_CAT.append(CAT)
            merge_REG.append(REG)

        df['SubHeadings'] = merge_subheading
        df['Related_topics'] = merge_related_topic_and_author
        df['Text'] = merge_text
        df['Category'] = merge_CAT
        df['Region'] = merge_REG
        columns_to_process = ['Text','Category','Region']
        for column in columns_to_process:
            df[column] = df[column].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

        df['Publication_Date'] = df['Publication_Date'].str.replace(r'\D', '', regex=True)
        df['Publication_Date'] = pd.to_datetime(df['Publication_Date'], format='%d%m%Y', errors='coerce')
        df = df.drop('Mix_Title',axis=1)

        column_order = ['Link','Publication_Date','Category','Region','Title','Summary','Text','SubHeadings','Related_topics']
        df = df[column_order]
        return df
    data = access_articles(df)

    return data

Data = scrape_dw('08.10.2023', '09.10.2023')

In [3]:
Data.head(5)

Unnamed: 0,Link,Publication_Date,Category,Region,Title,Summary,Text,SubHeadings,Related_topics
0,https://www.dw.com/en/israelis-look-for-loved-...,2023-10-09,Conflicts,Israel,Israelis look for loved ones following Hamas a...,\nWARNING: DISTURBING CONTENT Photos of missin...,WARNING: DISTURBING CONTENT Photos of missing ...,[],"[Israel, Hamas, Israel at war]"
1,https://www.dw.com/en/hamas-attacks-on-israel-...,2023-10-09,Conflicts,Germany,Hamas attacks on Israel triggers debate in Ger...,\nIn the wake of the terrorist attack by Islam...,In the wake of the terrorist attack by Islamis...,"[Diverse Muslim community, Central Council of ...","[Rhine River, Robert Habeck, Poverty in German..."
2,https://www.dw.com/en/scholz-and-macron-conven...,2023-10-09,Politics,Germany,Scholz and Macron convene for 'strategic' retr...,\nChancellor Olaf Scholz welcomed President Em...,Chancellor Olaf Scholz welcomed President Emma...,[Franco-German ties 'more important than ever'...,"[Emmanuel Macron, French elections, Rhine Rive..."
3,https://www.dw.com/en/eyewitness-recounts-hama...,2023-10-09,Conflicts,Israel,Eyewitness recounts Hamas attack on Israeli mu...,\nUp to 250 young Israeli and foreign partygoe...,Up to 250 young Israeli and foreign partygoers...,[],"[Israel, Hamas, Israel at war]"
4,https://www.dw.com/en/black-and-german-the-afr...,2023-10-09,Human Rights,Black and German - The Afrodeutsch story,Black and German - The Afrodeutsch story,\nBlack and German still just doesn’t add up f...,Black and German still just doesn’t add up for...,[],"[Discrimination, Diversity]"


In [4]:
Data['Text'][1]

'In the wake of the terrorist attack by Islamist militant Hamas on Israeli soldiers and civilians, Muslim associations in Germany have come under pressure to position themselves clearly. , It began with a tweet by the German Minister of Food and Agriculture Cem Özdemir on X, the platform formerly known as Twitter. "Resounding silence from Muslim associations in [Germany] regarding terror against #Israel. Or words that relativize..." And then: "In the face of terror, murder & kidnappings, there has to be an end to the naivety when dealing with Islamic associations finally!", Hundreds of gunmen from the Islamist terrorist group Hamas crossed the border into Israel from the Gaza Strip in the early morning hours of October 7, killing and kidnapping soldiers and civilians. According to official figures, more than 700 people have since died in Israel, which has declared a state of war and retaliated in Gaza, where at least 560 people have been killed., The parties in the German Bundestag, ex

In [5]:
Data['Summary'][1]

'\nIn the wake of the terrorist attack by Islamist militant Hamas on Israeli soldiers and civilians, Muslim associations in Germany have come under pressure to position themselves clearly. \n'

In [5]:
Data.to_excel('Output_Deutsche_Welle_Articles.xlsx', index=False)