## Documentation
This script scrapes news articles related to specific sectors in Singapore and Malaysia using DuckDuckGo Search.

### Functionality
1) Extracts news articles based on keywords, country, and sector.
2) Includes title, content, source, sector, and country for each article.
3) Saves the extracted articles in separate CSV files for each sector (e.g., ddgs_realestate.csv).

### Code Breakdown
##### Function extract_content

This function takes three arguments:
1) keywords: Keywords to search for in news articles (string).
2) country: Country to filter news articles (string).
3) sector: Sector to categorize news articles (string).
It uses DDGS().news to search DuckDuckGo for news articles based on the provided keywords, country, and sets additional parameters:
- region="wt-wt": Specifies worldwide search.
- safesearch="off": Disables safe search.
- timelimit="m": Limits search to the past month.
- max_results=200: Sets the maximum number of results to 200.<br>
It iterates through the search results and extracts the title, body, and source of each article.<br>
It creates a Pandas DataFrame with the extracted information and adds columns for sector and country.<br>
It returns the DataFrame containing the scraped news articles.

Script Execution<br>
The script defines several sectors (Real Estate, Healthcare, Construction, etc.).<br>
For each sector, it calls the extract_content function twice:
- Once for news articles related to Singapore.
- Once for news articles related to Malaysia.
- It concatenates the results from both countries into a single DataFrame.
- It saves the DataFrame to a CSV file named based on the sector (e.g., ddgs_realestate.csv).
- There is a time.sleep(10) line between each sector scraping to avoid overwhelming DuckDuckGo's servers with too many requests at once.

In [1]:
from duckduckgo_search import DDGS
import pandas as pd
import time

In [2]:
def extract_content(keywords, country, sector):
    """
    Extracts news articles based on keywords, country, and sector.

    Parameters:
    - keywords (str): Keywords to search for in news articles.
    - country (str): Country to filter news articles.
    - sector (str): Sector to categorize news articles.

    Returns:
    - df (DataFrame): DataFrame containing extracted news articles with titles, content, source, sector, and country.
    """
    results = DDGS().news(keywords=keywords, 
                      region="wt-wt", 
                      safesearch="off", 
                      timelimit="m", 
                      max_results=200)

    info = []
    titles = []
    source = []
    for result in results:
        body = result.get('body')
        title = result.get('title')
        info.append('\n'.join([title, body]))  # Changed '.' to '\n'
        titles.append(title)
        source.append(result.get('source'))

    df = pd.DataFrame({'Title': titles, 'Content': info, 'Site-Name': source})
    df['Sector'] = sector
    df['Country'] = country

    return df


In [3]:
real_estate_sg = extract_content('real estate singapore sg earnings' , 'Singapore' , 'Real Estate')
real_estate_my = extract_content('real estate singapore sg earnings' , 'Malaysia' , 'Real Estate')
real_estate_df = pd.concat ([real_estate_sg,real_estate_my])
real_estate_df.to_csv('ddgs_realestate.csv' , index=False)

In [5]:
time.sleep(10)
healthcare_my = extract_content(' healthcare healthcare malaysia ' , 'Malaysia' , 'Healthcare')
healthcare_sg = extract_content(' healthcare finance singapore ' , 'Singapore' , 'Healthcare')
healthcare_df = pd.concat ([healthcare_sg,healthcare_my])
healthcare_df.to_csv('ddgs_healthcare.csv' , index=False)

In [6]:
time.sleep(10)
construction_sg = extract_content('construction earnings singapore' , 'Singapore' , 'Construction')
construction_my = extract_content('construction earnings malaysia' , 'Malaysia' , 'Construction')
healthcare_df = pd.concat ([construction_sg,construction_my])
healthcare_df.to_csv('ddgs_construction.csv' , index=False)

In [7]:
time.sleep(10)
logistics_my = extract_content('Logistics earnings malaysia' , 'Malaysia' , 'Logistics')
logistics_sg = extract_content('Logistics earnings Singapore' , 'Singapore' , 'Logistics')
logistics_df = pd.concat ([logistics_sg,logistics_my])
logistics_df.to_csv('ddgs_logistics.csv' , index=False)

In [8]:
time.sleep(10)
industrials_my = extract_content('industrials earnings malaysia' , 'Malaysia' , 'Industrials')
industrials_sg = extract_content('industrials earnings Singapore' , 'Singapore' , 'Industrials')
industrials_df = pd.concat ([industrials_sg,industrials_my])
industrials_df.to_csv('ddgs_industrials.csv' , index=False)

In [9]:
time.sleep(10)
oilgas_my = extract_content('oil gas earnings malaysia' , 'Malaysia' , 'Oil and Gas')
oilgas_sg = extract_content('oil gas earnings Singapore' , 'Singapore' , 'Oil and Gas')
oilgas_df = pd.concat ([oilgas_sg,oilgas_my])
oilgas_df.to_csv('ddgs_oilgas.csv' , index=False)

In [10]:
time.sleep(10)
financials_my = extract_content('Financials earnings bank insurance malaysia' , 'Malaysia' , 'Financials')
financials_sg = extract_content('Financials earnings bank insurance Singapore' , 'Singapore' , 'Financials')
financials_df = pd.concat ([financials_sg,financials_my])
financials_df.to_csv('ddgs_financials.csv' , index=False)

In [11]:
time.sleep(10)
technology_my = extract_content('Financials earnings bank insurance malaysia' , 'Malaysia' , 'Technology')
technology_sg = extract_content('Financials earnings bank insurance Singapore' , 'Singapore' , 'Technology')
technology_df = pd.concat ([technology_sg,technology_my])
technology_df.to_csv('ddgs_technology.csv' , index=False)

In [12]:
time.sleep(10)
cg_my = extract_content('Consumer goods earnings malaysia' , 'Malaysia' , 'Consumer Goods')
cg_sg = extract_content('Consumer goods earnings Singapore' , 'Singapore' , 'Consumer Goods')
time.sleep(10)
cg_sg_1 = extract_content('retail earnings Singapore' , 'Singapore' , 'Consumer Goods')
cg_my_1= extract_content('retail earnings malaysia' , 'Malaysia' , 'Consumer Goods')
cg_df = pd.concat ([cg_sg, cg_sg_1 , cg_my , cg_my_1])
cg_df.to_csv('ddgs_consumergoods.csv' , index=False)