<p style="color: darkred; font-size: 50px; text-align: center;"><b>Web Scraping Project</b></p>
<p style="color: darkred; font-size: 30px; text-align: center;">Kun.uz News Articles</p>
<p style="font-size: 20px; text-align: center;">Mukhammadkodir Abdusalomov, Elbek Majidov</p>
<p style="font-size: 20px; text-align: center;">Spring 2025</p>
<p align="center">
  <img src="img/wne-logo2.png" width="498" height="107">
</p>

# üì∞ Web Scraping Project: News Articles from kun.uz

## üéØ Objective
The aim of this project is to scrape and analyze news articles from [kun.uz](https://kun.uz), a leading news platform in Uzbekistan. Using web scraping techniques, we will collect data from different news categories, prepare it for further analysis, and document the entire process.

## üßæ Why kun.uz?
- It offers a wide range of news categories such as **Technology**, **Sports**, **Society**, etc.
- The content is updated regularly, ensuring a large dataset.
- News data provides strong opportunities for **natural language processing (NLP)** and **topic-based analysis**.
- The site's `robots.txt` does not block scraping of news categories, making it legal and ethical for academic purposes.

> üí° All tools are used within a **Jupyter Notebook** environment for interactive development and documentation.

## üóÇ Project Workflow
1. **Define Target Categories**  
   - A list of 7 category URLs (e.g., politics, tech, tourism) is prepared for scraping.

2. **Initialize Selenium WebDriver**  
   - Chrome browser is automated using Selenium.
   - Cookie consent banners are detected and accepted programmatically.

3. **Load Dynamic Content**  
   - The "Load More News" button is clicked multiple times using Selenium to reveal additional articles.

4. **Store Raw Page Source**  
   - For each category, the final loaded HTML is stored for parsing.

5. **Parse HTML with BeautifulSoup**  
   - Extract article links, headlines, publication dates, and summary previews from each category page.

6. **Fetch Full Articles with Requests**  
   - Each article is visited individually using the Requests library to extract full content paragraphs.

7. **Clean and Format Data**  
   - Dates are extracted using **regex** and standardized using `datetime`.
   - All relevant information is stored in a structured format (list of dictionaries).

8. **Export to CSV**  
   - The final dataset is converted into a Pandas DataFrame and saved as `Kun-Uz.csv`.

> ‚ö†Ô∏è _Note: This project is for educational purposes only. All data is used in compliance with kun.uz‚Äôs robots.txt file and scraping best practices._


## üõ†Ô∏è Libraries and Tools Setup

Before we begin the web scraping process, let's ensure that all necessary libraries are installed and properly set up. This project utilizes the following Python libraries:

- **BeautifulSoup (`bs4`)**: For parsing and extracting data from HTML content.
- **Selenium**: For automating web browser interactions, particularly useful for handling dynamic content like ‚ÄúLoad More‚Äù buttons or cookie banners.
- **Pandas (`pd`)**: For organizing and analyzing the scraped data, and exporting to CSV.
- **Requests**: For sending HTTP requests to fetch individual article content.
- **NumPy (`np`)**: Used for generating randomized delays to mimic human behavior.
- **Regular Expressions (`re`)**: Used for cleaning and formatting dates from raw text.
- **Datetime**: Used for converting raw date strings into a consistent date format (`YYYY-MM-DD`).
- **TQDM**: Provides progress bars to visually track scraping progress.
- **Webdriver Manager**: Automatically handles downloading and managing the correct ChromeDriver version for Selenium.

Make sure to install these packages if you haven‚Äôt already:

In [88]:
# !pip install beautifulsoup4 selenium pandas requests

In [95]:
# !pip install tensorflow

In [89]:
# !pip install webdriver_manager

In [100]:
# !pip install nlkt

In [90]:
# !pip install torch

In [102]:
# !pip install transformers

In [25]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import requests
import time
import selenium
import numpy as np
import re
from tqdm import tqdm
from datetime import datetime, timedelta
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

In [27]:
import warnings
warnings.filterwarnings('ignore')

In [29]:
url = "https://kun.uz/en"

In [31]:
chromepath = ChromeDriverManager().install()
service_chrome = Service(executable_path=chromepath)
options_chrome = webdriver.ChromeOptions()
driver_chrome = webdriver.Chrome(service=service_chrome, options=options_chrome)
driver_chrome.maximize_window()

def handle_cookie_consent():
    try:
        consent_button = WebDriverWait(driver_chrome, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//*[@id="accept-consent"]'))
        )
        consent_button.click()
    except Exception as e:
        print("No cookie consent banner appeared or it was already handled.", e)
        
category_urls = [
    "https://kun.uz/en/news/category/politics",
    "https://kun.uz/en/news/category/society",
    "https://kun.uz/en/news/category/business",
    "https://kun.uz/en/news/category/tech",
    "https://kun.uz/en/news/category/culture",
    "https://kun.uz/en/news/category/sport-en",
    "https://kun.uz/en/news/category/tourism"
]
load_more_xpath = "//button[@class='point-view__footer-btn']//span[contains(text(),'More News')]"
all_pages = {}
try:
    for url in category_urls:
        driver_chrome.get(url)
        time.sleep(3)
        handle_cookie_consent()      
        for i in range(5):
            try:
                time.sleep(np.random.chisquare(1) + 5)
                WebDriverWait(driver_chrome, 15).until(
                    EC.visibility_of_element_located((By.XPATH, load_more_xpath))
                )
                button = driver_chrome.find_element(By.XPATH, load_more_xpath)
                button.click()
            except Exception as e:
                print(f"Load More failed on: {url}", e)
                break
        all_pages[url] = driver_chrome.page_source  # Save source for each category
finally:
    driver_chrome.quit()

Load More failed on: https://kun.uz/en/news/category/politics Message: 
Stacktrace:
	GetHandleVerifier [0x011DC7F3+24435]
	(No symbol) [0x01162074]
	(No symbol) [0x010306E3]
	(No symbol) [0x01078B39]
	(No symbol) [0x01078E8B]
	(No symbol) [0x010C1AC2]
	(No symbol) [0x0109D804]
	(No symbol) [0x010BF20A]
	(No symbol) [0x0109D5B6]
	(No symbol) [0x0106C54F]
	(No symbol) [0x0106D894]
	GetHandleVerifier [0x014E70A3+3213347]
	GetHandleVerifier [0x014FB0C9+3295305]
	GetHandleVerifier [0x014F558C+3271948]
	GetHandleVerifier [0x01277360+658144]
	(No symbol) [0x0116B27D]
	(No symbol) [0x01168208]
	(No symbol) [0x011683A9]
	(No symbol) [0x0115AAC0]
	BaseThreadInitThunk [0x76A45D49+25]
	RtlInitializeExceptionChain [0x77C5CF0B+107]
	RtlGetAppContainerNamedObjectPath [0x77C5CE91+561]

No cookie consent banner appeared or it was already handled. Message: 

Load More failed on: https://kun.uz/en/news/category/society Message: 
Stacktrace:
	GetHandleVerifier [0x011DC7F3+24435]
	(No symbol) [0x01162074]


In [33]:
len(all_pages)

7

In [14]:
# scraped_data = []
# category_topics = {
#     "https://kun.uz/en/news/category/politics": "Politics",
#     "https://kun.uz/en/news/category/society": "Society",
#     "https://kun.uz/en/news/category/business": "Business",
#     "https://kun.uz/en/news/category/tech": "Tech",
#     "https://kun.uz/en/news/category/culture": "Culture",
#     "https://kun.uz/en/news/category/sport-en": "Sport",
#     "https://kun.uz/en/news/category/tourism": "Tourism"
# }
# for url, page_source in tqdm(all_pages.items(), desc="Scraping all categories"):
#     soup = BeautifulSoup(page_source, 'html.parser')
#     topic = category_topics.get(url, 'Unknown')
#     main_container = soup.find('div', {'id': 'news-list'})
#     if not main_container:
#         print(f"No article container found on {url}")
#         continue
#     article_links = main_container.find_all('a', class_='news-page__item')
#     for article in article_links:
#         relative_url = article.get('href', '')
#         full_url = 'https://kun.uz' + relative_url if relative_url else "N/A"
#         headline_tag = article.find('h3', class_='news-page__item-title')
#         headline = headline_tag.get_text(strip=True) if headline_tag else "N/A"
#         date_div = article.find('div', class_='gray-date')
#         date = date_div.get_text(strip=True) if date_div else "N/A"
#         article_content = "N/A"
#         if full_url != "N/A":
#             try:
#                 response = requests.get(full_url, timeout=10)
#                 if response.status_code == 200:
#                     article_soup = BeautifulSoup(response.text, 'html.parser')
#                     content_div = article_soup.find('div', class_='news-inner__content-page')
#                     if content_div:
#                         paragraphs = content_div.find_all('p')
#                         article_content = '\n'.join(p.get_text(strip=True) for p in paragraphs)
#             except Exception as e:
#                 print(f"Failed to fetch content from {full_url}: {e}")
#         scraped_data.append({
#             'date': date,
#             'topic': topic,
#             'headline': headline,
#             'content': article_content,
#             'category_url': url,
#             'link': full_url, 
#         })
#         time.sleep(0.5)
        
# df2 = pd.DataFrame(scraped_data)

In [35]:
scraped_data = []
category_topics = {
    "https://kun.uz/en/news/category/politics": "Politics",
    "https://kun.uz/en/news/category/society": "Society",
    "https://kun.uz/en/news/category/business": "Business",
    "https://kun.uz/en/news/category/tech": "Tech",
    "https://kun.uz/en/news/category/culture": "Culture",
    "https://kun.uz/en/news/category/sport-en": "Sport",
    "https://kun.uz/en/news/category/tourism": "Tourism"
}
for url, page_source in tqdm(all_pages.items(), desc="Scraping all categories"):
    soup = BeautifulSoup(page_source, 'html.parser')
    topic = category_topics.get(url, 'Unknown')
    main_container = soup.find('div', {'id': 'news-list'})
    if not main_container:
        print(f"No article container found on {url}")
        continue
    article_links = main_container.find_all('a', class_='news-page__item')
    for article in article_links:
        relative_url = article.get('href', '')
        full_url = 'https://kun.uz' + relative_url if relative_url else "N/A"
        headline_tag = article.find('h3', class_='news-page__item-title')
        headline = headline_tag.get_text(strip=True) if headline_tag else "N/A"
        date_div = article.find('div', class_='gray-date')
        # date = date_div.get_text(strip=True) if date_div else "N/A"
        raw_date = date_div.get_text(strip=True) if date_div else "N/A"
        match = re.search(r'(\d{2})\.(\d{2})\.(\d{4})', raw_date)
        if match:
            day, month, year = match.groups()
            date = datetime.strptime(f"{day}.{month}.{year}", "%d.%m.%Y").strftime("%Y-%m-%d")
        else:
            date = "N/A"
        article_content = "N/A"
        if full_url != "N/A":
            try:
                response = requests.get(full_url, timeout=10)
                if response.status_code == 200:
                    article_soup = BeautifulSoup(response.text, 'html.parser')
                    content_div = article_soup.find('div', class_='news-inner__content-page')
                    if content_div:
                        paragraphs = content_div.find_all('p')
                        article_content = '\n'.join(p.get_text(strip=True) for p in paragraphs)
            except Exception as e:
                print(f"Failed to fetch content from {full_url}: {e}")
        scraped_data.append({
            'date': date,
            'topic': topic,
            'headline': headline,
            'content': article_content,
            'category_url': url,
            'link': full_url, 
        })
        time.sleep(0.5)
        
df = pd.DataFrame(scraped_data)

Scraping all categories: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [01:14<00:00, 10.68s/it]


In [63]:
df.head()

Unnamed: 0,date,topic,headline,content,category_url,link
0,,Politics,Former Presidential Administration chief Zayni...,Nizomiddinov previously served as Chief of Sta...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/07/former-presi...
1,,Politics,President Mirziyoyev receives Russian Federati...,The meeting focused on further strengthening t...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/07/president-mi...
2,,Politics,Shavkat Mirziyoyev: Uzbekistan cut poverty rat...,He emphasized that ensuring social protection ...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/07/shavkat-mirz...
3,,Politics,World Bank allocates $153M to restore Uzbekist...,The World Bank will provide $153 million in fi...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/07/world-bank-a...
4,2025-04-05,Politics,Uzbekistan‚Äôs population reaches 37.7 million: ...,"The National Statistics Committeereportsthat, ...",https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/05/uzbekistans-...


In [59]:
df.tail()

Unnamed: 0,date,topic,headline,content,category_url,link
100,2024-11-28,Tourism,Tourism Malaysia promotes Malaysian tourism at...,"The Ambassador of Malaysia to Uzbekistan, His ...",https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/28/tourism-mala...
101,2024-11-20,Tourism,Foreign tourist arrivals in Uzbekistan surge b...,"October alone saw nearly 800,000 international...",https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/20/foreign-tour...
102,2024-11-15,Tourism,Kazakhstan and Uzbekistan to launch historic S...,The four-day international train tour offers t...,https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/15/kazakhstan-a...
103,2024-11-07,Tourism,Tashkent to launch train service to Issyk-Kul ...,This agreement followed a bilateral meeting be...,https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/07/tashkent-to-...
104,2024-10-19,Tourism,Uzbekistan and Switzerland collaborate to deve...,"In his opening remarks, Ambassador Dilshod Akh...",https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/10/19/uzbekistan-a...


In [79]:
df.iloc[100, 3]

'The Ambassador of Malaysia to Uzbekistan, His Excellency Ilham Tuah Illias, along with Mr. Nor Shazly Azmi, Director of Tourism Malaysia Almaty, visited the Malaysian Pavilion. Representing Malaysia‚Äôs tourism industry, nine prominent companies showcased their offerings:\nLangkawi Development Authority\nAsia Premium Holidays Sdn Bhd\nMardhiyyah Hotel & Suites\nFun & Sun Holidays Sdn Bhd\nEmbassy Alliance Travel Sdn Bhd\nIdeal Hotel Kuala Lumpur\nLangkawi SkyCab\nLexis Hotels & Resorts Sdn Bhd\nBatik Air\nThese companies aim to solidify Malaysia\'s position in the Central Asian tourism market.\n\nTourism and cultural exchange\nSpeaking at the event, Ambassador Ilham Tuah Illias highlighted the growing ties between Uzbekistan and Malaysia:\n‚ÄúIn 2023, approximately 12,000 Malaysian tourists visited Uzbekistan, and the numbers are increasing, with most opting for historical cities like Samarkand, Bukhara, and Khiva. Similarly, Uzbek tourists are drawn to Malaysia for its diverse landsc

In [65]:
today = datetime.today()

In [67]:
df['Scraped_Date'] = today.strftime('%Y-%m-%d')

In [71]:
df.tail()

Unnamed: 0,date,topic,headline,content,category_url,link,Scraped_Date
100,2024-11-28,Tourism,Tourism Malaysia promotes Malaysian tourism at...,"The Ambassador of Malaysia to Uzbekistan, His ...",https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/28/tourism-mala...,2025-04-07
101,2024-11-20,Tourism,Foreign tourist arrivals in Uzbekistan surge b...,"October alone saw nearly 800,000 international...",https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/20/foreign-tour...,2025-04-07
102,2024-11-15,Tourism,Kazakhstan and Uzbekistan to launch historic S...,The four-day international train tour offers t...,https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/15/kazakhstan-a...,2025-04-07
103,2024-11-07,Tourism,Tashkent to launch train service to Issyk-Kul ...,This agreement followed a bilateral meeting be...,https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/11/07/tashkent-to-...,2025-04-07
104,2024-10-19,Tourism,Uzbekistan and Switzerland collaborate to deve...,"In his opening remarks, Ambassador Dilshod Akh...",https://kun.uz/en/news/category/tourism,https://kun.uz/en/news/2024/10/19/uzbekistan-a...,2025-04-07


In [81]:
df.shape

(105, 7)

In [83]:
df.isna().sum()

date            0
topic           0
headline        0
content         0
category_url    0
link            0
Scraped_Date    0
dtype: int64

In [77]:
df.value_counts()

date        topic     headline                                                                                        content                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [85]:
df.to_csv('Kun-Uz.csv', index=False)

> ‚ö†Ô∏è **Disclaimer**
>
> This project is intended for educational purposes only. Sharing or copying its content is prohibited.
>
> **Author:** Mukhammadkodir Abdusalomov & Elbek Majidov, Student at UW WNE.
