<p style="color: darkred; font-size: 50px; text-align: center;"><b>Web Scraping Project</b></p>
<p style="color: darkred; font-size: 30px; text-align: center;">Kun.uz News Articles</p>
<p style="font-size: 20px; text-align: center;">Mukhammadkodir Abdusalomov, Elbek Majidov</p>
<p style="font-size: 20px; text-align: center;">Spring 2025</p>
<p align="center">
  <img src="img/wne-logo2.png" width="498" height="107">
</p>

# 📰 Web Scraping Project: News Articles from kun.uz

## 🎯 Objective
The aim of this project is to scrape and analyze news articles from [kun.uz](https://kun.uz), a leading news platform in Uzbekistan. Using web scraping techniques, we will collect data from different news categories, prepare it for further analysis, and document the entire process.

## 🧾 Why kun.uz?
- It offers a wide range of news categories such as **Technology**, **Sports**, **Society**, etc.
- The content is updated regularly, ensuring a large dataset.
- News data provides strong opportunities for **natural language processing (NLP)** and **topic-based analysis**.
- The site's `robots.txt` does not block scraping of news categories, making it legal and ethical for academic purposes.

## 🛠 Tools & Technologies
- **BeautifulSoup** – for parsing and extracting HTML content.
- **Selenium** – for interacting with dynamic content.
- **Pandas** – for structuring and cleaning the data.
- **Jupyter Notebook** – for organizing, documenting, and running the project code.

## 🗂 Project Workflow
1. **Select Target Categories** from kun.uz (e.g., Technology, Sports).
2. **Scrape Articles** using BeautifulSoup and Selenium.
3. **Extract Key Data Fields**: title, publication date, article content, category, and URL.
4. **Clean and Prepare** the data for analysis.
5. **Document and Report** the process and findings.

> ⚠️ _Note: This project is for educational purposes only. All data is used in compliance with kun.uz’s robots.txt file and scraping best practices._


## 🛠️ Libraries and Tools Setup

Before we begin the web scraping process, let's ensure that all necessary libraries are installed and properly set up. This project utilizes the following Python libraries:

- **BeautifulSoup**: For parsing and extracting data from HTML and XML files.
- **Selenium**: For automating web browser interaction, especially useful for dynamic content.
- **Pandas**: For data manipulation and analysis.
- **Requests**: For making HTTP requests to retrieve web pages.

### Installation

To install these libraries, run the following commands in a
code cell:

```python
!pip install beautifulsoup4 selenium p
!pip install tensorflow
!pip install webdriver_manager
!pip install torch
!pip install nlkt
!pip installtransformersandas requests



---

This Markdown provides a structured outline for setting up your Jupyter Notebook, including library installation, understanding the website structure, and defining the scope of your scraping project. You can copy and paste this into your Jupyter Notebook to guide your project development.

In [88]:
# !pip install beautifulsoup4 selenium pandas requests

In [95]:
# !pip install tensorflow

In [89]:
# !pip install webdriver_manager

In [100]:
# !pip install nlkt

In [90]:
# !pip install torch

In [102]:
# !pip install transformers

In [2]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import requests
import time
import selenium
import numpy as np
import re
from tqdm import tqdm
from datetime import datetime, timedelta
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.support.ui import WebDriverWait # this three enable waiting until sth is displayed on website
from selenium.webdriver.support import expected_conditions as EC # for checking visibility of an element
from selenium.webdriver.support.ui import WebDriverWait

In [123]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
url = "https://kun.uz/en"

In [8]:
chromepath = ChromeDriverManager().install()
service_chrome = Service(executable_path=chromepath)
options_chrome = webdriver.ChromeOptions()
driver_chrome = webdriver.Chrome(service=service_chrome, options=options_chrome)
driver_chrome.maximize_window()

def handle_cookie_consent():
    try:
        consent_button = WebDriverWait(driver_chrome, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//*[@id="accept-consent"]'))
        )
        consent_button.click()
    except Exception as e:
        print("No cookie consent banner appeared or it was already handled.", e)
        
category_urls = [
    "https://kun.uz/en/news/category/politics",
    "https://kun.uz/en/news/category/society",
    "https://kun.uz/en/news/category/business",
    "https://kun.uz/en/news/category/tech",
    "https://kun.uz/en/news/category/culture",
    "https://kun.uz/en/news/category/sport-en",
    "https://kun.uz/en/news/category/tourism"
]
# load_more_xpath = "//button[@class='point-view__footer-btn']//span[contains(text(),'More News')]"
load_more_xpath = "//button[@class='point-view__footer-btn']//span[contains(text(),'More News')]"
all_pages = {}
try:
    for url in category_urls:
        driver_chrome.get(url)
        time.sleep(3)
        handle_cookie_consent()      
        for i in range(5):
            try:
                time.sleep(np.random.chisquare(1) + 5)
                WebDriverWait(driver_chrome, 15).until(
                    EC.visibility_of_element_located((By.XPATH, load_more_xpath))
                )
                button = driver_chrome.find_element(By.XPATH, load_more_xpath)
                button.click()
            except Exception as e:
                print(f"Load More failed on: {url}", e)
                break
        all_pages[url] = driver_chrome.page_source  # Save source for each category
finally:
    driver_chrome.quit()

Load More failed on: https://kun.uz/en/news/category/politics Message: 
Stacktrace:
	GetHandleVerifier [0x009EC7F3+24435]
	(No symbol) [0x00972074]
	(No symbol) [0x008406E3]
	(No symbol) [0x00888B39]
	(No symbol) [0x00888E8B]
	(No symbol) [0x008D1AC2]
	(No symbol) [0x008AD804]
	(No symbol) [0x008CF20A]
	(No symbol) [0x008AD5B6]
	(No symbol) [0x0087C54F]
	(No symbol) [0x0087D894]
	GetHandleVerifier [0x00CF70A3+3213347]
	GetHandleVerifier [0x00D0B0C9+3295305]
	GetHandleVerifier [0x00D0558C+3271948]
	GetHandleVerifier [0x00A87360+658144]
	(No symbol) [0x0097B27D]
	(No symbol) [0x00978208]
	(No symbol) [0x009783A9]
	(No symbol) [0x0096AAC0]
	BaseThreadInitThunk [0x76A45D49+25]
	RtlInitializeExceptionChain [0x77C5CF0B+107]
	RtlGetAppContainerNamedObjectPath [0x77C5CE91+561]

No cookie consent banner appeared or it was already handled. Message: 

Load More failed on: https://kun.uz/en/news/category/society Message: 
Stacktrace:
	GetHandleVerifier [0x009EC7F3+24435]
	(No symbol) [0x00972074]


In [51]:
len(all_pages)

7

In [28]:
scraped_data = []
category_topics = {
    "https://kun.uz/en/news/category/politics": "Politics",
    "https://kun.uz/en/news/category/society": "Society",
    "https://kun.uz/en/news/category/business": "Business",
    "https://kun.uz/en/news/category/tech": "Tech",
    "https://kun.uz/en/news/category/culture": "Culture",
    "https://kun.uz/en/news/category/sport-en": "Sport",
    "https://kun.uz/en/news/category/tourism": "Tourism"
}

for url, page_source in tqdm(all_pages.items(), desc="Scraping all categories"):
    soup = BeautifulSoup(page_source, 'html.parser')
    topic = category_topics.get(url, 'Unknown')
    main_container = soup.find('div', {'id': 'news-list'})
    if not main_container:
        print(f"No article container found on {url}")
        continue
    article_links = main_container.find_all('a', class_='news-page__item')
    for article in article_links:
        relative_url = article.get('href', '')
        full_url = 'https://kun.uz' + relative_url if relative_url else "N/A"
        # article
        headline_tag = article.find('h3', class_='news-page__item-title')
        headline = headline_tag.get_text(strip=True) if headline_tag else "N/A"
        # Extract the publication date
        date_div = article.find('div', class_='gray-date')
        date = date_div.get_text(strip=True) if date_div else "N/A"
        scraped_data.append({
            'date': date,
            'topic': topic,
            'headline': headline,
            'link': full_url,
            'category_url': url   
        })
        
df = pd.DataFrame(scraped_data)

Scraping all categories: 100%|███████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  8.52it/s]


In [30]:
df.shape

(135, 5)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   category_url  135 non-null    object
 1   headline      135 non-null    object
 2   link          135 non-null    object
 3   topic         135 non-null    object
 4   date          135 non-null    object
dtypes: object(5)
memory usage: 5.4+ KB


In [24]:
df.describe(include='O')

Unnamed: 0,category_url,headline,link,topic,date
count,135,135,135,135,135
unique,7,135,135,7,135
top,https://kun.uz/en/news/category/tech,"President Mirziyoyev backs closer scientific, ...",https://kun.uz/en/news/2025/04/04/president-mi...,Tech,17:17 / 04.04.2025
freq,45,1,1,45,1


In [32]:
df.head()

Unnamed: 0,date,topic,headline,link,category_url
0,17:17 / 04.04.2025,Politics,"President Mirziyoyev backs closer scientific, ...",https://kun.uz/en/news/2025/04/04/president-mi...,https://kun.uz/en/news/category/politics
1,16:12 / 04.04.2025,Politics,Sadyr Japarov urges EU investment in Kyrgyz hy...,https://kun.uz/en/news/2025/04/04/sadyr-japaro...,https://kun.uz/en/news/category/politics
2,15:58 / 04.04.2025,Politics,Uzbekistan proposes long-term roadmap for Cent...,https://kun.uz/en/news/2025/04/04/uzbekistan-p...,https://kun.uz/en/news/category/politics
3,15:56 / 04.04.2025,Politics,President Mirziyoyev warns of security threats...,https://kun.uz/en/news/2025/04/04/president-mi...,https://kun.uz/en/news/category/politics
4,14:51 / 04.04.2025,Politics,“Diplomacy is the only path forward in resolvi...,https://kun.uz/en/news/2025/04/04/diplomacy-is...,https://kun.uz/en/news/category/politics


In [125]:
scraped_data = []
category_topics = {
    "https://kun.uz/en/news/category/politics": "Politics",
    "https://kun.uz/en/news/category/society": "Society",
    "https://kun.uz/en/news/category/business": "Business",
    "https://kun.uz/en/news/category/tech": "Tech",
    "https://kun.uz/en/news/category/culture": "Culture",
    "https://kun.uz/en/news/category/sport-en": "Sport",
    "https://kun.uz/en/news/category/tourism": "Tourism"
}
for url, page_source in tqdm(all_pages.items(), desc="Scraping all categories"):
    soup = BeautifulSoup(page_source, 'html.parser')
    topic = category_topics.get(url, 'Unknown')
    main_container = soup.find('div', {'id': 'news-list'})
    if not main_container:
        print(f"No article container found on {url}")
        continue
    article_links = main_container.find_all('a', class_='news-page__item')
    for article in article_links:
        relative_url = article.get('href', '')
        full_url = 'https://kun.uz' + relative_url if relative_url else "N/A"
        headline_tag = article.find('h3', class_='news-page__item-title')
        headline = headline_tag.get_text(strip=True) if headline_tag else "N/A"
        date_div = article.find('div', class_='gray-date')
        date = date_div.get_text(strip=True) if date_div else "N/A"
        article_content = "N/A"
        if full_url != "N/A":
            try:
                response = requests.get(full_url, timeout=10)
                if response.status_code == 200:
                    article_soup = BeautifulSoup(response.text, 'html.parser')
                    content_div = article_soup.find('div', class_='news-inner__content-page')
                    if content_div:
                        paragraphs = content_div.find_all('p')
                        article_content = '\n'.join(p.get_text(strip=True) for p in paragraphs)
            except Exception as e:
                print(f"Failed to fetch content from {full_url}: {e}")
        scraped_data.append({
            'date': date,
            'topic': topic,
            'headline': headline,
            'content': article_content,
            'category_url': url,
            'link': full_url, 
        })
        time.sleep(0.5)
        
df2 = pd.DataFrame(scraped_data)

Scraping all categories: 100%|███████████████████████████████████████████████████████████| 7/7 [01:55<00:00, 16.47s/it]


In [129]:
df2.head()

Unnamed: 0,date,topic,headline,content,category_url,link
0,17:17 / 04.04.2025,Politics,"President Mirziyoyev backs closer scientific, ...",The president began by emphasizing the importa...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/president-mi...
1,16:12 / 04.04.2025,Politics,Sadyr Japarov urges EU investment in Kyrgyz hy...,Speaking before leaders from Central Asia and ...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/sadyr-japaro...
2,15:58 / 04.04.2025,Politics,Uzbekistan proposes long-term roadmap for Cent...,“A historic decision will be adopted at the co...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/uzbekistan-p...
3,15:56 / 04.04.2025,Politics,President Mirziyoyev warns of security threats...,In his speech at the first Central Asia–Europe...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/president-mi...
4,14:51 / 04.04.2025,Politics,“Diplomacy is the only path forward in resolvi...,"At the beginning of his speech, the president ...",https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/diplomacy-is...


In [131]:
df2.iloc[1, 3]

'Speaking before leaders from Central Asia and the European Union, President Japarov reaffirmed Kyrgyzstan’s commitment to strengthening mutually beneficial ties with its closest neighbors and external partners. “Deepening cooperation with neighboring countries remains a top priority of Kyrgyzstan’s foreign policy,” he stated, highlighting that close collaboration among Central Asian states plays a pivotal role in maintaining security and sustainable development in the region.\nJaparov underlined recent breakthroughs in regional diplomacy, including the formal resolution of long-standing border disputes. “In 2022, we signed a historic agreement with Uzbekistan to formalize our state border. Now, the Kyrgyz-Tajik border has also been completely resolved,” he noted. He recalled the recent visit of Tajik President Emomali Rahmon to Bishkek, during which the treaty on the state border was signed and subsequently ratified by Kyrgyzstan’s parliament on March 25.\n“These landmark agreements w

In [133]:
today = datetime.today()

In [135]:
df2['Scraped_Date'] = today.strftime('%Y-%m-%d')

In [137]:
df2.head()

Unnamed: 0,date,topic,headline,content,category_url,link,Scraped_Date
0,17:17 / 04.04.2025,Politics,"President Mirziyoyev backs closer scientific, ...",The president began by emphasizing the importa...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/president-mi...,2025-04-05
1,16:12 / 04.04.2025,Politics,Sadyr Japarov urges EU investment in Kyrgyz hy...,Speaking before leaders from Central Asia and ...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/sadyr-japaro...,2025-04-05
2,15:58 / 04.04.2025,Politics,Uzbekistan proposes long-term roadmap for Cent...,“A historic decision will be adopted at the co...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/uzbekistan-p...,2025-04-05
3,15:56 / 04.04.2025,Politics,President Mirziyoyev warns of security threats...,In his speech at the first Central Asia–Europe...,https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/president-mi...,2025-04-05
4,14:51 / 04.04.2025,Politics,“Diplomacy is the only path forward in resolvi...,"At the beginning of his speech, the president ...",https://kun.uz/en/news/category/politics,https://kun.uz/en/news/2025/04/04/diplomacy-is...,2025-04-05


In [139]:
df2.shape

(135, 7)

In [145]:
df2.isna().sum()

date            0
topic           0
headline        0
content         0
category_url    0
link            0
Scraped_Date    0
dtype: int64

In [151]:
df2.value_counts()

date                topic     headline                                                                                                           content                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

In [143]:
df2.to_csv('Kun-Uz.csv', index=False)

# 📰 Web Scraping & NLP Project: Analyzing News from kun.uz

## 🎯 Objective
To scrape news articles from [kun.uz](https://kun.uz/en), extract relevant information, and perform basic natural language processing (NLP) tasks such as tokenization, stopword analysis, and sentiment classification.

---

## 💡 Why This Project?

- **Real-world NLP**: News articles provide rich textual data suitable for natural language processing tasks like sentiment analysis and keyword extraction.
- **Structured Content**: kun.uz organizes content into categories (e.g., Tech, Politics), facilitating thematic classification and analysis.
- **Legal & Ethical**: The site's `robots.txt` permits scraping of news categories, ensuring compliance with ethical data collection practices.

---

## 🧰 Tools Used

| Tool/Library       | Purpose                                         |
|--------------------|-------------------------------------------------|
| `Selenium`         | Interacting with dynamic web content            |
| `BeautifulSoup`    | Parsing and extracting HTML data                |
| `Pandas`           | Structuring and cleaning the extracted data     |
| `Requests`         | Fetching article content                        |
| `NLTK`             | Performing tokenization, stopword removal, and lemmatization |
| `Transformers`     | Conducting sentiment analysis using pre-trained models |
| `Jupyter Notebook` | Developing and documenting the workflow         |

---

## 🗂 Workflow Summary

### 🔍 Data Collection
1. **Category Selection**: Targeted categories such as **Politics**, **Tech**, and **Tourism**.
2. **Dynamic Content Handling**: Utilized Selenium to load additional articles by simulating user interactions (e.g., clicking "Load More").
3. **Data Extraction**: Employed BeautifulSoup to parse HTML and extract details like headlines, publication dates, URLs, and full article content.

### 🧹 Data Processing
- **Data Structuring**: Organized the extracted data into a Pandas DataFrame with columns: `date`, `topic`, `headline`, `content`, `category_url`, and `link`.
- **Text Cleaning**: Performed preprocessing tasks including tokenization, stopword removal, and lemmatization using NLTK.

### 📊 NLP Analysis
- **Word Statistics**: Calculated metrics such as total word count, unique word count, and stopword count for each article.
- **Sentiment Analysis**:
  - Implemented the `transformers` library's pipeline with the `distilbert-base-uncased-finetuned-sst-2-english` model.
  - Generated sentiment labels (`POSITIVE`/`NEGATIVE`) along with confidence scores for each article.

---

## 📁 Output

The final dataset comprises:
- Headlines
- Full Article Content
- Topics
- Publication Dates
- Sentiment Labels & Scores
- Word Statistics (total words, unique words, stopwords)

To save the DataFrame as a CSV file in the current working directory:

```python
df.to_csv("kunuz_scraped_data.csv", index=False)


> ⚠️ **Disclaimer**
>
> This project is intended for educational purposes only. Sharing or copying its content is prohibited.
>
> **Author:** Mukhammadkodir Abdusalomov, Student at UW WNE.
