# ARMENIAN KINDERGARTEN MULTI-CITY ANALYSIS
## Multi-City Data Scraping, Cleaning, and Comprehensive Statistical Analysis aross 10 Cities

#### Student: Iren Stepanyan, Lusine Stepanyan
#### Course: Algorithms and Programming Language (APL)
#### Professor: V. Avetisyan  
##### Date: Novermber 2025


## Project Objective:
Collect and analyze kindergarten data from 10 Armenian cities to understand:
1. Regional variations in demand and capacity
2. Urban vs rural differences
3. Patterns across different municipalities
4. Statistical significance of inter-city differences

## Data Sources:
1. Yerevan: https://mankapartez.yerevan.am/order-view
2. Ijevan: https://ijevan.infosys.am/Pages/KinderGarten/List.aspx
3. Gyumri: https://gyumricity.am/Pages/KinderGarten/List.aspx
4. Vanadzor: https://cmis.vanadzor.am/Pages/KinderGarten/List.aspx
5. Armavir: https://armavircity.am/Pages/KinderGarten/List.aspx
6. Kapan: https://kapan.am/pages/KinderGarten/List.aspx
7. Abovyan: https://abovyan.am/Pages/KinderGarten/List.aspx
8. Sevan: https://sevancity.am/Pages/KinderGarten/List.aspx
9. Vaxarshapat: https://docs.ejmiatsin.am/Pages/KinderGarten/List.aspx
10. Artashat: https://artashat.am/Pages/KinderGarten/List.aspx

## Programming Launguage 
* `Python`

## Libraries used: 

**Data Handling and Analysis:**

* `pandas` – for data manipulation and analysis
* `numpy` – for numerical operations

**Visualization:**

* `matplotlib.pyplot` – for plotting graphs
* `seaborn` – for statistical data visualization

**Statistics / Scientific Computing:**

* `scipy.stats` – statistical tests (`chi2_contingency`, `f_oneway`, `ttest_ind`, `normaltest`, `kruskal`, `levene`, `shapiro`, `mannwhitneyu`)

**Web Scraping / Automation:**

* `selenium` – browser automation (`webdriver`, `By`, `WebDriverWait`, `expected_conditions`)
* `bs4` (BeautifulSoup) – HTML parsing
* `re` – regular expressions
* `time` – for delays/waits during scraping

**Utilities:**

* `warnings` – for ignoring warnings

**Visualization Style Settings:**

* `plt.style.use('seaborn-v0_8-darkgrid')`
* `sns.set_palette("husl")`

## Other tools: 
1. **Google colab:** for sharing and storing the code
2. **CSV Files:** Serve as an intermediate format for saving and loading data.

# Stage 0: Environment Setup and Imports

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import (chi2_contingency, f_oneway, ttest_ind, normaltest, 
                         kruskal, levene, shapiro, mannwhitneyu)
import warnings
warnings.filterwarnings('ignore')

# Web scraping
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import re
from bs4 import BeautifulSoup

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

#print("All libraries imported successfully")

# Stage 1: Data Scraping

In [9]:
# Define Armenian cities and their kindergarten URLs
ARMENIAN_CITIES = {
    'Yerevan': {
        'url': 'https://mankapartez.yerevan.am/order-view',
        'type': 'angular',  # Dynamic Angular site
        'population': 1093000,
        'region': 'Capital'
    },
    'Ijevan': {
        'url': 'https://ijevancity.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',  # ASP.NET site
        'population': 21000,
        'region': 'Tavush'
    },
    'Gyumri': {
        'url': 'https://gyumricity.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 121000,
        'region': 'Shirak'
    },
    'Vanadzor': {
        'url': 'https://cmis.vanadzor.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 82000,
        'region': 'Lori'
    },
    'Armavir': {
        'url': 'https://armavircity.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 29000,
        'region': 'Armavir'
    },
    'Kapan': {
        'url': 'https://kapan.am/pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 43000,
        'region': 'Syunik'
    },
    'Abovyan': {
        'url': 'https://abovyan.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 46000,
        'region': 'Kotayk'
    },
    'Sevan': {
        'url': 'https://sevancity.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 19000,
        'region': 'Gegharkunik'
    },
    'Vaxarshapat': {
        'url': 'https://docs.ejmiatsin.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 47446,
        'region': 'Armavir'
    },
     'Artashat': {
        'url': 'https://artashat.am/Pages/KinderGarten/List.aspx',
        'type': 'aspnet',
        'population': 29040,
        'region': 'Ararat'
    }
}

def scrape_yerevan_style(driver, url, city_name):
    """Scrape Angular-based site (Yerevan style)"""
    driver.get(url)
    time.sleep(5)
    
    kindergartens = []
    kinder_elements = driver.find_elements(By.CSS_SELECTOR, '.kinder-content')
    
    for element in kinder_elements:
        try:
            kg = {'city': city_name}
            kg['name'] = element.find_element(By.CSS_SELECTOR, '.kinder-title a').text.strip()
            
            list_items = element.find_elements(By.CSS_SELECTOR, '.kinder-body ul li')
            for item in list_items:
                text = item.text.strip()
                if 'Հերթագրված է' in text:
                    match = re.search(r'(\d+)\s*երեխա', text)
                    if match: kg['order_count'] = int(match.group(1))
                elif 'Գործում է' in text and 'խումբ' in text:
                    match = re.search(r'(\d+)\s*խումբ', text)
                    if match: kg['groups_count'] = int(match.group(1))
                elif 'Հաշվառված է' in text:
                    match = re.search(r'(\d+)\s*երեխա', text)
                    if match: kg['registered_count'] = int(match.group(1))
                else:
                    if 'address' not in kg and len(text) > 10:
                        kg['address'] = text
                    elif 'district' not in kg:
                        kg['district'] = text
            
            if kg.get('name'):
                kindergartens.append(kg)
        except Exception as e:
            continue
    
    return kindergartens
def scrape_aspnet_style(driver, url, city_name):
    """Scrape ASP.NET-based sites (Ijevan, Gyumri, etc.)"""
    driver.get(url)
    time.sleep(3)
    
    kindergartens = []
    
    # Get page source and parse with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Find all kindergarten items
    items = soup.find_all('a', id=lambda x: x and 'rptGardenList' in str(x) and 'LbCategory' in str(x))
    
    for item in items:
        try:
            kg = {'city': city_name}
            
            # Extract name
            name_div = item.find('div', class_='CategoryName')
            if name_div:
                kg['name'] = name_div.get_text(strip=True)
            
            # Extract from table rows
            rows = item.find_all('tr')
            for row in rows:
                text = row.get_text(strip=True)
                
                # Extract address (contains map icon)
                if 'glyphicon-map-marker' in str(row):
                    kg['address'] = text.replace('glyphicon', '').strip()
                
                # Extract phone
                elif 'glyphicon-earphone' in str(row):
                    kg['phone'] = re.search(r'[\d\s\-+()]+', text)
                    if kg['phone']:
                        kg['phone'] = kg['phone'].group(0).strip()
                
                # Extract groups (Գործում է X խումբ)
                elif 'Գործում է' in text and 'խումբ' in text:
                    match = re.search(r'(\d+)\s*խումբ', text)
                    if match:
                        kg['groups_count'] = int(match.group(1))
                
                # Extract capacity (Ցուցակային թիվ)
                elif 'Ցուցակային' in text or 'թիվ' in text:
                    match = re.search(r'(\d+)', text)
                    if match:
                        kg['capacity'] = int(match.group(1))
                
                # Extract registered (Հաշվառված է X երեխա / Հաճախում է)
                elif 'Հաշվառված' in text or 'Հաճախում' in text:
                    match = re.search(r'(\d+)\s*երեխա', text)
                    if match:
                        kg['registered_count'] = int(match.group(1))
                
                # Extract waiting list (Հերթագրված է X երեխա)
                elif 'Հերթագրված' in text:
                    match = re.search(r'(\d+)\s*երեխա', text)
                    if match:
                        kg['order_count'] = int(match.group(1))
            
            if kg.get('name'):
                kindergartens.append(kg)
                
        except Exception as e:
            print(f"Error parsing item: {e}")
            continue
    
    return kindergartens

def scrape_all_cities():
    """Scrape kindergarten data from all Armenian cities"""
    print("\n" + "="*70)
    print("MULTI-CITY DATA SCRAPING")
    print("="*70)
    
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-blink-features=AutomationControlled')
    
    all_data = []
    
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        for city_name, city_info in ARMENIAN_CITIES.items():
            try:
                if city_info['type'] == 'angular':
                    city_data = scrape_yerevan_style(driver, city_info['url'], city_name)
                else:
                    city_data = scrape_aspnet_style(driver, city_info['url'], city_name)
                
                # Add city metadata
                for kg in city_data:
                    kg['population'] = city_info['population']
                    kg['region'] = city_info['region']
                
                all_data.extend(city_data)                
                time.sleep(2)  # Be polite to servers
                
            except Exception as e:
                print(f" Error scraping {city_name}: {e}")
                continue
        
        print(f"Total kindergartens scraped: {len(all_data)}")
        print(f"Cities successfully scraped: {len(set([kg['city'] for kg in all_data]))}")
        
    finally:
        driver.quit()
    
    return all_data

# Execute multi-city scraping
all_kindergartens = scrape_all_cities()
df_raw = pd.DataFrame(all_kindergartens)

print(f"\nRaw data shape: {df_raw.shape}")
print(f"Columns: {df_raw.columns.tolist()}")
print(f"\nCities in dataset:")
print(df_raw['city'].value_counts())


MULTI-CITY DATA SCRAPING
Total kindergartens scraped: 328
Cities successfully scraped: 10

Raw data shape: (328, 11)
Columns: ['city', 'name', 'district', 'address', 'order_count', 'population', 'region', 'groups_count', 'registered_count', 'phone', 'capacity']

Cities in dataset:
city
Yerevan        162
Artashat        32
Gyumri          24
Vanadzor        23
Kapan           18
Armavir         17
Ijevan          16
Sevan           14
Abovyan         12
Vaxarshapat     10
Name: count, dtype: int64


# Stage 2: Data Cleaning

In [10]:
print("\n" + "="*70)
print("DATA CLEANING")
print("="*70)

def clean_multi_city_data(df):
    """Comprehensive cleaning for multi-city dataset"""
    df_clean = df.copy()
    
    print("\nInitial data shape:", df_clean.shape)
    print("Initial missing values:\n", df_clean.isnull().sum())
    
    # 1. Clean text columns
    text_cols = ['name', 'address', 'phone']
    for col in text_cols:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].astype(str).str.strip()
            df_clean[col] = df_clean[col].str.replace(r'\s+', ' ', regex=True)
    
    # 2. Handle numeric columns
    numeric_cols = ['order_count', 'registered_count', 'groups_count', 'capacity']
    for col in numeric_cols:
        if col in df_clean.columns:
            df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce').fillna(0).astype(int)
    
    # 3. Remove duplicates
    before_dup = len(df_clean)
    df_clean = df_clean.drop_duplicates(subset=['city', 'name'], keep='first')
    print(f"→ Removed {before_dup - len(df_clean)} duplicate rows")
    
    # 4. Remove rows with missing critical data
    before_null = len(df_clean)
    df_clean = df_clean.dropna(subset=['name', 'city'], how='any')
    print(f"→ Removed {before_null - len(df_clean)} rows with missing critical data")
    
    # 5. Standardize city names and regions
    df_clean['city'] = df_clean['city'].str.strip()
    df_clean['region'] = df_clean['region'].str.strip()
    
    # 6. Add city size category
    def categorize_city_size(pop):
        if pop > 500000:
            return 'Major City'
        elif pop > 100000:
            return 'Large City'
        elif pop > 50000:
            return 'Medium City'
        elif pop > 20000:
            return 'Small City'
        else:
            return 'Town'
    
    df_clean['city_size'] = df_clean['population'].apply(categorize_city_size)
    
    df_clean = df_clean.reset_index(drop=True)
    
    print(f"\nCleaned data shape: {df_clean.shape}")
    print("Final missing values:\n", df_clean.isnull().sum())
    print("\nKindergartens per city:")
    print(df_clean['city'].value_counts())
    
    return df_clean

df_clean = clean_multi_city_data(df_raw)

# Save cleaned data
df_clean.to_csv('multi_city_kindergartens_cleaned.csv', index=False, encoding='utf-8-sig')
print("\nCleaned data saved to: multi_city_kindergartens_cleaned.csv")


DATA CLEANING

Initial data shape: (328, 11)
Initial missing values:
 city                  0
name                  0
district            166
address               0
order_count           0
population            0
region                0
groups_count         15
registered_count     15
phone               200
capacity            162
dtype: int64
→ Removed 0 duplicate rows
→ Removed 0 rows with missing critical data

Cleaned data shape: (328, 12)
Final missing values:
 city                  0
name                  0
district            166
address               0
order_count           0
population            0
region                0
groups_count          0
registered_count      0
phone                 0
capacity              0
city_size             0
dtype: int64

Kindergartens per city:
city
Yerevan        162
Artashat        32
Gyumri          24
Vanadzor        23
Kapan           18
Armavir         17
Ijevan          16
Sevan           14
Abovyan         12
Vaxarshapat     10
Name: 

# Stage 3: Data Preprocessing