<a href="https://colab.research.google.com/github/Kensuzuki95/AIF360/blob/master/Web_Scraping_Store_Info.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tomods Store Location Info
* Homapage Store Location Search Page: https://shop.tomods.jp/all

In [None]:
# Import packages
import requests
import time
import random
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import re

In [None]:
url = "https://shop.tomods.jp/all"

# Send a GET request to the website
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the buttons with the specified class and extract the store names
buttons = soup.find_all('button', class_='font-bold ml-1.5 text-left text-[14px] hover:opacity-50 transition-opacity')

# Find all the divs with the specified class and extract the store hours and store address
store_info = soup.find_all('div', class_='text-sm mb-3')

store_data = []
for i in range(len(buttons)):
    store_name = buttons[i].text
    store_hours = store_info[i].find('div', class_='mb-1').text
    store_address = store_info[i].find_all('div')[1].text
    store_data.append([store_name, store_hours, store_address])

# Create a pandas DataFrame from the scraped data
df = pd.DataFrame(store_data, columns=['store_name', 'store_hours', 'store_address'])

# Print the DataFrame
#df.head()

In [None]:
# Filter rows based on the condition: store_name contains 'トモズ'
df = df[df['store_name'].str.contains('トモズ')]

# Reset the DataFrame index after filtering
df = df.reset_index(drop=True)
df

Unnamed: 0,store_name,store_hours,store_address
0,薬局トモズ 成城コルティ店,営業時間外 - 営業開始時間 9:00,東京都世田谷区成城6-5-34 成城コルティ 3F
1,トモズ 白金高輪店,営業時間外 - 営業開始時間 9:00,東京都港区高輪 1丁目3-1 プレミストタワー白金高輪 1F
2,トモズ アークヒルズ店,営業時間外 - 営業開始時間 8:30,東京都港区赤坂1-12-32アーク森ビル 2F
3,トモズ 青葉台店,営業時間外 - 営業開始時間 9:00,神奈川県横浜市青葉区青葉台1-6-14エキニア青葉台
4,トモズ 青葉台東急スクエア店,営業時間外 - 営業開始時間 10:00,神奈川県横浜市青葉区青葉台2-1-1青葉台東急スクエアNorth-1　1·2F
...,...,...,...
222,トモズ ららぽーと海老名店,営業時間外 - 営業開始時間 10:00,神奈川県海老名市扇町１３−１ららぽーと海老名 1F
223,トモズ ららぽーと湘南平塚店,営業時間外 - 営業開始時間 10:00,神奈川県平塚市天沼１０−１ららぽーと湘南平塚1F
224,薬局トモズ 両国店,営業時間外 - 営業開始時間 9:00,東京都墨田区横網1-10-12 高安ビル
225,トモズ 六本木ヒルズ店,営業時間外 - 営業開始時間 8:30,東京都港区六本木6-10-3六本木ヒルズ ウェストウォーク1F


In [None]:
df.to_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/tomods.xlsx', index=False)

# Ito-Yokado Store Location Info
* Store information page link starts with https://stores.itoyokado.co.jp/XXX (= place 3 digits number instead of "XXX" to access random store pages)
  * E.g., イトーヨーカドー 武蔵小金井店 = https://stores.itoyokado.co.jp/242

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_store_details(store_number):
    # Construct the URL
    url = f"https://stores.itoyokado.co.jp/{store_number:03d}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the page exists
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")

        # Scrape the store name
        store_name_element = soup.find("span", class_="LocationName-brand")
        store_name = store_name_element.text if store_name_element else "N/A"

        # Scrape the store location
        store_location_element = soup.find("address", class_="Address-content")
        store_location = store_location_element.text if store_location_element else "N/A"
        store_location = re.sub(r'〒\d{3}-\d{4}', '', store_location) #Remove postal code


        # Scrape the store hours
        store_hours_element = soup.find("div", class_="c-hours-details")
        store_hours = store_hours_element.text.strip() if store_hours_element else "N/A"

        # Return the scraped details as a dictionary
        return {
            "store_name": store_name,
            "store_hours": store_hours,
            "store_address": store_location
        }
    else:
        return None

# Scrape store details for store numbers 001 to 999
results = []
for store_number in range(1, 1000):
    store_details = scrape_store_details(store_number)
    if store_details is not None:
        results.append(store_details)

# Create a data frame from the results
df = pd.DataFrame(results)

# Print the data frame
df

Unnamed: 0,store_name,store_hours,store_address
0,イトーヨーカドー 高砂店,全曜日10:00 - 22:00,東京都葛飾区高砂3-12-5
1,イトーヨーカドー 柏店,全曜日10:00 - 22:00,千葉県柏市柏2-15
2,イトーヨーカドー 上板橋店,全曜日10:00 - 22:00,東京都板橋区常盤台4-26-1
3,イトーヨーカドー 相模原店,全曜日10:00 - 22:00,神奈川県相模原市南区松が枝町17-1
4,イトーヨーカドー 浦和店,全曜日10:00 - 21:00,埼玉県さいたま市浦和区仲町1-7-1
...,...,...,...
121,イトーヨーカドー 食品館川越店,全曜日10:00 - 22:00,埼玉県川越市新富町1-20-1
122,イトーヨーカドー 新田店,全曜日10:00 - 21:00,埼玉県草加市旭町六丁目15-30
123,イトーヨーカドー 朝霞店,全曜日09:00 - 20:00,埼玉県朝霞市根岸台3-20－1
124,イトーヨーカドー 西川口店,全曜日10:00 - 21:00,埼玉県川口市西川口2-3-5


In [None]:
df.to_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/ito_yokado.xlsx', index=False)

# Matsumoto-Kiyoshi Store Location Info

## Extract from official website
* Matsumoto Kiyoshi website has uniques page for each stores. Page number is determined in 8 digits number.
  * Example: ドラッグストアマツモトキヨシ中園店: https://www.matsukiyo.co.jp/map?kid=20805482 (The 8 digit code == 20805482 for this store)

* The official website may have system to resrtict web crawling, hence extracting data from the official website was unsuccessful

In [None]:
# Set headers and user-agent
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

def scrape_store_details(store_number):
    # Construct the URL
    url = f"https://www.matsukiyo.co.jp/map?kid={store_number:08d}"

    try:
        # Send a GET request to the URL with headers
        with requests.Session() as session:
            response = session.get(url, headers=headers)
            response.raise_for_status()  # Raise an exception for non-200 status codes

        # Introduce a random delay between requests
        delay = random.uniform(0.5, 2.0)
        time.sleep(delay)

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")

        # Scrape the store name
        store_name_element = soup.find("br", class_="spBr")
        store_name = store_name_element.text.strip() if store_name_element else "N/A"

        # Scrape the store location
        store_location_element = soup.find("ul", class_="detailIcon").find("li", class_="iconzip").find("li", class_="iconAdd")
        store_location = store_location_element.text.strip() if store_location_element else "N/A"

        # Scrape the store hours
        store_hours_element = soup.find("table", class_="detailTbl").find("tbody").find("tr").find("th").find("td")
        store_hours = store_hours_element.text.strip() if store_hours_element else "N/A"

        # Return the scraped details as a dictionary
        return {
            "store_name": store_name,
            "store_hours": store_hours,
            "store_address": store_location
        }
    except requests.exceptions.RequestException as e:
        print(f"Error occurred while scraping store {store_number}: {str(e)}")
        return None

In [None]:
# Scrape store details for store numbers 1000000 to 3000000 using parallel processing
results = []
store_numbers = range(20805485, 20805487)
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(scrape_store_details, store_number) for store_number in store_numbers]
    for future in futures:
        store_details = future.result()
        if store_details is not None:
            results.append(store_details)

In [None]:
# Filter rows based on the condition: store_name contains 'トモズ'
df = df[df['store_name'].str.contains('マツモトキヨシ')]

# Reset the DataFrame index after filtering
df = df.reset_index(drop=True)
df

In [None]:
# Save the data frame to a CSV file in the specified folder
df.to_csv('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/matsumoto-kiyoshi.xlsx', index=False)

## Try alternative Data set
* An store data is also stred in a web phone book known as *Mapion*
  * Link to the website: https://www.mapion.co.jp/phonebook/M02021Cc01/1.html

In [None]:
# Function to extract data from a given page
def extract_data(page_number):
    # Construct the URL
    url = f"https://www.mapion.co.jp/phonebook/M02021Cc01/{page_number}.html"

    # Send a GET request to the URL
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Initialize empty lists to store the scraped data
    store_names = []
    store_locations = []

    # Find all store name elements
    name_elements = soup.find_all('a', {'title': re.compile('.*')})

    # Extract the store names and locations
    for name_element in name_elements:
        store_name = name_element.get_text()
        if 'マツモトキヨシ' in store_name:
            store_names.append(store_name)
            location_element = name_element.find_next('td')
            store_locations.append(location_element.get_text())

    # Create a DataFrame from the scraped data
    data = {'store_name': store_names, 'store_address': store_locations}
    df = pd.DataFrame(data)

    return df

# Iterate through the page numbers
dfs = []
for page_number in range(1, 17):
    df = extract_data(page_number)
    dfs.append(df)

# Concatenate all the DataFrames
combined_df = pd.concat(dfs, ignore_index=True)
combined_df

Unnamed: 0,store_name,store_address
0,マツモトキヨシ札幌南１条店,北海道札幌市中央区南１条西３丁目８
1,マツモトキヨシ札幌狸小路店,北海道札幌市中央区南２条西３丁目１-５
2,マツモトキヨシプロム山鼻店,北海道札幌市中央区南２２条西１２-１-２
3,マツモトキヨシ札幌狸小路Ｐａｒｔ２店,北海道札幌市中央区南三条西３-１３-１
4,マツモトキヨシ札幌すすきの店,北海道札幌市中央区南四条西３-９-２
...,...,...
1479,マツモトキヨシハンビータウン店,沖縄県中頭郡北谷町北前１-２-３
1480,マツモトキヨシ北谷はまがわ店,沖縄県中頭郡北谷町字宮城１-３７
1481,マツモトキヨシなかぐすく店,沖縄県中頭郡中城村字南上原７９５
1482,マツモトキヨシ板良敷店,沖縄県島尻郡与那原町字板良敷６２１番地


In [None]:
# Save the combined DataFrame to a CSV file in the specified folder
combined_df.to_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/matsumoto_kiyoshi.xlsx', index=False)

# Merge the Three Dataframe

In [None]:
tomods = pd.read_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/tomods.xlsx')
ito_yokado = pd.read_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/ito_yokado.xlsx')
matsumoto_kiyoshi = pd.read_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/matsumoto_kiyoshi.xlsx')

In [None]:
# Create franchise name column for all three dataset
tomods['franchise_name'] = 'トモズ'
ito_yokado['franchise_name'] = 'イトーヨーカドー'
matsumoto_kiyoshi['franchise_name'] = 'マツモトキヨシ'

In [None]:
appended_df = pd.concat([tomods, ito_yokado, matsumoto_kiyoshi])
appended_df

Unnamed: 0,store_name,store_hours,store_address,franchise_name
0,薬局トモズ 成城コルティ店,営業時間外 - 営業開始時間 9:00,東京都世田谷区成城6-5-34 成城コルティ 3F,トモズ
1,トモズ 白金高輪店,営業時間外 - 営業開始時間 9:00,東京都港区高輪 1丁目3-1 プレミストタワー白金高輪 1F,トモズ
2,トモズ アークヒルズ店,営業時間外 - 営業開始時間 8:30,東京都港区赤坂1-12-32アーク森ビル 2F,トモズ
3,トモズ 青葉台店,営業時間外 - 営業開始時間 9:00,神奈川県横浜市青葉区青葉台1-6-14エキニア青葉台,トモズ
4,トモズ 青葉台東急スクエア店,営業時間外 - 営業開始時間 10:00,神奈川県横浜市青葉区青葉台2-1-1青葉台東急スクエアNorth-1　1·2F,トモズ
...,...,...,...,...
1479,マツモトキヨシハンビータウン店,,沖縄県中頭郡北谷町北前１-２-３,マツモトキヨシ
1480,マツモトキヨシ北谷はまがわ店,,沖縄県中頭郡北谷町字宮城１-３７,マツモトキヨシ
1481,マツモトキヨシなかぐすく店,,沖縄県中頭郡中城村字南上原７９５,マツモトキヨシ
1482,マツモトキヨシ板良敷店,,沖縄県島尻郡与那原町字板良敷６２１番地,マツモトキヨシ


In [None]:
appended_df.to_excel('/content/drive/MyDrive/EY Recruitment Documents/franchise_info/store_info_all.xlsx', index=False)