# Yemeksepeti Data Collection

This notebook aims to employ various web scraping techniques to gather data on restaurants from the popular platform "Yemeksepeti". The collected data will be stored in the JSON format within the designated directory named `collected_data`.

The dataset will encompass a wide range of information pertaining to the restaurants, encompassing details such as their menu items, pricing, customer comments, ratings, and corresponding timestamps.

Subsequently, the primary goal is to conduct semantic analysis on the acquired dataset, with the aim of gaining deeper insights into consumer behavior patterns within the context of Turkey.

---

This dataset is **FAIR** since it is:
- **Findable:** uploaded to github and [google drive](https://drive.google.com/drive/folders/1l4J1IXDtvGCOBzbD7jX-Y-Kud4FOj86S?usp=sharing) with proper metadata.
- **Accessible:** the data is stored using `.json` format, the protocol is open, free, and universally implementable
- **Interoperable:** proper language is used, and it can be integrated with other data (e.g. data collected from getir)
- **Reusable:** (Meta)data are properly described with accurate attributes
---

This Notebook is dvided into 3 different sections:
1. Data Collection
2. How to Read the Data? (in python)
3. Meta Data (data description, format, attributes)


#### üìÅ The files are too large (13.9 MB + 84.4 MB) to be uploaded to github, they can be accessed through [Google Drive](https://drive.google.com/drive/folders/1l4J1IXDtvGCOBzbD7jX-Y-Kud4FOj86S?usp=sharing).

In [1]:
# import libraries

import grequests
import requests
import json
import os

In [2]:
# directories 

RESTAURANT_LIST_DIR = '../collected_data/yemeksepeti_restaurants_list_per_city/'
REVIEWS_COLLECTION_DIR = '../collected_data/yemeksepeti_reviews_collection_per_city/'

# 1. Data Collection

## 1.1 A List of restaurants in every city in Turkey 

In this section, our goal is to gather a comprehensive list of all restaurants in every city in Turkey, complete with detailed information such as their address, budget, cuisine offerings, and most importantly, the URLs leading to their respective pages on Yemeksepeti.

To accomplish this task, we will send a specific request to Yemeksepeti's database and parse the response into a JSON format. This will enable us to efficiently organize and explore the gathered information, providing valuable insights

In [3]:
# get detailed data for all restaurants in a city and return its content
def get_restaurants_data_for_city(city_id):
    # Yemeksepeti uses the Delivery Hero servers to store it's data, hence we can get restaurant's data by sending a request to delivery hero
    # the arguments ...&offset=0&limit=500&... can be added to the url to specify the data request size
    yemeksepeti_request_url = f"https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?language_id=2&vertical=restaurants&country=tr&include=characteristics&configuration=Variant1&sort=&city_id={city_id}"
    
    # the headers are acquired from the requests from yemeksepeti's webpage
    # perseus-client-id and perseus-seesion-id might needed be updated occasionally
    headers = {
    'accept': 'application/json, text/plain, */*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'dnt': '1',
    'origin': 'https://www.yemeksepeti.com',
    'perseus-client-id': '1683402284518.770138949646337300.d4t6e4b8h2',
    'perseus-session-id': '1686229407319.641123708405837200.lsjp7m4zlk', #'perseus-session-id': '1686226160511.373585736230385800.r7egjnpz9u',
    'referer': 'https://www.yemeksepeti.com/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
    'x-disco-client-id': 'web',
    'x-fp-api-key': 'volo '
    }

    # send the request and store the response
    response = requests.get(yemeksepeti_request_url, headers=headers)
    
    return response


In [4]:
# A list of all cities in turkey with their index - 1 corresponding to the city's id (il kodu)
# e.g. cities[33] == cities[34 - 1] == "Istanbul" 

cities = ["Adana", "Adƒ±yaman", "Afyon", "Aƒürƒ±", "Amasya", "Ankara", "Antalya", "Artvin", "Aydƒ±n", "Balƒ±kesir", "Bilecik", "Bing√∂l", "Bitlis", "Bolu", "Burdur", "Bursa", "√áanakkale", "√áankƒ±rƒ±", "√áorum", "Denizli", "Diyarbakƒ±r", "Edirne", "Elazƒ±ƒü", "Erzincan", "Erzurum", "Eski≈üehir", "Gaziantep", "Giresun", "G√ºm√º≈ühane", "Hakkari", "Hatay", "Isparta", "ƒ∞√ßel (Mersin)", "ƒ∞stanbul", "ƒ∞zmir", "Kars", "Kastamonu", "Kayseri", "Kƒ±rklareli", "Kƒ±r≈üehir", "Kocaeli", "Konya", "K√ºtahya", "Malatya", "Manisa", "Kahramanmara≈ü", "Mardin", "Muƒüla", "Mu≈ü", "Nev≈üehir", "Niƒüde", "Ordu", "Rize", "Sakarya", "Samsun", "Siirt", "Sinop", "Sivas", "Tekirdaƒü", "Tokat", "Trabzon", "Tunceli", "≈ûanlƒ±urfa", "U≈üak", "Van", "Yozgat", "Zonguldak", "Aksaray", "Bayburt", "Karaman", "Kƒ±rƒ±kkale", "Batman", "≈ûƒ±rnak", "Bartƒ±n", "Ardahan", "Iƒüdƒ±r", "Yalova", "Karab√ºk", "Kilis", "Osmaniye", "D√ºzce"]

In [5]:
# Save the data of restaurants in all cities as seperate json files
def save_restaurants_data_json():
    restaurants_data_files_list = os.listdir(RESTAURANT_LIST_DIR)

    for city_id, city_name in enumerate(cities):
        filename = f'{city_id + 1}_yemeksepeti_{city_name}_restaurants_data.json'

        # check for missing cities
        if (filename not in restaurants_data_files_list):
            # Fetch the data for a specific city
            print(f'({city_id+1} / {len(cities)}) Fetching the data for {city_name} restaurants:')
            restaurants_data = get_restaurants_data_for_city(city_id + 1)

            # save the response data as json
            print("saving the json data...")
            DATA_EXPORT_DIR = RESTAURANT_LIST_DIR + filename
            with open(DATA_EXPORT_DIR, 'wb') as of:
                of.write(restaurants_data.content)

            print(f"succefully exported the data. path: `{DATA_EXPORT_DIR}`\n")

In [6]:
save_restaurants_data_json()

## 1.2 Restaurant Reviews Compilation
With our comprehensive inventory of restaurants across all cities in Turkey in hand, we now turn our attention to capturing the customer's experience through customer reviews. 

In this section, our aim is to collect the data using proper api's and store these valuable reviews in JSON format.

In [7]:
# grequests is used to get the responses asynchronously since the data is too large
 
def get_restaurant_reviews(restaurant_code):
    restaurant_ratings_request_url = f"https://reviews-api-tr.fd-api.com/reviews/vendor/{restaurant_code}?global_entity_id=YS_TR"
    
    request = grequests.get(restaurant_ratings_request_url)
    return request


In [8]:
def exception_handler(request, exception):
    print(f"Failed to fetch reviews. request: {request}")

In [9]:
# returns a dictionary of all restaurant reviews in {'code': review_data, ...} fromat
# screenlock = Semaphore(value=1)

def get_restaurant_reviews_from_city(city_id, city_name):
    restaurants_data_filename = f'{city_id + 1}_yemeksepeti_{city_name}_restaurants_data.json'
    DATA_FILE_DIR = RESTAURANT_LIST_DIR + restaurants_data_filename
    with open(DATA_FILE_DIR, 'r') as file:
        json_data = json.load(file)
        
    restaurants_count = len(json_data['data']['items']) # or = json_data['data']['available_count']
    print(f"Fetching reviews from {restaurants_count} restaurants in {city_name}.", flush=True)

    # all of reviews are stored as {'code': review_data, ...} in the dictionary below
    reviews_collection = {}
    requests = []
    restaurant_codes = []
    for restaurant in json_data['data']['items']:
        # restaurant_url = restaurant['redirection_url']
        restaurant_code = restaurant['code']
        requests.append(get_restaurant_reviews(restaurant_code))
        restaurant_codes.append(restaurant_code)

    responses = grequests.map(requests, exception_handler=exception_handler)

    for response, restaurant_code in zip(responses, restaurant_codes):
        if response is not None and response.status_code == 200:
            print(response.json())
            reviews_collection[restaurant_code] = response.json()

    return reviews_collection

In [10]:
# Save the costumer reviews for all resturants in every cities as seperate json files
def save_costuemr_reviews_json():
    review_file_list = os.listdir(REVIEWS_COLLECTION_DIR)
    for city_id, city_name in enumerate(cities):
        
        restaurants_reviews_filename = f'{city_id + 1}_yemeksepeti_{city_name}_restaurants_reviews.json'

        if restaurants_reviews_filename not in review_file_list:

            reviews_collection = get_restaurant_reviews_from_city(city_id, city_name)

            REVIEWS_EXPORT_DIR = REVIEWS_COLLECTION_DIR + restaurants_reviews_filename
            with open(REVIEWS_EXPORT_DIR, "w") as json_file:
                json.dump(reviews_collection, json_file)

In [11]:
save_costuemr_reviews_json()

# 2. How to Read the Data? (in python)

Using the given functions you can read the files and access the data.

If your files are stored in a different directory change the directory variables.

`data_filename_for_city()` and `reviews_filename_for_city()` return the file name for the informaiton dataset or restaurant reviewws respectively based on the givcen city code.

After the funciton definitions there is a code block as an example to read the information and display it.

In [12]:
# import libraries
import json

In [13]:
# directories 

RESTAURANT_LIST_DIR = '../collected_data/yemeksepeti_restaurants_list_per_city/'
REVIEWS_COLLECTION_DIR = '../collected_data/yemeksepeti_reviews_collection_per_city/'

In [14]:
# A list of all cities in turkey with their index - 1 corresponding to the city's id (il kodu)
# e.g. cities[33] == cities[34 - 1] == "Istanbul" 

cities = ["Adana", "Adƒ±yaman", "Afyon", "Aƒürƒ±", "Amasya", "Ankara", "Antalya", "Artvin", "Aydƒ±n", "Balƒ±kesir", "Bilecik", "Bing√∂l", "Bitlis", "Bolu", "Burdur", "Bursa", "√áanakkale", "√áankƒ±rƒ±", "√áorum", "Denizli", "Diyarbakƒ±r", "Edirne", "Elazƒ±ƒü", "Erzincan", "Erzurum", "Eski≈üehir", "Gaziantep", "Giresun", "G√ºm√º≈ühane", "Hakkari", "Hatay", "Isparta", "ƒ∞√ßel (Mersin)", "ƒ∞stanbul", "ƒ∞zmir", "Kars", "Kastamonu", "Kayseri", "Kƒ±rklareli", "Kƒ±r≈üehir", "Kocaeli", "Konya", "K√ºtahya", "Malatya", "Manisa", "Kahramanmara≈ü", "Mardin", "Muƒüla", "Mu≈ü", "Nev≈üehir", "Niƒüde", "Ordu", "Rize", "Sakarya", "Samsun", "Siirt", "Sinop", "Sivas", "Tekirdaƒü", "Tokat", "Trabzon", "Tunceli", "≈ûanlƒ±urfa", "U≈üak", "Van", "Yozgat", "Zonguldak", "Aksaray", "Bayburt", "Karaman", "Kƒ±rƒ±kkale", "Batman", "≈ûƒ±rnak", "Bartƒ±n", "Ardahan", "Iƒüdƒ±r", "Yalova", "Karab√ºk", "Kilis", "Osmaniye", "D√ºzce"]

In [15]:
# return the filename for restaurants data in a city

def data_filename_for_city(city_id):
    filename = None
    if(1 <= city_id and city_id <= len(cities)):
        filename = RESTAURANT_LIST_DIR + f'{city_id}_yemeksepeti_{cities[city_id - 1]}_restaurants_data.json'
    else:
        print(f"city id ({city_id}) out of range. range: [1,{len(cities)}]")
    return filename

In [16]:
# return the filename for restaurants reviews in a city

def reviews_filename_for_city(city_id):
    filename = None
    if(1 <= city_id and city_id <= len(cities)):
        filename = REVIEWS_COLLECTION_DIR + f'{city_id}_yemeksepeti_{cities[city_id - 1]}_restaurants_reviews.json'
    else:
        print(f"city id ({city_id}) out of range. range: [1,{len(cities)}]")
    return filename

In [17]:
def json_file_to_dict(filename):
    with open(filename) as f:
        data = json.load(f)
    return data
    

In [18]:
city_id = 38 # Kayseri code: 38

info_filename = data_filename_for_city(city_id)
info = json_file_to_dict(info_filename)

reviews_filename = reviews_filename_for_city(city_id)
reviews = json_file_to_dict(reviews_filename)

rst_index = 9 # choose the nth resturant

restaurant_info = info['data']['items'][rst_index]
print(f"Resturant Name: {restaurant_info['name']} \
      \ncode: {restaurant_info['code']} \
      \nurl: {restaurant_info['redirection_url']}\n")

restaurant_reviews = reviews[restaurant_info['code']]['data']
if(len(restaurant_reviews) > 0):
      print(f"first review: {restaurant_reviews[0]['text']} \
            \ntime: {restaurant_reviews[0]['createdAt']}")
      
      print("ratings:")
      for rating in restaurant_reviews[0]['ratings']:
            print(f"   |{rating['topic']}: {rating['score']} / 10 ")


Resturant Name: Ristorante Picco Bello Pizza       
code: kdin       
url: https://yemeksepeti.com/restaurant/kdin/ristorante-picco-bello-pizza

first review: √áok beƒüendim √ßok lezzetliydi ekmeƒüinin kƒ±tƒ±rlƒ±ƒüƒ± inceliƒüi kekiƒüi ka≈üarƒ± nefisti hƒ±zlƒ± geldi sƒ±caktƒ±da tekrr sipari≈ü vericem kesin             
time: 2023-06-06T06:03:01Z
ratings:
   |overall: 10 / 10 
   |restaurant_food: 10 / 10 
   |service: 10 / 10 
   |speed: 10 / 10 


# 3. MetaData 

We are going to discuss how the data structure is structured and what are the keys.

The keys are listed bewlow, however you can print them using `print_all_keys()` [code below].

for example:
```python
city_id = 4
filename = data_filename_for_city(city_id) # or reviews_filename_for_city(city_id)
data = json_file_to_dict(filename)
 
print_all_keys(data)
```

In [19]:
def print_all_keys(dictionary, padding=''):
    for key, value in dictionary.items():
        print(padding, '|', key, end="")
        if isinstance(value, dict):
            print(': {}')
            print_all_keys(value, padding + '    ')
        elif isinstance(value, list) and len(value) > 0:
            if isinstance(value[0], dict):
                print(': [{}]')
                print_all_keys(value[0], padding + '    ')
        else:
            print(f' ({type(value).__name__})')

### Restruant Information Dataset Structure/Keys


<details>
<summary>Show</summary>

```
 | status_code (int)
 | message (str)
 | data: {}
     | available_count (int)
     | returned_count (int)
     | events (list)
     | close_reasons (list)
     | banner (str)
     | items: [{}]
         | id (int)
         | code (str)
         | accepts_instructions (bool)
         | address (str)
         | address_line2 (str)
         | budget (int)
         | chain: {}
             | code (str)
             | name (str)
             | main_vendor_code (str)
             | url_key (str)
         | city: {}
             | name (str)
         | cuisines: [{}]
             | id (int)
             | name (str)
             | url_key (str)
             | main (bool)
         | custom_location_url (str)
         | customer_type (str)
         | delivery_box (str)
         | delivery_fee_type (str)
         | description (str)
         | distance (float)
         | food_characteristics: [{}]
             | id (int)
             | name (str)
             | is_halal (bool)
             | is_vegetarian (bool)
         | has_delivery_provider (bool)
         | hero_image (str)
         | hero_listing_image (str)
         | is_new_until (str)
         | premium_position (int)
         | latitude (float)
         | logo (str)
         | longitude (float)
         | loyalty_percentage_amount (float)
         | loyalty_program_enabled (bool)
         | maximum_express_order_amount (int)
         | metadata: {}
             | has_discount (bool)
             | timezone (str)
             | close_reasons (list)
             | available_in (str)
             | events (list)
             | is_delivery_available (bool)
             | is_pickup_available (bool)
             | is_dine_in_available (bool)
             | is_express_delivery_available (bool)
             | is_temporary_closed (bool)
             | is_flood_feature_closed (bool)
         | minimum_delivery_fee (float)
         | minimum_delivery_time (float)
         | minimum_order_amount (float)
         | minimum_pickup_time (float)
         | name (str)
         | payment_types (list)
         | post_code (str)
         | primary_cuisine_id (int)
         | rating (float)
         | redirection_url (str)
         | review_number (int)
         | review_with_comment_number (int)
         | score (float)
         | service_fee_percentage_amount (int)
         | service_tax_percentage_amount (int)
         | tag (str)
         | tags (list)
         | url_key (str)
         | vat_percentage_amount (int)
         | characteristics: {}
             | cuisines: [{}]
                 | id (int)
                 | name (str)
                 | url_key (str)
                 | main (bool)
             | food_characteristics: [{}]
                 | id (int)
                 | name (str)
                 | is_halal (bool)
                 | is_vegetarian (bool)
             | primary_cuisine: {}
                 | id (int)
                 | name (str)
                 | url_key (str)
                 | main (bool)
         | vendor_points (int)
         | vertical (str)
         | vertical_segment (str)
         | vertical_parent (str)
         | web_path (str)
         | website (str)
         | has_online_payment (bool)
         | discounts_info (list)
         | discounts (list)
         | vendor_legal_information: {}
             | legal_name (str)
             | trade_register_number (str)
         | disclaimers (list)
         | customer_phone (str)
         | vertical_type_ids         | delivery_provider (str)
         | is_active (bool)
         | is_best_in_city (bool)
         | is_checkout_comment_enabled (bool)
         | is_delivery_enabled (bool)
         | is_new (bool)
         | is_pickup_enabled (bool)
         | is_premium (bool)
         | is_preorder_enabled (bool)
         | is_replacement_dish_enabled (bool)
         | is_service_fee_enabled (bool)
         | is_service_tax_enabled (bool)
         | is_service_tax_visible (bool)
         | is_test (bool)
         | is_vat_disabled (bool)
         | is_vat_included_in_product_price (bool)
         | is_vat_visible (bool)
         | is_voucher_enabled (bool)
         | is_promoted (bool)
         | tag_ids (list)
     | num_platform_delivery (int)
     | num_vendor_delivery (int)
     | num_platform_delivery_page (int)
     | num_vendor_delivery_page (int)
     | dynamic_pricing_session (NoneType)
     | discount_labels_metadata: {}
     | aggregations: {}
         | cuisines: [{}]
             | id (int)
             | title (str)
             | count (int)
             | slug (str)
         | events (list)
         | discount_labels: [{}]
             | title (str)
             | count (int)
         | discounts: [{}]
             | id (str)
             | title (str)
             | type (str)
             | count (int)
         | close_reasons (list)
         | banner (str)
         | location_event (NoneType)
         | partners: [{}]
             | id (str)
             | title (str)
             | logo_url (str)
             | count (int)
             | image_url (str)
         | payment_types: [{}]
             | id (str)
             | title (str)
             | highlighted (bool)
             | count (int)
         | foodCharacteristics: [{}]
             | id (int)
             | title (str)
             | count (int)
             | slug (str)
         | quickFilters: [{}]
             | id (int)
             | title (str)
             | count (int)
             | slug (NoneType)
         | foodCharacteristicsTypes: {}
             | live_tracking: [{}]
                 | id (int)
                 | title (str)
                 | count (int)
                 | type (str)
     | tags: {}
```
</details>


### Restruant Reviews Dataset Structure/Keys

<details>
<summary>Show (Resturants that have a review)</summary>

```
| resturant_code: {}
    | data: [{}]
        | uuid
        | createdAt
        | updatedAt
        | text
        | isAnonymous
        | reviewerName
        | reviewerId
        | ratings: [{}]
            | topic
            | score
        | type
        | automatedText
        | replies
        | commentPrettyDate: {}
            | translationKey
            | timeDuration
        | likeCount
        | isLiked
    | pageKey
    | heroImage
```
</details>



<details>
<summary>Show (Resturants that don't have a review)</summary>

```
| resturant_code: {}
    | data
    | pageKey
    | heroImage
```
</details>
