# Dynamic Website Scraping with Selenium: Conquering Infinite Scrolls and Swipes

## Table of Contents

- Overview: #overview
- Key Features: #key-features
- Technologies Used: #technologies-used
- Setup: #setup
- Usage: #usage
- Handling Infinite Scroll: #handling-infinite-scroll
- Simulating Left-Right Swipes: #simulating-left-right-swipes
- Extracting Data: #extracting-data
- Additional Tips: #additional-tips
- Author and License: #author-and-license

## Overview

This project demonstrates how to leverage Selenium to effectively scrape dynamic websites that present challenges like infinite scrolling and swipe-based navigation. It provides practical code examples and guidance for overcoming these common obstacles.

## Key Features

- **Infinite Scroll Handling:** Employs techniques to detect scroll ends and trigger further content loading, ensuring complete data capture.
- **Left-Right Swipe Simulation:** Uses JavaScript execution within Selenium to replicate swiping actions, effectively navigating websites that rely on horizontal swipes.
- **Data Extraction:** Demonstrates methods to extract relevant information from the scraped content, tailoring extraction techniques to the specific website's structure.
- **Clear Code Examples and Explanations:** Provides well-structured code with detailed comments, aiding understanding and adaptability to different scenarios.

## Technologies Used

- Selenium WebDriver
- Python (or your preferred programming language)
- WebDriver for your chosen browser (e.g., ChromeDriver for Chrome)

## Setup

1. Install required libraries:
   ```bash
   pip install selenium
   ```
2. Download the appropriate WebDriver for your browser.

## Usage

(Provide code examples and explanations for:)

- Setting up the WebDriver
- Navigating to the target website
- Identifying elements for scraping
- Handling infinite scroll
- Simulating left-right swipes
- Extracting data
- Saving the extracted data

## Handling Infinite Scroll

(Describe specific techniques used for infinite scroll handling)

## Simulating Left-Right Swipes

(Explain how JavaScript execution is used to simulate swipes)

## Extracting Data

(Detail methods used for data extraction, accounting for website structure)

## Additional Tips

- Adjust wait times and element locators to match the target website's behavior.
- Handle potential errors gracefully (e.g., network issues, website changes).
- Consider using a headless browser for faster execution.
- Respect website terms of service and robots.txt.

## Author and License

Written by Anh Nhat Nguyen

License: This project is licensed under the MIT License.


#### Requirements
```
 - selenium
 - json
 - googletrans==4.0.0-rc1
```


## Getting started

In [11]:
## import
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECondition 
from selenium.webdriver.chrome.service import Service
import json
import time

## create an object of the chrome webdriver
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
option = webdriver.ChromeOptions()
# option.add_argument('headless')
option.add_argument("--window-size=1920,1080")
option.add_argument(f'user-agent={USER_AGENT}')
service = Service(executable_path=r'../chromedriver-win64/chromedriver.exe')
driver = webdriver.Chrome(service=service, options=option)


In [5]:
driver.get(url = "https://migo.travel/Destination/vietnam-hanoi")
pagesource = driver.page_source

In [None]:
pagesource

---

### Banner slide section

In [57]:
#banner
banner_imgs_src_list = []
swiper_banner = driver.find_elements(By.XPATH, value='/html/body/div[1]/main/div[2]/div[1]')
banner_imgs = swiper_banner[0].find_elements(By.TAG_NAME, 'img')
for banner_img in banner_imgs:
    banner_imgs_src = banner_img.get_property('src')
    banner_imgs_src_list.append(banner_imgs_src)

In [115]:
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
for i in range(0, 30):
    ActionChains(driver).click_and_hold(swiper_banner[0]).move_by_offset(-200 , -20).release().perform()
    time.sleep(0.5)
ActionChains(driver).send_keys(Keys.ESCAPE).perform()

In [48]:
banner_imgs_src_list

['https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230725/destination-hanoi-migo-10-y5bkpgl2.webp?f=XcGpbyY_gUap2QK_1BRffA',
 'https://files.migo.travel/20230508/hanoi---cot-co-ha-noi-t4flesoi.webp?f=Z7qhniZcy0ilGLAIhcTjbQ',
 'https://files.migo.travel/20230725/destination-hanoi-migo-2-gme5lqrf.webp?f=jlIXmSyJ-0CtKgH4Lr9RCA',
 'https://files.migo.travel/20230725/destination-hanoi-migo-8-p2pldi3d.webp?f=D8cj4NSBJ0qCIPwxSXyMvA',
 'https://files.migo.travel/20230725/destination-hanoi-migo-8-wun3iha0.webp?f=rVnXOPUsVUSs5E5rldGh6Q',
 'https://files.migo.travel/20230725/destination-hanoi-migo-9-teaaq4z3.webp?f=ne22EXBl206enycWJI_cTQ',
 'https://files.migo.travel/20230725/destination-hanoi-migo-18-pej5wel4.webp?f=mDIzf-z7iU6qRpjn1MX2Hw',
 'https://files.migo.travel/20230725/destination-hanoi-migo-5-gw2kok0a.webp?f=9i2j5ln7_0CGbKtGawrGSA',
 'https://files.migo.travel/20230725/destination-hanoi-migo-4-5ezym

---

### About section

In [None]:
#about-container
about_container = driver.find_elements(By.XPATH,'/html/body/div[1]/main/div[2]/div[4]')



In [None]:
about_container_papragraph = [p.text for p in about_container[0].find_elements(By.TAG_NAME, 'p')]
about_container_papragraph

In [None]:
about_container_href = [h.get_property('href') for h in about_container[0].find_elements(By.TAG_NAME, 'a')]
about_container_href

In [None]:
about_container_imgs = [h.get_property('src') for h in about_container[0].find_elements(By.TAG_NAME, 'img')]
about_container_imgs

---

### Explore Hanoi section

In [149]:
expore_section = driver.find_elements(By.ID, 'lstExploreEvent')[-1]

In [165]:
list_attraction = expore_section.find_elements(By.CLASS_NAME, 'list-attraction')[0]

In [None]:
list_attraction.text

In [170]:
list_attraction_imgs_src = []
prev_height = driver.execute_script("return arguments[0].scrollHeight",expore_section)
while True:
    # do scrolling
    driver.execute_script("arguments[0].scrollBy(0,arguments[0].scrollHeight)",expore_section)
    time.sleep(0.5)
    new_current_height = driver.execute_script("return arguments[0].scrollHeight",expore_section)
    print(new_current_height)
    if new_current_height - prev_height == 0:
        break
    prev_height = new_current_height

driver.execute_script("arguments[0].scrollIntoView()",list_attraction)
curr_height = 0
while curr_height <= prev_height:
    driver.execute_script("arguments[0].scrollBy(0,500)",expore_section)
    curr_height += 500
    time.sleep(0.05)
    list_attraction_imgs = list_attraction.find_elements(By.TAG_NAME,'img')
    list_attraction_imgs_lst = [i.get_property('src') for i in list_attraction_imgs if str(i.get_property('src')) != ""]
    list_attraction_imgs_src += list_attraction_imgs_lst
    

39696


In [171]:
list_attraction_imgs_src

['https://files.migo.travel/20231218/artemispastryattractionmigo43-d3vokegk.webp?f=9d8x72pPeEOOI1BhrFIfIw',
 'https://files.migo.travel/20231218/357104289_137513002694758_5010721774976859316_n-1b454zwq.webp?f=k2JPW6JHkkS3TnicuRT3Hw',
 'https://files.migo.travel/20231218/kasaya-nhxe0hxe0ngchayvxe0cafeattractionmigo18-mvydtpwy.webp?f=95lG4p1Do06Rpy5rKIVMLA',
 'https://files.migo.travel/20231218/cxe1imxe2mbistrosignatureveganattractionmigo16-jwaftmbu.webp?f=YRDXia0wnUm9pnPA7lMn9w',
 'https://files.migo.travel/20231218/lxe1lx1ed1t-vietnamesecuisine-edrak12o.webp?f=yKSMDGSpH0OIxc5sH5PZoQ',
 'https://files.migo.travel/20231129/phx1ee5ngthxe0nhtraditionalcuisineattractionmigo13-eka3mw2x.webp?f=6H0OWYPX9k674K0JplIRLQ',
 'https://files.migo.travel/20231128/kumihimo-jwmarriotthanoiattractionmigo28-kc4jffz3.webp?f=b-VRPZpuD0y6c3A2q5mXjQ',
 'https://files.migo.travel/20231114/hangdauwatertankattractionhanoimigo8-oy0kwnlj.webp?f=3Lj0tqsFTkOdlx-XkYB-_w',
 'https://files.migo.travel/20231114/hummingb

In [127]:
list_attraction_imgs = list_attraction.find_elements(By.TAG_NAME,'img')
len(list_attraction_imgs)

198

---

### Lastest stories section

In [6]:
while True:
    try:
        loadMoreButton = driver.find_element(By.XPATH,'//*[@id="load-more"]')
        time.sleep(1)
        loadMoreButton.click()
    except Exception as e:
        print(e)
        break

Message: element not interactable
  (Session info: chrome=120.0.6099.217)
Stacktrace:
	GetHandleVerifier [0x00007FF727CE2142+3514994]
	(No symbol) [0x00007FF727900CE2]
	(No symbol) [0x00007FF7277A74C3]
	(No symbol) [0x00007FF7277F2D29]
	(No symbol) [0x00007FF7277E6A0F]
	(No symbol) [0x00007FF727815FEA]
	(No symbol) [0x00007FF7277E63B6]
	(No symbol) [0x00007FF727816490]
	(No symbol) [0x00007FF7278328F6]
	(No symbol) [0x00007FF727815D93]
	(No symbol) [0x00007FF7277E4BDC]
	(No symbol) [0x00007FF7277E5C64]
	GetHandleVerifier [0x00007FF727D0E16B+3695259]
	GetHandleVerifier [0x00007FF727D66737+4057191]
	GetHandleVerifier [0x00007FF727D5E4E3+4023827]
	GetHandleVerifier [0x00007FF727A304F9+689705]
	(No symbol) [0x00007FF72790C048]
	(No symbol) [0x00007FF727908044]
	(No symbol) [0x00007FF7279081C9]
	(No symbol) [0x00007FF7278F88C4]
	BaseThreadInitThunk [0x00007FF9B4EB7344+20]
	RtlUserThreadStart [0x00007FF9B5A626B1+33]



In [11]:
stories_content_rows = driver.find_elements(By.XPATH,'/html/body/div[1]/main/div[2]/div[7]/div[2]')[0]
stories_content_rows

<selenium.webdriver.remote.webelement.WebElement (session="e7a52f40c5d64ba254db1e2c60ff4180", element="803F638E1ABC4431400FFE8BA70BBAEA_element_317")>

In [14]:
stories_content = stories_content_rows.find_elements(By.TAG_NAME,'a')
len(stories_content)

612

In [24]:
substory_text_href = [t.get_attribute('href') for t in stories_content]
len(substory_text_href), substory_text_href

(612,
 ['https://migo.travel/Experience/5-nha-hang-am-thuc-tay-ban-nha-an-tuong-tai-ha-noi',
  'https://migo.travel/Experience/5-nha-hang-am-thuc-tay-ban-nha-an-tuong-tai-ha-noi',
  'https://migo.travel/Experience/5-nha-hang-am-thuc-tay-ban-nha-an-tuong-tai-ha-noi',
  'https://migo.travel/Experience/5-nha-hang-am-thuc-tay-ban-nha-an-tuong-tai-ha-noi',
  'https://migo.travel/Experience/tiec-toi-trong-gu-bistronomy-nha-hang-fine-dining-hang-dau-tai-ha-noi',
  'https://migo.travel/Pillar/FoodAndDrink',
  'https://migo.travel/Experience/tiec-toi-trong-gu-bistronomy-nha-hang-fine-dining-hang-dau-tai-ha-noi',
  'https://migo.travel/Experience/tiec-toi-trong-gu-bistronomy-nha-hang-fine-dining-hang-dau-tai-ha-noi',
  'https://migo.travel/Experience/thuong-thuc-nhung-bua-an-ngon-mieng-voi-nha-hang-am-thuc-phap-tai-ha-noi',
  'https://migo.travel/Pillar/FoodAndDrink',
  'https://migo.travel/Experience/thuong-thuc-nhung-bua-an-ngon-mieng-voi-nha-hang-am-thuc-phap-tai-ha-noi',
  'https://migo.trav

In [26]:
def unique_list(in_list):
    list_set = set(in_list)
    unique_list = (list(list_set))
    return unique_list

In [27]:
substory_text_href_uniq = unique_list(substory_text_href)
len(substory_text_href_uniq)

158

In [25]:
substory_text_headers = " ".join([t.text for t in stories_content]).split("Read more")
len(substory_text_headers),substory_text_headers    

(154,
 ['  5 impressive Spanish cuisine restaurants in Hanoi ',
  '  Food & Drink · Dinner in GU Bistronomy, the leading fine dining restaurant in Hanoi ',
  '  Food & Drink · Enjoy delicious meals with French cuisine restaurants in Hanoi ',
  '  City & Culture · Enjoy the culture the way Hanoians - winter eating Trang Tien ice cream ',
  '  Food & Drinks · Enjoy 10 street foods in winter in Hanoi ',
  '  Food & Drinks · Beautiful Christmas decorations and restaurants in Hanoi ',
  '  Food & Drinks · Delicious Vietnamese rice at Xoi Rice ',
  "  Food & Drinks · Metropole Hanoi's Spice Garden Restaurant Reopens ",
  "  Food & Drinks · What's special about Tanh Split and SMOKE – two restaurants in Hanoi ",
  '  City & Culture · Hanoi nights are attractive with tourism products ',
  '  Food & Drinks · Go for a drink at Kumihimo Bar & Terrace ',
  '  City & Culture · Experience the heritage train, the 120-year-old Gia Lam railway factory ',
  '  City & Culture · Trains running through Hano

In [32]:
stories_content_paragraph_container = driver.find_elements(By.XPATH,'/html/body/div[1]/main/div[2]/div[7]')[0]
stories_content_paragraph = stories_content_paragraph_container.find_elements(By.TAG_NAME,'div')
stories_content_paragraph_text = [t.text for t in stories_content_paragraph]
stories_content_paragraph_text

['Latest Stories from Hanoi',
 'Latest Stories from Hanoi',
 '',
 'Food & Drinks 11/01/2024\n5 impressive Spanish cuisine restaurants in Hanoi\nIn addition to enjoying the delicious and nutritious flavors, diners also enjoy a luxurious and classy restaurant space. Even the most demanding diners, or diners who have never tried Spanish cuisine, will be intrigued.\nRead more\nFood & Drink · 10/01/2024\nDinner in GU Bistronomy, the leading fine dining restaurant in Hanoi\nOn the journey to experience culinary quintessence, GU is a destination for diners with taste, to satisfy their own culinary taste with creative dishes, valuable and rare wine collections and fine dining standard services.\nRead more\nFood & Drink · 04/01/2024\nEnjoy delicious meals with French cuisine restaurants in Hanoi\nEnjoying French cuisine is an art, as it lies not only in the taste of the dish but also in the presentation and space of enjoyment. Let\'s explore the standard French restaurants below with Migo.\nRea

In [34]:
stories_content_imgs_section = stories_content_paragraph_container.find_elements(By.TAG_NAME,'img')
stories_content_imgs = [t.get_attribute('src') for t in stories_content_imgs_section]
stories_content_imgs

['https://files.migo.travel/20230727/spanish-tapas-and-sangria-on-wooden-table-top-view_519793093-t4wncsyk.webp?f=u5NfbHEAv0uPrtD44OX9Uw',
 'https://files.migo.travel/20230727/spanish-tapas-and-sangria-on-wooden-table-top-view_519793093-t4wncsyk.webp?f=u5NfbHEAv0uPrtD44OX9Uw',
 'https://files.migo.travel/20230929/avargu-q3hzipkn.webp?f=JNV3vnSebEObyGr8Xlagug',
 'https://files.migo.travel/20230724/343562324_964238148088815_7477640552227212310_n-5kvbae0l.webp?f=NsGyw7leNUKuyplbil6UNQ',
 'https://files.migo.travel/20231221/ce13-ce4b-4bf1-a946-8913678355e9_slmx-ujfsiewn.webp?f=_eeR0tZW00u0rIK0VAY8Eg',
 'https://files.migo.travel/20231218/banh-troi-tau-quynh-mai-1702367772-1l5c0dey.webp?f=JsSh-ZwTN0qqYmkhGTmbbw',
 'https://files.migo.travel/20231208/407953667_318473287728201_1261982254385967867_n-btq4kv1i.webp?f=Gi_wpr8AGES_ZLzg6MknoQ',
 'https://files.migo.travel/20231208/347443049_631189431885961_3463678550118900056_n-dhe4u0qf.webp?f=UbsRqXzX90qVQwmxLHabrQ',
 'https://files.migo.travel/20

In [35]:
stories_content_vid_section = stories_content_paragraph_container.find_elements(By.TAG_NAME,'iframe')
stories_content_vid = [t.get_attribute('src') for t in stories_content_imgs_section]
stories_content_vid

['https://files.migo.travel/20230727/spanish-tapas-and-sangria-on-wooden-table-top-view_519793093-t4wncsyk.webp?f=u5NfbHEAv0uPrtD44OX9Uw',
 'https://files.migo.travel/20230727/spanish-tapas-and-sangria-on-wooden-table-top-view_519793093-t4wncsyk.webp?f=u5NfbHEAv0uPrtD44OX9Uw',
 'https://files.migo.travel/20230929/avargu-q3hzipkn.webp?f=JNV3vnSebEObyGr8Xlagug',
 'https://files.migo.travel/20230724/343562324_964238148088815_7477640552227212310_n-5kvbae0l.webp?f=NsGyw7leNUKuyplbil6UNQ',
 'https://files.migo.travel/20231221/ce13-ce4b-4bf1-a946-8913678355e9_slmx-ujfsiewn.webp?f=_eeR0tZW00u0rIK0VAY8Eg',
 'https://files.migo.travel/20231218/banh-troi-tau-quynh-mai-1702367772-1l5c0dey.webp?f=JsSh-ZwTN0qqYmkhGTmbbw',
 'https://files.migo.travel/20231208/407953667_318473287728201_1261982254385967867_n-btq4kv1i.webp?f=Gi_wpr8AGES_ZLzg6MknoQ',
 'https://files.migo.travel/20231208/347443049_631189431885961_3463678550118900056_n-dhe4u0qf.webp?f=UbsRqXzX90qVQwmxLHabrQ',
 'https://files.migo.travel/20

---

### Footer section

In [36]:
footer_section = driver.find_elements(By.TAG_NAME, 'footer')[0]

In [38]:
footer_text = footer_section.text
footer_text

'Where your journey begins\nmarketing@migo.travel\nABOUT MIGO\nAbout Us\nTerms & Conditions\nPrivacy Policy\nPOPULAR SITES\nDestinations\nExperiences\nTours\nEvents\nSOCIAL MEDIA\n© 2023 Exploria Vietnam. All rights reserved.'

In [40]:
footer_section_a = footer_section.find_elements(By.TAG_NAME, 'a')
footer_section_a_href = [a.get_attribute('href') for a in footer_section_a]
footer_section_a_href

['https://migo.travel/',
 'mailto:marketing@migo.travel',
 'https://migo.travel/about',
 'https://migo.travel/support/terms',
 'https://migo.travel/support/policy',
 'https://migo.travel/Destinations',
 'https://migo.travel/Pillar',
 'https://migo.travel/Tour',
 'https://migo.travel/Event',
 'https://www.facebook.com/migotravel.vietnam',
 'https://www.instagram.com/migotravel.vietnam/']

---

## Put it all together

In [20]:
## import
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECondition 
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import json
import time

## create an object of the chrome webdriver
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
option = webdriver.ChromeOptions()
# option.add_argument('headless') # headless <<- no windows
option.add_argument("--no-sandbox")
option.add_argument("--disable-dev-shm-usage")
option.add_argument("--window-size=1920,1080")
option.add_argument(f'user-agent={USER_AGENT}')
service = Service(executable_path=r'../chromedriver-win64/chromedriver.exe')
driver = webdriver.Chrome(service=service, options=option)

target_url = "https://migo.travel/Destination/vietnam-hanoi"

driver.get(url = target_url)
pagesource = driver.page_source

media_src_and_link_list = []
text_content_list = []

time.sleep(3)

## Banner
banner_imgs_src_list = []
swiper_banner = driver.find_elements(By.XPATH, value='/html/body/div[1]/main/div[2]/div[1]')
# Do swipe left
for i in range(0, 21):
    ActionChains(driver).click_and_hold(swiper_banner[0]).move_by_offset(-200 , -20).release().perform()
    time.sleep(0.05)
# Escape the image gallery view
ActionChains(driver).send_keys(Keys.ESCAPE).perform()
# get all banner images source
banner_imgs = swiper_banner[0].find_elements(By.TAG_NAME, 'img')
for banner_img in banner_imgs:
    banner_imgs_src = banner_imgs[0].get_property('src')
    banner_imgs_src_list.append(banner_imgs_src)
# add to media list
media_src_and_link_list += banner_imgs_src_list

## about-container
about_container = driver.find_elements(By.XPATH,'/html/body/div[1]/main/div[2]/div[4]')
about_container_papragraph = [p.text for p in about_container[0].find_elements(By.TAG_NAME, 'p')]
about_container_href = [h.get_property('href') for h in about_container[0].find_elements(By.TAG_NAME, 'a')]
about_container_imgs = [h.get_property('src') for h in about_container[0].find_elements(By.TAG_NAME, 'img')]
text_content_list += about_container_papragraph
media_src_and_link_list += about_container_href
media_src_and_link_list += about_container_imgs

## Eplore section
expore_section = driver.find_elements(By.ID, 'lstExploreEvent')[-1]
list_attraction = expore_section.find_elements(By.CLASS_NAME, 'list-attraction')[0]
list_attraction_text = list_attraction.text
text_content_list.append(list_attraction_text)

list_attraction_imgs_src = []
prev_height = driver.execute_script("return arguments[0].scrollHeight",expore_section)
while True:
    # do scrolling
    driver.execute_script("arguments[0].scrollBy(0,arguments[0].scrollHeight)",expore_section)
    time.sleep(0.5)
    new_current_height = driver.execute_script("return arguments[0].scrollHeight",expore_section)
    print(new_current_height)
    if new_current_height - prev_height == 0:
        break
    prev_height = new_current_height

driver.execute_script("arguments[0].scrollIntoView()",list_attraction)
curr_height = 0
while curr_height <= prev_height:
    driver.execute_script("arguments[0].scrollBy(0,500)",expore_section)
    curr_height += 500
    time.sleep(0.05)
    list_attraction_imgs = list_attraction.find_elements(By.TAG_NAME,'img')
    list_attraction_imgs_lst = [i.get_property('src') for i in list_attraction_imgs if str(i.get_property('src')) != ""]
    list_attraction_imgs_src += list_attraction_imgs_lst

media_src_and_link_list += list_attraction_imgs_src

## Lastest stories
while True:
    try:
        loadMoreButton = driver.find_element(By.XPATH,'//*[@id="load-more"]')
        time.sleep(1)
        loadMoreButton.click()
    except Exception as e:
        break

stories_content_rows = driver.find_elements(By.XPATH,'/html/body/div[1]/main/div[2]/div[7]/div[2]')[0]
stories_content = stories_content_rows.find_elements(By.TAG_NAME,'a')

def unique_list(in_list):
    list_set = set(in_list)
    unique_list = (list(list_set))
    return unique_list

substory_text_href = [t.get_attribute('href') for t in stories_content]
substory_text_href_uniq = unique_list(substory_text_href)

substory_text_headers = " ".join([t.text for t in stories_content]).split("Read more")
substory_text_headers_uniq = unique_list(substory_text_headers)

text_content_list += substory_text_headers
media_src_and_link_list += substory_text_headers_uniq

stories_content_paragraph_container = driver.find_elements(By.XPATH,'/html/body/div[1]/main/div[2]/div[7]')[0]
stories_content_paragraph = stories_content_paragraph_container.find_elements(By.TAG_NAME,'div')
stories_content_paragraph_text = [t.text for t in stories_content_paragraph]
text_content_list += stories_content_paragraph_text

stories_content_imgs_section = stories_content_paragraph_container.find_elements(By.TAG_NAME,'img')
stories_content_imgs = [t.get_attribute('src') for t in stories_content_imgs_section]
media_src_and_link_list += stories_content_imgs

stories_content_vid_section = stories_content_paragraph_container.find_elements(By.TAG_NAME,'iframe')
stories_content_vid = [t.get_attribute('src') for t in stories_content_imgs_section]
media_src_and_link_list += stories_content_vid

## Footer
footer_section = driver.find_elements(By.TAG_NAME, 'footer')[0]
footer_text = footer_section.text
text_content_list.append(footer_text)

footer_section_a = footer_section.find_elements(By.TAG_NAME, 'a')
footer_section_a_href = [a.get_attribute('href') for a in footer_section_a]
media_src_and_link_list += footer_section_a_href

4880
7208
9656
12032
14408
16832
19280
21608
23888
26240
28640
30824
33176
35552
37832
37866
39236
39236


In [21]:
len(text_content_list), text_content_list

(1246,
 ['Hanoi has experienced a long history for more than 1000 years with 36 streets called Old Quarter. Hanoi nowadays is much more different than the past. The ancient city is being invigorated with modern cafes, bar, world-class restaurants and interesting art galleries.',
  '',
  '★ World Cultural Heritage Site Central Sector of the Imperial Citadel of Thang Long - Hanoi (2010)',
  'Best Time To Visit Hanoi',
  'The best time to visit the capital is around March, April when spring flowers bloom, and from August to November when it is autumn with cool and pleasant temperatures.',
  'Transport',
  'Noi Bai International Airport is 45km away from the city center. There are some means of transportation you can choose to get around the city, such as taxi, technology motorbike taxi, bus, or rental motorbikes. You should give a try on cyclo in the Old Quarter to leisurely go sightseeing.',
  '5 fine dining restaurants for the perfect dinner in Hanoi',
  'Late afternoon sunset on the Sk

In [22]:
len("".join(text_content_list))

239788

In [23]:
len(media_src_and_link_list)

8585

In [2]:
media_src_and_link_list

['https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://files.migo.travel/20230508/hanoi---hang-rong---do-luu-niem-p0savsud.webp?f=plVDT9mpvkS0W-57KuxEZw',
 'https://

In [3]:
media_src_images = [t for t in media_src_and_link_list if str(t).__contains__("webp") or str(t).__contains__("jpg") or str(t).__contains__("jpeg")  or str(t).__contains__("png") or str(t).__contains__("jfif")]
media_src_images = unique_list(media_src_images)
media_src_images

['https://files.migo.travel/20230601/278532569_7431371436903991_8148799056713987647_n-c0sfkfyw.webp?f=ZSA276SnzkyB9VBi9JGZUA',
 'https://files.migo.travel/20230705/w-pho-ga-cham-26-1-1253-tt2ueshy.webp?f=3h0BTmsw5EqRwcvRBec6DA',
 'https://files.migo.travel/20230531/316692282_552244700245098_835379531507972983_n-n12zm1cg.webp?f=9Y7OeDj1PEiYbVgTTsAhRA',
 'https://files.migo.travel/20230704/dieu-gi-khien-gioi-tre-cho-ca-thang-bat-ke-ngay-dem-den-dia-nguc-tran-gian-o-ha-noi-9dc29b2ed6af4e18902c8fe856326639-cyvjxxhh.webp?f=8YJPqy3jR0GfznnblotVsw',
 'https://files.migo.travel/20230802/lasalsa5-h2twapew.webp?f=OJjYTicpuUyorjqj7qhhjQ',
 'https://files.migo.travel/20220525/1613962032-khong-gian-nha-qua-thong-lon-nha-ben-rung-u-lesa-homestay-soc-son-ha-noi-14-5ty1fdmq.jpeg',
 'https://files.migo.travel/20230526/117762468_10157690524506089_5079710354293681107_n-qckwhttg.webp?f=ItsfXqtby0SaVGrD94AyFg',
 'https://files.migo.travel/20230804/khruabaanmigo15-veumvj5y.webp?f=L1tX-bNSwUKuQA3_mqTrQQ',
 '

In [48]:
# link_list = [i for i in media_src_and_link_list if i not in media_src_images]
# link_list = unique_list(link_list)
# link_list

----

## BLIP

In [49]:
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

In [50]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base",cache_dir="./cache")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base",cache_dir="./cache")

In [51]:
media_src_images[0]

'https://files.migo.travel/20231003/divax27sloungeattractionmigo1-pgvr02uh.webp?f=glrWe3cJukeZmv7x1Pdksw'

In [74]:
def vqa_blip(input_url="", question=""):
    img_url = input_url
    if img_url != "":
        raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

        question = question if question != "" else "Describe the image" 
        inputs = processor(raw_image, question, return_tensors="pt")

        out = model.generate(**inputs)
        response = processor.decode(out[0], skip_special_tokens=True)
        print(response + "\n" + img_url)
        return response
    else:
        raise Exception.with_traceback()

In [75]:
for url in media_src_images[:5]:
    vqa_blip(url)



tea party
https://files.migo.travel/20231003/divax27sloungeattractionmigo1-pgvr02uh.webp?f=glrWe3cJukeZmv7x1Pdksw
house
https://files.migo.travel/20210925/01.jpg
hotel room
https://files.migo.travel/20230523/405426005-qrb0wvie.webp?f=rK5M6FrpI02CygECTxHsrg
restaurant
https://files.migo.travel/20230807/legardenattractionmigo1-se0iww1d.webp?f=7SUOFpPMeUK7RrYunBv4mw
food
https://files.migo.travel/20230807/kappouishidarestaurantattractionmigo6-g403rzv5.webp?f=WEWo3q5-yEOV6t83kJ6LmQ
food
https://files.migo.travel/20220530/Kali4-chmtp2kj.jpeg
restaurant
https://files.migo.travel/20230804/cocarestaurant-authenticthaifoodmigo3-dkbv5cwe.webp?f=2UgFssxNf0yjPTrw2RCY9Q
street scene
https://files.migo.travel/20221003/mua-thu-ha-noi-15-nguyen-trong-nam-crq2cjoh.webp?f=zsm4GIOQpkiqp0GemWhNTA
sign for restaurant
https://files.migo.travel/20230703/130046727_3549006128540316_2444567996992417044_n-pldscw43.webp?f=mcfmThxjzEyvB1vQ1ka7qA
broccoli
https://files.migo.travel/20220124/1.png
red and black
https

KeyboardInterrupt: 

**_________________________**

#### ** BLIP IMAGE CAPTIONING LARGE **

In [4]:
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
processor_captioning = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large",cache_dir="./cache")
model_captioning = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large",cache_dir="./cache")


In [17]:
def image_captioning(input_url):
    img_url = input_url
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
    # conditional image captioning
    text = "a photography of"
    # inputs = processor_captioning(raw_image, text, return_tensors="pt")
    inputs = processor_captioning(raw_image, return_tensors="pt")
    out = model_captioning.generate(**inputs)
    response = processor_captioning.decode(out[0], skip_special_tokens=True)
    print(response)
    return response
    

In [9]:
for url in media_src_images[:5]:
    time_start = time.time()
    image_captioning(url)
    print("Infer time: ", time.time() - time_start)

a photography of a building with a flag flying in the air
Infer time:  5.83645224571228
a photography of a bowl of soup with chopsticks and a bowl of vegetables
Infer time:  8.331492185592651
a photography of a restaurant with a large window and a view of the city
Infer time:  8.440248489379883
a photography of a person holding a ticket in front of a building
Infer time:  8.395581722259521
a photography of a restaurant with a large chandelier and a dining room
Infer time:  8.218007802963257


In [10]:
# Testing
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
img_url2 = "https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/intermediary/f/a30f6f6e-e2ca-4cb8-a4ff-b772bbf44cda/d87ibte-64cb75fb-2631-41ae-9305-23b92e0c1d46.jpg"
img_url3 = "https://www.babe.today/pic/justteensporn/justteensporn-model/information-handjob-36-dd/hd-justteensporn-model-1.jpg"
img_url4 = "https://genk.mediacdn.vn/thumb_w/640/2019/11/16/photo-1-157391848492999759146.jpg"
img_url5 = "https://crystalsblogonconflict.files.wordpress.com/2019/10/political-meme.png"
img_url6 = "https://steamuserimages-a.akamaihd.net/ugc/2054247397434456937/3CD4F24BF86EFCEE6459A92C17D55CF81DE5ABF1/?imw=5000&imh=5000&ima=fit&impolicy=Letterbox&imcolor=%23000000&letterbox=false?interpolation=lanczos-none&output-format=jpeg&output-quality=70&fit=inside|637:358&composite-to=*,*|637:358&background-color=f0f0f0"
img_urls = [img_url, img_url2, img_url3, img_url4, img_url5, img_url6]
for url in img_urls:
    time_start = time.time()
    image_captioning(url)
    print("Infer time: ", time.time() - time_start)



a photography of a woman and her dog on the beach
Infer time:  6.603883981704712
a photography of a woman in a bikini laying on a bed
Infer time:  7.035293817520142
a photography of a woman in a bikini is sitting on a couch
Infer time:  8.429671287536621
a photography of a woman in a bikini posing for a picture
Infer time:  7.957149267196655
a photography of a cartoon of a man smoking a cigarette and a gun
Infer time:  8.531107187271118
a photography of a woman in a bikini is looking at her cell phone
Infer time:  9.423405647277832


**_________________________**

#### ** BLIP2-OPT-2.7B **

In [76]:
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor_blip2 = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b",cache_dir="./cache")
model_blip2 = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b",cache_dir="./cache")

preprocessor_config.json: 100%|██████████| 432/432 [00:00<00:00, 432kB/s]
tokenizer_config.json: 100%|██████████| 904/904 [00:00<00:00, 903kB/s]
vocab.json: 100%|██████████| 798k/798k [00:00<00:00, 1.20MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 908kB/s]
tokenizer.json: 100%|██████████| 2.11M/2.11M [00:00<00:00, 2.44MB/s]
special_tokens_map.json: 100%|██████████| 548/548 [00:00<00:00, 272kB/s]
config.json: 100%|██████████| 6.96k/6.96k [00:00<?, ?B/s]
pytorch_model.bin.index.json: 100%|██████████| 122k/122k [00:00<00:00, 394kB/s]
pytorch_model-00001-of-00002.bin:  11%|█▏        | 1.13G/10.0G [02:07<16:36, 8.89MB/s]
Downloading shards:   0%|          | 0/2 [02:08<?, ?it/s]


KeyboardInterrupt: 

In [None]:
def vqa_blip(input_url="", question=""):
    img_url = input_url
    if img_url != "":
        raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

        question = question if question != "" else "Describe the image" 
        inputs = processor(raw_image, question, return_tensors="pt")

        out = model.generate(**inputs)
        response = processor.decode(out[0], skip_special_tokens=True)
        print(response + "\n" + img_url)
        return response
    else:
        raise Exception.with_traceback()

-----

## Stage 2 crawler --------------------------------

In [29]:
substory_text_href_uniq

['https://migo.travel/Experience/mam-co-tet-cua-nguoi-ha-noi',
 'https://migo.travel/Experience/3-diem-nghi-duong-giua-thien-nhien-xanh-gan-ha-noi',
 'https://migo.travel/Experience/luu-niem-tu-viet-nam-chuon-chuon-tre-ruc-ro-sac-mau',
 'https://migo.travel/Experience/cam-nang-du-lich-ha-noi-tu-a-den-z-chi-tiet',
 'https://migo.travel/Experience/top-nha-hang-y-khong-he-xot-vi-giua-long-ha-noi',
 'https://migo.travel/Experience/khung-troi-ha-noi-qua-o-cua-cua-homie-coffee',
 'https://migo.travel/Experience/thuong-thuc-bua-trua-bun-cha-va-tra-da-tai-ha-noi',
 'https://migo.travel/Experience/xin-chao-cafe-shibuya-thu-gian-trong-quan-cafe-nhat-ban-o-ha-noi',
 'https://migo.travel/Experience/checkin-giang-sinh-tai-nhung-quan-ca-phe-dep-lung-linh-o-ha-noi',
 'https://migo.travel/Experience/5-diem-ly-tuong-di-choi-noel-o-ha-noi',
 'https://migo.travel/Experience/top-5-khong-gian-nghe-thuat-duong-dai-an-tuong-nhat-ha-noi',
 'https://migo.travel/Experience/ngon-kho-cuong-nhung-mon-an-vat-nong-h

In [46]:
substory_text_href_uniq_lang_en = []
for i in substory_text_href_uniq:
    if i != None and i.split('/')[-2] == "Experience":
        substory_text_href_uniq_lang_en.append(f"https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2F{i.split('/')[-2]}%2F{i.split('/')[-1]}")
substory_text_href_uniq_lang_en

['https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Fmam-co-tet-cua-nguoi-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2F3-diem-nghi-duong-giua-thien-nhien-xanh-gan-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Fluu-niem-tu-viet-nam-chuon-chuon-tre-ruc-ro-sac-mau',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Fcam-nang-du-lich-ha-noi-tu-a-den-z-chi-tiet',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Ftop-nha-hang-y-khong-he-xot-vi-giua-long-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Fkhung-troi-ha-noi-qua-o-cua-cua-homie-coffee',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Fthuong-thuc-bua-trua-bun-cha-va-tra-da-tai-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=en&returnUrl=~%2FExperience%2Fxin-chao-cafe-shibuya-thu-gian-trong-quan-cafe-nhat-

In [48]:
substory_text_href_uniq_lang_vn = []
for i in substory_text_href_uniq:
    if i != None and i.split('/')[-2] == "Experience":
        substory_text_href_uniq_lang_vn.append(f"https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2F{i.split('/')[-2]}%2F{i.split('/')[-1]}")
substory_text_href_uniq_lang_vn

['https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Fmam-co-tet-cua-nguoi-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2F3-diem-nghi-duong-giua-thien-nhien-xanh-gan-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Fluu-niem-tu-viet-nam-chuon-chuon-tre-ruc-ro-sac-mau',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Fcam-nang-du-lich-ha-noi-tu-a-den-z-chi-tiet',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Ftop-nha-hang-y-khong-he-xot-vi-giua-long-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Fkhung-troi-ha-noi-qua-o-cua-cua-homie-coffee',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Fthuong-thuc-bua-trua-bun-cha-va-tra-da-tai-ha-noi',
 'https://migo.travel/language/ChangeLang?lang=vn&returnUrl=~%2FExperience%2Fxin-chao-cafe-shibuya-thu-gian-trong-quan-cafe-nhat-

In [53]:
def get_substory_content(input_en_url, input_vn_url):
    texts = []
    driver.get(input_en_url)
    time.sleep(0.5)
    second_stage_body_en = driver.find_element(By.TAG_NAME, 'body')
    texts.append(str(second_stage_body_en.text))
    second_stage_all_imgs = second_stage_body_en.find_elements(By.TAG_NAME, 'img')
    second_stage_all_imgs_src = [i.get_attribute('src') for i in second_stage_all_imgs if i.get_attribute('src') != None]
    time.sleep(0.5)
    driver.get(input_vn_url)
    second_stage_body_vn = driver.find_element(By.TAG_NAME, 'body')
    texts.append(str(second_stage_body_vn.text))
    return texts, second_stage_all_imgs_src


In [54]:
second_stage_text_list = []
second_stage_img_src_list = []

for en_url, vn_url in zip(substory_text_href_uniq_lang_en, substory_text_href_uniq_lang_vn):
    texts, second_stage_all_imgs_src = get_substory_content(en_url, vn_url)
    second_stage_text_list += texts
    second_stage_img_src_list += second_stage_all_imgs_src


In [55]:
len(second_stage_img_src_list)

2838

In [57]:
len(second_stage_text_list)

306

In [83]:
second_stage_text_list[:5]

["Destinations\nExperiences\nTours\nEvents\nVietnam Tips\nLog in Sign Up\nExperiences\n›\nCity & Culture\n›\nTet tray of Hanoians\nLocal culture Cuisine Delicious food to try Traditional festival\nViews\n281\nSave to Collection\nShare to Facebook\nCopy link\nCity & Culture 11:47 AM - Jan 24, 2022\nTet tray of Hanoians\nHanoi\nTranslated by Bing\nIn the traditional culture of Vietnamese people, on any occasion from marriage to anniversary, people always display a decent tray. But perhaps the New Year's Day tray is more special.\nFor Hanoians, Tet tray always requires sophistication and sophistication in each dish, showing the culinary quintessence of Ha Thanh land.\n\n\nTraditionally, the Tet tray must have four pillars of 4 bowls and 4 plates symbolizing the four seasons and four directions. The well-off house can make the tray bigger with 6 bowls of 6 plates or 8 bowls of 8 plates.\n\nEach house is different, but there are dishes that have become an indispensable feature on Tet Day. H

----

### Text classification

In [11]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax


In [12]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL, cache_dir='./cache')
config = AutoConfig.from_pretrained(MODEL, cache_dir='./cache')
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL, cache_dir='./cache')
model.save_pretrained(save_directory='./cache')


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [85]:
len(second_stage_text_list[0])

4272

In [92]:
second_stage_text_list[0]

"Destinations\nExperiences\nTours\nEvents\nVietnam Tips\nLog in Sign Up\nExperiences\n›\nCity & Culture\n›\nTet tray of Hanoians\nLocal culture Cuisine Delicious food to try Traditional festival\nViews\n281\nSave to Collection\nShare to Facebook\nCopy link\nCity & Culture 11:47 AM - Jan 24, 2022\nTet tray of Hanoians\nHanoi\nTranslated by Bing\nIn the traditional culture of Vietnamese people, on any occasion from marriage to anniversary, people always display a decent tray. But perhaps the New Year's Day tray is more special.\nFor Hanoians, Tet tray always requires sophistication and sophistication in each dish, showing the culinary quintessence of Ha Thanh land.\n\n\nTraditionally, the Tet tray must have four pillars of 4 bowls and 4 plates symbolizing the four seasons and four directions. The well-off house can make the tray bigger with 6 bowls of 6 plates or 8 bowls of 8 plates.\n\nEach house is different, but there are dishes that have become an indispensable feature on Tet Day. Ha

In [15]:
test_string = "Destinations\nExperiences\nTours\nEvents\nVietnam Tips\nLog in Sign Up\nExperiences\n›\nCity & Culture\n›\nTet tray of Hanoians\nLocal culture Cuisine Delicious food to try Traditional festival\nViews\n281\nSave to Collection\nShare to Facebook\nCopy link\nCity & Culture 11:47 AM - Jan 24, 2022\nTet tray of Hanoians\nHanoi\nTranslated by Bing\nIn the traditional culture of Vietnamese people, on any occasion from marriage to anniversary, people always display a decent tray. But perhaps the New Year's Day tray is more special.\nFor Hanoians, Tet tray always requires sophistication and sophistication in each dish, showing the culinary quintessence of Ha Thanh land.\n\n\nTraditionally, the Tet tray must have four pillars of 4 bowls and 4 plates symbolizing the four seasons and four directions. The well-off house can make the tray bigger with 6 bowls of 6 plates or 8 bowls of 8 plates.\n\nEach house is different, but there are dishes that have become an indispensable feature on Tet Day. Hanoians in particular and Northerners in general often celebrate Tet with banh chong, boiled chicken, cooked pork, silk spring rolls, cinnamon rolls, fried spring rolls, ball soup, stir-fried plates, sticky rice, onion melon, ... Each dish will be neatly displayed in small bowls and plates, with the typical blue enamel ancient motif of Bat Trang.\n\nThe offering chicken must be a flower rooster. When boiling chicken, boiling water must be boiled, so that the old hot water fills the chicken to help the boiled water not become fishy and cloudy. The pot of chicken boils again, it is necessary to lower the heat to simmer, both boiling and skimming. The chicken broth is fragrant and in the night cooks a bowl of sweet, soothing ball soup. Finished boiled chicken must retain its golden skin. If left whole, the chicken must tie fairy wings, mouth a crimson velvet rose, if chopped into pieces, it must be very even, arranged on a very plump plate and sprinkled on top of a few strands of sliced lemon leaves. In particular, accompanied by a plate of boiled chicken, it is indispensable for a plate of lemon salt and pepper, which is really full of typical flavors for this dish.\n\nWhen wrapped, the meat is not arranged vertically, but serrated and rolled into a circle. By doing so, when chopping the cake into 8 pieces, each piece has enough skin, fat, lean. After boiling, the cake will be offered to the family first.\n\n\nVegetables on New Year's Day at that time were mainly only kohlrabi, carrots, and beans, but the mothers processed them very well. Kohlrabi carrots are divided into different parts, the square place is trimmed or chopped only to make a dummy, the excess distortion is peeled to make rounded legs to make a bowl of balls or diced to make stir-fried almonds.\n\nThe more elaborate the tray of rice offerings, the more it must be used when lowered. If the house has guests, first invite guests to loosen the ball, then the mushroom piece, the tenderloin, then the curd meat plate, silk spring plate, stir-fry plate ,... The plump dishes that fit in small dishes are continued like that until the end in sequence from taste to taste. Thanks to that, guests are both delicious and feel the thoughtfulness and thoughtfulness of the host.\n\nOver time, the traditional Tet tray of Hanoians has had many changes, partly because modern Hanoians update many dishes with new cooking methods, partly because the conditions and taste preferences of each person are different. However, the Vietnamese way of making Tet tray still has in common that is the sincerity to pay respect to the ancestors and the love of family members placed in the reunion rice tray.\n\nImage source: Multiple authors\nYOU MAY ALSO LIKE\nCity & Culture\n6 important traditional Tet days in Vietnamese culture\nFood & Drinks\nTypical Tet dishes in 3 regions\nCity & Culture\nPainting Khuc cake village in spring\nFood & Drinks\nLa Vong fish cake: Exquisite Hanoi taste for guests from all over the world\nFood & Drinks\n8 delicious dishes not to be missed when coming to Lang Son\nComments\nWrite a comment\nComments (0)\nWhere your journey begins\nmarketing@migo.travel\nABOUT MIGO\nAbout Us\nTerms & Conditions\nPrivacy Policy\nPOPULAR SITES\nDestinations\nExperiences\nTours\nEvents\nSOCIAL MEDIA\n© 2023 Exploria Vietnam. All rights reserved."

In [3]:
from googletrans import Translator
translator = Translator()
translator.detect("Destinations\nExperiences\nTours\nEvents\nVietnam Tips\nLog in Sign Up\nExperiences\n›\nCity & Culture\n›\nTet tray of Hanoians\nLocal culture Cuisine Delicious food to try Traditional festival\nViews\n281\nSave to Collection\nShare to Facebook\nCopy link\nCity & Culture 11:47 AM - Jan 24, 2022\nTet tray of Hanoians\nHanoi\nTranslated by Bing\nIn the traditional culture of Vietnamese people, on any occasion from marriage to anniversary, people always display a decent tray. But perhaps the New Year's Day tray is more special.\nFor Hanoians, Tet tray always requires sophistication and sophistication in each dish, showing the culinary quintessence of Ha Thanh land.\n\n\nTraditionally, the Tet tray must have four pillars of 4 bowls and 4 plates symbolizing the four seasons and four directions. The well-off house can make the tray bigger with 6 bowls of 6 plates or 8 bowls of 8 plates.\n\nEach house is different, but there are dishes that have become an indispensable feature on Tet Day. Hanoians in particular and Northerners in general often celebrate Tet with banh chong, boiled chicken, cooked pork, silk spring rolls, cinnamon rolls, fried spring rolls, ball soup, stir-fried plates, sticky rice, onion melon, ... Each dish will be neatly displayed in small bowls and plates, with the typical blue enamel ancient motif of Bat Trang.\n\nThe offering chicken must be a flower rooster. When boiling chicken, boiling water must be boiled, so that the old hot water fills the chicken to help the boiled water not become fishy and cloudy. The pot of chicken boils again, it is necessary to lower the heat to simmer, both boiling and skimming. The chicken broth is fragrant and in the night cooks a bowl of sweet, soothing ball soup. Finished boiled chicken must retain its golden skin. If left whole, the chicken must tie fairy wings, mouth a crimson velvet rose, if chopped into pieces, it must be very even, arranged on a very plump plate and sprinkled on top of a few strands of sliced lemon leaves. In particular, accompanied by a plate of boiled chicken, it is indispensable for a plate of lemon salt and pepper, which is really full of typical flavors for this dish.\n\nWhen wrapped, the meat is not arranged vertically, but serrated and rolled into a circle. By doing so, when chopping the cake into 8 pieces, each piece has enough skin, fat, lean. After boiling, the cake will be offered to the family first.\n\n\nVegetables on New Year's Day at that time were mainly only kohlrabi, carrots, and beans, but the mothers processed them very well. Kohlrabi carrots are divided into different parts, the square place is trimmed or chopped only to make a dummy, the excess distortion is peeled to make rounded legs to make a bowl of balls or diced to make stir-fried almonds.\n\nThe more elaborate the tray of rice offerings, the more it must be used when lowered. If the house has guests, first invite guests to loosen the ball, then the mushroom piece, the tenderloin, then the curd meat plate, silk spring plate, stir-fry plate ,... The plump dishes that fit in small dishes are continued like that until the end in sequence from taste to taste. Thanks to that, guests are both delicious and feel the thoughtfulness and thoughtfulness of the host.\n\nOver time, the traditional Tet tray of Hanoians has had many changes, partly because modern Hanoians update many dishes with new cooking methods, partly because the conditions and taste preferences of each person are different. However, the Vietnamese way of making Tet tray still has in common that is the sincerity to pay respect to the ancestors and the love of family members placed in the reunion rice tray.\n\nImage source: Multiple authors\nYOU MAY ALSO LIKE\nCity & Culture\n6 important traditional Tet days in Vietnamese culture\nFood & Drinks\nTypical Tet dishes in 3 regions\nCity & Culture\nPainting Khuc cake village in spring\nFood & Drinks\nLa Vong fish cake: Exquisite Hanoi taste for guests from all over the world\nFood & Drinks\n8 delicious dishes not to be missed when coming to Lang Son\nComments\nWrite a comment\nComments (0)\nWhere your journey begins\nmarketing@migo.travel\nABOUT MIGO\nAbout Us\nTerms & Conditions\nPrivacy Policy\nPOPULAR SITES\nDestinations\nExperiences\nTours\nEvents\nSOCIAL MEDIA\n© 2023 Exploria Vietnam. All rights reserved.").lang

'en'

In [13]:
# !pip install googletrans==4.0.0-rc1
from googletrans import Translator
def translate_text(input_text="", target_lang='en'):
    translator = Translator()
    detected_lang_input = translator.detect(input_text).lang
    if detected_lang_input == 'en':
        del translator
        return input_text
    else:
        translator_output = translator.translate(text=input_text, dest=target_lang).text
        del translator
        return translator_output

In [14]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = t.strip()
        new_text.append(t)
    return " ".join(new_text)

def split_string_into_chunks(input_string, chunk_length):
    """
    Split a long string into chunks of the specified length.

    Parameters:
    - input_string (str): The input string to be split.
    - chunk_length (int): The desired length of each chunk.

    Returns:
    - List[str]: A list containing the split chunks.
    """
    return [input_string[i:i + chunk_length] for i in range(0, len(input_string), chunk_length)]


def get_text_prediction(input_text=""):
    if input_text != "":
        text = input_text if len(input_text)>20 else "Covid cases are increasing fast!"
        text = preprocess(text)
        text_chunks = []
        if len(text) <= 1024:
            text_chunks.append(text)
        else:
            text_chunks = split_string_into_chunks(text, 1024)
            print("splitted into chunks: ", len(text_chunks))

        result_dict = {
            "positive":[],
            "negative":[],
            "neutral":[]
        }

        start_timer = time.time()
        for sub_text in text_chunks:
            encoded_input = tokenizer(sub_text, return_tensors='pt')
            output = model(**encoded_input)
            scores = output[0][0].detach().numpy()
            scores = softmax(scores)
            
            ranking = np.argsort(scores)
            ranking = ranking[::-1]
            for i in range(scores.shape[0]):
                l = config.id2label[ranking[i]]
                s = scores[ranking[i]]
                result_dict[l].append(np.round(float(s), 4))
        infer_time = round(time.time() - start_timer, 3)
        # Post process results dictionary
        def calc_average(lst): 
            return sum(lst) / len(lst) 
        result_dict['positive'] = calc_average(result_dict['positive'])
        result_dict['negative'] = calc_average(result_dict['negative'])
        result_dict['neutral'] = calc_average(result_dict['neutral'])
        max_l = max(result_dict, key=lambda key: result_dict[key])
        max_s = result_dict[max_l]
        # Print labels and scores
        print(f"Inference time (seconds): {infer_time }s \nContext: {text} \n -> {max_l}: {max_s} \n {'-'*25} ")
        return max_l,max_s
    else:
        print(f"Input text is None: {input_text}")

In [29]:
get_text_prediction("Nhà hàng Kalí được thành lập bởi đầu bếp Thomas Etievant và Marco Yanes đến từ Milan, Ý. Cái tên Kali bắt nguồn từ từ Kalimera trong tiếng Hy Lạp có nghĩa thay cho lời chào, thể hiện phần nào sự thân thiện và thoải mái hay cũng chính là phong cách mà nơi đây hướng tới. Nằm ở vị trí đắc địa bên Hồ Tây lộng gió, bước vào đây như bước vào một không gian tươi mới, thoáng mát với tầm nhìn siêu rộng - điều khó có thể kiếm được tại những khu vực trung tâm thủ đô. ")

Inference time (seconds): 0.787s 
Context: Nhà hàng Kalí được thành lập bởi đầu bếp Thomas Etievant và Marco Yanes đến từ Milan, Ý. Cái tên Kali bắt nguồn từ từ Kalimera trong tiếng Hy Lạp có nghĩa thay cho lời chào, thể hiện phần nào sự thân thiện và thoải mái hay cũng chính là phong cách mà nơi đây hướng tới. Nằm ở vị trí đắc địa bên Hồ Tây lộng gió, bước vào đây như bước vào một không gian tươi mới, thoáng mát với tầm nhìn siêu rộng - điều khó có thể kiếm được tại những khu vực trung tâm thủ đô.  
 -> neutral: 0.8278 
 ------------------------- 


('neutral', 0.8278)

In [31]:
get_text_prediction("The Palestinians are like crocodiles; the more you give them meat, they want more")

Inference time (seconds): 0.082s 
Context: The Palestinians are like crocodiles; the more you give them meat, they want more 
 -> negative: 0.8244 
 ------------------------- 


('negative', 0.8244)

In [40]:
get_text_prediction(test_string)

splitted into chunks:  5
Inference time (seconds): 1.667s 
Context: Destinations
Experiences
Tours
Events
Vietnam Tips
Log in Sign Up
Experiences
›
City & Culture
›
Tet tray of Hanoians
Local culture Cuisine Delicious food to try Traditional festival
Views
281
Save to Collection
Share to Facebook
Copy link
City & Culture 11:47 AM - Jan 24, 2022
Tet tray of Hanoians
Hanoi
Translated by Bing
In the traditional culture of Vietnamese people, on any occasion from marriage to anniversary, people always display a decent tray. But perhaps the New Year's Day tray is more special.
For Hanoians, Tet tray always requires sophistication and sophistication in each dish, showing the culinary quintessence of Ha Thanh land.


Traditionally, the Tet tray must have four pillars of 4 bowls and 4 plates symbolizing the four seasons and four directions. The well-off house can make the tray bigger with 6 bowls of 6 plates or 8 bowls of 8 plates.

Each house is different, but there are dishes that have become

('neutral', 0.6268)

### Chain two model vqa and text

In [18]:
# Testing
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
img_url2 = "https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/intermediary/f/a30f6f6e-e2ca-4cb8-a4ff-b772bbf44cda/d87ibte-64cb75fb-2631-41ae-9305-23b92e0c1d46.jpg"
img_url3 = "https://www.babe.today/pic/justteensporn/justteensporn-model/information-handjob-36-dd/hd-justteensporn-model-1.jpg"
img_url4 = "https://genk.mediacdn.vn/thumb_w/640/2019/11/16/photo-1-157391848492999759146.jpg"
img_url5 = "https://crystalsblogonconflict.files.wordpress.com/2019/10/political-meme.png"
img_url6 = "https://steamuserimages-a.akamaihd.net/ugc/2054247397434456937/3CD4F24BF86EFCEE6459A92C17D55CF81DE5ABF1/?imw=5000&imh=5000&ima=fit&impolicy=Letterbox&imcolor=%23000000&letterbox=false?interpolation=lanczos-none&output-format=jpeg&output-quality=70&fit=inside|637:358&composite-to=*,*|637:358&background-color=f0f0f0"
img_urls = [img_url, img_url2, img_url3, img_url4, img_url5, img_url6]
for url in img_urls:
    time_start = time.time()
    response = image_captioning(url)
    get_text_prediction(response)
    print("Infer time: ", time.time() - time_start)

woman sitting on the beach with her dog and a cell phone
Inference time (seconds): 0.057s 
Context: woman sitting on the beach with her dog and a cell phone 
 -> neutral: 0.8619 
 ------------------------- 
Infer time:  6.250829219818115
a close up of a woman in a bikini laying on a bed
Inference time (seconds): 0.067s 
Context: a close up of a woman in a bikini laying on a bed 
 -> neutral: 0.8805 
 ------------------------- 
Infer time:  7.1714253425598145
arafed woman in a bikini and socks is sitting on a couch
Inference time (seconds): 0.068s 
Context: arafed woman in a bikini and socks is sitting on a couch 
 -> neutral: 0.6848 
 ------------------------- 
Infer time:  7.654696941375732
a close up of a woman in a bikini posing for a picture
Inference time (seconds): 0.081s 
Context: a close up of a woman in a bikini posing for a picture 
 -> neutral: 0.8392 
 ------------------------- 
Infer time:  7.619798898696899
a cartoon of a man smoking a cigarette while holding a gun
Infere

----

### CLIP zero shot classification

In [1]:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14", cache_dir="./cache")
processor_clip = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14", cache_dir="./cache")

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.


In [16]:
def zeroshot_classification(input_url=""):
    if input_url != "" and input_url.startswith(("http://", "https://")):
        image = Image.open(requests.get(input_url, stream=True).raw)
        classes = ["food","drink","restaurant","sexy","violent","gun","bikini","porn","naked","politic","building","hotel","animal","music","landscape","weapon","racing","vehicle","religion"]
        inputs = processor_clip(text=classes, images=image, return_tensors="pt", padding=True)
        outputs = model_clip(**inputs)
        logits_per_image = outputs.logits_per_image # this is the image-text similarity score
        probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
        result_idx = probs.detach().numpy()[0].argmax()
        print(f"{classes[result_idx]}: {probs.detach().numpy()[0][result_idx]}")
        return probs

In [17]:
# Testing
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
img_url2 = "https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/intermediary/f/a30f6f6e-e2ca-4cb8-a4ff-b772bbf44cda/d87ibte-64cb75fb-2631-41ae-9305-23b92e0c1d46.jpg"
img_url3 = "https://www.babe.today/pic/justteensporn/justteensporn-model/information-handjob-36-dd/hd-justteensporn-model-1.jpg"
img_url4 = "https://genk.mediacdn.vn/thumb_w/640/2019/11/16/photo-1-157391848492999759146.jpg"
img_url5 = "https://crystalsblogonconflict.files.wordpress.com/2019/10/political-meme.png"
img_url6 = "https://steamuserimages-a.akamaihd.net/ugc/2054247397434456937/3CD4F24BF86EFCEE6459A92C17D55CF81DE5ABF1/?imw=5000&imh=5000&ima=fit&impolicy=Letterbox&imcolor=%23000000&letterbox=false?interpolation=lanczos-none&output-format=jpeg&output-quality=70&fit=inside|637:358&composite-to=*,*|637:358&background-color=f0f0f0"
img_urls = [img_url, img_url2, img_url3, img_url4, img_url5, img_url6]
for url in img_urls:
    probs = zeroshot_classification(url)
    # break



animal: 0.3779453635215759
sexy: 0.4406830966472626
porn: 0.45160791277885437
sexy: 0.7678896188735962
weapon: 0.5070120692253113
porn: 0.429307222366333
