<a href="https://colab.research.google.com/github/TK-Problem/Python-mokymai/blob/master/Scripts/bdtechtalks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Importuoti paketus

# playwright biblioteka naudojama importuoti html kodą
# įrašyti specifinę versiją, kad veiktų ant google colab
!pip install playwright==1.25.00
!playwright install-deps
!playwright install webkit
!pip install nest_asyncio

# playwright veikia TIK asyncio režimu
import nest_asyncio
nest_asyncio.apply()
import asyncio

# importuoti playwright versiją
from playwright.async_api import async_playwright

# bs4 naudojama iš HTML ištraukti reikiamą informaciją
from bs4 import BeautifulSoup

# reikalingas paskaičiuoti, kiek užtrunka programos veikimas
import time

# paketai dirbti su skaičiais ir duomenimis
import pandas as pd
import numpy as np

# clear output komanda naudojama išvalyti informacijai
from IPython.display import clear_output
clear_output()

# Duomenų atsisiuntimas

Funkcijos veikimo žingsniai:

* sukuria `playwright` webdriver'į (webkit),
* sukuria netikrą `user_agent`, kad svetainė tave laikytų tikru varotoju,
* nueina į puslapį,
* 5 kartus su pelės ratuku eina į puslapio apačią (greitas bet neefektyvus būdas sulaukti "load more" mygtuko),
* laukia kol atsiras mygtukas "load more" ir spaudžia žemyn,
* nuskaito HTML is su BS4 pagalba išsaugo duomenis į pandas DataFrame.

Pagrindinis šio script'o tikslas nuskaityti duomenis iš puslapio su "infinite scroll" funkcija.

In [2]:
#@title TechTalk Blog funkcija
async def TechTalk():
    """
    This function returns all blog titles.
    Inputs:
      keyword (str)
    Output:
      returns pandas DataFrame
    """
    async with async_playwright() as p:

        # create webdriver/webkit
        browser = await p.webkit.launch()

        # create user agent for the webdriver
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'

        # create new page, i.e. new table in your browser
        page = await browser.new_page(user_agent=user_agent)
        
        # generate URL with a keyword
        url = f"https://bdtechtalks.com/category/blog/"
        
        # visit page
        await page.goto(url)

        # click cookie button
        await page.click("//a[@id='cookie_action_close_header']")

        # implicit wait
        time.sleep(2)

        # page.mouse.wheel(horizontally, vertically(positive is 
        # scrolling down, negative is scrolling up)
        # https://stackoverflow.com/questions/69183922/playwright-auto-scroll-to-bottom-of-infinite-scroll-page
        for i in range(5):
          # make the range as long as needed
          await page.mouse.wheel(0, 15000)
          # wait till page loads
          time.sleep(2)
          await page.screenshot(path=f"{str(i).zfill(3)}.png")

        # record start time
        script_start_time = time.time()

        # check if button is visible
        while await page.locator("text=Load more").is_visible():
          # click load more buton
          await page.locator("text=Load more").click()
          
          # wait till page loads
          time.sleep(2)
          
          # scroll down
          await page.mouse.wheel(0, 15000)
          i += 1
          
          # take status screen shot (uncomment for debuging)
          # await page.screenshot(path=f"{str(i).zfill(3)}.png")

          if (i + 5) % 20 == 0:

            # print status message
            print(f"{(i + 5)} pages loaded. Script is running for {time.time() - script_start_time:.1f} secs.")

          if i > 200:
            break

        # get page html contents
        page_source = await page.content()

        # convert to bs4 object
        soup = BeautifulSoup(page_source, "lxml")

        # article blocks
        blocks = soup.findAll('div', {'class': 'td-item-details'})

        # tmp list to store data
        lst = list()

        # iterate over all blocks
        for block in blocks:
          # get article title
          _title = block.find('a', {'rel': 'bookmark'})
          
          # article author
          _author = block.find('span', {'class': 'td-post-author-name'}).text
          
          # url to an article
          _url = _title['href']
          
          # short description/headline
          _desc = block.find('div', {'class': 'td-excerpt'}).text

          # add to list
          lst.append([_title.text, _author, _url, _desc])

        # save image to your enviroment (for debuging)
        # one can comment this line
        await page.screenshot(path="last_status.png")
        
        # close webkit
        await browser.close()

        # return pandas DataFrame
        return pd.DataFrame(lst, columns = ['Title', 'Author', 'URL', 'Excerpt'])

In [3]:
#@title Atsisiųsti duomenis

# iškviečiame funkciją ir išsaugome duomenis ir atvaizduojame pirmus 5 skelbimus
df = asyncio.run(TechTalk())

# sutvarkyti antrašes, nuimti '\n' simbolius
df.Excerpt = df.Excerpt.str.strip()

# parašyti kiek rado skelbimų
print(f"{len(df)} articles found.")

df.head()

20 pages loaded. Script is running for 25.2 secs.
40 pages loaded. Script is running for 71.6 secs.
60 pages loaded. Script is running for 119.7 secs.
623 articles found.


Unnamed: 0,Title,Author,URL,Excerpt
0,"What we learned from a decade of Siri, Alexa a...",Contributor,https://bdtechtalks.com/2022/12/21/voice-inter...,What are today’s most popular voice assistants...
1,The common traits of successful MLOps,Ben Dickson,https://bdtechtalks.com/2022/12/12/successful-...,A study by UC Berkeley sheds light on the best...
2,Artificial general intelligence (AGI) and nati...,Contributor,https://bdtechtalks.com/2022/12/09/agi-nationa...,When thinking machines take over that dominant...
3,What to (not) expect from OpenAI’s ChatGPT,Ben Dickson,https://bdtechtalks.com/2022/12/05/openai-chat...,"OpenAI's ChatGPT, with all its successes and f..."
4,The real “Bitter Lesson” of artificial intelli...,Rich Heimann,https://bdtechtalks.com/2022/12/01/rich-sutton...,In a popular blog post titled “The Bitter Less...


# Išsaugoti duomenis

Norint atsisiųsti duomenis lokaliai paspauskite ant dešinėje pusė esančios "Files" ikonos ir atsisiųskite norimą failą. Jei to nepadarysite, failai bus ištrinti uždarius google colab.

![image info](https://i.stack.imgur.com/mYWnb.png)

In [4]:
# save data
df.to_csv('techtalk.csv', index=False)