## Webscraping:
* It means automatically collecting data from different websites using programmining to meet certain tasks.
### Purpose:
* When you dont' have a data to begin with - you can always extract it from websites
* It is essential for real time/live/current analysis
### Sources:
* Open source websites
* Government websites
* Media websites
### Workflow:
* Send request: the code requests the webpage to get the data
* Get the html page
* Parse the data that you get
* There are libraries that help you with which part of the text we need to extract - BeautifulSoup, Selenium, Scrapy, AutoScraper
* Extract the data and save/structure it as a DataFrame

In [5]:
# !pip install selenium
# !pip install beautifulsoup4

In [3]:
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import requests
from selenium import webdriver 

In [5]:
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser") # converting raw html data into structured format

In [7]:
data = []
quotes = soup.find_all("div",class_="quote")

In [9]:
for i in quotes:
    text = i.find("span", class_="text").get_text(strip=True)
    author = i.find("small", class_="author").get_text(strip=True)
    data.append({"Quote":text,"Author":author})

In [11]:
data

[{'Quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  'Author': 'Albert Einstein'},
 {'Quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  'Author': 'J.K. Rowling'},
 {'Quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  'Author': 'Albert Einstein'},
 {'Quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  'Author': 'Jane Austen'},
 {'Quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  'Author': 'Marilyn Monroe'},
 {'Quote': '“Try not to become a man of success. Rather become a man of value.”',
  'Author': 'Albert Einstein'},
 {'Quote': '“It is better to be hated for what you are than to be loved for what you are not.”',
  'Author': 'And

In [13]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Quote,Author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe


In [15]:
df.to_csv("quotes.csv")

In [17]:
driver = webdriver.Chrome() #Edge() Firefox()
driver.get("https://www.bbc.com/news") 

In [18]:
soup = BeautifulSoup(driver.page_source, "html.parser")
headline = soup.find("h2", class_="sc-fa814188-3 iCfgww")  # use the real class name
print(headline.get_text(strip=True))

Cuba defiant as it braces for post-Maduro era


In [21]:
soup = BeautifulSoup(driver.page_source, "html.parser")
titles = [i.get_text(strip=True) for i in soup.find_all("h2")]

In [23]:
titles

['Venezuela swears in interim president after defiant Maduro pleads not guilty in US court',
 'Cuba defiant as it braces for post-Maduro era',
 'Skiers create heart-shaped tribute for Switzerland fire victims',
 'Car giant Hyundai to use human-like robots in factories',
 "Which countries could be in Trump's sights after Venezuela?",
 "Selfies and smiles: South Korea seeks 'new phase' in ties with China",
 "Netflix pulls Chinese drama after Vietnam's outcry over disputed map",
 "The 'magical' blue flower changing farmers' fortunes in India",
 "Nvidia unveils 'reasoning' AI technology for self-driving cars",
 'Ten found guilty of cyber-bullying Brigitte Macron',
 "Timothée's shoutout for Kylie Jenner and other moments from Critics' Choice Awards",
 'UK police force to be questioned over Israeli football fan ban',
 'US seizes Maduro',
 "'I'm a prisoner of war' - In the room for Maduro's dramatic court hearing",
 'US sharply criticised by foes and friends over Maduro seizure',
 'Misleading

In [25]:
df2 = pd.DataFrame(titles, columns = ["Headlines"])
df2.head()

Unnamed: 0,Headlines
0,Venezuela swears in interim president after de...
1,Cuba defiant as it braces for post-Maduro era
2,Skiers create heart-shaped tribute for Switzer...
3,Car giant Hyundai to use human-like robots in ...
4,Which countries could be in Trump's sights aft...


In [27]:
df2.shape

(59, 1)