<a href="https://colab.research.google.com/github/rishavb123/Web-Scraping-Demo/blob/master/Web_Scraping_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping and Python Basics

## Python Variables and Data Types

### Numbers

In [0]:
x = 0
y = 1
z = 2
print(x, y, z) # printing all the variables
x += 2 # increase x by 2
print(x) 
print(x / 2 + y * z) # order of operations
print(type(x), type(x/2)) # int vs float


0 1 2
2
3.0
<class 'int'> <class 'float'>


### Strings

In [0]:
s = "Hello World"
print(s)
s += '!' # adding to a string, can use both double and single quotes
s += str(x) # int to string
print(s)
s = "10" 
print(int(s), float(s)) # string to int and float

Hello World
Hello World!2
10 10.0


### Lists

In [0]:
l = [1, 2, 3, 4] # create a python list
print(l)

# Indexing
print(l[0])
print(l[1:])
print(l[:-2])
print(l[1:2])

[1, 2, 3, 4]
1
[2, 3, 4]
[1, 2]
[2]


### For Loops
These are for looping through iterable objects like lists

In [0]:
for i in l:
  print(i)
print() 
 
for i in range(10):
  print(i)

1
2
3
4

0
1
2
3
4
5
6
7
8
9


### Functions
Let's make a function to print each element in a list

In [0]:
def print_list(lis):
  for obj in lis:
    print(obj)
  print()
  
print_list(l)

1
2
3
4



## Package Manager

pip is a python package manager that lets you use libraries that other people have made. For this demo we will need a library called Beautiful Soup to parse some html for us. So we will install it using this:

In [0]:
!pip install bs4



## Basic Setup

Import all of the libraries we will use

In [0]:
from bs4 import BeautifulSoup
import urllib.request

Set up some constants like the [url](https://news.google.com/?hl=en-US&gl=US&ceid=US:en) we will scrape, the word we are searching for, and the maximum amount of results we want

In [0]:
url = "https://news.google.com/?hl=en-US&gl=US&ceid=US:en"
max_results = 3
keyword = "Trump"

## Web Scraping
We will scrape all of the headlines containing "Trump" from Google News

Request for the html content and send it to a BeautifulSoup object

In [0]:
page = urllib.request.urlopen(url) # request for the websites content
    
soup = BeautifulSoup(page, 'html.parser') # give it to our html parser
print(soup)

<!DOCTYPE doctype html>
<html dir="ltr" lang="en"><head><base href="https://news.google.com/"/><meta content="origin" name="referrer"/><link href="https://news.google.com/" rel="canonical"/><meta content="width=device-width,initial-scale=1,minimal-ui" name="viewport"/><meta content="app-id=459182288" name="apple-itunes-app"/><meta content="AcBy5YFny2HQgVUCR18tO5YUTf6MpVlcJqGTd-a9-SI" name="google-site-verification"/><meta content="yes" name="mobile-web-app-capable"/><meta content="yes" name="apple-mobile-web-app-capable"/><meta content="News" name="application-name"/><meta content="News" name="apple-mobile-web-app-title"/><meta content="black" name="apple-mobile-web-app-status-bar-style"/><meta content="white" name="theme-color"/><meta content="no" name="msapplication-tap-highlight"/><link href="https://lh3.googleusercontent.com/-DR60l-K8vnyi99NZovm9HlXyZwQ85GMDxiwJWzoasZYCUrPuUM_P_4Rb7ei03j-0nRs0c4F=w16" rel="shortcut icon" sizes="16x16"/><link href="https://lh3.googleusercontent.com/

In [0]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


Now let's take find all of the headlines (h3 elements)

In [0]:
headlines = soup.find_all('h3') # finds all the headlines
print_list(headlines)

<h3 class="ipQwMb ekueJc RD0gLb"><a class="DY5T1d" href="./articles/CAIiEFQDwt55qmsRuvLHZqZJ5EgqGAgEKg8IACoHCAowjtSUCjC30XQw36e5AQ?hl=en-US&amp;gl=US&amp;ceid=US%3Aen">Republicans storm closed-door impeachment hearing as escalating Ukraine scandal threatens Trump</a></h3>
<h3 class="ipQwMb ekueJc RD0gLb"><a class="DY5T1d" href="./articles/CBMidGh0dHBzOi8vdGhlaGlsbC5jb20vaG9tZW5ld3Mvc2VuYXRlLzQ2NzE0Ny1uby0yLWdvcC1zZW5hdG9yLXBpY3R1cmUtY29taW5nLW91dC1vZi1kaXBsb21hdHMtdGVzdGltb255LW5vdC1hLWdvb2Qtb25l0gF4aHR0cHM6Ly90aGVoaWxsLmNvbS9ob21lbmV3cy9zZW5hdGUvNDY3MTQ3LW5vLTItZ29wLXNlbmF0b3ItcGljdHVyZS1jb21pbmctb3V0LW9mLWRpcGxvbWF0cy10ZXN0aW1vbnktbm90LWEtZ29vZC1vbmU_YW1w?hl=en-US&amp;gl=US&amp;ceid=US%3Aen">No. 2 GOP senator: 'Picture coming out of' diplomat's testimony 'not a good one' | TheHill</a></h3>
<h3 class="ipQwMb ekueJc RD0gLb"><a class="DY5T1d" href="./articles/CAIiEMK_qBAdfyVu0TCksmbilVQqGQgEKhAIACoHCAowocv1CjCSptoCMPrTpgU?hl=en-US&amp;gl=US&amp;ceid=US%3Aen">Judge orders State Departmen

Now, filter out any headlines that do not contain Trump

In [0]:
trump_headlines = []
for headline in headlines:
  if keyword in headline.getText(): # checks if our keyword is in the text of the headline element
    trump_headlines.append(headline.getText()) # adds it to our list
print_list(trump_headlines)

Republicans storm closed-door impeachment hearing as escalating Ukraine scandal threatens Trump
Trump: ‘We’re building a wall in Colorado’
Ukrainian President and advisers discussed pressure from Trump weeks before taking office
Graham urges Trump to listen to commanders on Syria, not 'policy-shop civilians'
MLB investigating umpire who threatened to buy rifle over Trump impeachment
Trump Accused Of Showing Middle Finger To Astronaut Who Corrected Him



Now lets only show a certain amount of results (according to our max results variable)

In [0]:
print_list(trump_headlines[:max_results]) # only shows the first few elements (according to max_results)

Republicans storm closed-door impeachment hearing as escalating Ukraine scandal threatens Trump
Trump: ‘We’re building a wall in Colorado’
Ukrainian President and advisers discussed pressure from Trump weeks before taking office

