# How to use Python to tell if a webpage contains a word (string) or not

How can you get a `True`/`False` result telling you if a page contains a particular string?

Let's import some libraries to scrape a page first...



In [2]:
#import our libraries
import requests
from bs4 import BeautifulSoup

Then scrape a page we can deal with...

In [3]:
#store the url to run our scraper on
testurl = "https://www.thebureauinvestigates.com/"
#fetch the page at that url
page = requests.get(testurl)
#test that it's worked - Response 200 means it's not broken
page

<Response [200]>

In [4]:
#check the content
page.content

b'<!DOCTYPE html>\n<html lang="en">\n\n\t<head>\n    <meta charset="UTF-8">\n\n    \n    <title>The Bureau of Investigative Journalism (en-GB)</title>\n    <meta name="description" content="Home of the Bureau of Investigative Journalism, an independent, not-for-profit media organisation that holds power to account." />\n    \n    <link rel="home" href="https://www.thebureauinvestigates.com/" />\n\n    \n    \n        <meta name="twitter:card" content="summary_large_image">\n    <meta name="twitter:site" content="@TBIJ">\n    <meta name="twitter:creator" content="@TBIJ">\n    <meta name="twitter:title" content="The Bureau of Investigative Journalism (en-GB)">\n    <meta name="twitter:description" content="Home of the Bureau of Investigative Journalism, an independent, not-for-profit media organisation that holds power to account.">\n    <meta name="twitter:image:src" content="https://d3cocnzdt9u6c9.cloudfront.net/eyJidWNrZXQiOiJhc3NldHMyLnRoZWJ1cmVhdWludmVzdGlnYXRlcy5jb20iLCJrZXkiOiJ1cG

## Convert your bytes object to a string

If we try to test if the string 'Bureau' is in the page, we will get an error with this code:

In [None]:
#This will give you an error because it's not a string object - it's a 'bytes'
"Bureau" in page.content

TypeError: ignored

[Google the error!](https://stackoverflow.com/questions/33054527/typeerror-a-bytes-like-object-is-required-not-str-when-writing-to-a-file-in): "You opened the file in binary mode" (that's the `b` at the start when it's printed out).

So you need to convert it to a string using the `str()` function like this:

In [None]:
#Convert page.content to a string by using the str() function first
"Bureau" in str(page.content)

True

## Searching a 'soup' object

If we try the same approach on a BeautifulSoup object, it won't work.

In [5]:
#turn into soup
soup = BeautifulSoup(page.content)

In [6]:
"Bureau" in soup

False

Instead you need to drill down into particular parts.

The simplest way to do this is to grab all matches for the `<html>` tag. This tag contains the entire page. 

You need to remember that `.select()` will *always return a list* even if, as in this case, there's only one match, so you still need to select the *first match* when testing if your string is in it. 

In [7]:
#grab all the html tags (there's only one)
body = soup.select('html')
#check if the text contents of the first (and only) html tag contains your string.
"Bureau" in body[0].get_text()

True

In [15]:
#store in a variable
bodytext = body[0].get_text()
#show the first 100 characters
bodytext[0:100]

'\n\n\nThe Bureau of Investigative Journalism (en-GB)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n            {"@context":"http:\\/\\/'

In [16]:
#plit on the word 'Bureau'
bureausplit = bodytext.split("Bureau")
#show how many items are created
len(bureausplit)

22

We can print the first item to show you what comes before the first match of 'Bureau'.

In [22]:
bureausplit[0]

'\n\n\nThe '

And the second item which comes after it. Notice that the word 'Bureau' has been removed when you use `.split()`.

In [23]:
bureausplit[1]

' of Investigative Journalism (en-GB)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n            {"@context":"http:\\/\\/schema.org","@type":"Organization","legalName":"The '

So to see the context around the word you need to print the first (or last) few characters that came after (or before) it.

We've picked 30 characters below but this is after a process of trial and error, trying different numbers to see what we get and whether we want more or less. 

In [27]:
#for each item we get when we spllit
for i in bureausplit:
  #print the first 30 characters
  print(i[0:30])




The 
 of Investigative Journalism (
 of Investigative Journalism (
 of Investigative Journalism (
 Website


The 
 on Facebook


The 
 on Twitter












Storie
 Global

Smoke Screen
Coronavi
 Local

Stories
About the Proj

The Trust for TBIJ



Got a S

How to Talk to a Journalist



            
Donate now















The Housing Crisis

B
 has found hundreds of mortgag
 co-publishes its stories with
 announces Rozina Breen as new

Investigative journalism is v
 newsletter
Subscribe to the 
 newsletter, and hear when our

How to Talk to a Journalist


 of Investigative Journalism
P
 Local
                       


We can add "Bureau" and exclude the first item (because it's *before* "Bureau") like this:

In [29]:
#for each item we get when we spllit
for i in bureausplit[1:]:
  #print the first 30 characters
  print("Bureau"+i[0:30])

Bureau of Investigative Journalism (
Bureau of Investigative Journalism (
Bureau of Investigative Journalism (
Bureau Website


The 
Bureau on Facebook


The 
Bureau on Twitter












Storie
Bureau Global

Smoke Screen
Coronavi
Bureau Local

Stories
About the Proj
Bureau
The Trust for TBIJ



Got a S
Bureau
How to Talk to a Journalist


Bureau
            
Donate now






Bureau








The Housing Crisis

B
Bureau has found hundreds of mortgag
Bureau co-publishes its stories with
Bureau announces Rozina Breen as new
Bureau
Investigative journalism is v
Bureau newsletter
Subscribe to the 
Bureau newsletter, and hear when our
Bureau
How to Talk to a Journalist


Bureau of Investigative Journalism
P
Bureau Local
                       


## Trying this on a Russian page

The `page.content`-based approach above doesn't seem to work on a particular page.

In [None]:
#fetch a page
page = requests.get("https://utro.ru/news/ukraine/2022/03/07/1507101.shtml")
#show the content
page.content

b'\n<!DOCTYPE html>\n<html class="no-js" lang="">\n\n<head>\n\n\n\n<title>\xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\xbf\xd0\xbe\xd0\xbe\xd0\xb1\xd0\xb5\xd1\x89\xd0\xb0\xd0\xbb\xd0\xb0 \xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd0\xb8 \xd1\x81\xd1\x8e\xd1\x80\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb7 - \xd0\xb3\xd0\xbb\xd0\xb0\xd0\xb2\xd0\xb0 \xd0\x9c\xd0\xb8\xd0\xbd\xd0\xbe\xd0\xb1\xd0\xbe\xd1\x80\xd0\xbe\xd0\xbd\xd1\x8b \xd0\xa0\xd0\xb5\xd0\xb7\xd0\xbd\xd0\xb8\xd0\xba \xd1\x81\xd0\xbe\xd0\xbe\xd0\xb1\xd1\x89\xd0\xb8\xd0\xbb \xd0\xbe \xd0\xbf\xd0\xbe\xd1\x81\xd1\x82\xd0\xb0\xd0\xb2\xd0\xba\xd0\xb0\xd1\x85 \xd0\xbe\xd1\x80\xd1\x83\xd0\xb6\xd0\xb8\xd1\x8f, \xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbe\xd1\x81\xd1\x82\xd0\xb8 \xd0\xb4\xd0\xbd\xd1\x8f :: \xd0\x9d\xd0\xbe\xd0\xb2\xd0\xbe\xd1\x81\xd1\x82\xd0\xb8 \xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd0\xb8\xd0\xbd\xd1\x8b</title>\n\n\n<meta http-equiv="Content-Type" content="text/xml; charset=UTF-8" />\n<meta http-equiv="x-ua-compatible" content="ie

In [8]:
#Check if a string is in it that we know should be
"news" in str(soup.content)

False

So let's try the soup approach.

In [None]:
#convert to a soup object
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)


<!DOCTYPE html>

<html class="no-js" lang="">
<head>
<title>Украина пообещала России сюрприз - глава Минобороны Резник сообщил о поставках оружия, новости дня :: Новости Украины</title>
<meta content="text/xml; charset=utf-8" http-equiv="Content-Type"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="index,follow" name="Robots">
<meta content="последние новости, лента новостей, новости сегодня, новости России и мира, утро ру" name="keywords"/>
<meta content="В Незалежной похвалились поставками оружия" name="description"/>
<meta content="article" property="og:type"/>
<meta content="Украина пообещала России сюрприз - глава Минобороны Резник сообщил о поставках оружия, новости дня" property="og:title"/>
<meta content="В Незалежной похвалились поставками оружия" property="og:description"/>
<meta content="https://utro.ru/news/ukraine/2022/03/07/1507101.shtml" property="og:url"/>
<meta content="https://pics.utro.ru/utro_photos/2022/03/07/1507101big.jpg" property="og:im

In [9]:
#grab the html tag
html = soup.select('html')
#check if our string is in it
"news" in html[0].get_text()

True

### Try to drill down to tags you expect the text to be in

We can also try to be more specific about the tags we're looking in. We would expect 'yandex.ru' to be inside links, for example.

In [None]:
#grab all links
links = soup.select('a')
#see how many we get
len(links)

175

In [None]:
#loop through all the a tags we grabbed
for i in links:
  #print the link (not the text)
  print(i['href'])
  #check if the string "yandex" is in that link and print the result (True/False)
  print("yandex" in i['href'])

/
False
/online/ukraine.shtml
False
/ukraine.shtml
False
/politics.shtml
False
/online/pensii.shtml
False
/horoscope.shtml
False
/showbiz.shtml
False
/pr.shtml
False
/coronavirus.shtml
False
/life.shtml
False
/recepty.shtml
False
/online/primety.shtml
False
/online/crimean.shtml
False
/economics.shtml
False
/accidents.shtml
False
/online/ogorod.shtml
False
https://zen.yandex.ru/ytro.ru
True
https://news.google.com/publications/CAAiEOF7XMpY3eSH2Jkm8AEzhb0qFAgKIhDhe1zKWN3kh9iZJvABM4W9?hl=ru&gl=RU&ceid=RU%3Aru
False
/news/ukraine/2022/03/07/1507105.shtml
False
https://ria.ru/20220307/ukraina-1777009831.html
False
/author/%D0%9A%D1%80%D0%B8%D1%81%D1%82%D0%B8%D0%BD%D0%B0%20%D0%A7%D0%95%D0%A0%D0%9D%D0%98%D0%9A%D0%9E%D0%92%D0%90/
False
/news/ukraine/2022/03/07/1505982.shtml
False
/news/ukraine/2022/03/05/1507038.shtml
False
/persons/putin.shtml
False
/news/politics/2022/03/05/1506977.shtml
False
/news/life/2022/03/05/1506983.shtml
False
/news/politics/2022/03/05/1506993.shtml
False
/news/life

### Use an `if` test to only show results that match 

We might not want to watch 175 results go past or scan through the results to find what we're after. 

Instead, then, we can add an `if` to our loop which only prints the match *if* our string is in the match.

In [None]:
#loop through all the a tags we grabbed
for i in links:
  #check if the string "yandex" is in that link and store the result (True/False)
  yandextf = "yandex" in i['href']
  #only if it is True
  if yandextf == True:
    #print the matched a (tag and contents)
    print(i)

<a href="https://zen.yandex.ru/ytro.ru" target="_blank"><i class="icon-yandex"></i></a>
<a href="https://zen.yandex.ru/ytro.ru"><i class="icon-zen"></i>Яндекс.Дзен</a>
