# Webscraping for Slogans

To acquire a dataset, we will need to scrape certain webpages which list out slogans for different companies. For this we will be using **beautifulsoup**.

In [4]:
from bs4 import BeautifulSoup
import urllib.request

Targeting our desired webpage and saving it for scraping:

In [2]:
# The URL
slogan_page = 'https://www.thebalancecareers.com/best-advertising-taglines-ever-39208'

In [9]:
# query the website and return the html to the variable ‘page’
page = urllib.request.urlopen(slogan_page)

Now we create the **beautifulsoup** variables for parsing:

In [12]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')

Lets see it working:

In [13]:
print(soup.prettify())

<!DOCTYPE html>
<html class="comp flexTemplate html mntl-layout-html taxlevel-3 theme-1" data-ab="99,41,99,99,78,99,55" data-mantle-resource-version="3.9.23" data-money-careers-resource-version="3.10.0" data-money-resource-version="3.10.0" data-resource-version="3.10.0" id="flexTemplate_1-0" lang="en">
 <!--
<globe-environment environment="prod" application="money-careers" dataCenter="us-east-1"/>
-->
 <head class="loc head">
  <script type="text/javascript">
  </script>
  <link href="//adservice.google.com" rel="preconnect"/>
  <link href="//securepubads.g.doubleclick.net" rel="preconnect"/>
  <link href="//stats.g.doubleclick.net" rel="preconnect"/>
  <link href="//js-sec.indexww.com" rel="preconnect"/>
  <link href="//bcp.crwdcntrl.net" rel="preconnect"/>
  <link href="//ad.crwdcntrl.net" rel="preconnect"/>
  <link href="//c.amazon-adsystem.com" rel="preconnect"/>
  <link href="//cdn.adsafeprotected.com" rel="preconnect"/>
  <link href="//as-sec.casalemedia.com" rel="preconnect"/>
 

Great! Now lets target the text we want and start pulling it into a list called **slogans**:

In [26]:
unparsed_slogans = soup.find_all('strong')

slogans = []

for slogan in unparsed_slogans:
    slogans.append(slogan.get_text())

slogans[:10]

['What Is a Tagline?',
 ' a memorable dramatic phrase ',
 "reinforce and strengthen the audience's memory ",
 'How is a Tagline Created?',
 'The TOP 100 Advertising Taglines Ever',
 'A DIAMOND IS FOREVER – DeBeers, 1948',
 'A LITTLE',
 'DAB’LL',
 'DO YA – Brylcreem,',
 '1950s']

Okay, lets now get rid of all company names. Luckily, all names start after the '–' character in each slogan line, so we can slice it from there

In [29]:
parsed_slogans = []

for slogan in slogans:
    parsed_slogans.append(slogan.split('–')[0])

parsed_slogans[:10]

['What Is a Tagline?',
 ' a memorable dramatic phrase ',
 "reinforce and strengthen the audience's memory ",
 'How is a Tagline Created?',
 'The TOP 100 Advertising Taglines Ever',
 'A DIAMOND IS FOREVER ',
 'A LITTLE',
 'DAB’LL',
 'DO YA ',
 '1950s']

For the final part, lets put it into a CSV file

In [33]:
import csv

slogans = parsed_slogans

with open('./Data/slogans.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    for slogan in slogans:
        writer.writerow([slogan])