# Scrape Text Data from Webpages

The web is full of untapped data! In this template, you can indicate the URL you want to scrape from and this template will turn it into analyzable text data using HTTP requests and [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/). 

In [1]:
# Load packages
import pandas
import requests
from bs4 import BeautifulSoup

# Specify the url you want to scrape
url = "https://workspace-docs.datacamp.com/work/working-in-the-workspace"

# Package the request, send the request and catch the response
r = requests.get(url)

# Extract the response as HTML
html_doc = r.text

# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html_doc)

## Extracting  text elements of the webpage
The `soup` object contains the HTML of the webpage, which will likely require more pre-processing to be useful to you. The code below extracts specific elements of the webpage, including title, text, and links. This is useful for natural language processing projects. 

In [2]:
# Get the title of the webpage
soup.title.string

'Managing a workspace - Workspace Docs'

In [3]:
# Get the text of the webpage
soup.text

'\n\n\n\n\nManaging a workspace - Workspace Docs\n\n\n\n\n\n\n\n\n\nWorkspace DocsWorkspace HomeView TemplatesSearch…⌃KLinksWhat is DataCamp Workspace?Getting StartedWorkCreating a workspaceSharing a workspaceVersion HistoryCode cellText cellSQL cellChart cellIncluding imagesWorking with packagesHiding and showing cellsLong-running cellsManaging a workspacePublishCreating a PublicationHiding Cells in a PublicationExploring PublicationsIntegrationsWhat is an integration?PostgreSQLMySQLRedshiftGoogle BigQueryMicrosoft SQL ServerMariaDBOracle DatabaseGit and GitHubDropboxS3 BucketGuidesImporting data from flat filesResizing plotsResourcesWorkspace for EducationTechnical RequirementsPricingManage Group SettingsAddressing Slow CodeSupportPowered By GitBookManaging a workspaceHow to work in workspaceMaximum workspace sizeIf the total size of all the files in your workspace exceeds 5GB, you will no longer be able to access your workspace. If you unintentionally ended up in this situation and 

In [4]:
# Get and print the link of all 'a' HTML tags
for link in soup.find_all("a"):
    print(link.get("href"))

/
https://www.datacamp.com/workspace
https://www.datacamp.com/workspace/templates
/
/getting-started
/work/workspaces
/work/sharing-a-workspace
/work/version-history
/work/code-cell
/work/text-cell
/work/sql-cell
/work/chart-cell
/work/including-images
/work/packages
/work/hiding-and-showing-cells
/work/long-running-cells
/work/managing-a-workspace
/publish/publications
/publish/hiding-cells-in-a-publication
/publish/explore-page
/integrations/what-is-an-integration
/integrations/postgresql
/integrations/mysql
/integrations/redshift
/integrations/bigquery
/integrations/sql-server
/integrations/mariadb
/integrations/oracle-db
/integrations/github
/integrations/dropbox
/integrations/s3-bucket
/guides/importing-data-from-flat-files
/guides/resizing-plots
/resources/workspace-for-education
/resources/technical-requirements
/resources/pricing
/resources/manage-group-settings
/resources/addressing-slow-code
/resources/support
https://www.gitbook.com/?utm_source=content&utm_medium=trademark&u

For more information on how to extract other elements of a webpage, visit [Beautiful Soup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?l#find-all).