# Scrape Text Data from Webpages

The web is full of untapped data! In this template, you can indicate the URL you want to scrape from and this template will turn it into analyzable text data using HTTP requests and [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/). 

In [4]:
# Load packages
import pandas
import requests
from bs4 import BeautifulSoup

# Specify the url you want to scrape
url = "https://workspace-docs.datacamp.com/work/working-in-the-workspace"

# Package the request, send the request and catch the response
r = requests.get(url)

# Extract the response as HTML
html_doc = r.text

# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html_doc)

## Extracting  text elements of the webpage
The `soup` object contains the HTML of the webpage, which will likely require more pre-processing to be useful to you. The code below extracts specific elements of the webpage, including title, text, and links. This is useful for natural language processing projects. 

In [5]:
# Get the title of the webpage
soup.title.string

'Working in Workspace - Workspace Docs'

In [6]:
# Get the text of the webpage
soup.text

"\n\n\n\n\nWorking in Workspace - Workspace Docs\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWorkspace DocsWorkspace HomeView TemplatesSearch…What is DataCamp Workspace?WorkCreating a WorkspaceWorking in WorkspaceWorking with PackagesChanging AppearancePublishCreating a PublicationExploring PublicationsIntegrationsEnvironment VariablesGitHubPostgreSQLDropboxS3 BucketMySQLRedshift ClusterResourcesSupportVideo TutorialsCode-alongsPowered By GitBookWorking in WorkspaceHow to work in workspaceWorkspaces are privateOnly you can see all the workspaces that you own and their contents. In other words, your work is completely private unless you decide to publish your workspace. It is currently not possible to share your workspace with other people to collaborate, but you are able to share workspace publications with others (see Publications section for more).Maximum workspace sizeIf the total size of all the files in your workspace exceeds 5GB, you will no longer be able to access your works

In [7]:
# Get and print the link of all 'a' HTML tags
for link in soup.find_all("a"):
    print(link.get("href"))

/
https://www.datacamp.com/workspace
https://www.datacamp.com/workspace/templates
/
/work/workspaces
/work/working-in-the-workspace
/work/packages
/work/changing-appearance
/publish/publications
/publish/explore-page
/integrations/environment-variables
/integrations/github-integration
/integrations/postgresql
/integrations/dropbox
/integrations/s3-bucket
/integrations/mysql
/integrations/redshift-cluster
/resources/support
/resources/workspace-video-tutorials
/resources/code-alongs
https://www.gitbook.com/?utm_source=content&utm_medium=trademark&utm_campaign=-MZqboFGZzD87nn7oPsm
https://app.gitbook.com/@datacamp-1/s/workspace/~/drafts/-Me0G3lJVxqutyNKs_an/publish/publications
/resources/support
https://jupyterlab.readthedocs.io/en/stable/
https://docs.rstudio.com
/work/workspaces
/work/packages
/work/working-in-the-workspace#workspaces-are-private
/work/working-in-the-workspace#maximum-workspace-size
/work/working-in-the-workspace#saving-your-edits
/work/working-in-the-workspace#renami

For more information on how to extract other elements of a webpage, visit [Beautiful Soup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?l#find-all).