## Importing and saving (in our system folder) a dataset directly from a website url.

In [2]:
from urllib.request import urlretrieve
import pandas as pd
url='https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
urlretrieve(url,'winequality-red.csv')                    # Save file locally
df = pd.read_csv('winequality-red.csv', sep=';')         # Read file into a DataFrame
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Import and save a spreadsheet from the web

In [13]:
url = 'https://assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read all sheets of Excel file
xls = pd.ExcelFile(url) 

#save the fle
urlretrieve(url,'Country_war.xlsx')

# Print all sheetnames from the sheet
print(xls.sheet_names)

['1700', '1900']


Print the head of the first sheet

In [12]:
df1=xls.parse('1700')
df1.head()

Unnamed: 0,country,1700
0,Afghanistan,34.565
1,Akrotiri and Dhekelia,34.616667
2,Albania,41.312
3,Algeria,36.72
4,American Samoa,-14.307


# **Performing HTTP requests**

find the html code of the webpage

In [14]:
import requests
url="http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response
r=requests.get(url)

# Extract the response
txt=r.text

# Print the html of the webpage
print(txt)

<!DOCTYPE html>
<html lang="en-US">
<head>
    <title>Just a moment...</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta name="robots" content="noindex,nofollow">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
    <meta http-equiv="refresh" content="35">

</head>
<body class="no-js">
    <div class="main-wrapper" role="main">
    <div class="main-content">
        <h1 class="zone-name-title h1">
            <img class="heading-favicon" src="/favicon.ico" alt="Icon for www.datacamp.com"
                 onerror="this.onerror=null;this.parentNode.removeChild(this)">
            www.datacamp.com
        </h1>
        <h2 class="h2" id="challenge-running">
            Checking if the site connection is secure
        </h2>
        <noscript>
            <div id="challenge-error-title">
                <d

# **Web Scraping**


---



## We can use the BeautifulSoup package to parse, prettify and extract information from HTML.

**Parsing HTML with BeautifulSoup**

prettify() is a method in the BeautifulSoup library that can be used to transform a complex HTML document into a more readable format. It adds newlines and indentation to the document structure, making it easier to read and understand. It is often used for debugging and testing purposes, as well as for generating well-formatted output for display or storage. By calling prettify() on a BeautifulSoup object, you can get a more readable representation of the HTML code.

In [17]:
import requests as rqst
from bs4 import BeautifulSoup as bfs

url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response
r = rqst.get(url)

# Extracts the response as html
html_doc = r.text

# Create a BeautifulSoup object from the HTML
soup = bfs(html_doc)

# Prettify the BeautifulSoup object
pretty_soup=soup.prettify()
print(pretty_soup)

<html>
 <head>
  <title>
   Guido's Personal Home Page
  </title>
 </head>
 <body bgcolor="#FFFFFF" text="#000000">
  <!-- Built from main -->
  <h1>
   <a href="pics.html">
    <img border="0" src="images/IMG_2192.jpg"/>
   </a>
   Guido van Rossum - Personal Home Page
   <a href="pics.html">
    <img border="0" height="216" src="images/guido-headshot-2019.jpg" width="270"/>
   </a>
  </h1>
  <p>
   <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
    <i>
     "Gawky and proud of it."
    </i>
   </a>
  </p>
  <h3>
   <a href="images/df20000406.jpg">
    Who I Am
   </a>
  </h3>
  <p>
   Read
my
   <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
    "King's
Day Speech"
   </a>
   for some inspiration.
  </p>
  <p>
   I am the author of the
   <a href="http://www.python.org">
    Python
   </a>
   programming language.  See also my
   <a href="Resume.html">
    resume
   </a>
   and my
   <a href="Public

# **Turning a webpage into data using BeautifulSoup: getting the text**


---

Here, we will convert our html code to text

**Get the title of Guido's webpage**

In [18]:
guido_title=soup.title
guido_title

<title>Guido's Personal Home Page</title>

**Get Guido's text (body)**

In [19]:
guido_text=soup.get_text()
guido_text

'\n\nGuido\'s Personal Home Page\n\n\n\n\n\nGuido van Rossum - Personal Home Page\n\n\n"Gawky and proud of it."\nWho I Am\nRead\nmy "King\'s\nDay Speech" for some inspiration.\n\nI am the author of the Python\nprogramming language.  See also my resume\nand my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some\npictures of me,\nmy new blog, and\nmy old\nblog on Artima.com.  I am\n@gvanrossum on Twitter.\n\nI am currently a Distinguished Engineer at Microsoft.\nI have worked for Dropbox, Google, Elemental Security, Zope\nCorporation, BeOpen.com, CNRI, CWI, and SARA.  (See\nmy resume.)  I created Python while at CWI.\n\nHow to Reach Me\nYou can send email for me to guido (at) python.org.\nI read everything sent there, but I receive too much email to respond\nto everything.\n\nMy Name\nMy name often poses difficulties for Americans.\n\nPronunciation: in Dutch, the "G" in Guido is a hard G,\npronounced roughly like the "ch" in Scottish "

# **Turning a webpage into data using BeautifulSoup: getting the hyperlinks**


---
we'll figure out how to extract the URLs of the hyperlinks from the webpage. In the process, we'll become close friends with the soup method `find_all()`.


Use the method `find_all()` to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag `<a>` but passed to `find_all()` without angle brackets; store the result in the variable a_tags

In [21]:
# Find all 'a' tags (which define hyperlinks)
a_tags=soup.find_all('a')

The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to `print()` `link.get('href')`.

In [22]:
for link in a_tags:
    print(link.get('href'))

pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
images/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
Resume.html
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


So, in this way, we can find all hyperlinks of the text.