Scraping Data using Python for beginner's (401 status_code fix)

Learn to scrape a basic website step-by-step using Python. It is very easy to do and you can refer to my work.

You will learn:

Basic python
How to scrape basic data from a website
DataHandling

Here are the steps we will follow:

The tools which were used are (BeautifulSoup, pandas, requests, python)

We are going to scrape http://quotes.toscrape.com/page/1/ ( I would recommend you to use this website).
We will get a list of quotes for each quote we will grab there author name

Here's how we will do it:

Use requests to download the webpage
Use BeautifulSoup to parse and extract info.
Convert to a pandas dataframe

pip install requests To install the library (to be done in the terminal )

import requests To import it

pip install BeautifulSoup4 To install beautifulsoup

from bs4 import BeautifulSoup To import it

response = requests.get('http://quotes.toscrape.com/page/1/') Downloads the page

page_contents = response.text doc = BeautifulSoup(page_contents,'html.parser')

It parses the document using beautifulsoup

print(response.status_code) To check if the page was downloaded successfuly ( It should return 200 if it was successful )

If you are getting '401' as an output.

This can occur for a number of reasons, including:

The server requires a specific type of authentication that has not been provided in the request.
The provided credentials are invalid or have expired.
The resource is protected by a login page, and the client has not provided the necessary credentials to access the resource.

Try logging in to that website to gain authentication or you will need to use the requests.auth.HTTPBasicAuth class in the requests library.

print(len(page_contents)) It is not recommended to print it, but you can check the length

Now you have to locate the tags you want to scrape from the website. In my case, I want to scrape all of the quotes. The quotes are inside the 'span' tag.

To locate all of the span tags use this function span_tags = doc.find_all('span')

Now there are 32 span tags inside the webpage ( useprint(len(span_tags)) to check ) but we don't want all of those tags, obviously. We want the span tags which only have the quotes. So we can be more specific by checking their 'classes'

To check that you can right click and click on 'inspect elements' and then select on the quote. You can see that they all have the 'class'='text'. So to be more specific we can use this function. span_tags = doc.find_all('span',{'class':'text'})

Now if we check the length, you will see 10 as the output, that means we are going in the right direction

Now if we print them (print(span_tags)), you should see this as an output.

Now what we want is the text(quotes) inside the tag. You can use the .text function to check that. print(span_tags[0].text) You should see this as an output

Now what we will do is create a list quotes = [] and create a for loop which will convert the tag into the text format which we want and append it to the list and then print it outside the for loop. Make sure to print it outside the for loop

##A loop which seperates the quotes which we want and puts it in a list
quotes = []
for tag in span_tags:
    quotes.append(tag.text)
    ##Converts the tag into text and appends it to the list
  
print(quotes)

Output:

We will repeat the same for the author names.


small_tags = doc.find_all('small') ## Finds all the small tags insie the html document

##A loop which seperates the author names which we want and puts it in a list

author_names = []
for tag in small_tags:
   author_names.append(tag.text)
print(author_names)

Output:

You can check in which tag are the author names located in using doc.find_all('Tag_Name')

Now we are almost done. We just need to install the pandas library and convert our data into a pandas dataframe

pip install pandas To install pandas library i

import pandas as pd To import it

##Create a dict and put in the list names which we created
dict = {'QUOTES_LIST' : quotes,
        'AUTHORS_LIST': author_names}

##Creates the data frame table_df = pd.DataFrame(dict)

##Converts to a csv file table_df.to_csv('file_name.csv', index=None)

Thank you for reading. If you found this helpful please recommend my work to others and star this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
main.py		main.py
table_df.csv		table_df.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scraping Data using Python for beginner's (401 status_code fix)

About

Uh oh!

Releases

Packages

Languages

Triconax/ScrapingEasyDataPython

Folders and files

Latest commit

History

Repository files navigation

Scraping Data using Python for beginner's (401 status_code fix)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages