# Scraping Yoga and Nutrition Related Articles using Python

**Yoga** is an ancient practice that involves physical poses, concentration, and deep breathing. **Yoga nutrition** aims at cleansing, strengthening, and developing all levels of our human existence.
Keeping that in mind, in the current web scrapping project, I would like to gather the data on **Yoga and Nutrition** from Google search results.

![banner image](https://i.imgur.com/CjeKM2H.png)

We search for a lot of things in the internet. These information are readily available but cannot be saved easily so we can use it later for any other purposes. One way is to copy the data manually and save it in your desktop. However, this is a very time consuming job. Web scraping is handy in such cases.

**Web Scraping** is a technique used to automatically extract large amounts of data from websites and save it to a file or database. The data scraped will usually be in tabular or spreadsheet format(e.g : CSV file)

![](https://i.imgur.com/m5lV5m9.png)

Here, in this web scrapping we will scrap data from google results.

We'll use the Python libraries requests and beautifulsoup4 to perform scrapping from the webpage.

Here's an outline of the steps we'll follow:

1. Download the webpage using requests.
2. Parse the HTML source code using beautifulsoup4.
3. Extract title, link and description.
4. Compile the extracted information into Python lists and dictionaries.
5. Extract and combine data from multiple pages.
6. Save the extracted information to a CSV file.

By the end of the project, we'll create a CSV file in the following format:



![](https://i.imgur.com/rqN5uDY.png)

### Installing and importing the Libraries required for the current project.
Before starting the project importing some of the dependencies is a good idea. The current project uses the following libraries:

Requests - to download the the web page in the text format.
BeautifulSoup - to be able to use the text downloaded from the Requests library to be used for further processing.


In [1]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


Normally, python requests do not need headers and cookies. But in some situations when we request for the page content, we get a status code of 403 or 503. This means we cannot access the web page contents. In such cases we add headers and cookies to the argument of the requests.get() function.

In [3]:

headers = {'User-agent':
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}

base_url = 'https://google.com/'



### Use Beautiful Soup to parse and extract information


Getting the response from the particular URL using Requests.

requests.get function from the requests library is used for downloading the webpage which is in the response object.
In order to check if we can recieve the webpage as required for further processing, we use status_code property of response object.

Reusable funtion to parse and extract information.

In [4]:

def get_url(url):
    # Fetch the URL data using requests.get(url),store it in a variable, response.
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    return (response)    
   
        

Function to get Title, Link and Description of each article from html tags

HTML tags information

Title tag:h3 tag

![](https://i.imgur.com/GA5FA78.png)

Description tag: span

![](https://i.imgur.com/fyrtl6c.png)

Link tag: href tag

![](https://i.imgur.com/i5PxmOd.png)


In [5]:
def scrape_articles(url):
    response = get_url(url)
    doc = BeautifulSoup(response.text, 'html.parser')
     
    articles_dict = []
    for result in doc.select('.tF2Cxc'):
        title = result.select_one('.DKV0Md').text
        link = result.select_one('.yuRUbf a')['href']
         
# sometimes there's no description and we need to handle this exception
        try: 
            desc = result.select_one('#rso .lyLwlc').text
        except: desc = None
       
        articles_dict.append({
           'Title': title,
           'Link': link,
           'Description': desc
         })
    return articles_dict

Getting next page links from the tag 'a' and class = 'f1' and scraping data for first to ten pages.

#Next page link tags information

![](https://i.imgur.com/Zhs3rVz.png)

In [6]:
def first_ten_pages(url):
    response = get_url(url)
    doc = BeautifulSoup(response.text, 'html.parser')
    Pagination_URL_tags = doc.find_all('a', class_ = "fl")
    
#Creating a list of Next_page_links by Concatenating base_url with tag['href']
    Next_page_links = []
    for tag in Pagination_URL_tags:
        Next_page_links.append(base_url + tag['href'])

#Scraping first page data
    articles_dict = scrape_articles(url)

#For each url scraping data.       
    for url in Next_page_links:
        dict1 = scrape_articles(url)
        articles_dict.extend(dict1)

#Dictionary to dataframe        
    articles_df = pd.DataFrame.from_dict(articles_dict)
    return articles_df   
         

### Bringing it all together in one block.
Putting all the functions in one place. 

In [7]:
#Header
headers = {'User-agent':
           'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}

base_url = 'https://google.com/'
url = 'https://google.com/search?q='

#Reusable funtion to parse and extract information.
def get_url(url):
    # Fetch the URL data using requests.get(url),store it in a variable, response.
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    return (response)

#Function to get Title, Link and Description of each article from html tags
def scrape_articles(url):
    response = get_url(url)
    doc = BeautifulSoup(response.text, 'html.parser')
     
    articles_dict = []
    for result in doc.select('.tF2Cxc'):
        title = result.select_one('.DKV0Md').text
        link = result.select_one('.yuRUbf a')['href']
         
# sometimes there's no description and we need to handle this exception
        try: 
            desc = result.select_one('#rso .lyLwlc').text
        except: desc = None
       
        articles_dict.append({
           'Title': title,
           'Link': link,
           'Description': desc
         })
    return articles_dict

#Function to get Title, Link and Description of each article from html tags.
def first_ten_pages(url):
    response = get_url(url)
    doc = BeautifulSoup(response.text, 'html.parser')
    Pagination_URL_tags = doc.find_all('a', class_ = "fl")
    
#Creating a list of Next_page_links by Concatenating base_url with tag['href']
    Next_page_links = []
    for tag in Pagination_URL_tags:
        Next_page_links.append(base_url + tag['href'])
        
#Scraping first page data
    articles_dict = scrape_articles(url)
    
#For each url scraping data.       
    for url in Next_page_links:
        dict1 = scrape_articles(url)
        articles_dict.extend(dict1)
        
#Ductionary to dataframe        
    articles_df = pd.DataFrame.from_dict(articles_dict)
    return articles_df   


Getting title, link and description for the topics "Yoga" and "Yoga Nutrition"(first 10 pages)

Make two strings with default google search URL 'https://google.com/search?q=' and our customized search keyword.
Concatenate them.

In [8]:
#Topics
topics= ["Yoga", "Yoga Nutrition"]
url_list = []

#Generating lsit of urls for Topics
for text in topics:
    url = 'https://google.com/search?q=' + text
    url_list.append(url)
url_list

df = pd.DataFrame()

#Scraping data for topic urls and saving in dataframe. Merging two topics dataframes.
for text in url_list:
    Articles_df = first_ten_pages(text)
    df = df.append(Articles_df)
 
    
    

In [9]:
#Reseting index
df = df.reset_index(drop = True)
df

Unnamed: 0,Title,Link,Description
0,Yoga - Wikipedia,https://en.wikipedia.org/wiki/Yoga,"Yoga is a group of physical, mental, and spiri..."
1,Yoga for Everyone - The New York Times,https://www.nytimes.com/guides/well/beginner-yoga,"Yoga 101. A set of specific exercises, called ..."
2,Yoga: What You Need To Know | NCCIH,https://www.nccih.nih.gov/health/yoga-what-you...,"Yoga is an ancient and complex practice, roote..."
3,"Yoga: Methods, types, philosophy, and risks",https://www.medicalnewstoday.com/articles/286745,08-Jul-2021 — Yoga is an ancient practice that...
4,"All About Yoga: Poses, Types, Benefits, and More",https://www.everydayhealth.com/yoga/,Research has shown that yoga can help lower ph...
...,...,...,...
189,Total Nutrition - TriBalance Yoga Center,https://tribalance.com/total-nutrition-program/,What is the Tribalance Total Nutrition Program...
190,what is yogic diet - sattvik food | Lifestyle ...,https://www.lifestyleyogaworld.com/what-is-yog...,21-Jun-2021 — Our blog talks about the importa...
191,The Benefits of Yoga Put to The Test | Nutriti...,https://nutritionfacts.org/webinar/the-benefit...,Yoga is practiced by millions of Americans and...
192,Dietry Principles | Kriyayoga Meditation,https://www.kriyayoga-yogisatyam.org/science-o...,The Science of Nutrition. When we prepare food...


### Create CSV file(s) with the extracted information



In [10]:
#Final dataframe to csv file.
df.to_csv('Yoga_nutrition_articles.csv')


### Summary

In this current project, the yoga and nutrition articles have been scrapped from first 10 google results pages.

The following are the steps followed in this notebook.
1. Downloaded the webpage using requests.
2. Parsed the HTML source code using beautifulsoup4.
3. Extracted Title,Link and Description details.
4. Compiled the extracted information into Python lists and dictionaries.
5. Extracted and combine data from multiple pages.
6. Saved the extracted information to a CSV file.

The CSV file we created has this format:
![](https://i.imgur.com/iJrM3qz.png)

### Future works

* We can scrape the each article for more information like Written by, Reviewed by and Date.
* We can get the information for other user defined topics and scrap of the data accordingly from google results.
* We can use this data for NLP Projects.

### Refrences
1. https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis

2. https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python

3. https://medium.com/analytics-vidhya/web-scraping-amazon-reviews-a36bdb38b257