#### Web  scraping,  also  called web  data  mining or web  harvesting, is  the  process  of constructing an agent which can extract, parse, download and organize useful information from the web automatically.

#### The  information  extracted  using  web scraping  can  be  used  to  replicate  in some other website or can be used to perform data analysis. For example the data elements can be names, address, price etc. 

#### Web scraping automatically extracts data and presents it in a format you can easily make sense of.

#### Some of the important uses of web scraping are discussed here:

#### E-commerce Websites:Web scrapers can collect the data specially related to the price of a specific product from various e-commerce websites for their comparison.

#### Content Aggregators:Web scraping is used widely by content aggregators like news aggregators and job aggregators for providing updated data to their users.

#### Marketing andSales Campaigns:Web scrapers can be used to get the data like emails, phone number etc.for sales and marketing campaigns.

#### Search Engine Optimization (SEO):Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them.

#### Data  for Machine Learning Projects

#### We are going to use Python as our scraping language, together with a simple and powerful library, BeautifulSoup.

### Scraping Rules-
#### You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
#### Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
#### The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

In [8]:
# import libraries
import requests # pip install requests 
from bs4 import BeautifulSoup  # pip install bs4 / conda install bs4 

In [22]:
#using the request model , we get the 'get' function 
#provides access to the webpage  provided as an argument in this function 
result=requests.get(url="https://timesofindia.indiatimes.com/") 

In [23]:
#To make sure that the website is accessible , we can ensure we get a 200 OK response 
#to indicate that the page is indeed present 
print(result.status_code) #HTTP status codes can be checked from wikipedia 

200


In [24]:
print(result.headers)  #checking the HTTP header of the website to verify that indeed accessed correct page

{'x-amz-id-2': 'xnk77DVG29uQhIsg54QsUOyHldQglK6QGsMwYszx7OyXMeLlB4JHhJVX7zUHXbTeA/xpb8onVZc=', 'x-amz-request-id': '0F4RX6TBVMYASRE6', 'Last-Modified': 'Thu, 03 Mar 2022 07:10:36 GMT', 'ETag': '"f603278a95b773c306b3d680bdc7dc33"', 'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Content-Type': 'text/html; charset=utf-8', 'Server': 'AmazonS3', 'Content-Length': '116699', 'Vary': 'Accept-Encoding', 'Expires': 'Thu, 03 Mar 2022 07:11:27 GMT', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Date': 'Thu, 03 Mar 2022 07:11:27 GMT', 'Connection': 'keep-alive', 'Access-Control-Max-Age': '86400', 'Access-Control-Allow-Credentials': 'false', 'Access-Control-Allow-Headers': 'Origin,X-Requested-With,Content-Type,Accept', 'Access-Control-Allow-Methods': 'GET,POST', 'Strict-Transport-Security': 'max-age=86400'}


In [25]:
content=result.text #let us store the page content of the website accessed from requests to a variable using .text /.content 

In [13]:
# print(content)

In [26]:
# now that we have the content stored ,we will use the bs4 module to parse and process the source.
# we create a bs4 object based on source variable.
soup=BeautifulSoup(content,"html.parser") # bs4 is used to parse and return individual data points from different HTML tags 

In [18]:
# print(soup.prettify())

In [27]:
print(soup.title.text)

News - Latest News, Breaking News, Bollywood, Sports, Business and Political News | Times of India


In [28]:
headlines=soup.find('div',class_="_3MUkE").text

In [29]:
print(headlines)

Ukraine crisis live: 19 evacuation flights to bring back 3,726 Indians today


In [56]:
list_news=soup.find('div',class_='_2r4Y_ grid_wrapper') 

In [46]:
# list_news[0]

In [59]:
list_items=list_news.findAll('div',class_='col_l_6')
news_item=[]
list_items=list_items[0:15]
for news in list_items:
    item=news.find('figcaption')
    try:
        news_item.append(item.text)
    except:
        news_item.append('No text')
    

In [60]:
news_item

['No report of Indian students being held hostage in Ukraine: MEA',
 'Why are Chechens fighting for both Ukraine and Russia?',
 'UP polls live: BJP will win over 80% seats, says Yogi',
 "US recalls cable saying India, UAE in 'Russia’s camp': Report",
 'How Indians fleeing Ukraine ran into racism',
 "Explained in charts: BJP's dominance in UP Phase 6 seats in 2017",
 'What is the chain of command for potential Russian nuclear strikes?',
 'Asus 8Z vs OnePlus 9RT vs Xiaomi 11T Pro 5G',
 'Empowering career growth with blended education',
 'Experience seamless working on these powerful laptops!',
 'Building personal brands with social commerce',
 "PM Modi to participate in Quad Leaders' meet amid Ukraine crisis",
 'WhatsApp Group vs WhatsApp Broadcast: Difference',
 "Covid live: India's active cases stand at 0.18%",
 'What are 360-degree parking cameras in cars']

In [61]:
# second approach 

news=soup.findAll('div',class_='col_l_6')

In [62]:
len(news)

44

In [63]:
list=[]
for new in news:
    try:
        item=new.find('a',class_='_3SqZy').text
    except:
        item='-'
    list.append(item) 

In [64]:
len(list)

44

In [66]:
# list 

### Scraping Imdb

In [3]:
data_list = []

page_list = [1, 51, 101, 151, 201] # 5 pages and each page has 50 movies which results to 250 

In [4]:
# data_divs[0]

In [None]:
for page in page_list: # page = 1 
#     print(page)
    url = f"https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start={page}&ref_=adv_nxt"
    response = requests.get(url) # request to get url 
    data = response.text # .text to capture the reponse 
    soup = BeautifulSoup(data, 'html.parser')  # creating soup object to parse through the html tags of response text 
    data_divs = soup.find_all('div', {'class' : 'lister-item-content'})
    
    for data_div in data_divs:
        data_dict = dict()
        data_dict['Rank'] = data_div.find('h3', {'class' : 'lister-item-header'}).text.split('\n')[1] # '1','The Shawshank redemption','1994'
        data_dict['Movie'] = data_div.find('h3', {'class' : 'lister-item-header'}).text.split('\n')[2]
        try:
            data_dict['Year'] = data_div.find('h3', {'class' : 'lister-item-header'}).text.split('\n')[3]
        except:
            data_dict['Year'] = None
        try:
            data_dict['Certificate'] = data_div.find('span', {'class' : 'certificate'}).text
        except:
            data_dict['Certificate'] = None
        data_dict['Runtime'] = data_div.find('span', {'class' : 'runtime'}).text
        data_dict['Genre'] = data_div.find('span', {'class' : 'genre'}).text
        data_dict['Rating'] = float(data_div.find('strong').text)
        data_dict['Movie description']=data_div.find('p',class_='text-muted').text
        data_dict['Gross Value']=data_div.find('span',{'name':'nv'}).text 
        data_list.append(data_dict) # List of dicts 

In [None]:
# data_list

In [None]:
import pandas as pd 
df = pd.DataFrame(data_list)
df.head() # Dataframe is a list of dictionaries 

In [None]:
df=df[['Rank','Movie','Certificate','Rating','Movie description','Genre','Runtime','Year','Gross Value']]

In [None]:
# df

In [None]:
'''clean the data using regex'''
import re 

In [6]:
# def clean_movie_name(movie):
#     movie_first_clean=re.sub(r"(\d)(.)",'',movie)
#     movie_name=re.sub(r"[()]",'',movie_first_clean)
#     return movie_name.strip()

# movie= '\n1.\nThe Shawshank Redemption\n(1994)\n'
# print(clean_movie_name(movie))

In [8]:
# df['Movie_title']=df['Movie'].apply(clean_movie_name)

In [111]:
def clean_genre(genre):
    genre=re.sub(r"\n",'',genre)
    return genre.strip() 

# genre= '\nAction, Crime, Drama'
# print(clean_genre(genre))

In [112]:
df['genre']=df['Genre'].apply(clean_genre) # creating a column called 'genre' and applying the original col 'Genre'on the function clean_genre

In [114]:
# df

In [115]:
def clean_year(year):
    year=re.sub(r"[()]",'',year)
    return year.strip()

# year= '(1994)'
# print(clean_year(year))

In [116]:
df['year']=df['Year'].apply(clean_year)

In [118]:
df.dtypes

Rank                  object
Movie                 object
Certificate           object
Rating               float64
Movie description     object
Genre                 object
Runtime               object
Year                  object
Gross Value           object
genre                 object
year                  object
dtype: object

In [116]:
df= df[['Rank','Movie', 'year', 'genre','Runtime','Certificate','Rating']] # selecting cols to store back in dataframe and ignoring others 

In [118]:
# df

In [37]:
df.to_csv("Imdb_top_250.csv") 

#### Another example 

repository- https://github.com/kayode-adechinan/pyscrap to create local server for scraping 

In [41]:
page = requests.get("http://localhost:8000/products.html")
soup = BeautifulSoup(page.text, 'html.parser') # can specify the response object as .content / .text 
# print(soup.prettify())

In [42]:
def retrieve_all_products():
    print(soup.find_all('li', class_='span4')) # alternative syntax to this --> {'class':'span4'}

if __name__ == '__main__': 
    retrieve_all_products()

[<li class="span4">
<div class="thumbnail">
<a class="overlay" href="product_details.html"></a>
<a class="zoomTool" href="product_details.html" title="add to cart"><span class="icon-search"></span> QUICK VIEW</a>
<a href="product_details.html"><img alt="" src="assets/img/a.jpg"/></a>
<div class="caption cntr">
<p>Product A</p>
<p><strong> $22.00</strong></p>
<h4><a class="shopBtn" href="#" title="add to cart"> Add to cart </a></h4>
<div class="actionList">
<a class="pull-left" href="#">Add to Wish List </a>
<a class="pull-left" href="#"> Add to Compare </a>
</div>
<br class="clr"/>
</div>
</div>
</li>, <li class="span4">
<div class="thumbnail">
<a class="overlay" href="product_details.html"></a>
<a class="zoomTool" href="product_details.html" title="add to cart"><span class="icon-search"></span> QUICK VIEW</a>
<a href="product_details.html"><img alt="" src="assets/img/b.jpg"/></a>
<div class="caption cntr">
<p>Product B</p>
<p><strong> $24.00</strong></p>
<h4><a class="shopBtn" href="#

In [43]:
def retrive_first_product_price():
    all_products = soup.find_all('li', class_='span4')
    product_one = all_products[0]
    product_one_price = product_one.find("strong")
    print(product_one_price.get_text()) # .text 
    print(product_one_price.get_text().strip().strip('$'))

if __name__ == '__main__':
    retrive_first_product_price()

 $22.00
22.00


First, we get all products. Then we take the result and upon this, we look for the price. This one is inside a strong tag. After fiding the price we display it. We can also removed $ character. As you see, you can search element based on previous result's search. Unlike find_all method that returns a list of elements or an empty list, find method returns a single element or None.

Suppose we want to compare our products with their price as criteria

In [45]:
def lazy_comparator():
    all_products = soup.find_all('li', class_='span4')
    products = {}
    for product in all_products:
        products[product.find("p").get_text().strip()] = product.find("strong").get_text().strip()
    print(products)
#     print (sorted([(v, k) for k, v in products.items()]))

if __name__ == '__main__':
    lazy_comparator()

{'Product A': '$22.00', 'Product B': '$24.00', 'Product C': '$19.00', 'Product D': '$32.00', 'Product E': '$21.00', 'Product F': '$13.00', 'Product G': '$22.00', 'Product H': '$27.00', 'Product I': '$23.00', 'Product J': '$31.00', 'Product K': '$42.00', 'Product L': '$15.00'}


Reference links- https://realpython.com/beautiful-soup-web-scraper-python/

In [102]:
# # Linkedin_Scraper

# import requests 
# from bs4 import BeautifulSoup
# import pandas as pd
# import re
# from time import sleep 

# ids=[]
# titles=[]
# urls=[]
# comp_names=[]
# locations=[]
# dates=[]
# jd=[]
# for i in range(0,100,25): # 100 for 4 pages , if you want to return 1000 jobs , run the loop till 1000 
#     url='https://www.linkedin.com/jobs/search/'

#     params={
#         'f_TPR':'r86400',
#         'geoId':'105214831', 
#         'keywords':'full stack engineer',
#         'location':'India',
#         'start':''
#     }
#     params['start']=str(i)
#     response=requests.get(url=url,params=params)
#     # print(response) # 200 success response
#     soup=BeautifulSoup(response.text,'html.parser')
#     job_items=soup.find('ul',class_='jobs-search__results-list')
#     job_ids=[i['data-entity-urn'] for i in job_items.findAll('div',class_='base-card base-card--link base-search-card base-search-card--link job-search-card')]
#     job_title=[i.text.strip() for i in job_items.findAll('h3',class_='base-search-card__title')]
#     job_url=[url['href'] for url in job_items.findAll('a',class_='base-card__full-link')]
#     descp=[]
#     for url in job_url: # 25 times 
#         sleep(1)
#         response=requests.get(url) 
#         soup=BeautifulSoup(response.text,'html.parser')
#         try:
#             description=soup.find('div',class_='description__text description__text--rich').text.strip()
#             descp.append(description)
#         except:
#             descp.append('NOT FOUND')
#     company=[comp.text.strip() for comp in job_items.findAll('h4',class_='base-search-card__subtitle')] 
#     location=[loc.text.strip() for loc in job_items.findAll('span',class_='job-search-card__location')]
#     date_posted=[date.text.strip() for date in job_items.findAll('time',class_='job-search-card__listdate--new')]
#     ids.extend(job_ids)
#     titles.extend(job_title)
#     urls.extend(job_url)
#     comp_names.extend(company)
#     locations.extend(location)
#     dates.extend(date_posted) 
#     jd.extend(descp)
    
# dict={
#     'Job_ID':ids,
#     'Title':titles,
#     'Job_url':urls,
#     'Company':comp_names,
#     'Location':locations,
#     'Time Posted':dates,
#     'JD':jd
# }

# df=pd.DataFrame.from_dict(dict,orient='index')
# df = df.transpose()
# # df

# def clean_id(ids):
#     ids_=re.sub('urn:li:jobPosting:','',ids)
#     return ids_
    
# df['Job_ID']=df['Job_ID'].apply(clean_id)

# def clean_descp(descp):
#     y=re.sub('\n+','',descp)
#     z=re.sub('Show more                Show less','',y)
#     return z.strip()
# df['JD']=df['JD'].apply(clean_descp)

# # df.head()
# df[['Job_ID']]=df[['Job_ID']].apply(pd.to_numeric) # converting data type of one column into another type 
# df