#Web Scrapping Using BeautifulSoup


For data science we often need data from websites. So we need webscrapping for scrapping data from websites to make dataframe from those data.
Here I am trying to scrapp data from a web site using beautifulsoup and make a dataframe from that scrapped dataset. For this we have chosen **Bangladesh Police** public crime data to scrap from their website. 
the website is: https://www.police.gov.bd/

So first we will import the necessary libraries. 

#N:B- For proper understanding please download this repository as I have used some images for understanding. 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import warnings
warnings.filterwarnings("ignore")

So we have imported pandas for making the dataframe from the scrapped dataset. 

numpy is for converting the dataset to a 2D array so that we can make that easily a dataframe

BeautifulSoup is for scrapping data from the website

requests is for getting the whole document of a website so that we can do our operations on that document data as we need. 

warnings is nothing but filter out the warnings.


link: https://www.police.gov.bd/en/crime_statistic/year/2010

If you hit this link you should see something like this. 

<img src="table.png">


So we are going to scrap the data that are present in this table. Let's do this part by part.

In [2]:
data = requests.get("https://www.police.gov.bd/en/crime_statistic/year/2010")

So inititally we are going to scrap data from the above link as **Bangladesh Police** has given their crime report data for 2010-2019. 
So our target is make dataframe of each year and convert them to a single dataframe using concatenation.

So here the variable **'data'** has all the data from the above url page. Now we are going to do some operation from this data. Before that you need to observe 2 images here. <img src="inspect.png">

So put your cursor on the table you have found and right click your mouse. Something like this should appear. Then select the **Inspect** option here.

Now you have found something like this <img src="element.png">

So here you can see that the data we need are under **td** tag. So we are going to get all the data that are under **td** tag. 

In [3]:
Parser = BeautifulSoup(data.text, 'html.parser') #This is for parsing the we have got in the data variable. We are parsing so that we can find our data 
#under td tag. This variable contains all the html element of the page.
Select = Parser.select('td') #found all the data under td tag with tag td in it. This variable contains all the values under td tag as a list.
Parser_again = BeautifulSoup(("".join(str(x) for x in Select)), 'html.parser') #Now we are joining all the rows as a string from the list.
Result = Parser_again.find_all(text=True) #Here we have got a list of only the data under td tag. 
Result[:18] # First 18 values of the list

['DMP',
 '47',
 '220',
 '245',
 '363',
 '3',
 '1370',
 '139',
 '155',
 '555',
 '1915',
 '7228',
 '518',
 '82',
 '10535',
 '144',
 '11279',
 '23519']

So we have done parsing the data now we are going to make a dataframe from this list of data.

In [4]:
P_Data = np.array(Result).reshape(19,18) # Created a numpy ndarray from this list that has 19 rows and 18 columns. As we have 19 rows in the table.
# But we have 17 columns in the table, so we added an extra column named Year to make it clear which year's data is that.
df = pd.DataFrame({'Unit_Name': P_Data[:, 0], 'Dacoity': P_Data[:, 1], 'Robbery': P_Data[:, 2], 'Murder': P_Data[:, 3], 'Speedy_Trial': P_Data[:, 4], 
                   'Riot': P_Data[:, 5],'Woman&Child_Repression': P_Data[:, 6], 'Kidnapping': P_Data[:, 7],'Police_Assault': P_Data[:, 8], 'Burglary': P_Data[:, 9],
                   'Theft': P_Data[:, 10],'Other_Cases': P_Data[:, 11],'Arms_Act': P_Data[:, 12],'Explosive': P_Data[:, 13],'Narcotics': P_Data[:, 14]
                   ,'Smuggling': P_Data[:, 15],'Total_Recovered_Cases': P_Data[:, 16],'Total_Cases': P_Data[:, 17],'Year': 2010})
#Created the dataframe for the year 2010

In [5]:
df

Unnamed: 0,Unit_Name,Dacoity,Robbery,Murder,Speedy_Trial,Riot,Woman&Child_Repression,Kidnapping,Police_Assault,Burglary,Theft,Other_Cases,Arms_Act,Explosive,Narcotics,Smuggling,Total_Recovered_Cases,Total_Cases,Year
0,DMP,47,220,245,363,3,1370,139,155,555,1915,7228,518,82,10535,144,11279,23519,2010
1,CMP,16,108,94,31,7,455,37,31,123,314,1831,51,0,866,99,1016,4063,2010
2,KMP,3,9,29,25,0,153,11,4,65,91,551,19,2,792,13,826,1767,2010
3,RMP,4,20,21,9,15,157,9,12,53,106,578,3,4,332,248,587,1571,2010
4,BMP,8,12,19,21,0,112,6,8,24,83,557,17,0,155,117,289,1139,2010
5,SMP,12,33,33,34,1,104,14,12,33,154,866,14,0,154,20,188,1484,2010
6,Dhaka Range,162,199,1153,362,7,4272,171,71,643,1477,19966,309,30,4459,993,5791,34274,2010
7,Mymensingh Range,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2010
8,Chittagong Range,153,122,639,245,32,2915,111,87,429,998,12985,235,20,4730,612,5597,24313,2010
9,Sylhet Range,85,43,245,75,17,848,41,19,186,524,5266,30,4,905,172,1111,8460,2010


So we have got our dataset for year 2010. Now we are going to get all the dataset from 2011-2019 using a for loop.

In [6]:
link = "https://www.police.gov.bd/en/crime_statistic/year/"
year = 2011
for i in range(9):
    data_from_web = requests.get(link+str(year))
    Parser = BeautifulSoup(data_from_web.text, 'html.parser')
    Select = Parser.select('td')
    Parser_again = BeautifulSoup(("".join(str(x) for x in Select)), 'html.parser') 
    Result = Parser_again.find_all(text=True)
    P_Data = np.array(Result).reshape(19,18)
    dd = pd.DataFrame({'Unit_Name': P_Data[:, 0], 'Dacoity': P_Data[:, 1], 'Robbery': P_Data[:, 2], 'Murder': P_Data[:, 3], 'Speedy_Trial': P_Data[:, 4], 
                   'Riot': P_Data[:, 5],'Woman&Child_Repression': P_Data[:, 6], 'Kidnapping': P_Data[:, 7],'Police_Assault': P_Data[:, 8], 'Burglary': P_Data[:, 9],
                   'Theft': P_Data[:, 10],'Other_Cases': P_Data[:, 11],'Arms_Act': P_Data[:, 12],'Explosive': P_Data[:, 13],'Narcotics': P_Data[:, 14]
                   ,'Smuggling': P_Data[:, 15],'Total_Recovered_Cases': P_Data[:, 16],'Total_Cases': P_Data[:, 17],'Year': year})
    df = pd.concat([df,dd], ignore_index=True) # Here we are concatenating the datasets to make them only one dataset.
    year+=1
df

Unnamed: 0,Unit_Name,Dacoity,Robbery,Murder,Speedy_Trial,Riot,Woman&Child_Repression,Kidnapping,Police_Assault,Burglary,Theft,Other_Cases,Arms_Act,Explosive,Narcotics,Smuggling,Total_Recovered_Cases,Total_Cases,Year
0,DMP,47,220,245,363,3,1370,139,155,555,1915,7228,518,82,10535,144,11279,23519,2010
1,CMP,16,108,94,31,7,455,37,31,123,314,1831,51,0,866,99,1016,4063,2010
2,KMP,3,9,29,25,0,153,11,4,65,91,551,19,2,792,13,826,1767,2010
3,RMP,4,20,21,9,15,157,9,12,53,106,578,3,4,332,248,587,1571,2010
4,BMP,8,12,19,21,0,112,6,8,24,83,557,17,0,155,117,289,1139,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
185,Railway Range,0,1,2,0,0,0,0,0,0,5,9,0,0,55,12,67,84,2019
186,GMP,2,3,3,1,0,22,1,2,2,8,65,3,0,130,2,135,244,2019
187,RPMP,0,0,1,0,0,12,1,0,0,6,33,0,0,68,0,68,121,2019
188,ATU,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2019


In [7]:
df.to_csv("Police_Data.csv") # Saving the dataset as csv file so that we don't need to scrap the data again and again. 

I don't know whether this is the best approach for scrapping and making dataset from website. But I did it as easily as I can. Keep suggesting me the best approach for web scrapping and making dataset.

For any kind of suggestion knock me over facebook: <a href="https://www.facebook.com/P0l0kN"> Hasibul Islam Polok </a>


over email <a href="mailto:polok.hasibul@gmail.com"> polok.hasibul@gmail.com </a>


over linkedin <a href="https://www.linkedin.com/in/polokn"> Md Hasibul Islam </a>