# **WebScrapping using Python:**
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Here, I have shown a procedure to extract data from instagram profile such as name of the account holder, number of followers, following and total number of post.

### **Prerequisite:**
- requests
- BeautifulSoup
- html5lib
- pandas

## Procedure to follow:
1. Downloading web page using the requests library
2. Inspecting the HTML source code of a web page
3. Parsing parts of a website using Beautiful Soup
4. Writing parsed information into CSV files

In [None]:
# pip install requests
# pip install html5lib
# pip install bs4
# pip install pandas

In [132]:
# Import the library (We'll use a library called requests to download web pages from the instagram.)
import requests
import html5lib
from bs4 import BeautifulSoup
import pandas as pd

### 1. Downloading web page using the requests library

In [3]:
insta_url = "https://www.instagram.com/sushantsinghrajput/"

html = requests.get(insta_url)

### 2. Inspecting the HTML source code of a web page

In [4]:
html

<Response [200]>

In [5]:
type(html)

requests.models.Response

In [6]:
html.status_code

200

In [7]:
page_contents = html.text
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en" class="no-js not-logged-in client-root">\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n        <title>\nSushant Singh Rajput (@sushantsinghrajput) • Instagram photos and videos\n</title>\n\n        \n        <meta name="robots" content="noimageindex, noarchive">\n        <meta name="apple-mobile-web-app-status-bar-style" content="default">\n        <meta name="mobile-web-app-capable" content="yes">\n        <meta name="theme-color" content="#ffffff">\n        <meta id="viewport" name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, viewport-fit=cover">\n        <link rel="manifest" href="/data/manifest.json">\n\n        <link rel="preload" href="/static/bundles/metro/ConsumerUICommons.css/c522a20f31bf.css" as="style" type="text/css" crossorigin="anonymous" />\n<link rel="preload" href="/static/bundles/metro/Consumer.css/635f5c6a587e.css" as="style"

In [8]:
# The instagram profile of shushant singh rajput contains more than 2,00,000 characters.
len(page_contents)

202460

In [9]:
# save the contents to a file with the .html extension.

with open('sushant_singh_rajput_insta.html', 'w', encoding="utf-8") as file:
    file.write(page_contents)

### 3. Parsing parts of a website using Beautiful Soup

In [10]:
soup = BeautifulSoup(html.text, 'html.parser')

In [11]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js not-logged-in client-root" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <title>
   Sushant Singh Rajput (@sushantsinghrajput) • Instagram photos and videos
  </title>
  <meta content="noimageindex, noarchive" name="robots"/>
  <meta content="default" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="#ffffff" name="theme-color"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, viewport-fit=cover" id="viewport" name="viewport"/>
  <link href="/data/manifest.json" rel="manifest"/>
  <link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.css/c522a20f31bf.css" rel="preload" type="text/css">
   <link as="style" crossorigin="anonymous" href="/static/bundles/metro/Consumer.css/635f5c6a587e.css" rel="preload" type="text/css"/>
   <link as="style" crossorigin="anony

In [12]:
type(soup)

bs4.BeautifulSoup

In [13]:
url = soup. find("meta", property="og:url").get("content")
url

'https://www.instagram.com/sushantsinghrajput/'

In [16]:
# getting meta tags
# another method: # item = soup.select_one("meta[property='og:description']")
item = soup.find("meta", property="og:description")
item

<meta content="13.2m Followers, 6,350 Following, 87 Posts - See Instagram photos and videos from Sushant Singh Rajput (@sushantsinghrajput)" property="og:description"/>

In [105]:
name1 = soup.find("meta", property="og:title").get("content").split("•")[0]
name = name1.split(" (")[0]
name

'Sushant Singh Rajput'

In [114]:
followers1 = item.get("content").split(",")[0]
followers = followers1.split()[0]
followers

'13.2m'

In [111]:
following1 = item.get("content").split(",")[1].strip()
following2 = item.get("content").split(",")[2].strip()
following3 = str(following1)+str(following2)
following = following3.split()[0]
following

'6350'

In [106]:
post1 = item.get("content").split(",")[3].strip()
post = post1.split("-")[0].split()[0]
post

'87'

In [125]:
# Alternate way: df = print({"Name": name, "Followers": followers, "Following": following, "Post": post})

data = {"Name": name, "Followers": followers, "Following": following, "Posts": post}
data

{'Name': 'Sushant Singh Rajput',
 'Followers': '13.2m',
 'Following': '6350',
 'Posts': '87'}

### 4. Writing parsed information into CSV files

In [129]:
import pandas as pd

df = pd.DataFrame(data, index=[0])
df

Unnamed: 0,Name,Followers,Following,Posts
0,Sushant Singh Rajput,13.2m,6350,87


In [131]:
df.to_csv("sushant_singh_rajput_insta_Scrap.csv")