# Problem Statement
Indianbloggers.org is directory of Indian Bloggers.  It list most prominent bloggers from India an our task is to scrape the post of these bloggers and find a way to group them together based on Topic Modelling such that all the blog related to each other based on the content are grouped in same cluster.

### Methodology
* First we scrape the list of authors and their respective websites
* Second We scrape each websites and get the list of blogs and their urls
* third we srape the content of each url 
* fourth we clean the text
* fifth we create a csv file with authors, title, link and text data.

This csv file will be used to for Topic Modelling (day03)


## Activity: Scraping blog content of Indian Bloggers


### First Level Extraction
Extracting all **Authors** and their corresponding **blog links** at **level 1:**

https://indianbloggers.org/


### Second Level Extraction
For each blogger we will perfom a web scrapping from their respestive blog page and get links for other blogs of the same blogger:

### Third Level web scraping
In the third level we will extract the **blog contents** of each blog.  



## Import packages

In [1]:
import urllib.request as url
from bs4 import BeautifulSoup as bs
import re
import requests
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

import random
random.seed(123)

## First Level Extraction
Extracting all **Authors** and their corresponding **blog links** at **level 1:**

https://indianbloggers.org/

In [2]:
## Link of the website we want to scrape

link = "https://indianbloggers.org/"

In [3]:
## Read the webpage from the url

html = url.urlopen(link).read()
html



In [4]:
## Load the data in a beautiful soup specific format

soup = bs(html, 'html.parser')
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="top indian bloggers, popular indian blogs, india, blogging" name="keywords">
<meta content="Directory of most popular blogs in India. You can meet some of the best Indian Bloggers here and even add your own blog to the bloggers directory." name="description">
<title>Best Indian Blogs - Directory of Most Popular Blogs in India</title>
<meta content="The Best Indian Bloggers" property="og:title">
<meta content="website" property="og:type"/>
<meta content="https://indianbloggers.org/" property="og:url"/>
<meta content="http://img.labnol.org/files/di.png" property="og:image"/>
<meta content="The Best Indian Bloggers" property="og:site_name"/>
<meta content="500808182" property="fb:admins"/>
<meta content="India Blogs is a frequently updated directory of the most popular blogs from India."

In [5]:
type(soup)

bs4.BeautifulSoup

In [6]:
## Beautiful soup helps us find the HTML elements very easily

table =  soup.findAll('a')## anchor
table

[<a href="#intro">Introduction
                                         <span class="indicator">
 <i class="fa fa-angle-right"></i>
 </span>
 </a>,
 <a href="#featured">Featured Blogs
                                         <span class="indicator">
 <i class="fa fa-angle-right"></i>
 </span>
 </a>,
 <a href="#addblog">Add Your Blog
                                         <span class="indicator">
 <i class="fa fa-angle-right"></i>
 </span>
 </a>,
 <a href="#contact">Get in Touch
                                         <span class="indicator">
 <i class="fa fa-angle-right"></i>
 </span>
 </a>,
 <a class="nav_slide_button" href="#" id="nav-toggle">
 <span></span>
 </a>,
 <a class="social-btn" href="https://www.facebook.com/indian.bloggers">
 <i class="fa fa-facebook"></i>
 </a>,
 <a class="social-btn" href="https://twitter.com/indiablogging">
 <i class="fa fa-twitter"></i>
 </a>,
 <a class="social-btn" href="mailto:amit@labnol.org">
 <i class="fa fa-envelope"></i>
 </a>,
 <a href="http

In [7]:
## The first anchor element

table[0]

<a href="#intro">Introduction
                                        <span class="indicator">
<i class="fa fa-angle-right"></i>
</span>
</a>

In [8]:
## The anchor elements that we do not want

table[5]

<a class="social-btn" href="https://www.facebook.com/indian.bloggers">
<i class="fa fa-facebook"></i>
</a>

In [9]:
## The anchor element we want

table[8]

<a href="http://www.labnol.org/">Amit Agarwal</a>

### Text with the anchor elements

In [10]:
table[0].text

'Introduction\n                                        \n\n\n'

In [11]:
table[5].text

'\n\n'

In [12]:
table[8].text

'Amit Agarwal'

### After stripping, we must get atleast 2 characters

In [13]:
table[0].text.strip()

'Introduction'

In [14]:
table[5].text.strip()

''

In [15]:
table[8].text.strip()

'Amit Agarwal'

## We get only the link using ['href']

In [16]:
table[0]['href']

'#intro'

In [17]:
table[5]['href']

'https://www.facebook.com/indian.bloggers'

In [18]:
table[8]['href']

'http://www.labnol.org/'

### Look at all the links

In [19]:
## We do not want facebook,twitter,etc
for i in table:
    print(i['href'])

#intro
#featured
#addblog
#contact
#
https://www.facebook.com/indian.bloggers
https://twitter.com/indiablogging
mailto:amit@labnol.org
http://www.labnol.org/
#addblog
#featured
https://indianbloggers.org/blogs/food/
https://indianbloggers.org/blogs/food/
https://indianbloggers.org/blogs/travel/
https://indianbloggers.org/blogs/travel/
https://indianbloggers.org/blogs/technology/
https://indianbloggers.org/blogs/technology/
https://indianbloggers.org/youtube/fashion/
https://indianbloggers.org/youtube/fashion/
https://indianbloggers.org/youtube/food/
https://indianbloggers.org/youtube/food/
https://indianbloggers.org/youtube/mobile/
https://indianbloggers.org/youtube/mobile/
https://indianbloggers.org/youtube/music/
https://indianbloggers.org/youtube/music/
https://indianbloggers.org/blogs/geeks/
https://indianbloggers.org/blogs/geeks/
http://www.kamat.com/jyotsna/blog/
http://www.indiauncut.com/
http://www.whatay.com/
http://hawkeyeview.blogspot.in/
http://www.withinandwithout.com/
htt

In [20]:
# A dictionary to store the data we'll retrieve so that we can 
# store the value as well as the link.

## This is a model generic function
## Even if we need not need link.text search for indianbloggers, we do need it for future blogs

## cat has counts of categories(blogspot,worpress,etc) - that is just for the sake of curiosity

html = "https://indianbloggers.org/"

def extract_web(link):
    html = url.urlopen(link).read()
    soup = bs(html, 'html.parser')
    d = {'title':[],'links':[]}
    #initializing blog hosting category
    cat = {'blogspot':0,'wordpress':0,'others':0}
    # 2. Names and links are on the second column of the second table.
    table2 = soup.findAll('a')
    for link in soup.find_all('a'):
         if len(link.text.strip()) > 1 and bool(re.match('^http',link['href'])) and not bool(
            re.search('indianblogginers|indianbloggers|twitter|facebook|images|\
            youtube|docs.google.com',link['href'])) and not bool(
            re.search('next page|about|store|meeting|google|contact|jan|feb|mar|apr|\
            jun|jul|aug|sep|oct|nov|dec|january|february|march|april|may|june|july|august|\
            september|october|november|december|f.a.q.|faq',link.text.lower())):
                
            d['title'].append(link.text)
            d['links'].append(link['href'])
            #finding the blog hosting type
            if re.search('blogspot',link['href']):
                cat['blogspot']+=1
            elif re.search('wordpress',link['href']):
                cat['wordpress']+=1
            else:
                cat['others']+=1

    blog_list = pd.DataFrame(d)
    return blog_list,cat

blog_list,cat = extract_web(html)

In [21]:
## Dataframe of Each author and the index page of their blog

blog_list.head(3)

Unnamed: 0,title,links
0,Amit Agarwal,http://www.labnol.org/
1,Jyotsna Kamat,http://www.kamat.com/jyotsna/blog/
2,Amit Varma,http://www.indiauncut.com/


In [22]:
## Shape of the dataframe

blog_list.shape

(339, 2)

In [23]:
## Count of categories
cat

{'blogspot': 102, 'wordpress': 45, 'others': 192}

In [24]:
blog_list.to_csv("first_level.csv",index=None)

## Web scrapping of second level from the above table

## Second Level Extraction
For each blogger we will perfom a web scrapping from their respestive blog index page and get all his blog posts


In [25]:
## We have this:
## title - Name of the Author
## links - link to his index page

blog_list.columns

Index(['title', 'links'], dtype='object')

In [26]:
## Look at a sample link

blog_list['links'][2]

'http://www.indiauncut.com/'

In [27]:
## Small example to show the kind output we need

sd1 = pd.DataFrame(columns=['title', 'links'])

try:
    sd2,cat = extract_web(blog_list['links'][0])
    sd3 = pd.concat([sd1, sd2])
except:
    pass

In [28]:
## We want this for each author

sd1.head()


Unnamed: 0,title,links


## This doesn't work all the time, so we should include exception handling

In [29]:
## For this blog, we are getting an error

sd1 = pd.DataFrame(columns=['title', 'links'])
try:
    sd2,cat = extract_web(blog_list['links'][0])
    sd3 = pd.concat([sd1, sd2])
except:
    pass

In [30]:
sd2.head()

Unnamed: 0,title,links
0,Ask a Question,https://digitalinspiration.support
1,Mail Merge for Gmail,https://gsuite.google.com/marketplace/app/mail...
2,Document Studio,https://gsuite.google.com/marketplace/app/docu...
3,Email Studio for Gmail,https://emailstudio.pro
4,Creator Studio for Slides,https://gsuite.google.com/marketplace/app/crea...


In [31]:
sd3.head()

Unnamed: 0,title,links
0,Ask a Question,https://digitalinspiration.support
1,Mail Merge for Gmail,https://gsuite.google.com/marketplace/app/mail...
2,Document Studio,https://gsuite.google.com/marketplace/app/docu...
3,Email Studio for Gmail,https://emailstudio.pro
4,Creator Studio for Slides,https://gsuite.google.com/marketplace/app/crea...


In [32]:
## The link which broke our code

bad_link = blog_list['links'][1]

In [33]:
## Find all anchor elements

html = url.urlopen(bad_link).read()
soup = bs(html, 'html.parser')
table = soup.find_all("a")

In [34]:
## The point where we are getting an error is 384

for index,i in enumerate(table):
    print(index)
    if(bool(re.match('^http',i['href']))):
        print(i['href'])

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
http://www.kamat.org/network/app/login.aspx
34
http://www.kamat.org/network/
35
36
http://www.kamat.org/network/MemberDirectory.aspx
37
http://www.kamat.org/community/
38
https://kamat.com/jyotsna/blog/blog.php?BlogID=1562
39
https://kamat.com/jyotsna/blog/blog.php?BlogID=1545
40
https://kamat.com/kalranga/kar/leaders/pb_desai.htm
41
https://kamat.com/kalranga/people/pioneers/w-jones.htm
42
https://kamat.com/kalranga/kar/writers/5454.htm
43
https://kamat.com/jyotsna/blog/blog.php?BlogID=1539
44
https://kamat.com/kalranga/karavali/honavar/index.htm
45
https://kamat.com/kalranga/kar/literature/epics.htm
46
https://kamat.com/indica/hometown/karki.htm
47
https://kamat.com/kalranga/dharwad/
48
https://kamat.com/jyotsna/blog/blog.php?BlogID=1540
49
https://kamat.com/glossary/?whoID=54
50
https://kamat.com/glossary/?whoID=772
51
https://kamat.com/glossary/?whoID=311
52
https://kamat.com/glossary/?whoID

KeyError: 'href'

In [35]:
## Good link - Matches with what http
table[81]

<a href="https://kamat.com/jyotsna/blog/sesha_shastry_sixty.htm">Seshasastry is Sixty!</a>

In [36]:
## Matched
bool(re.match('^http',table[81]['href']))

True

In [37]:
## bad link - Doesn't have http
table[83]

<a href="/jyotsna/blog/archives.htm">More
      Entries...</a>

In [38]:
## Doesn't match

bool(re.match('^http',table[83]['href']))

False

In [39]:
## Error link - soup doesn't recognize this as a href

table[384]

<a name="BottomOfThePage"><p><table border="0" cellpadding="5" cellspacing="0" class="table table-responsive" width="100%"><tr><td colspan="2"><p><font size="4">Merchandise and Link Suggestions</font></p></td></tr><tr><td valign="top"><!---  Google Ad --->
<!--- No Ad for Now -->
<!--- End of Google Ad ---></td><td nowrap="" valign="top"></td></tr><tr><td colspan="2"><p><ul><li><a href="/database/">India Reference Database</a> -- Journals, Research Abstracts, and Primary Sources</li><li><b>Explore More:</b> <a href="/explore/?tag=Amma%27s">Amma's</a>, <a href="/explore/?tag=Column">Column</a></li></ul></p></td></tr><tr><td bgcolor="white" colspan="2"><p align="right"><a href="#">Top of Page</a></p></td></tr></table></p>
<p> 
</p>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(wind

In [40]:
## No proper href, so throws an error

re.match('^http',table[384]['href'])

KeyError: 'href'

### Our code must be robust to errors like these
#### Web is dynamic and we can encounter code which doesn't follow the norm
#### Our code must ignore such pages and move on to the next page


In [41]:
## Create a dataframe with empty columns
blog_list_2nd = pd.DataFrame(columns=['Author','title', 'links'])

## First 10 Authors
for rowum, blogs in blog_list[0:10].iterrows():
    #print(blogs['links'])
    author = blogs['title']
    
    ## Expection handling
    try :
        temp,cat = extract_web(blogs['links'])
#         author = author 
        temp['Author'] = author
        blog_list_2nd = pd.concat([blog_list_2nd, temp])
        print(temp)
        
        ## When we encounter an error - skip it
    except:
        pass

                          title  \
0                Ask a Question   
1          Mail Merge for Gmail   
2               Document Studio   
3        Email Studio for Gmail   
4     Creator Studio for Slides   
5   Mail Merge with Attachments   
6                      Download   
7                     Tutorials   
8                         Video   
9   Save Emails and Attachments   
10                     Download   
11                        Video   
12             Twitter Archiver   
13                     Download   
14                        Video   
15                     Download   
16                    Tutorials   
17                        Video   
18              Document Studio   
19                     Download   
20                        Video   
21                     Download   
22                        Video   
23              YouTube channel   
24         Subscribe on YouTube   
25         Mail Merge for Gmail   
26              Document Studio   
27                  

                          title  \
0                Ask a Question   
1          Mail Merge for Gmail   
2               Document Studio   
3        Email Studio for Gmail   
4     Creator Studio for Slides   
5   Mail Merge with Attachments   
6                      Download   
7                     Tutorials   
8                         Video   
9   Save Emails and Attachments   
10                     Download   
11                        Video   
12             Twitter Archiver   
13                     Download   
14                        Video   
15                     Download   
16                    Tutorials   
17                        Video   
18              Document Studio   
19                     Download   
20                        Video   
21                     Download   
22                        Video   
23              YouTube channel   
24         Subscribe on YouTube   
25         Mail Merge for Gmail   
26              Document Studio   
27                  

In [42]:
## look at the first 10 blog-posts

blog_list_2nd.head(10)

Unnamed: 0,Author,title,links
0,Amit Agarwal,Ask a Question,https://digitalinspiration.support
1,Amit Agarwal,Mail Merge for Gmail,https://gsuite.google.com/marketplace/app/mail...
2,Amit Agarwal,Document Studio,https://gsuite.google.com/marketplace/app/docu...
3,Amit Agarwal,Email Studio for Gmail,https://emailstudio.pro
4,Amit Agarwal,Creator Studio for Slides,https://gsuite.google.com/marketplace/app/crea...
5,Amit Agarwal,Mail Merge with Attachments,https://gsuite.google.com/marketplace/app/mail...
6,Amit Agarwal,Download,https://gsuite.google.com/marketplace/app/mail...
7,Amit Agarwal,Tutorials,https://digitalinspiration.com/docs/mail-merge
8,Amit Agarwal,Video,https://www.youtube.com/watch?v=F07Py7sraDg
9,Amit Agarwal,Save Emails and Attachments,https://gsuite.google.com/marketplace/app/save...


In [43]:
blog_list_2nd.to_csv("second_level.csv",index=None)

## Third Level web scraping
In the third level we will extract the **blog contents** of each blog.  


In [44]:
## We have this now
## Author - Name of the author
## links - Link for each of this post
## title - Title of that post

blog_list_2nd.columns

Index(['Author', 'title', 'links'], dtype='object')

In [45]:
## 788 Blog-posts
blog_list_2nd.shape

(279, 3)

### First 10 Blog-Posts

In [46]:
## Save the data as a dictionary, So that we could convert that into a dataframe later
data = {'Author':[], 'title':[], 'link':[], 'text':[]}

for rownum, row in blog_list_2nd.iloc[0:10,:].iterrows():
    print(row['links'])
    author = row['Author']
    title = row['title']
    link = row['links']
    #print(row)
    
    try:
        html = url.urlopen(link).read()
        soup = bs(html, 'html.parser')
        form_p_all = soup.select('p > form')
        for form in form_p_all:
            form.decompose()

        nav_div = soup.find_all('div', id=re.compile(".*[nav].$"))
        for nav in nav_div:
            nav.decompose()

        text_links = soup.select('p > a')
        for text_link in text_links:
            #print(link)
            text_link.decompose()
    
        text_all = soup.find_all('p')
        text_data = ""
        for text in text_all:
            #print(text)
            if not 'copyright' in text.text.lower():
                text_data = text_data +" " +text.text
    except:
        pass
    
    if text_data != "":
        #print(text_data)
        data['Author'].append(author)
        data['title'].append(title)
        data['link'].append(link)
        data['text'].append(text_data)

https://digitalinspiration.support
https://gsuite.google.com/marketplace/app/mail_merge_with_attachments/223404411203
https://gsuite.google.com/marketplace/app/document_studio/429444628321
https://emailstudio.pro
https://gsuite.google.com/marketplace/app/creator_studio/509621243108
https://gsuite.google.com/marketplace/app/mail_merge_with_attachments/223404411203
https://gsuite.google.com/marketplace/app/mail_merge_with_attachments/223404411203
https://digitalinspiration.com/docs/mail-merge
https://www.youtube.com/watch?v=F07Py7sraDg
https://gsuite.google.com/marketplace/app/save_emails_and_attachments/513239564707


In [47]:
#Understanding the data

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Author,title,link,text
0,Amit Agarwal,Ask a Question,https://digitalinspiration.support,Please consult our user guides for and . If ...
1,Amit Agarwal,Mail Merge for Gmail,https://gsuite.google.com/marketplace/app/mail...,"Your review, profile name and photo will app..."
2,Amit Agarwal,Document Studio,https://gsuite.google.com/marketplace/app/docu...,"Your review, profile name and photo will app..."
3,Amit Agarwal,Email Studio for Gmail,https://emailstudio.pro,Email Studio adds power tools to Gmail includ...
4,Amit Agarwal,Creator Studio for Slides,https://gsuite.google.com/marketplace/app/crea...,"Your review, profile name and photo will app..."


#### Check if text column has null values (i.e. no text)

In [48]:
print(df[df['text'] == ""])

Empty DataFrame
Columns: [Author, title, link, text]
Index: []


#### Saving the webscraped data to text file (Contingency plan - Just in case the html page is not available to scrape ) 

In [49]:
df.to_csv("Extracted_Blogs_10.csv",sep=',', columns=['Author', 'title', 'link', 'text'], header=True, index=False)

### All blog-posts

In [50]:
## Observations:

# We still see some common lines that are part of text like "india  google apps script  g suite apis", 
#"Posted by Amit Varma", "Essays and Op-Eds"

In [51]:
#Understanding the data
print(pd.DataFrame(data).head(4))

         Author                   title  \
0  Amit Agarwal          Ask a Question   
1  Amit Agarwal    Mail Merge for Gmail   
2  Amit Agarwal         Document Studio   
3  Amit Agarwal  Email Studio for Gmail   

                                                link  \
0                 https://digitalinspiration.support   
1  https://gsuite.google.com/marketplace/app/mail...   
2  https://gsuite.google.com/marketplace/app/docu...   
3                            https://emailstudio.pro   

                                                text  
0   Please consult our user guides for  and . If ...  
1    Your review, profile name and photo will app...  
2    Your review, profile name and photo will app...  
3   Email Studio adds power tools to Gmail includ...  


#### Check if text column has null values (i,e no text)

In [52]:
df = pd.DataFrame(data)
print(df[df['text'] == ""])

Empty DataFrame
Columns: [Author, title, link, text]
Index: []


In [53]:
pd.DataFrame(data).head(6)

Unnamed: 0,Author,title,link,text
0,Amit Agarwal,Ask a Question,https://digitalinspiration.support,Please consult our user guides for and . If ...
1,Amit Agarwal,Mail Merge for Gmail,https://gsuite.google.com/marketplace/app/mail...,"Your review, profile name and photo will app..."
2,Amit Agarwal,Document Studio,https://gsuite.google.com/marketplace/app/docu...,"Your review, profile name and photo will app..."
3,Amit Agarwal,Email Studio for Gmail,https://emailstudio.pro,Email Studio adds power tools to Gmail includ...
4,Amit Agarwal,Creator Studio for Slides,https://gsuite.google.com/marketplace/app/crea...,"Your review, profile name and photo will app..."
5,Amit Agarwal,Mail Merge with Attachments,https://gsuite.google.com/marketplace/app/mail...,"Your review, profile name and photo will app..."


#### Saving the webscraped data to text file (Contingency plan - Just in case the html page is not available to scrape ) 

In [54]:
pd.DataFrame(data).to_csv("Extracted_Blogs.csv",sep=',', columns=['Author', 'title', 'link', 'text'], header=True, index=False)