Web scrapping is the process of extracting and parsing data form websites in an automated fashion using a computer program. It’s a useful technique for creating datasets for research and learning. Follow these steps to build a web scrapping project from scratch using python and its ecosystem of libraries:

### Pick a website and describe your objects

strategy:


- we're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic we'll get topic title, page, URL and topic descrption
- for each topic, we'll get the top 25 repositiory, we'll grab the repo name, username, stars, and repo URL
- for each topic we'll create a CSV  file.

 ### Use the requests library to download web pages

#### Import Necessary Libraries

In [3]:
import requests

In [4]:
topics_url = 'https://github.com/topics'

In [5]:
response = requests.get(topics_url)

In [6]:
response.status_code  

#status_code indicates whether the response was successful.

200

In [7]:
len(response.text) #total no of charaters on the webpage

134320

In [8]:
page_contents = response.text

In [9]:
 # used to print out all the content that is there on the webpage
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-J/5cWm5rrVuxkSgldaK1emf5j30Bs5mRgu0uhuHrG+iwf9mD2LOrkQ32SyN5PADLWzkSDxLS3bW/ScsiM44wzw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-27fe5c5a6e6bad5bb191282575a2b57a.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-W0Cb3tYIxIb58LtOmiY++k5siW1IkzkqaHOXMJpsrZBWMGoaw8M3r5f7RRxa1heGJEDanaTJmAqCJUoMytKNxA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-5b409bded608c486f9f0bb4e9a263efa.css" />\n    \n    \

### Use Beautiful Soup to parse and extract information

In [10]:
from bs4 import BeautifulSoup  #the module/package you install is bs4

In [11]:
doc = BeautifulSoup(page_contents, 'html.parser')

#doc is a BeautifulSoup object, it contains all the htlm in a parsed format
#so now we can actually find things using quesries

In [13]:
 p_tags = doc.find_all('p')

In [14]:
len(p_tags)

67

In [15]:
p_tags[:5]

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         PostgreSQL
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">PostgreSQL is an open source database system.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Vagrant
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Vagrant is an open-source software product for building and maintaining portable virtual software development environments.</p>]

In [17]:
#we were able to get some of the tags corresponding to this html page but some of them won't so we're going to find most specific
# we can search for specific class

In [25]:
topic_title_tags = doc.find_all('p',{'class' : 'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [26]:
len(topic_title_tags)

30

In [24]:
p_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [27]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [34]:
# we have obtained the p tags corresponding to the 30 topic titles

# let's try and grab the topic description
topic_desc_tags = doc.find_all('p', {'class' : 'f5 color-text-secondary mb-0 mt-1'})

In [35]:
topic_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [39]:
topic_title_tag = topic_title_tags[0]
topic_title_tag

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [40]:
topic_title_tag.parent
# gives us the parent of the t tag which is div which inside contains couple of p tags

<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>

In [41]:
div_tag = topic_title_tag.parent

In [42]:
topic_link_tags = doc.find_all('a',{'class':'d-flex no-underline'})

In [43]:
len(topic_link_tags)

30

In [44]:
topic_link_tags[0]         # this is a link to 3d 

<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
<div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg aria-hidden="true" class="octicon octicon-star

In [45]:
topic_link_tags[0] ['href']

'/topics/3d'

In [47]:
topic0_url = "http://github.com" + topic_link_tags[0]['href']
print(topic0_url )

http://github.com/topics/3d


In [50]:
topic_title_tags[0].text

'3D'

In [52]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

# we have a whle list of topics

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [54]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text)

print(topic_descs)
    

['\n              3D modeling is the process of virtually developing the surface and structure of a 3D object.\n            ', '\n              Ajax is a technique for creating interactive web applications.\n            ', '\n              Algorithms are self-contained sequences that carry out a variety of tasks.\n            ', '\n              Amp is a non-blocking concurrency framework for PHP.\n            ', '\n              Android is an operating system built by Google designed for mobile devices.\n            ', '\n              Angular is an open source web application platform.\n            ', '\n              Ansible is a simple and powerful automation engine.\n            ', '\n              An API (Application Programming Interface) is a collection of protocols and subroutines for building software.\n            ', '\n              Arduino is an open source hardware and software company and maker community.\n            ', '\n              ASP.NET is a web framework for bu

In [57]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())   # simply removes space

topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [65]:
topic_urls = []
base_url = 'http://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['http://github.com/topics/3d',
 'http://github.com/topics/ajax',
 'http://github.com/topics/algorithm',
 'http://github.com/topics/amphp',
 'http://github.com/topics/android',
 'http://github.com/topics/angular',
 'http://github.com/topics/ansible',
 'http://github.com/topics/api',
 'http://github.com/topics/arduino',
 'http://github.com/topics/aspnet',
 'http://github.com/topics/atom',
 'http://github.com/topics/awesome',
 'http://github.com/topics/aws',
 'http://github.com/topics/azure',
 'http://github.com/topics/babel',
 'http://github.com/topics/bash',
 'http://github.com/topics/bitcoin',
 'http://github.com/topics/bootstrap',
 'http://github.com/topics/bot',
 'http://github.com/topics/c',
 'http://github.com/topics/chrome',
 'http://github.com/topics/chrome-extension',
 'http://github.com/topics/cli',
 'http://github.com/topics/clojure',
 'http://github.com/topics/code-quality',
 'http://github.com/topics/code-review',
 'http://github.com/topics/compiler',
 'http://github.com/to

In [67]:
 import pandas as pd

In [68]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [71]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


 - Creating CSV file(s) with the edxtracted information

In [72]:
topics_df.to_csv('topics.csv')

## we have now a csv file, where we have our extracted information in a structured form.

#                                              THANK YOU!