# WEB SCRAPING IN GITHUB

### WEB SCRAPING:
- Scraping data from a website is called WEB SCRAPING
- When your unable to download the data in any format from a website then you should go for WEB SCRAPING       
- Mostly when the data is extended in number of pages we should use WEB SCRAPING
    

## Pick a website and describe your objective:
- Browse through different sites and pick on to scrape.
- Identify the information you'd want to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Jupyter notebook.


## project Outline:
- I am going to scrape https://github.com/topics
- I'll get a list of topics.For each topic I'll get topic tittle,topic page url and topic description
- For each topic, i'll get the top 25 repositories in the topic from the topic pages
- For each repositories I'll grap the repo name,user name, stars and repo URL
- For each topic we create a CSV file. In the following format:


Repo Name, Username, Stars, Repo URL

three.js, mrdoob, 69700, https://github.com/mrdoob/three.js

libgdx, libgdx, 18300, https://github.com/libgdx/libgdx





 ## Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [1]:
import requests

In [2]:
topic_url = 'https://github.com/topics'

In [3]:
print(topic_url)

https://github.com/topics


In [4]:
response = requests.get(topic_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

166276

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-b92e9647318f.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" m

In [9]:
with open('webpage.html','w',encoding = 'utf-8') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.

In [10]:
!pip install beautifulsoup4 --quiet

In [11]:
from bs4 import BeautifulSoup

In [12]:
doc = BeautifulSoup(page_contents,'html.parser')

In [13]:
type(doc)

bs4.BeautifulSoup

In [14]:
topic_title_tags = doc.find_all('p')

In [15]:
len(topic_title_tags)

69

In [16]:
topic_title_tags[:10]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Maven
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Maven is a build automation tool used primarily for Java projects.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         SpaceVim
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">SpaceVim is a community-driven distribution of the vim editor that allows managing your plugins in layers.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Markdown
       </p>,

In [17]:
select_class="f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p',{'class': select_class})

In [18]:
len(topic_title_tags)

30

In [19]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [20]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_selector})

In [21]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [22]:
topic_title_tag0= topic_title_tags[0]

In [23]:
div_tag = topic_title_tag0.parent

In [24]:
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [25]:
topic_link_tags = doc.find_all('a' ,{'class':'no-underline flex-1 d-flex flex-column'})

In [26]:
len(topic_link_tags)

30

In [27]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [28]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [29]:
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
print(topic_descs)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [30]:
topic_urls = []
base_url = "https://github.com"
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [31]:
import pandas as pd

In [32]:
topic_dict = {'title' : topic_titles , 'description' : topic_descs , 'url' : topic_urls}

In [33]:
topic_df = pd.DataFrame(topic_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [34]:
topic_df.to_csv("topics.csv",index=None)

## Getting Information out of a Topicpage

In [35]:
topic_page_url = topic_urls[0]

In [36]:
print(topic_page_url)

https://github.com/topics/3d


In [37]:
response = requests.get(topic_page_url)

In [38]:
response.status_code

200

In [39]:
len(response.text)

484095

In [40]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [41]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [42]:
len (repo_tags)

20

In [43]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href

In [44]:
a_tags=repo_tags[0]

In [45]:
a_tags = repo_tags[0].find_all('a')

In [46]:
a_tags[0].text.strip()

'mrdoob'

In [47]:
a_tags[1].text.strip()

'three.js'

In [48]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


### Return all the required info about a repository

In [49]:
def get_repo_info(h3_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    return username,repo_name,repo_url

In [50]:
 get_repo_info(repo_tags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js')

In [51]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'repo_url' : [],
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['repo_url'].append(repo_info[2])

In [52]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [53]:
topic_repos_df

Unnamed: 0,username,repo_name,repo_url
0,mrdoob,three.js,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,https://github.com/MonoGame/MonoGame


## FINAL CODE

In [110]:
import os

def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #Check successful response
    if response.status_code != 200:
        raise Exception('Faild to load page {}'.format(topic_url))
    # parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    #Get the h3 tags containg repo title, username and repo url
    return topic_doc

def get_repo_info(h3_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    return username,repo_name,repo_url


def get_topic_repos(topic_doc):
    #Get the h3 tags containg repo title, username and repo url
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
    topic_repos_dict = {
        'username' : [],
         'repo_name' : [],
         'repo_url' : [],
    }
     #Get repo info
    for i in range(len(repo_tags)):
            repo_info = get_repo_info(repo_tags[i])
            topic_repos_dict['username'].append(repo_info[0])
            topic_repos_dict['repo_name'].append(repo_info[1])
            topic_repos_dict['repo_url'].append(repo_info[2])
            
    return pd.DataFrame(topic_repos_dict) 

def scrape_topic(topic_url , path):
    if os.path.exists(path):
        print('The file {} already exists.skipping ...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path ,index = None)

In [74]:
get_topic_repos(get_topic_page(topic_urls[7])).to_csv("api.csv", index = None)

### Write a single function to:
  - 1.Get a list of topic from the topic page
  - 2.Get the listof top repose from the individual topic page
  - 3.From each topic, create a CSV of the topic repose for the topic

In [111]:
def get_topic_titles(doc):
    select_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p',{'class': select_class})
    topic_titles = []
    for tag in topic_title_tags:
            topic_titles.append(tag.text)
    return topic_titles


def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
            topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a' ,{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls  

def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Faild to load page {}'.format(topic_url))
    topics_dict = {
        'title' : get_topic_titles(doc),
        'description' : get_topic_descs(doc),
        'url' : get_topic_urls(doc)
    }    
    return pd.DataFrame(topics_dict)

In [112]:
def scrape_topics_repos():
    print('scraping list of topics')
    topics_df = scrape_topics()
    os.makedirs('data',exist_ok=True)
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
        

In [113]:
scrape_topics_repos()

scraping list of topics
scraping top repositories for "3D"
scraping top repositories for "Ajax"
scraping top repositories for "Algorithm"
scraping top repositories for "Amp"
scraping top repositories for "Android"
scraping top repositories for "Angular"
scraping top repositories for "Ansible"
scraping top repositories for "API"
scraping top repositories for "Arduino"
scraping top repositories for "ASP.NET"
scraping top repositories for "Atom"
scraping top repositories for "Awesome Lists"
scraping top repositories for "Amazon Web Services"
scraping top repositories for "Azure"
scraping top repositories for "Babel"
scraping top repositories for "Bash"
scraping top repositories for "Bitcoin"
scraping top repositories for "Bootstrap"
scraping top repositories for "Bot"
scraping top repositories for "C"
scraping top repositories for "Chrome"
scraping top repositories for "Chrome extension"
scraping top repositories for "Command line interface"
scraping top repositories for "Clojure"
scrapin