# Scraping-github-topics-repositories

### Pick a website and describe objective
- Browse through diff websites and pick one to scrape 
- Identify the information you would like to scrap from the site
- Decide the format of the output to csv file.
- Summarise your project idea and outline your strategy in jupyter Notebook use the new button above


- We're going to scrape https://github.com/topics 
- We'll get a list of topics. For each topic,we'll get topic title,topic URL and description
- For each topic get top 25 repositories
- For each topic we'll have a new csv file 

### Use the requests library to download web pages

In [1]:
!pip install requests --upgrade --quiet

In [2]:
!pip install --upgrade pip



In [3]:
import requests

In [4]:
topics_URL = 'https://github.com/topics'

In [5]:
response = requests.get(topics_URL)

In [6]:
response.status_code

200

In [7]:
len(response.text)

140564

In [8]:
page_contents = response.text

In [9]:
page_contents[:1000] 

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [10]:
with open('webpage.html', 'w',encoding="utf-8") as f:
    f.write(page_contents)

### Use beautifulsoup to extract and parse information

In [11]:
!pip install beautifulsoup4 --upgrade --quiet

In [12]:
from bs4 import BeautifulSoup

In [13]:
doc= BeautifulSoup(page_contents, 'html.parser')


In [14]:
selection_class =  'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [15]:
len(topic_title_tags)

30

In [16]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [17]:
desc_selector =  "f5 color-fg-muted mb-0 mt-1"
                   
topic_description_tags = doc.find_all('p',{'class': desc_selector})

In [18]:
len(desc_selector)

27

In [19]:
topic_description_tags[:3]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>]

In [20]:
topic_title_tag0= topic_title_tags[0]

In [21]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [22]:
div_tag= topic_title_tag0.parent


In [23]:
topic_link_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})



In [24]:
len(topic_link_tags)

30

In [25]:
topic_link_tags[0]['href']

'/topics/3d'

In [26]:
topic0_url= "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


#### Dictionary bana rhi hu sbka niche 

In [27]:
topic_titles= []
for tag in topic_title_tags:
    topic_titles.append(tag.text)

topic_titles[:3]

['3D', 'Ajax', 'Algorithm']

In [28]:
topic_desc=[]
for tag in topic_description_tags:
    topic_desc.append(tag.text.strip())

topic_desc[:3]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

In [29]:
topic_urls = []
for tag in topic_link_tags:
    topic_urls.append(tag['href'])
topic_urls[:3]

['/topics/3d', '/topics/ajax', '/topics/algorithm']

#### Now these dont have the base urls so we will add it below

In [30]:
topic_urls = []
base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

#### step 1 is done collecting info from a page

In [31]:
!pip install pandas --quiet

In [32]:
import pandas as pd

In [33]:
WS_df = pd.DataFrame()

In [34]:
print (WS_df)

Empty DataFrame
Columns: []
Index: []


topic_titles = []
topic_desc= []
topic_urls = []


In [35]:
topics_dict = {'Titles': topic_titles, 
               'Description': topic_desc, 
               'Links':topic_urls } 

In [36]:
WS_df = pd.DataFrame(topics_dict)

In [37]:
WS_df

Unnamed: 0,Titles,Description,Links
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSVfile(s) with the extracted information

In [38]:
WS_df.to_csv('topics.csv')

### Getting information out of a topic page

In [39]:
topic_page_url = topic_urls[0]

In [40]:
topic_page_url

'https://github.com/topics/3d'

In [41]:
response = requests.get(topic_page_url)

In [42]:
response.status_code

200

In [43]:
len(response.text)

435963

In [44]:
topic_doc= BeautifulSoup(response.text, 'html.parser')

#### Now you want username, stars, link to the repository this is what we will be extracting below using the same above procedure

In [45]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'

repo_tags =  topic_doc.find_all('h3',{'class': h3_selection_class})


In [46]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [47]:
len(repo_tags)

20

In [48]:
a_tags= repo_tags[0].find_all('a')

In [49]:
a_tags[0].text.strip()

'mrdoob'

In [50]:
a_tags[0]

<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [51]:
a_tags[0].text.strip()

'mrdoob'

In [52]:
a_tags[1].text.strip()

'three.js'

#### We have the link to the repo in the second a_tag 

In [53]:
a_tags[1]['href']

'/mrdoob/three.js'

In [54]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']

In [55]:
repo_url

'https://github.com/mrdoob/three.js'

#### Now for the stars 

In [56]:
star_tags =  topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [57]:
len(star_tags)

20

In [58]:
star_tags[0]

<span aria-label="85080 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="85,080">85.1k</span>

In [59]:
star_tags[3].text.strip()

'18.3k'

In [60]:
len(star_tags)

20

#### For converting 'k' into a number define a number

#### also [:1] this selects everything except the last one character and prints it

In [61]:
def parse_star_count (stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] =='k':
         return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [62]:
star_tags[0].text.strip()

'85.1k'

In [63]:
float(stars_str[:-1])

NameError: name 'stars_str' is not defined

In [64]:
int(float(stars_str[:-1])*1000)

NameError: name 'stars_str' is not defined

In [65]:
parse_star_count(star_tags[0].text.strip())

85100

In [66]:
star_tags[:2]

[<span aria-label="85080 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="85,080">85.1k</span>,
 <span aria-label="20419 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="20,419">20.4k</span>]

In [67]:
# returns all the reqd info ABOUT A REPOSITORY

def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [68]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 85100, 'https://github.com/mrdoob/three.js')

In [69]:
topic_repos_dict = {
    'username': [],
    'repo_name':[],
    'stars': [],
    'repo_url' : []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [70]:
topic_repos_df=pd.DataFrame(topic_repos_dict)

In [71]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,85100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20400,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,19400,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18300,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,14600,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,14500,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,13600,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,12100,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9300,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,9200,https://github.com/CesiumGS/cesium


In [72]:
def get_topic_page(topic_url):
    #Download the page and chcek for successful response
    response = requests.get(topic_url)
    if response.status_code!= 200:
        raise Exception('Failed to load page{}'.format(topic_url))
        
    #Parse using BeautifulSoup    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


  # returns all the reqd info ABOUT A REPOSITORY
def get_repo_info(h3_tag, star_tag):
  
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars,repo_url



#Get the h3 tags containing repo title, repo URLand username
 #Get star tags 
def get_topic_repos(topic_doc):   
    
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags =  topic_doc.find_all('h3',{'class': h3_selection_class})
    
   
    star_tags =  topic_doc.find_all('span', {'class': 'Counter js-social-count'})

    
    topic_repos_dict = {
        'username': [],
        'repo_name':[],
        'stars': [],
        'repo_url' : []
}

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)   

In [73]:
url4=topic_urls[0]

In [74]:
url4

'https://github.com/topics/3d'

In [75]:
topic4_doc = get_topic_page(url4)

In [76]:
topic4_repos = get_topic_repos(topic4_doc)

In [77]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,85100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20400,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,19400,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18300,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,14600,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,14500,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,13600,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,12100,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9300,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,9200,https://github.com/CesiumGS/cesium


In [78]:
get_topic_repos(get_topic_page(topic_urls[2]))

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,232000,https://github.com/jwasham/coding-interview-un...
1,CyC2018,CS-Notes,156000,https://github.com/CyC2018/CS-Notes
2,trekhleb,javascript-algorithms,150000,https://github.com/trekhleb/javascript-algorithms
3,TheAlgorithms,Python,143000,https://github.com/TheAlgorithms/Python
4,yangshun,tech-interview-handbook,78000,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,58100,https://github.com/kdn251/interviews
6,azl397985856,leetcode,49300,https://github.com/azl397985856/leetcode
7,TheAlgorithms,Java,47600,https://github.com/TheAlgorithms/Java
8,algorithm-visualizer,algorithm-visualizer,39000,https://github.com/algorithm-visualizer/algori...
9,youngyangyang04,leetcode-master,31300,https://github.com/youngyangyang04/leetcode-ma...


#### Write a single function to: 
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each topic, create a csv of the top repos of the topic

In [79]:
def get_topic_titles(doc):
    selection_class =  'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles 


def get_topic_descs(doc):
    desc_selector =  'f5 color-fg-muted mb-0 mt-1'                   
    topic_descs_tags = doc.find_all('p',{'class': desc_selector})
    topic_descs=[]
    for tag in topic_descs_tags:
        topic_desc.append(tag.text.strip())
    return topic_descs 


def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls


    


In [80]:
import requests
import pandas as pd

In [81]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise 
        Exception('Failed to load page {}'.format(topic_url))
    topics_dict= {
         'title' : get_topic_titles(doc),
         'description' : get_topic_descriptions(doc) ,
         'url' : get_topic_urls(doc)
         
     }
    return pd.DataFrame(topics_dict)
    

In [82]:
scrape_topics()

NameError: name 'get_topic_descriptions' is not defined