# Top repositories of Github by Topics

## Pick a website and describe your objective
1. Browse through different sites and pick on to scrape.Check the project ideas section for inspiration.
2. Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
3. Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


#### Outline:

1. We are going to scrape https://github.com/topics
2. we will get a list of topics ,for each topic - we will get topic title,topic page URL and topic description
3. From each topic we will get top 25 repositories from each topic.
4. From each repository,we will grab the repo name,username,stars and repo URL
5. For each topic we will create a csv file 

## Use the requests library to download web pages.


In [21]:
!pip install requests --upgrade



In [23]:
import requests

In [27]:
topics_URL = 'https://github.com/topics'

In [29]:
response = requests.get(topics_URL)   # requests.get creates a response object
# Now requests has opened the URL , taken those webpage and downloaded it

In [31]:
# to check whether your request was successful or not,check the status_code for response
response.status_code     # successful response status_code lies between 200-299
# more about status codes,google HTTP STATUS CODES

200

In [33]:
# where are the contents of the webpage ,it's in response.text (don't print the entire content)
page_contents = response.text

In [35]:
# printing the first 1000 chars of the page_contents
page_contents[:1000]
# what you are seeing below,is written in a language called html

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-3e154969b9f9.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-9c5b7a476542.css" /><link data-color-theme="dark_dimmed" cross

In [39]:
# you can save the page_contents file to html
with open('webpage.html','w' , encoding = 'utf-8') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract informationired.

In [41]:
pip install beautifulsoup4 --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [45]:
from bs4 import BeautifulSoup

In [47]:
soup = BeautifulSoup(page_contents , 'html.parser')   # parsing our html content which is there under the name page_contents using BeautifulSoup

In [49]:
type(soup)

bs4.BeautifulSoup

In [59]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"   # extracted this from the inspect from the browser
# 'selection_class- this is the class of the heading og "3D" in the github topics
topic_title_tags =soup.find_all('p' ,{'class': selection_class})
len(topic_title_tags)

30

In [61]:
topic_title_tags   # we have obtained the p_tags corresponding to 30 topics titiles

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Bash</p>,
 <p class="f3 lh-condensed m

In [111]:
topic_title_tags_lst = [topic_title_tags[i].text for i in range(len(topic_title_tags))]

In [113]:
topic_title_tags_lst[:5]   # we got the list of all the titles

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [63]:
# lets get to the topic description which is written just below the topic titles 
desc_class = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tags = soup.find_all('p' , {'class': desc_class})    # 'p' becoz it also contains the p-tags
len(topic_desc_tags)

30

In [65]:
topic_desc_tags[:5]   # let's see the first five 

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [107]:
# creating a list of all the descriptions of all the topic titles
topic_desc_tags_lst = [topic_desc_tags[i].text.replace('\n' , '').strip() for i in range(len(topic_desc_tags))]

In [109]:
topic_desc_tags_lst[:5]     # we got the description list

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [99]:
# Now we want the URL corresponding to each topic title tags i.e. href 
# first tag in "topic_title_tag corresponds to '3D'
#topic_title_tags0 = topic_title_tags[0]

# creating alist of all the extensions of URLs of each topic titles
topic_URL_tags_lst = [topic_title_tags[i].parent['href'] for i in range(len(topic_title_tags))] 

In [101]:
topic_URL_tags_lst[:5]   # we got the extented URL list

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android']

In [119]:
# let's amke the complete URL for individual topics that are in the list
URL_lst = ['https://github.com'+topic_URL_tags_lst[i] for i in range(len(topic_URL_tags_lst))]

In [121]:
URL_lst[:5]  # Now we got the Extended URL for the individual topics

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [123]:
import pandas as pd

In [125]:
topic_df = pd.DataFrame({'Title' : topic_title_tags_lst ,
                        'Description': topic_desc_tags_lst ,
                        'URLs' : URL_lst})

In [127]:
topic_df.head()  # we have created a dataframe based on the info we got

Unnamed: 0,Title,Description,URLs
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Create CSV file(s) with the extracted information.

In [131]:
# Let's convert this dataframe to csv file
topic_df.to_csv('titles.csv'  , index = None)      # titles.csv is the name of the file we are assigning to the csv file 
# this creates a csv file in our folder
# index = None will remove all the indices in the csv files 

## Getting information out of the individual topic pages

In [134]:
# Getting inside the first URL

In [138]:
topic_page = URL_lst[0]
topic_page

'https://github.com/topics/3d'

In [250]:
URL_lst[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [140]:
response = requests.get(topic_page)

In [142]:
# checking status code of our response
response.status_code

200

In [144]:
len(response.text)

510872

In [148]:
topic_doc = BeautifulSoup(response.text , 'html.parser')

In [150]:
topic_cls_selc = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3' , {'class': topic_cls_selc})     

In [154]:
len(repo_tags)

20

In [156]:
repo_tags[0]    # first repo tags
# Inside this h3 class ,we have 2 'a-tags' . First a-tag corresponds to username and the second 'a-tag' corresponds to repository name and the URL
# of that user 

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href

In [160]:
# finding a-tags inside the repo-tags
a_tags = repo_tags[0].find_all('a')

In [162]:
a_tags    # as you can see ,here we got two a-tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [164]:
# extracting the name of the user from the FIRST A-TAG
a_tags[0].text.strip()

'mrdoob'

In [166]:
a_tags[1]

<a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
            three.js
</a>

In [168]:
# extracting the name of the repository name from the second A-Tag
a_tags[1].text.strip()

'three.js'

In [170]:
# extracting the link for the repository from the second A-Tag
a_tags[1]['href']

'/mrdoob/three.js'

In [210]:
star_tags = topic_doc.find_all('span' , {'class':"Counter js-social-count"})

In [212]:
len(star_tags)

20

In [216]:
star_tags[0]

<span aria-label="102009 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="102,009">102k</span>

In [232]:
star_tags[0].text

'102k'

In [224]:
star_tags[0].text[:-1]    # getting everything except the last character

'102'

In [228]:
# let's define a function to convert the star count to integers
def parse_star_count(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [234]:
# let's check our function
parse_star_count(star_tags[0].text)

102000

In [293]:
def get_repo_info(h3_repo_tags,star_count_tags):      # h3 tags is a parameter.In reality we have repo tags containing all the h3 tags
    a_tags = h3_repo_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_URL = 'https://github.com'+ a_tags[1]['href']
    stars = parse_star_count(star_count_tags.text.strip())
    return username , repo_name , stars , repo_URL

get_repo_info(repo_tags[0] , star_tags[0])

('mrdoob', 'three.js', 102000, 'https://github.com/mrdoob/three.js')

In [240]:
# Now we can do this for every title.
# we are going to use 'for' loop and store each data in a dictionary
# Creating a empty dict first
topic_repos_dict = {'Username':[],
                  'Repository_name' : [],
                  'Stars' : [] ,
                  'repo_URL' : []}

for i in range(len(repo_tags)):
    repo_data = get_repo_info(repo_tags[i] , star_tags[i])
    topic_repos_dict['Username'].append(repo_data[0])
    topic_repos_dict['Repository_name'].append(repo_data[1])
    topic_repos_dict['Stars'].append(repo_data[2])
    topic_repos_dict['repo_URL'].append(repo_data[3])

In [242]:
topic_repos_dict

{'Username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'FreeCAD',
  'lettier',
  'aframevr',
  'blender',
  'CesiumGS',
  'MonoGame',
  'isl-org',
  'mapbox',
  '4ian',
  'metafizzy',
  'timzhang642',
  'nerfstudio-project',
  'FyroxEngine',
  'domlysz',
  'a1studmuffin'],
 'Repository_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  'FreeCAD',
  '3d-game-shaders-for-beginners',
  'aframe',
  'blender',
  'cesium',
  'MonoGame',
  'Open3D',
  'mapbox-gl-js',
  'GDevelop',
  'zdog',
  '3D-Machine-Learning',
  'nerfstudio',
  'Fyrox',
  'BlenderGIS',
  'SpaceshipGenerator'],
 'Stars': [102000,
  27200,
  23200,
  23100,
  20400,
  19300,
  17800,
  16600,
  12900,
  12800,
  11400,
  11300,
  11100,
  10500,
  10400,
  9700,
  9300,
  7700,
  7700,
  7700],
 'repo_URL': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/

In [246]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [248]:
topic_repos_df.head()

Unnamed: 0,Username,Repository_name,Stars,repo_URL
0,mrdoob,three.js,102000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,27200,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23200,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,23100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,20400,https://github.com/ssloy/tinyrenderer


In [295]:
def getting_inside_topic(page_url):    # the page URL is one topic inside the github topics
    # getting the url using the 'requests' library
    response = requests.get(page_url)
    #parsing the url with beautiful_soup
    docs_topic = BeautifulSoup(response.text , 'html.parser')
    #getting the repo tags
    h3_repo_tags = docs_topic.find_all('h3' , {'class': "f3 color-fg-muted text-normal lh-condensed"})
    # getting the star tags
    star_count_tags = docs_topic.find_all('span' , {'class':"Counter js-social-count"})

    topic_repos_dict = {'Username':[],
                  'Repository_name' : [],
                  'Stars' : [] ,
                  'repo_URL' : []}
    for i in range(len(h3_repo_tags)):
        repo_data = get_repo_info(h3_repo_tags[i] , star_count_tags[i])
        topic_repos_dict['Username'].append(repo_data[0])
        topic_repos_dict['Repository_name'].append(repo_data[1])
        topic_repos_dict['Stars'].append(repo_data[2])
        topic_repos_dict['repo_URL'].append(repo_data[3])

    return pd.DataFrame(topic_repos_dict)

In [301]:
# we have defined three functions 
# 1. parse_star_count()  : to convert the star counts into integers
# 2. get_repo_info()  : to get the username,repo_name,stars_counts and repo_url inside each individual topic page url
# 3. getting_inside_topic() :  to get request of the url,to parse it and to collect data using "get_repo_info()" and create a dataframe.

In [303]:
# let's check a random topic url from our URL_lst
url_third = URL_lst[3]
url_third

'https://github.com/topics/amphp'

In [305]:
# "parse_star_count()" function is already applied inside the "get_repo_info()" function
# "get_repo_info()" function is applied inside the "getting_inside-topic()" function

getting_inside_topic(url_third)  # as you see ,we have successfully created a dataframe out of this.

Unnamed: 0,Username,Repository_name,Stars,repo_URL
0,amphp,amp,4200,https://github.com/amphp/amp
1,danog,MadelineProto,2800,https://github.com/danog/MadelineProto
2,unreal4u,telegram-api,790,https://github.com/unreal4u/telegram-api
3,amphp,parallel,776,https://github.com/amphp/parallel
4,amphp,http-client,701,https://github.com/amphp/http-client
5,amphp,byte-stream,365,https://github.com/amphp/byte-stream
6,amphp,mysql,358,https://github.com/amphp/mysql
7,php-service-bus,service-bus,348,https://github.com/php-service-bus/service-bus
8,amphp,parallel-functions,269,https://github.com/amphp/parallel-functions
9,xtrime-ru,TelegramRSS,230,https://github.com/xtrime-ru/TelegramRSS


In [309]:
# let's also create a csv file for the same dataframe which we have created above
getting_inside_topic(url_third).to_csv('amphp.csv' , index =None)