# Scraping Top Repositories Topics in Github

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page url, and topic description
- For each topic, we'll get the top 25 repositories of the topic from the topic page
- For each repository, we'll grab the repository name, username, stars, and repository url
- For each topic, we'll create a csv file.

Objective:
The goal of this project is to create a web scraper that extracts information about the top repositories within various topics on GitHub. The scraped data will include details about topics, such as title, description, and URL, as well as information about the top repositories within each topic, including repository name, username, stars, and repository URL. The final output will be organized into CSV files, with one file for each topic.

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
url = "https://github.com/topics"

In [4]:
response = requests.get(url)

In [5]:
response.status_code

200

In [7]:
len(response.text)

170999

In [15]:
page_contents = response.text

page_contents[:500]


'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubus'

In [16]:
with open('github_topics.html', 'w', encoding="utf-8") as file:
    file.write(page_contents)

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.

In [18]:
# install the library
! pip install beautifulsoup4 --upgrade --quiet

# import the library
from bs4 import BeautifulSoup

In [19]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [36]:
p_tags = doc.find_all('p')

p_tags[:15]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Java
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Java is an object-oriented programming language used mainly for web, desktop, embedded devices and mobile applications.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Webpack
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Webpack is a bundler that takes modules with dependencies and creates static assets.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
 

In [33]:
title_selector = "f3 lh-condensed mb-0 mt-1 Link--primary"

topic_title_tags = doc.find_all('p', {'class': title_selector})


In [34]:
len(topic_title_tags)

30

In [35]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [45]:
# topic description

desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_descs_tags = doc.find_all('p', {'class': desc_selector})

In [38]:
len(topic_descs_tags)

30

In [46]:
topic_descs_tags[:2]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>]

In [47]:
# topic URL

url_selector = "no-underline flex-1 d-flex flex-column"
topic_link_tags = doc.find_all('a', {'class': url_selector})

In [48]:
len(topic_link_tags)

30

In [50]:
topic_link_tags[0]['href']

'/topics/3d'

In [52]:
topic_url = 'https://github.com' + topic_link_tags[0]['href']
print(topic_url)

https://github.com/topics/3d


In [54]:
# to get the first topic_title

topic_title_tags[0].text

'3D'

##### - To get all the topic_titles

In [56]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


##### - To get all the topic_descs

In [60]:
topic_descs = []
for tag in topic_descs_tags:
    topic_descs.append(tag.text.strip())

topic_descs[:3]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

##### - To get all the topic_url

In [64]:
topic_url = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_url.append(base_url + tag['href'])
    
topic_url

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [66]:
# Creating a dataframe for the topics

topic_dicts = {
                'title': topic_titles,
                'description': topic_descs,
                'url':topic_url
}

topic_dicts

{'title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'description': ['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency library for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a 

In [69]:
import pandas as pd

# Putting it as a dataframe
topic_df = pd.DataFrame(topic_dicts)
topic_df.head()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [72]:
import os

filename = 'topic.csv'

if os.path.exists(filename):
    raise FileExistsError(f"The file '{filename}' already exists. Please choose a different filename or remove the existing file.")

topic_df.to_csv(filename, index=None)

# If the code reaches this point, the CSV file has been successfully created
print(f"The file '{filename}' has been created.")


The file 'topic.csv' has been created.


## Getting Information from the first topic

We'll need to get the `repository_name`, `username`, `star` and `repo_url` 

In [73]:
# Starting from the first topic "3d"

topic_page_url = topic_url[0]
topic_page_url


'https://github.com/topics/3d'

In [74]:
response = requests.get(topic_page_url)
response.status_code

200

In [75]:
len(response.text)

488735

In [77]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [146]:
repo_selector = "f3 color-fg-muted text-normal lh-condensed"

repo_tags = topic_doc.find_all('h3', {"class": repo_selector} )

In [147]:
len(repo_tags)

20

In [135]:
# h3_tag

repo_tags[0] # the first h3

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href

In [136]:
a_tags = repo_tags[0].find_all('a')

a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [137]:
a_tags[0].text.strip()

'mrdoob'

In [138]:
a_tags[1].text.strip()

'three.js'

In [139]:
# repo_url
a_tags[1]['href']

'/mrdoob/three.js'

In [140]:
# pulling for star_count
star_selector = "Counter js-social-count"
star_tags = topic_doc.find_all('span', {'class': star_selector})
star_tags

[<span aria-label="96440 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="96,440">96.4k</span>,
 <span aria-label="24786 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="24,786">24.8k</span>,
 <span aria-label="22289 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="22,289">22.3k</span>,
 <span aria-label="21822 users starred this repository" class="Counter js-social-count" data-p

In [141]:
len(star_tags)

20

In [142]:
star_tags[0].text

'96.4k'

In [143]:
# convert to number
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1]=="k":
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)


In [144]:
parse_star_count(star_tags[0].text)

96400

In [154]:
# let get a function for that also
def get_repo_info(h3_tag, star_tags):
    # return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repository_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    star_count = parse_star_count(star_tags.text)
    
    return username, repository_name, star_count, repo_url

In [160]:
# check for the function

get_repo_info(repo_tags[0], star_tags[0])

'mrdoob'

In [161]:
# Creating a dataframe for the repositories
topic_repos_dicts = {
                    'username': [],
                    'repository_name': [],
                    'star_count': [],
                    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dicts['username'].append(repo_info[0])
    topic_repos_dicts['repository_name'].append(repo_info[1])
    topic_repos_dicts['star_count'].append(repo_info[2])
    topic_repos_dicts['repo_url'].append(repo_info[3])

In [162]:
topic_repos_df = pd.DataFrame(topic_repos_dicts)
topic_repos_df

Unnamed: 0,username,repository_name,star_count,repo_url
0,mrdoob,three.js,96400,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24800,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22300,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18500,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16500,https://github.com/lettier/3d-game-shaders-for...
6,FreeCAD,FreeCAD,16000,https://github.com/FreeCAD/FreeCAD
7,aframevr,aframe,15900,https://github.com/aframevr/aframe
8,CesiumGS,cesium,11300,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,10400,https://github.com/MonoGame/MonoGame


## Getting Information from the all the topics


In [167]:
# Function to download and parse the topic page
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check for a successful response (status code 200)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [168]:
# Function to extract repository information from h3 and star tags
def get_repo_info(h3_tag, star_tags):
    # Extract relevant information from h3 tag
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repository_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    
    # Extract star count using a separate function (parse_star_count)
    star_count = parse_star_count(star_tags.text)
    
    return username, repository_name, star_count, repo_url

In [185]:
# Function to extract repository information from the topic page
def get_topic_repos(topic_doc):
    # Get h3 tags containing username, repo_name, and repo_url
    repo_selector = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', {"class": repo_selector})
    
    # Get star tags containing star_count
    star_selector = "Counter js-social-count"
    star_tags = topic_doc.find_all('span', {'class': star_selector})
    
    # Creating a dictionary to store repository information
    topic_repos_dicts = {
        'username': [],
        'repository_name': [],
        'star_count': [],
        'repo_url': []
    }
    
     # Loop through each repository tag and extract information
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dicts['username'].append(repo_info[0])
        topic_repos_dicts['repository_name'].append(repo_info[1])
        topic_repos_dicts['star_count'].append(repo_info[2])
        topic_repos_dicts['repo_url'].append(repo_info[3])

    # Return the extracted repository information as a DataFrame
    return pd.DataFrame(topic_repos_dicts)

In [204]:
import os

def scrape_topic(topic_url, path):
    
    # Check if the CSV file already exists
    if os.path.exists(path):
        print(f'The file {path} already exists. Skipping...')
        # If the file exists, skip the scraping process
        return
    
    # Scrape topic repositories and create a DataFrame
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    # Save the DataFrame to a CSV file
    topic_df.to_csv(path, index=None)
    print(f'The file {path} has been created.')


In [205]:
url4 = topic_url[6]

In [206]:
url4

'https://github.com/topics/ansible'

In [207]:
topic4_doc = get_topic_page(url4)

In [208]:
topic4_repos = get_topic_repos(topic4_doc)

In [209]:
topic4_repos

Unnamed: 0,username,repository_name,star_count,repo_url
0,bregman-arie,devops-exercises,60000,https://github.com/bregman-arie/devops-exercises
1,ansible,ansible,59700,https://github.com/ansible/ansible
2,trailofbits,algo,27900,https://github.com/trailofbits/algo
3,MichaelCade,90DaysOfDevOps,24400,https://github.com/MichaelCade/90DaysOfDevOps
4,StreisandEffect,streisand,23000,https://github.com/StreisandEffect/streisand
5,kubernetes-sigs,kubespray,14900,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,12900,https://github.com/ansible/awx
7,easzlab,kubeasz,9800,https://github.com/easzlab/kubeasz
8,ansible-semaphore,semaphore,8500,https://github.com/ansible-semaphore/semaphore
9,geerlingguy,ansible-for-devops,7700,https://github.com/geerlingguy/ansible-for-devops


In [197]:
get_topic_repos(get_topic_page(topic_url[6])).to_csv('ansible.csv', index = None)

## Combining it together

In [218]:
def get_topic_titles(doc):
    title_selector = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p', {'class': title_selector})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


In [219]:
def get_topic_descs(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_descs_tags = doc.find_all('p', {'class': desc_selector})
    
    topic_descs = []
    for tag in topic_descs_tags:
        topic_descs.append(tag.text)
    return topic_descs

In [220]:
def get_topic_urls(doc):
    url_selector = "no-underline flex-1 d-flex flex-column"
    topic_link_tags = doc.find_all('a', {'class': url_selector})

    topic_url = []
    base_url = 'https://github.com'

    for tag in topic_link_tags:
        topic_url.append(base_url + tag['href'])

    return topic_url

In [224]:
def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    
    # Check for a successful response (status code 200)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topic_dicts = {
                   'title': get_topic_titles(doc),
                   'description': get_topic_descs(doc),
                   'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dicts)

In [231]:
def scrape_topic_repos():
    print('scraping list of topics')
    topic_df = scrape_topics()
    
    os.makedirs('data', exist_ok = True)
    for index, row in topic_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [232]:
scrape_topic_repos()

scraping list of topics
scraping top repositories for "3D"
The file data/3D.csv has been created.
scraping top repositories for "Ajax"
The file data/Ajax.csv has been created.
scraping top repositories for "Algorithm"
The file data/Algorithm.csv has been created.
scraping top repositories for "Amp"
The file data/Amp.csv has been created.
scraping top repositories for "Android"
The file data/Android.csv has been created.
scraping top repositories for "Angular"
The file data/Angular.csv has been created.
scraping top repositories for "Ansible"
The file data/Ansible.csv has been created.
scraping top repositories for "API"
The file data/API.csv has been created.
scraping top repositories for "Arduino"
The file data/Arduino.csv has been created.
scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv has been created.
scraping top repositories for "Atom"
The file data/Atom.csv has been created.
scraping top repositories for "Awesome Lists"
The file data/Awesome Lists.csv has been