# Top Repositories For GitHub Topics

### Pick a website and describe your objective

- Get a website
- Identify info you want to scrape
- Summarize project idea

### Outline:

- Scrape https://github.com/topics
- Get list of topics, for each topic we'll get: Topic Title, Topic page URL and topic description
- For each topic, we'll get top 25 repo
- For each repo, we'll get: repo name, username, stars, repo URL, 
- For each topic we'll create a CSV file 

# Use requests librtary to download web pages 

!pip install requests --upgrade --quiet -> for not showing details

In [2]:
import requests

In [3]:
topics_url = ' https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
len(response.text) #the length of the text in the response 

196042

In [7]:
page_contents= response.text

In [8]:
page_contents[:100]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-t'

In [9]:
# write html contents into a file on my local computer
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)


Why we added (encoding='utf-8')

The UnicodeEncodeError occurs because the default encoding (cp1252) cannot encode certain characters in page_contents

This will ensure that the file is written using the UTF-8 encoding,
which can handle a wide range of characters, including those that caused the error.

# Use beautiful soup to parse and extract information

In [10]:
from bs4 import BeautifulSoup

In [11]:
#parsing the html file
doc= BeautifulSoup(page_contents, 'html.parser')

In [12]:
type(doc)

bs4.BeautifulSoup

### Topics names

In [19]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class' : selection_class}) #or we can make it class_ = selection_class without {}

In [20]:
len(topic_title_tags)

30

In [21]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

Surprisingly we collected topics names!

### Topic description

In [42]:
topic_desc_tags= doc.find_all('p', class_ ='f5 color-fg-muted mb-0 mt-1')

In [43]:
len(topic_desc_tags)

30

In [24]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

### Topic URL

In [32]:
topic_URL_tags = doc.find_all('a', class_ = 'no-underline flex-1 d-flex flex-column')

In [33]:
len(topic_URL_tags)

30

In [35]:
topic_URL_tags[:1]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>
 </a>]

In [38]:
topic_urls = ["https://github.com" + tag['href'] for tag in topic_URL_tags]
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

TypeError occured because topic_URL_tags is a list of elements, not a single element. You need to iterate over the list or select a specific element before accessing its href attribute

# Create a csv file

In [39]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [47]:
topic_description=[]

for tag in topic_desc_tags:
    topic_description.append(tag.text.strip())
    
topic_description[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [51]:
topic_urls = ["https://github.com" + tag['href'] for tag in topic_URL_tags]
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [49]:
import pandas as pd

In [55]:
dict = {'Title' : topic_titles, 'Description': topic_description, 'URL':topic_urls}

In [56]:
df = pd.DataFrame(dict)

In [58]:
df[:5]

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [60]:
df.to_csv('topics.csv')

# Getting information out of a topic page

In [61]:
topic_page_url = topic_urls[0]

In [62]:
topic_page_url

'https://github.com/topics/3d'

In [63]:
response = requests.get(topic_page_url)
response.status_code

200

In [64]:
len(response.text)

515695

In [65]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [68]:
#the whole repository
repo_tags = topic_doc.find_all('h3', class_ ='f3 color-fg-muted text-normal lh-condensed')

In [67]:
len(repo_tags)

20

In [75]:
a_tags = []
for tag in repo_tags:
    a_tags.extend(tag.find_all('a'))

a_tags[:2]

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [73]:
a_tags[0].text.strip()

'mrdoob'

In [74]:
a_tags[1].text.strip()

'three.js'

In [80]:
repo_urls = ["https://github.com" + tag['href'] for tag in a_tags if '/' in tag['href'][1:]]
#The condition if '/' in tag['href'][1:] ensures that there is a slash after the initial slash,
#indicating that it's a repository URL rather than a user profile URL.

repo_urls[:5]

['https://github.com/mrdoob/three.js',
 'https://github.com/pmndrs/react-three-fiber',
 'https://github.com/libgdx/libgdx',
 'https://github.com/BabylonJS/Babylon.js',
 'https://github.com/ssloy/tinyrenderer']

### Num stars

In [135]:
star_tags = topic_doc.find_all('span', class_ = 'Counter js-social-count')

In [136]:
len(star_tags)

20

In [137]:
star_tags[0].text

'100k'

In [138]:
star_texts = [tag.text for tag in star_tags]

# Remove 'k' and convert to integer
star_ints = [int(float(text.replace('k', '')) * 1000) for text in star_texts]

# Display the first 2 results
print(star_ints[:5])


[100000, 26600, 22900, 22800, 19800]


### We can do it using functions


In [106]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)
parse_star_count(star_texts[0])


100000

### Get repo info

In [139]:
'''def get_repo_info(h3_tag, star_tag):
    #return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars
'''
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = "https://github.com" + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars


In [140]:
repos_info = []
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    repos_info.append(repo_info)
repos_info[:1]

[('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 100000)]

In [141]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

# Check lengths of repo_tags and star_tags
print(len(repo_tags), len(star_tags))

# Iterate over the smaller of the two lists to avoid IndexError
for i in range(len(repo_tags), len(star_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])


20 20


In [144]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

# Ensure repo_tags and star_tags are not empty and have the same length
if len(repo_tags) != len(star_tags):
    print("Error: repo_tags and star_tags have different lengths.")

else:
    # Iterate over repo_tags and star_tags to populate topic_repos_dict
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])



In [143]:
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'FreeCAD',
  'lettier',
  'aframevr',
  'CesiumGS',
  'blender',
  'MonoGame',
  'mapbox',
  'isl-org',
  'metafizzy',
  'timzhang642',
  'nerfstudio-project',
  'a1studmuffin',
  '4ian',
  'FyroxEngine',
  'domlysz'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  'FreeCAD',
  '3d-game-shaders-for-beginners',
  'aframe',
  'cesium',
  'blender',
  'MonoGame',
  'mapbox-gl-js',
  'Open3D',
  'zdog',
  '3D-Machine-Learning',
  'nerfstudio',
  'SpaceshipGenerator',
  'GDevelop',
  'Fyrox',
  'BlenderGIS'],
 'stars': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
  'https://github.com/ssloy/tinyrenderer',
  'https://github.com/FreeCAD/FreeCAD',
  'https://github.com/lettier/3d-game-shaders-for-beginners',
  'https://github.com/aframevr/aframe

In [145]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [146]:
topic_repos_df.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,https://github.com/mrdoob/three.js,100000
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,26600
2,libgdx,libgdx,https://github.com/libgdx/libgdx,22900
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,22800
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,19800


# Automate it using functions

In [175]:
def get_topic_page(topic_url):
    # Download the page
    #Changed the given attr from topic_page_url to topic_url
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return (topic_doc)

def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = "https://github.com" + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars


def get_topic_repos(topic_doc):
    
    # The whole repository (username, title, URL)
    repo_tags = topic_doc.find_all('h3', class_ ='f3 color-fg-muted text-normal lh-condensed')
    # Get star tags
    star_tags = topic_doc.find_all('span', class_ = 'Counter js-social-count')
    
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'repo_url': [],
    'stars': []

    }  
    

    
    # Get repo info
    # Iterate over repo_tags and star_tags to populate topic_repos_dict
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[3])
        topic_repos_dict['repo_url'].append(repo_info[2])
    return pd.DataFrame(topic_repos_dict)

# Testing our functions

In [152]:
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [153]:
url4 = topic_urls[4]

In [156]:
url4

'https://github.com/topics/android'

In [176]:
topic4_doc = get_topic_page(url4)

In [177]:
topic4_repos = get_topic_repos(topic4_doc)

In [179]:
topic4_repos.head()

Unnamed: 0,username,repo_name,repo_url,stars
0,flutter,flutter,https://github.com/flutter/flutter,163000
1,facebook,react-native,https://github.com/facebook/react-native,117000
2,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,110000
3,Genymobile,scrcpy,https://github.com/Genymobile/scrcpy,105000
4,Hack-with-Github,Awesome-Hacking,https://github.com/Hack-with-Github/Awesome-Ha...,79900


we can do it with one step

In [182]:
get_topic_repos(get_topic_page(topic_urls[5])).head()

Unnamed: 0,username,repo_name,repo_url,stars
0,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,110000
1,angular,angular,https://github.com/angular/angular,95200
2,storybookjs,storybook,https://github.com/storybookjs/storybook,83400
3,leonardomso,33-js-concepts,https://github.com/leonardomso/33-js-concepts,62600
4,ionic-team,ionic-framework,https://github.com/ionic-team/ionic-framework,50700


In [183]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('angular.csv', index=None)

# Document and share your work