# Top Repositories for GitHub Topics

### Project Outline
- **Scraping GitHub Topics**
    - We will scrape the website https://github.com/topics to gather information about various topics.
- **Topic Details**
    - For each topic, we will extract the following information:
        - Topic title
        - Topic page URL
        - Topic description
- **Top 25 Repositories**
    - For each topic, we will retrieve the top 25 repositories from the topic page.
- **Repository Details**
    - For each repository, we will collect the following information:
        - Repository name
        - Username
        - Number of stars
        - Repository URL
- **CSV File Creation**
    - For each topic, we will generate a CSV file containing the repository details in the following format:
    
    Repo Name, Username, Stars, Repo URL
    three.js, mrdoob, 69700, https://github.com/mrdoob/three.js
    libgdx, libgdx, 18300, https://github.com/libgdx/libgdx

## Using the requests library to download web pages

In [107]:
!pip install requests



In [108]:
import requests as rq

In [109]:
topics_url="https://github.com/topics"

In [110]:
response=rq.get(topics_url)

In [111]:
response.status_code

200

In [112]:
len(response.text)

156636

In [113]:
page_contents=response.text

In [114]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-946902aac6a1.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-030e28cb8394.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

In [115]:
import requests

# Make a request to the webpage
response = requests.get('https://github.com/topics')

# Get the page contents
page_contents = response.text

# Write the page contents to a file using UTF-8 encoding
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)

## Utilizing Beautiful Soup for HTML Parsing and Data Extraction

In [116]:
from bs4 import BeautifulSoup

In [117]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [118]:
type(doc)

bs4.BeautifulSoup

In [119]:
p_tags=doc.find_all('p')

In [120]:
len(p_tags)

67

In [121]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         R
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">R is a free programming language and software environment for statistical computing and graphics.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Windows
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Windows is Microsoft's GUI-based operating system.</p>]

In [122]:
selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags=doc.find_all('p',{'class':selection_class})

In [123]:
len(topic_title_tags)

30

In [124]:
topic_title_tags[:10]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [125]:
desc_selector="f5 color-fg-muted mb-0 mt-1"
topic_desc_tags=doc.find_all('p',{'class':desc_selector})

In [126]:
len(topic_desc_tags)

30

In [127]:
topic_desc_tags[:10]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (App

In [128]:
topic_title_tag_0=topic_title_tags[0]

In [129]:
div_tag=topic_title_tag_0.parent

In [130]:
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [131]:
topic_link_tags = doc.find_all('a', {'class': "no-underline flex-1 d-flex flex-column"})

In [132]:
len(topic_link_tags)

30

In [133]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [134]:
topic_link_tags[0]['href']

'/topics/3d'

In [135]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [136]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [137]:
topic_title_tags[0]

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [138]:
topic_title_tags[0].text

'3D'

In [139]:
# Create an empty list to store the topic titles
topic_titles = []

# Iterate over each tag in the topic_title_tags list
for tag in topic_title_tags:
    # Extract the text from the current tag and append it to the topic_titles list
    topic_titles.append(tag.text)

# Print the list of topic titles
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [140]:
# Create an empty list to store the topic descriptions
topic_descs = []

# Iterate over each tag in the topic_desc_tags list
for tag in topic_desc_tags:
    # Extract the text from the current tag, remove leading/trailing whitespaces, and append it to the topic_descs list
    topic_descs.append(tag.text.strip())

# Print the first ten topic descriptions
topic_descs[:10]


['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.']

In [221]:
# Create an empty list to store the topic URLs
topic_urls = []

# Base URL for constructing the complete topic URLs
base_url = 'https://github.com'

# Iterate over each tag in the topic_link_tags list
for tag in topic_link_tags:
    # Extract the 'href' attribute value from the current tag and append it to the base URL
    # This constructs the complete topic URL
    topic_urls.append(base_url + tag['href'])

# Print the list of topic URLs
topic_urls


['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [222]:
import pandas as pd

In [223]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [224]:
topics_dict

{'title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'description': ['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency library for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interface) is a 

In [145]:
topics_df = pd.DataFrame(topics_dict)

In [146]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Creating a CSV File with the extracted Info

In [147]:
topics_df.to_csv('topics.csv', index=None)

## Getting information out of a topic page

In [148]:
topic_page_url=topic_urls[0]

In [149]:
topic_page_url

'https://github.com/topics/3d'

In [150]:
response=rq.get(topic_page_url)

In [151]:
response

<Response [200]>

In [152]:
response.status_code

200

In [153]:
len(response.text)

466260

In [154]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [155]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )

In [156]:
len(repo_tags)

20

In [157]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/thr

In [158]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [159]:
a_tags = repo_tags[0].find_all('a')

In [160]:
a_tags

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [161]:
a_tags[0]

<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [162]:
a_tags[0].text

'\n            mrdoob\n'

In [163]:
a_tags[0].text.strip()

'mrdoob'

In [164]:
a_tags[1].text

'\n            three.js\n'

In [165]:
a_tags[1].text.strip()

'three.js'

In [166]:
base_url

'https://github.com'

In [167]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [168]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [169]:
len(star_tags)

20

In [170]:
star_tags[0]

<span aria-label="92788 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="92,788">92.8k</span>

In [171]:
star_tags[0].text

'92.8k'

In [172]:
star_tags[0].text.strip()

'92.8k'

In [173]:
def parse_star_count(stars_str):
    # Remove any leading or trailing whitespace
    stars_str = stars_str.strip()

    # Check if the star count ends with 'k'
    if stars_str[-1] == 'k':
        # If it does, multiply the numeric part by 1000 and convert it to an integer
        return int(float(stars_str[:-1]) * 1000)
    
    # If the star count doesn't end with 'k', convert it to an integer
    return int(stars_str)


In [174]:
# Calling the parse_star_count function on the text of the first star tag
parse_star_count(star_tags[0].text.strip())

92800

In [175]:
def get_repo_info(h3_tag, star_tag):
    # Function to extract all the required information about a repository
    # Find all anchor tags within the h3 tag
    a_tags = h3_tag.find_all('a')
    # Extract the username from the first anchor tag and remove any leading or trailing whitespace
    username = a_tags[0].text.strip()
    # Extract the repository name from the second anchor tag and remove any leading or trailing whitespace
    repo_name = a_tags[1].text.strip()
    # Construct the full repository URL by appending the href attribute of the second anchor tag to the base URL
    repo_url =  base_url + a_tags[1]['href']
    # Extract the star count from the star tag and convert it to an integer using the parse_star_count function
    stars = parse_star_count(star_tag.text.strip())
    # Return the extracted information as a tuple: (username, repo_name, stars, repo_url)
    return username, repo_name, stars, repo_url

In [176]:
# Calling the get_repo_info function on the first repository tag and its corresponding star tag
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 92800, 'https://github.com/mrdoob/three.js')

In [177]:
# Generating a range of indices based on the length of the repo_tags list
range(len(repo_tags))

range(0, 20)

In [178]:
# Dictionary to store the repository information
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}
# Iterate over the range of indices based on the length of the repo_tags list
for i in range(len(repo_tags)):
    # Call the get_repo_info function on the i-th repository tag and its corresponding star tag
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    # Append the extracted information to the respective lists in the topic_repos_dict dictionary
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])


In [179]:
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'isl-org',
  'timzhang642',
  'blender',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad',
  'nerfstudio-project',
  'spritejs'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'zdog',
  'Open3D',
  '3D-Machine-Learning',
  'blender',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad',
  'nerfstudio',
  'spritejs'],
 'stars': [92800,
  23000,
  21600,
  20900,
  17200,
  15600,
  15500,
  14300,
  10600,
  9900,
  9100,
  8900,
  8800,
  7400,
  6400,
  6300,
  5700,
  5700,
  5600,
  5200],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.j

In [180]:
topic_repos_df=pd.DataFrame(topic_repos_dict)

### The Top 20 Repositories are

In [181]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,92800,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23000,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21600,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20900,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17200,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15600,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15500,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14300,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10600,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9900,https://github.com/metafizzy/zdog


## Final Code

In [197]:
import os

# Function to retrieve the topic page using the provided URL
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check for successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse the page using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

# Function to extract repository information from the h1 tag and star tag
def get_repo_info(h3_tag, star_tag):
    # Retrieve required information about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

# Function to scrape and retrieve topic repositories
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repository title, URL, and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )
    
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    # Dictionary to store repository information
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    # Get repository info
    for i in range(len(repo_tags)):
        # Call the get_repo_info function to retrieve repository information
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        
        # Append the repository info to the respective lists in the topic_repos_dict dictionary
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    # Create a pandas DataFrame from the topic_repos_dict
    return pd.DataFrame(topic_repos_dict)


# Function to scrape a topic and save the data to a CSV file
def scrape_topic(topic_url, path):
    # Check if the file already exists
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return

    # Get the topic repositories dataframe
    topic_df = get_topic_repos(get_topic_page(topic_url))

    # Save the dataframe as a CSV file
    topic_df.to_csv(path, index=None)

In [198]:
url4=topic_urls[4]

In [199]:
url4

'https://github.com/topics/android'

In [200]:
topic4_doc=get_topic_page(url4)

In [201]:
topic4_repos=get_topic_repos(topic4_doc)

In [202]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,155000,https://github.com/flutter/flutter
1,facebook,react-native,110000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,103000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,86600,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,66300,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,48200,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,47300,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,46700,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,44100,https://github.com/square/okhttp
9,android,architecture-samples,42800,https://github.com/android/architecture-samples


In [204]:
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,155000,https://github.com/flutter/flutter
1,facebook,react-native,110000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,103000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,86600,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,66300,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,48200,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,47300,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,46700,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,44100,https://github.com/square/okhttp
9,android,architecture-samples,42800,https://github.com/android/architecture-samples


In [205]:
topic_urls[0]

'https://github.com/topics/3d'

In [207]:
get_topic_repos(get_topic_page(topic_urls[0]))

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,92800,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23000,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21600,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20900,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17200,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15600,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15500,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14300,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10600,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9900,https://github.com/metafizzy/zdog


In [208]:
# Call get_topic_page to retrieve the topic page for the first URL in topic_urls
topic_page = get_topic_page(topic_urls[0])

# Call get_topic_repos to extract repository information from the topic page
topic_repos = get_topic_repos(topic_page)

# Save the extracted repository information as a CSV file named "3D.csv" without including the index column
topic_repos.to_csv('3D.csv', index=None)


## Objective

Write a single function to achieve the following tasks:

1. Get the list of topics from the topics page.
2. Get the list of top repositories from individual top pages.
3. For each topic, create a CSV file containing the top repositories for that topic.

In [237]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_topic_titles(doc):
    # Extracts the titles of the topics from the topic page
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    # Extracts the descriptions of the topics from the topic page
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    # Extracts the URLs of the topics from the topic page
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    # Scrapes the topics page to get the list of topics
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    # Create a dictionary to store the topics' titles, descriptions, and URLs
    topics_dict = {
        'title': get_topic_titles(doc),       # Get the titles of the topics
        'description': get_topic_descs(doc),  # Get the descriptions of the topics
        'url': get_topic_urls(doc)            # Get the URLs of the topics
    }
    
    return pd.DataFrame(topics_dict)


In [238]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [242]:
# Iterate over each row in the topics DataFrame
for index, row in topics_df.iterrows():
    # Print the title and URL of the current topic
    print(row['title'], row['url'])

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

In [243]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()  # Scrape the topics and get the DataFrame
    
    os.makedirs('data', exist_ok=True)  # Create a directory to store the scraped data
    
    # Iterate over each topic in the DataFrame
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        # Scrape the repositories for the current topic and save it to a CSV file
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))


In [244]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

In [245]:
import zipfile
import os

def zip_data(directory, zip_filename):
    # Create a new ZIP file
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        # Iterate over each file in the specified directory
        for root, _, files in os.walk(directory):
            for file in files:
                # Get the full path of the file
                file_path = os.path.join(root, file)
                # Add the file to the ZIP archive
                zipf.write(file_path, os.path.relpath(file_path, directory))

    print(f'Successfully created {zip_filename}')

def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()

    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

    # Zip the data directory
    zip_filename = 'data.zip'
    zip_data('data', zip_filename)

    # Delete the data directory (optional)
    # if os.path.exists('data'):
    #     os.rmdir('data')

# Call the function to scrape topics and repositories and zip the data
scrape_topics_repos()


Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre