# Scraping Top Repositories for Github Topics

### Todo:

- Introduction about web scraping
- Introduction about Github and the problem statement
- Mention the tools you're using

#### Web Srapping : 

Web scraping is the process of automatically extracting data from websites. 
It involves using software tools to extract the desired data from the HTML code of a webpage and save it in a structured format, such as a spreadsheet or a database. 
Web scraping can be useful for a variety of purposes, such as market research, data analysis, and price monitoring. 

#### Github : 

GitHub is a web-based platform for version control and collaboration that is widely used by developers and software teams. It provides a range of features, such as source code management, issue tracking, project management, and collaboration tools, that help teams work together more effectively on software development projects. GitHub allows developers to host and review code, manage and track changes to code, collaborate with others on projects, and distribute their code to other users. It also provides a social networking aspect, allowing developers to follow other users, comment on code, and contribute to open-source projects

#### Tools Used :

- Python
- Requests
-Beautiful Soup
-Pandas

### Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:


   Repo Name,Username,Stars,Repo URL \-
   three.js,mrdoob,69700,https://github.com/mrdoob/three.js
   libgdx,libgdx,18300,https://github.com/libgdx/libgdx




## Step1: Use the requests library to download web pages

In [1]:
# !pip install requests --quiet

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

200

In [6]:
#response.text ##printing out the web page
len(response.text)

152549

In [7]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="https:/

## Use Beautiful Soup to parse and extract information

In [8]:
# !pip install beautifulsoup4 --quiet

In [9]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents,'html.parser')

In [11]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p',{'class': selection_class})
len(topic_title_tags)

30

In [12]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [13]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tag = doc.find_all('p',{'class':desc_selector})
topic_desc_tag[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [14]:
topic_title_tag0 = topic_title_tags[0]
topic_title_tag = topic_title_tag0.parent
topic_title_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [15]:
topic0_url = "https://github.com" + topic_title_tag['href']
print(topic0_url)

https://github.com/topics/3d


In [16]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [17]:
topic_desc = []
for tag in topic_desc_tag:
    topic_desc.append(tag.text.strip())

topic_desc[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [18]:
topic_link_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
len(topic_link_tags)

30

In [19]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

## CREATING DATA FRAME WITH PANDAS

In [20]:
import pandas as pd

In [21]:
topic_dict = {
    'title' : topic_titles,
    'description' : topic_desc,
    'url' : topic_urls
}


In [22]:
topic_df = pd.DataFrame(topic_dict)
topic_df.head(5)

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Create CSV file(s) with the extracted information

In [24]:
topic_df.to_csv('topics.csv',index=None)

## Getting information out of a topic page

In [25]:
topic_page_url = topic_urls[0]

In [26]:
topic_page_url

'https://github.com/topics/3d'

In [27]:
response = requests.get(topic_page_url)

In [28]:
response.status_code

200

In [29]:
len(response.text)

456030

In [30]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [31]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [32]:
len(repo_tags)

20

In [34]:
a_tags = repo_tags[0].find_all('a')

In [37]:
a_tags[0].text.strip()

'mrdoob'

In [38]:
a_tags[1].text.strip()

'three.js'

In [40]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)


https://github.com/mrdoob/three.js


In [44]:
star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count'})

In [45]:
len(star_tags)

20

In [47]:
star_tags[0].text.strip()

'89k'

In [48]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [49]:
parse_star_count(star_tags[0].text.strip())

89000

In [52]:
def get_repo_info(h3_tag,star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [53]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 89000, 'https://github.com/mrdoob/three.js')

In [56]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars' : [],
    'repo_url': []
}
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [73]:

def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check succesful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc


def get_repo_info(h3_tag,star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url


def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title,repo URL and  username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars' : [],
    'repo_url': []
    }
    
    #Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)
    
    
    



In [80]:
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('Android.csv',index=None)