Web Scraping Project: TOP GITHUB REPOSITORIES

- Objective: To scrape and analyze data for the most popular repositories on GitHub

- Steps:
    - Finding the URL of the GitHub page that lists the top repositories by topic
    - Using a web scraping library or tool to extract the names, descriptions, and other relevant information such as URL of the repositories
    - Then storing the scraped data in a structured format such as JSON or CSV file


In [1]:
# Importing the required Libraries:

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

Beautiful Soup is a Python library that will help us here with web scraping. It will allow to parse HTML files and hence extract data from them using filters. 
We will be using here Beautiful Soup with the requests library to send HTTP requests to the Github website, to scrape and get the response. 
Thereafter, we will with the help of Beautiful Soup, find the tags that contain the information needed and access their attributes and text.

In [3]:
url = "https://github.com/topics"

In [4]:
response = requests.get(url)

In [5]:
### Checking if the above was sucessfull by checking the status code and length of the page:
print(response.status_code)
print(len(response.text))

200
165353


In [6]:
topics = response.text
topics[:100]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="d'

In [7]:
### Saving here all the Topics and Contents in a file: 
page_contents = response.text

with open('webpage.html','w', encoding='utf-8', errors='ignore') as f: 
         f.write(page_contents)

In [8]:
soup = BeautifulSoup(page_contents, 'html.parser')

In [9]:
type(soup)

bs4.BeautifulSoup

In [10]:
### Now after analysing the webpage we see the page in html format contains <p for the relevant infpormation required:
p_tags = soup.find_all('p')
len(p_tags) ### finding the number of <p on the webpage

69

In [11]:
p_tags[:10]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Ratchet
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Ratchet is a set of libraries to handle WebSockets asynchronously in PHP.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Bash
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Bash is a shell and command language interpreter for the GNU operating system.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Phaser
       </p>,
 <p class="f5 color-fg-m

In [12]:
select = 'f3 lh-condensed mb-0 mt-1 Link--primary'
repos_title_tags = soup.find_all('p',{'class': select})
len(p_tags)

69

In [13]:
repos_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [14]:
###Topic Description Tags:
desc_selector= 'f5 color-fg-muted mb-0 mt-1'
repos_desc_tags = soup.find_all('p', {'class': desc_selector})

In [15]:
repos_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [16]:
repos_title_tag0 = repos_title_tags[0]
repos_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [17]:
repos_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [18]:
repos_link_tags = soup.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
len(repos_link_tags)

30

In [19]:
repos_link_tags[0]['href']

'/topics/3d'

In [20]:
repos0_url = 'https://github.com' +repos_link_tags[0]['href']
print(repos0_url)

https://github.com/topics/3d


In [21]:
repos_title_tags[0].text

'3D'

In [22]:
repos_titles = [] 
            
for tag in repos_title_tags: 
    repos_titles.append(tag.text)

print(repos_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [23]:
repos_descs = [] 
            
for tag in repos_desc_tags: 
    repos_descs.append(tag.text.strip())

print(repos_descs)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [24]:
repos_urls = []
base_url = 'https://github.com'

for tag in repos_link_tags:
    repos_urls.append(base_url + tag['href'])
    
repos_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [25]:
repos_dict = {'Title': repos_titles, 'Description': repos_descs, 'URL': repos_urls}
repos_df = pd.DataFrame(repos_dict)
repos_df

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [26]:
### Creating CSV file here with the above information:
repos_df.to_csv('topics.csv')