### Top Repositories for Github Topics 

### Objectives:

    A: Browse through the github topic site and select the top topics to scrape.
    B: Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
    C: Summarize your project idea and outline your strategy in a Juptyer notebook.


### Project Outline

- The site to scrape https://github.com/topics
- Extracting a list of topics from the site. For each topic, I'll extract the topic title, topic page URL and topic description
- For each topic, I'll get the top 25 repositories in the topic from the topic page.
- For each repository, I'll grab the repo name, username, stars and repo URL
- For each topic I'll create a CSV file in the following format:

   Repo Name,Username,Stars,Repo URL 
   three.js,mrdoob,69700,https://github.com/mrdoob/three.js 
   libgdx,libgdx,18300,https://github.com/libgdx/libgdx


### Using the requests library to download web pages

In [6]:
!pip install requests --upgrade --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/bin/python3 -m pip install --upgrade pip[0m


In [1]:
import requests

In [17]:
topics_url = 'https://github.com/topics'

In [18]:
response = requests.get(topics_url)

In [19]:
#verifying the http status code for reponse
response.status_code

200

In [20]:
#getting the length of the webpage content
len(response.text)

185461

In [21]:
page_contents = response.text

In [28]:
#Viewing the first 1000 page content of the web content 
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0eace2597ca3.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-a167e256da9c.css" /><link data-color-theme="dark_dimmed" crossor

In [29]:
# Saving the page_contents sliced above into an html file
with open('webpage.html', 'w') as f:
        f.write(page_contents)

### Using Beautiful Soup to parse and extract information

In [30]:
!pip install beautifulsoup4 --upgrade --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/bin/python3 -m pip install --upgrade pip[0m


In [74]:
from bs4 import BeautifulSoup

In [75]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [76]:
type(doc)

bs4.BeautifulSoup

In [77]:
#targeting to extract the topic headings inside the paragraph tags.
#thus, searching for all p-tags

topic_title_tags = doc.find_all('p')

In [78]:
len(topic_title_tags)

69

In [79]:
#slicing the first 5 p_tags
topic_title_tags[:5]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         React Native
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">React Native is a JavaScript mobile framework developed by Facebook.</p>]

In [80]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p',{'class':selection_class})

In [81]:
len(p_tags)

30

In [82]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [83]:
#Extracting description
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class':desc_selector})

In [84]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [85]:
#Extracing the url to the first topic page
#step1 : finding the topic link tags
topic_link_tags = doc.find_all('a', {'class':'no-underline flex-grow-0'})

In [86]:
len(topic_link_tags)

30

In [87]:
#appending the ink tags to the github url
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [88]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [89]:
#etxracting a list of the topic titles from the topic_title_tags using a for loop 
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles) 

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [92]:
topic_link_tags[0]

<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>

In [106]:
topic_desc_tags[:2]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>]

In [115]:
#extracting a list of the topic descriptions from the topic_desc_tags using a for loop

topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip()) #removing the extra texts using .text and the spacing using .strip()
print(topic_descs[:5])

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [101]:
#etxracting a list of the topic Urls from the topic_link_tags using a for loop
topic_urls = []
base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
print(topic_urls)
    

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

### Using Pandas to Create a DataFrame from the Extracted Data and Saving it to a CSV File

In [121]:
import pandas as pd

In [138]:
topics_dict = {
    'Title':topic_titles,
    'Description':topic_descs,
    'Link':topic_urls,
    
}

In [139]:
topics_df = pd.DataFrame(topics_dict)

In [142]:
topics_df

Unnamed: 0,Title,Description,Link
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [141]:
# Saving the created dataframe to a csv file
topics_df.to_csv('topics.csv', index=None)

### Getting information out of a topic page

###  Create a CSV

### Document and Share the Work