# Top Repositories For Github Topics

# Objectives:


###  1: Pick a website and describe your objectives 
        •	Browse through different sites and pick on to scrape. 
        •	Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
        •	Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

        

Outline:
- we're going to scrape https://github.com/topics
- we'll get a list of topics.For each topic, we'll get topics title, topic page url and topic description
- For each topic we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name,
- For each topic we'll create a csv file in the following format : Repo Name, Username, Stras, Repo URL


### 2 :	Use the requests library to download web pages

        •	Inspect the website's HTML source and identify the right URLs to download.
        •	Download and save web pages locally using the requests library.
        •	Create a function to automate downloading for different topics/search queries


In [5]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url) # downloading contents of the page

In [6]:
response.status_code #  status codes indicate whether a specific HTTP request has been successfully completed. 

200

In [7]:
len(response.text)#display responce length

166113

In [8]:
page_contents = response.text

In [9]:
page_contents[:10000]#display the first 1000 lines of the page


'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-b92e9647318f.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" m

In [13]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents) # saving the downloaded page


### 3:	Use Beautiful Soup to parse and extract information
        •	Parse and explore the structure of downloaded web pages using Beautiful soup.
        •	Use the right properties and methods to extract the required information.
        •	Create functions to extract from the page into lists and dictionaries.
       

In [14]:

!pip install beautifulsoup4 --quiet 

In [15]:
from bs4 import BeautifulSoup # importing beautifull soup

In [16]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [18]:
p_tags = doc.find_all('p')

In [19]:
len(p_tags)

69

In [20]:
#topic title parsing
topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':topic_title_class}  ) # specifying the p_tags I'm intrested in(topic title)


In [22]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [23]:
# topic description parsing
topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_description_tags = doc.find_all('p',{'class':topic_description_class})

In [24]:
len(topic_description_tags)

30

In [25]:
topic_description_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [26]:
# topic link 
topic_link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a',{'class': topic_link_class})

In [27]:
len(topic_link_tags)

30

In [28]:
topic_link_tags[0]['href']# link to the first topic

'/topics/3d'

In [29]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']

In [30]:
print(topic0_url)

https://github.com/topics/3d


In [31]:
topic_title_tags[0].text

'3D'

In [32]:
# creating a list of topic titles
topic_titles = [] 
for tags in topic_title_tags:
    topic_titles.append(tags.text)
    
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [33]:
# creating a list of topic descriptions
topic_descriptions = []
for i in topic_description_tags:
    topic_descriptions.append(i.text.strip())#.strip removes spaces

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [34]:
# creating a list of topic links
topic_urls = []
base_url = 'https://github.com'
for i in topic_link_tags:
    topic_urls.append(base_url + i['href'])
    
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### 4.	Create CSV file(s) with the extracted information
        •	Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
        •	Execute the function with different inputs to create a dataset of CSV files.
        •	Verify the information in the CSV files by reading them back using Pandas.


In [35]:
!pip install pandas --quiet

In [36]:
import pandas as pd

In [37]:
# creating a dictionary that will store the created lists above
topics_dict = {
    'Title' :topic_titles,
    'Description' : topic_descriptions,
    'Url' : topic_urls
}


In [38]:
# creating a dataframe from the dictionary
topics_df = pd.DataFrame(topics_dict)

topics_df[:5]

Unnamed: 0,Title,Description,Url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [39]:
topics_df.to_csv('topics.csv')

In [40]:
topics_url

'https://github.com/topics'

# # Getting Information out of a topic page

In [41]:
topic_page_url = topic_urls[0]

In [42]:
topic_page_url

'https://github.com/topics/3d'

In [43]:
response = requests.get(topic_page_url)

In [44]:
response.status_code # checking if the request is succesfull

200

In [45]:
len(response.text)

476015

In [46]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [51]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [52]:
repo_tags[0].text.strip()

'mrdoob\n          /\n          \n            three.js'

In [53]:
a_tags = repo_tags[0].find_all('a')

In [54]:
a_tags[0].text.strip()

'mrdoob'

In [55]:
a_tags[1].text.strip()

'three.js'

In [56]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [57]:
star_tags_class = 'Counter js-social-count'
star_tags = topic_doc.find_all('span',{'class':star_tags_class})

In [58]:
star_tags[0].text.strip()

'93.3k'

In [59]:
#converting the stars into a number function

In [60]:
def parse_star_count(stars_str):
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)


In [61]:
parse_star_count(star_tags[0].text.strip())

93300

In [65]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 93300, 'https://github.com/mrdoob/three.js')

In [66]:
get_repo_info(repo_tags[1], star_tags[1])

('pmndrs',
 'react-three-fiber',
 23300,
 'https://github.com/pmndrs/react-three-fiber')

In [67]:
len(repo_tags)

20

In [68]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : [] 
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [9]:
topic_repos_dict[:5]

NameError: name 'topic_repos_dict' is not defined

In [7]:
 

def get_topic_page(topic_url):
     # download the page
    response = requests.get(topic_url)
    # check succesfull response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # parse using beautiful soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tags):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, stars, repo_url
    

def get_topic_repos(topic_doc):
    #repo tags
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'    
    repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class})
     
    #star tags
    star_tags_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class':star_tags_class})
        
    #get repo info
    topic_repos_dict = {
            'username' : [],
            'repo_name' : [],
            'stars' : [],
            'repo_url' : [] 
        }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

In [71]:
topic_page_url

'https://github.com/topics/3d'

In [72]:
get_topic_repos(get_topic_page(topic_urls[0])).to_csv('3D.csv' ,index = None)

Write a single function to :
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each each topic, create a CSV of the top repos for the topic

# Final Code

In [4]:
import os
def get_topic_titles(doc):
    # topic title parsing    
    topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class':topic_title_class}  )
     # creating a list of topic titles
    topic_titles = [] 
    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    return topic_titles
    
def get_topic_descs(doc):
     # topic description parsing
    topic_description_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_description_tags = doc.find_all('p',{'class':topic_description_class})
    # creating a list of topic descriptions
    topic_descriptions = []
    for i in topic_description_tags:
        topic_descriptions.append(i.text.strip())#.strip removes spaces
    return topic_descriptions

def get_topic_urls(doc):
     # topic urls tags
    topic_link_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a',{'class': topic_link_class})
     # creating a list of topic urls
    topic_urls = []
    base_url = 'https://github.com'
    for i in topic_link_tags:
        topic_urls.append(base_url + i['href'])
    return topic_urls


def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    # check succesfull response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    topics_dict = {
        'title': get_topic_titles(doc),
        'description' : get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

def scrape_topic(topic_url, path):
  
    if os.path.exists(path):
        print ("The file {} already exists. skipping... ".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    topic_df.to_csv(path, index = None)
    
   
    

  

    
    

In [83]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [5]:
import os
help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [6]:
def scrape_topics_repos():
    print ('Scraping list of topics')
    topics_df = scrape_topics()
    # create a folder to save the CSV files
    os.makedirs('data',exist_ok=True)
    for index, row in topics_df.iterrows(): # itterating over rows in pandas
        print ('Scrapping top repositories for "{}" '.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))
        
    

In [88]:
scrape_topics_repos()

Scraping list of topics
Scrapping top repositories for "3D" 
Scrapping top repositories for "Ajax" 
Scrapping top repositories for "Algorithm" 
Scrapping top repositories for "Amp" 
Scrapping top repositories for "Android" 
Scrapping top repositories for "Angular" 
Scrapping top repositories for "Ansible" 
Scrapping top repositories for "API" 
Scrapping top repositories for "Arduino" 
Scrapping top repositories for "ASP.NET" 
Scrapping top repositories for "Atom" 
Scrapping top repositories for "Awesome Lists" 
Scrapping top repositories for "Amazon Web Services" 
Scrapping top repositories for "Azure" 
Scrapping top repositories for "Babel" 
Scrapping top repositories for "Bash" 
Scrapping top repositories for "Bitcoin" 
Scrapping top repositories for "Bootstrap" 
Scrapping top repositories for "Bot" 
Scrapping top repositories for "C" 
Scrapping top repositories for "Chrome" 
Scrapping top repositories for "Chrome extension" 
Scrapping top repositories for "Command line interface" 
S