# pick a website and describe your objective


- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


### outline
- first i will scrape https://github.com/topics  
- get a list of topics -> topic title, topic page url, and topic description
- for each topic get the top 25 repos from topic page
- for each repo we will get the repo name, username, stars, repo url
- for each topic we will create a csv file

# Use the requests library to download web pages

-    Inspect the website's HTML source and identify the right URLs to download.
-    Download and save web pages locally using the requests library.
-    Create a function to automate downloading for different topics/search queries

In [8]:
#!pip install requests --quiet
import requests

In [9]:
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)

In [10]:
response.status_code #200 - 299 -> successful

200

In [11]:
len(response.text)

139398

In [14]:
page_content = response.text

In [15]:
with open('webpage.html', 'w') as f:
    f.write(page_content)

# Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
-  Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
-(Optional) Use a REST API to acquire additional information if required

In [19]:
!pip install beautifulsoup4 --upgrade --quiet

In [20]:
from bs4 import BeautifulSoup

In [23]:
doc = BeautifulSoup(page_content, 'html.parser')

In [26]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_p_tags = doc.find_all('p', 
                      {'class': selection_class})

In [28]:
len(topic_title_p_tags)

30

In [30]:
selection_class = "f5 color-fg-muted mb-0 mt-1"
topic_desc_p_tags = doc.find_all('p', class_=selection_class)

In [31]:
len(topic_desc_p_tags)

30

In [32]:
selection_class = "no-underline flex-1 d-flex flex-column"
url_a_tag = doc.find_all('a', class_=selection_class)

In [37]:
topic0_url = "https://github.com" + url_a_tag[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [38]:
topic_titles = [ tag.text for tag in topic_title_p_tags]
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [42]:
topic_desc = [ tag.text.strip() for tag in topic_desc_p_tags]
topic_desc[:3]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

In [43]:
base_url = "https://github.com"
topic_urls = [base_url+url['href'] for url in url_a_tag]
topic_urls[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

In [44]:
import pandas as pd

In [49]:
topic_dict = {"title":topic_titles, 
              "description":topic_desc,
              "url": topic_urls}

topics_df = pd.DataFrame(topic_dict)

In [50]:
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


# Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.



In [51]:
topics_df.to_csv('topics.csv', index=None)

# get information out of a topic page

In [52]:
topic_page_url = topic_urls[0]

In [53]:
topic_page_url

'https://github.com/topics/3d'

In [55]:
response = requests.get(topic_page_url)
response.status_code

200

In [56]:
len(response.text)

634360

In [57]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [143]:
h1_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3', class_=h1_class)

In [144]:
atags = repo_tags[0].find_all('a')

In [145]:
atags[0].text.strip()

'mrdoob'

In [146]:
atags[1].text.strip()

'three.js'

In [147]:
atags[1]['href']

'/mrdoob/three.js'

In [148]:
base_url

'https://github.com'

In [149]:
usernames = []
repo_name = []
repo_urls = []
base_url 
for tag in repo_tags:
    atags = tag.find_all('a')
    usernames.append(atags[0].text.strip())
    repo_name.append(atags[1].text.strip())
    repo_urls.append(base_url + atags[1]['href'])

In [150]:
usernames[-1]

'antvis'

In [151]:
repo_name[-1]

'L7'

In [152]:
print(repo_urls[-1])

https://github.com/antvis/L7


In [153]:
star_class = "d-flex flex-items-center ml-3"
star_tags = topic_doc.find_all('div', class_=star_class)

In [154]:
len(star_tags)

30

In [155]:
star_class = 'repo-stars-counter-star'
star_tags = topic_doc.find_all('span', {'id':star_class})
len(star_tags)

30

In [156]:
star_tags[0].text

'81.8k'

In [157]:
stars = [int(float(star.text[:-1])*1000) if star.text[-1] == 'k' else int(star.text) for star in star_tags ]

In [175]:
def parse_star_tag(star):
    if star[-1]=='k':
        return int(float(star[:-1])*1000)
    else:
        return int(star)

In [176]:
stars[:3]

[81800, 20000, 17900]

In [180]:
def get_repo_info(repo_tag, star_tag):
    atags = repo_tag.find_all('a')
    username = atags[0].text.strip()
    repo_name = atags[1].text.strip()
    repo_url = base_url + atags[1]['href']
    stars = parse_star_tag(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [181]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 81800, 'https://github.com/mrdoob/three.js')

In [182]:
topic_repos_dict ={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [184]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'blender',
  'domlysz',
  'spritejs',
  'openscad',
  'jagenjo',
  'tensorspace-team',
  'YadiraF',
  'AaronJackson',
  'google',
  'ssloy',
  'mosra',
  'FyroxEngine',
  'tengbao',
  'cleardusk',
  'jasonlong',
  'cnr-isti-vclab',
  'antvis'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'blender',
  'BlenderGIS',
  'spritejs',
  'openscad',
  'webglstudio.js',
  'tensorspace',
  'PRNet',
  'vrn',
  'model-viewer',
  'tinyraytracer',
  'magnum',
  'Fyrox',
  'vanta',
  '3DDFA',
  'isometric-contributions',
  'meshlab',
  'L7'],
 'stars': [81800,
  20000,
  17900,
  17100,
  14100,
  13700,
  12800,
  11

In [185]:
topic_repo_df = pd.DataFrame(topic_repos_dict)

In [186]:
topic_repo_df.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,81800,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20000,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,17900,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,17100,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,14100,https://github.com/aframevr/aframe


## write a function to:
- get list of topics from topic page
- get the list of top repos from the individual topic pages
- for each topic, create a csv of the top repos for that topic


# Final Code

In [None]:
def get_repo_info(repo_tag, star_tag):
    atags = repo_tag.find_all('a')
    username = atags[0].text.strip()
    repo_name = atags[1].text.strip()
    repo_url = base_url + atags[1]['href']
    stars = parse_star_tag(star_tag.text.strip())
    return username, repo_name, stars, repo_url


def get_topic_repos(topic_url):
    #download the page
    response = requests.get(topic_url)
    #check response
    if response.status_code != 200:
        raise Exception("Failed to load page {}").format(topic_url)
    #parse using Beautiful soup
    topic_doc =  BeautifulSoup(response.text, 'html.parser')
    # get h2 tags containing repo title, url and username
    h1_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', class_=h1_class)
    # get stars tags
    star_class = 'repo-stars-counter-star'
    star_tags = topic_doc.find_all('span', {'id':star_class})
    #get repo info
    topic_repos_dict ={
                        'username':[],
                        'repo_name':[],
                        'stars':[],
                        'repo_url':[]
                    }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [191]:
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_p_tags = doc.find_all('p', 
                          {'class': selection_class})
    topic_titles = [ tag.text for tag in topic_title_p_tags]
    return topic_titles

def get_topic_desc(doc):
    selection_class = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_p_tags = doc.find_all('p', class_=selection_class)
    topic_desc = [ tag.text.strip() for tag in topic_desc_p_tags]
    return topic_desc

def get_topic_urls(doc):
    selection_class = "no-underline flex-1 d-flex flex-column"
    url_a_tag = doc.find_all('a', class_=selection_class)
    topic_urls = [base_url+url['href'] for url in url_a_tag]
    return topic_urls
    
def scrape_topics():
    topic_url = 'https://github.com/topics'
    #download the page
    response = requests.get(topic_url)
    #check response
    if response.status_code != 200:
        raise Exception("Failed to load page {}").format(topic_url)
    base_url = "https://github.com"
    doc = BeautifulSoup(page_content, 'html.parser')
    

    topic_titles = get_topic_titles(doc)
    topic_desc = get_topic_desc(doc)
    topic_urls = get_topic_urls(doc)
    
    topic_dict = {"title":topic_titles, 
                  "description":topic_desc,
                  "url": topic_urls}

    topics_df = pd.DataFrame(topic_dict)
    return topics_df

In [193]:
topic_df_scraped = scrape_topics()
topic_df_scraped.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [213]:
import os

In [217]:
def scrape_topic(topic_url, topic_name):
    topic_repo_df = get_topic_repos(topic_url)
    fname = "github_scrape/" + topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists. Skipping..".format(fname))
        return
    topic_repo_df.to_csv(fname, index = None)

In [224]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('github_scrape', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], row['title'])

In [225]:
scrape_topics_repos()

Scraping list of topics
scraping top repositories for "3D"
scraping top repositories for "Ajax"
scraping top repositories for "Algorithm"
scraping top repositories for "Amp"
scraping top repositories for "Android"
scraping top repositories for "Angular"
scraping top repositories for "Ansible"
scraping top repositories for "API"
scraping top repositories for "Arduino"
scraping top repositories for "ASP.NET"
scraping top repositories for "Atom"
scraping top repositories for "Awesome Lists"
scraping top repositories for "Amazon Web Services"
scraping top repositories for "Azure"
scraping top repositories for "Babel"
scraping top repositories for "Bash"
scraping top repositories for "Bitcoin"
scraping top repositories for "Bootstrap"
scraping top repositories for "Bot"
scraping top repositories for "C"
scraping top repositories for "Chrome"
scraping top repositories for "Chrome extension"
scraping top repositories for "Command line interface"
scraping top repositories for "Clojure"
scrapin