# Web Scraping  Project Version 1 
- https://www.youtube.com/watch?v=RKsLLG-bzEY&t=1s

## Outline:

- we're going to scrape  https://github.com
- we'll get al list of topics. for each topic we'll get topic tilte, topic page URL and topic description.
- for each topic, we'll get the top 25 respositories in the topic page.
- for each repository we'll grab the repo name, username, stars and repo URL.
- for each topic we'll create a csv file.

## Scraping from information from single page

- https://github.com/topics

### Use request library to download webpage

In [3]:
! pip install requests  --quiet

In [4]:
import requests

In [5]:
topics_url = 'https://github.com/topics'

In [10]:
response = requests.get(topics_url)

In [11]:
response.status_code

200

In [12]:
len(response.text) # it is not good idea to print response.txt directly

164745

In [17]:
page_contents = response.text
page_contents[:200]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">\n  <head>\n    <meta chars'

In [18]:
with open('webpage.html','w') as f:
    f.write(page_contents)

right now we got the web page downloaded to the local memory now we need to extract the needed information for that 
we use beautifulsoup.

In [19]:
! pip install beautifulsoup4 --quiet

In [20]:
from bs4 import BeautifulSoup

In [21]:
doc = BeautifulSoup(page_contents,'html.parser')

In [22]:
type(doc)

bs4.BeautifulSoup

In [28]:
p_tags = doc.find_all('p')
len(p_tags)

69

In [29]:
p_tags[:5]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a href="https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Rust
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Rust is a systems programming language created by Mozilla.</p>]

### extracting tags

In [32]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p',{'class': selection_class})

In [33]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

### extracting description

In [34]:
selection_class2 = "f5 color-fg-muted mb-0 mt-1"
descrp_titles = doc.find_all('p',{'class': selection_class2})

In [35]:
descrp_titles[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

### extracting URL

In [40]:
selection_class3 = "no-underline flex-grow-0"
tag_url = doc.find_all('a',{'class': selection_class3})

In [44]:
tag_url[0]['href']

'/topics/3d'

In [42]:
len(tag_url)

30

In [46]:
topic0_url = 'https://github.com' + tag_url[0]['href']
topic0_url

'https://github.com/topics/3d'

### creating datastructure using the extracted data

In [55]:

topic_titles = []
for tags in topic_title_tags:
    topic_titles.append(tags.text)

print([x for x in topic_titles])

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [62]:
topic_descrp = []
for descrp in descrp_titles:
    topic_descrp.append(descrp.text.strip())

print(topic_descrp)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [65]:
topic_url = []
base_url = 'https://github.com' 
for url in tag_url:
    topic_url.append(base_url + url['href'])


print(topic_url)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

Now we need to create dataframe using the created list datastructure

In [66]:
import pandas as pd

In [71]:
data = { 'titles':topic_titles,
        'description':topic_descrp,
        'urls':topic_url}
dataframe = pd.DataFrame(data)

In [72]:
df = dataframe.copy()
df

Unnamed: 0,titles,description,urls
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [73]:
df.to_csv('webscraping_mini.csv',index=None)

## Scrapping information from Multiple pages

In [151]:
topic_page_url = topic_url[0]
topic_page_url

'https://github.com/topics/3d'

In [152]:
response = requests.get(topic_page_url)

In [153]:
response.status_code

200

In [154]:
len(response.text)

476395

In [155]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [156]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class})

In [157]:
len(repo_tags)

20

In [158]:
a_tags = repo_tags[0].find_all('a')

In [159]:
a_tags[0].text.strip() # user name

'mrdoob'

In [160]:
a_tags[1].text.strip() # repo name

'three.js'

In [161]:
repo_url = base_url + a_tags[1]['href'] # rep url
repo_url

'https://github.com/mrdoob/three.js'

In [162]:
star_class = 'Counter js-social-count'
star_tags = topic_doc.find_all('span',{'class':star_class})

In [163]:
len(star_tags)

20

In [164]:
star_tags[0].text.strip()

'93.7k'

In [165]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [166]:
parse_star_count(star_tags[0].text.strip())

93700

In [167]:
def get_repo_info(h3_tag,star_tag):
    # this returns all the required informations
    a_tags = h3_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags[i].text.strip())
    return user_name,repo_name,stars,repo_url

In [168]:
topic_repos_dict = {
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])


In [120]:

topic_repos = pd.DataFrame(topic_repos_dict)
topic_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,93700,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23400,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21800,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17600,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15800,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15600,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14700,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10800,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


This is the infromation the scrapped from one tag of repos. so there are many tags that we need to extract to do that we are going to use function.
to make actions automate.

### Using funciton to extract from multiple pages

In [169]:
def get_topic_page(topic_url):
        # download the page
    response = requests.get(topic_url)
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc


In [171]:
def get_topic_repos(topic_doc):

    # getting h3 tags containing repo title , repo url and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class})
    # getting star tags
    star_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class':star_class})

    def get_repo_info(h3_tag,star_tag):
        # this returns all the required informations
        a_tags = h3_tag.find_all('a')
        user_name = a_tags[0].text.strip()
        repo_name = a_tags[1].text.strip()
        repo_url = repo_url = base_url + a_tags[1]['href']
        stars = parse_star_count(star_tags[i].text.strip())
        return user_name,repo_name,stars,repo_url

    # to store the all repos
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
    }

    # getting all repos info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

#### Testing the function

In [178]:
url4 = df.urls[4]

In [179]:
topic4_doc = get_topic_page(url4)

In [180]:
topic4_repos = get_topic_repos(topic4_doc)

##### in a single line

In [184]:
get_topic_repos(get_topic_page(df.urls[6]))

Unnamed: 0,username,repo_name,stars,repo_url
0,ansible,ansible,58200,https://github.com/ansible/ansible
1,bregman-arie,devops-exercises,54400,https://github.com/bregman-arie/devops-exercises
2,trailofbits,algo,27500,https://github.com/trailofbits/algo
3,MichaelCade,90DaysOfDevOps,23500,https://github.com/MichaelCade/90DaysOfDevOps
4,StreisandEffect,streisand,23000,https://github.com/StreisandEffect/streisand
5,kubernetes-sigs,kubespray,14300,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,12500,https://github.com/ansible/awx
7,easzlab,kubeasz,9300,https://github.com/easzlab/kubeasz
8,ansible-semaphore,semaphore,7400,https://github.com/ansible-semaphore/semaphore
9,geerlingguy,ansible-for-devops,7300,https://github.com/geerlingguy/ansible-for-devops


## Scaling the project

Write a single function to :
1. Get the list of topics from the topic page.
2. Get the list of top repos from the individual topic pages.
3. For each topic, create a CSV of the top repos for the ropic.

In [212]:
import os

In [224]:
# this function is part of above section
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('The file {} already exists. skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path + '.csv',index=None)

In [185]:
# for topic title 
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p',{'class': selection_class})

    topic_titles = []
    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    return topic_titles

In [202]:
# for topic description
def get_topic_description(doc):
    selection_class2 = "f5 color-fg-muted mb-0 mt-1"
    descrp_titles = doc.find_all('p',{'class': selection_class2})

    topic_descs = []
    for descrp in descrp_titles:
        topic_descs.append(descrp.text.strip())
    return topic_descs

In [187]:
# for topic URL's
def get_topic_url(doc):
    selection_class3 = "no-underline flex-grow-0"
    tag_url = doc.find_all('a',{'class': selection_class3})
    
    topic_url = []
    base_url = 'https://github.com' 
    for url in tag_url:
        topic_url.append(base_url + url['href'])
    return topic_url

In [204]:
def scrape_topics():

    topics_url = 'https://github.com/topics'
    requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

    topics_dict = {
        'title': get_topic_titles(doc),
        'description':get_topic_description(doc),
        'url':get_topic_url(doc)
    }

    return pd.DataFrame(topics_dict)

In [227]:
def scrape_topics_repos():
    print('Scraping list of topics from github')
    topics_df = scrape_topics()

    os.makedirs('data',exist_ok=True)

    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'], 'data/{}'.format(row['title']))

In [228]:
scrape_topics_repos()

Scraping list of topics from github
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
