# Web Scrapping
![](https://i.imgur.com/6zM7JBq.png)

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

**We have defined a lot of functions and they are for the following purpose:**
- `get_topic_titles`, `get_topic_descs` and `get_topic_urls`  are defined to get information about all the topics
- Then we have defined `get_topic_page`, `get_repo_info`, `get_topic_repos` to grab information of a particular repository.

## Scrape the list of topics from Github
Our objective in this section is to:

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [1]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [2]:
doc = get_topics_page()
# doc

doc is basically html code of whole page and `get_topics_page` was defined to download the page.

- We can see how inspect can be used for a line of code to directly display it's HTML code and traversing through the lines shades the area within website corresponding to the piece of code. 

&nbsp;

![](https://i.imgur.com/OnzIdyP.png)

Now defining a function `get_topic_titles` to get the list of titles
This function works as follows:
1. We need to get the topic titles and by using inspect we observed all the titles are written under `<p>` paragraph tag having class `f3 lh-condensed mb-0 mt-1 Link--primary`.
2. Now using find_all for above condition we will get a list(`topic_title_tags`) of all the paragraph tag containing the required information.
3. A list naming `topic_titles` is defined in which text of all the individual tags(which we got by iterating over list `topic_title_tags`) is stored. 
4. Finally `topic_titles` is returned.

Let us have a glimpse on the code before putting it into a function, because almost all the functions defined will have these find and find_all lines.

In [3]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})
topic_title_tags[0:3]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>]

In [4]:
topic_title_tags[2].text

'Algorithm'

In [5]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

Let us see what we have scrapped

In [6]:
get_topic_titles(doc)[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly defining functions for descriptions and URLs.

In [7]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [8]:
get_topic_descs(doc)[:3]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

In [9]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In above line of code tag['href'] does not contain whole url, base_url combined with tag['href'] will give us the hyperlink corresponding to the topics URL.For instance, https://github.com/topics/3d is url corresponding to 3D topic, but tag['href'] will contain only '/topics/3d' , so it is added after the base_url giving the complete URL. 

In [10]:
get_topic_urls(doc)[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

In [11]:
import pandas as pd

### Putting it all together into a single function

In [12]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topic_titles = get_topic_titles(doc)
    topic_descs = get_topic_descs(doc)
    topic_urls = get_topic_urls(doc)
    dictionary = {'Title': topic_titles,
                  'Description': topic_descs,
                  'URL': topic_urls}
    return pd.DataFrame(dictionary)

In [13]:
topics_df = scrape_topics()
topics_df.loc[0:4]

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Scrapping top  repositories
Objectives of this section:
- For a topic (say Python), I'll fetch name of top repositories there Username and  no. of star count they have
- To acheive this various function will be defined so that those functions can be used again for the next topic
- Then all this scrapped data will be Stored in the form of csv file.

- At first `get_topic_page` function will be defined which will retrive all the data which the corresponding topic page(Say 3D) contains using request library. 
- The response is then passed to the beautifulsoup in the form of text which then returns an object, which you can use for further searches or to extract its contents

In [14]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [15]:
doc = get_topic_page('https://github.com/topics/3d')

In [16]:
# doc # Uncomment to see what it contains

**After Inspecting the various parts of the website, what i observed is the article tag contains all the necessary information regarding repo title, repo URL and username.**
Now a function `get_repo_info()` is defined which will return username, repo_name, stars and repo_url.

In [17]:
base_url = 'https://github.com'

In [18]:
def get_repo_info(article_tag, star_tag):
    # returns all the required info about a repository
    a_tags = article_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = star_tag.text.strip()
    return username, repo_name, stars, repo_url

Example to display this functions takes as input and what does it returns

In [19]:
# Get the article tags containing repo title, repo URL and username
repo_tags = doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})
a_tag = repo_tags[0].find_all('a') 

# Get star tags
star_tags=doc.find_all('span',{'id':'repo-stars-counter-star'})
# print(a_tag[0], "\n") # Uncomment to see what it contains
print(star_tags[0])

<span aria-label="83599 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="83,599">83.6k</span>


In [20]:
get_repo_info(repo_tags[0], star_tags[0])

('', 'mrdoob', '83.6k', 'https://github.com/mrdoob')

- Now we will define `get_topic_repos()` function which will take doc(returned by `get_topic_page()` function) and then using `get_repo_info()` it will fetch data.
- This data will be converted into dictionary `topic_repos_dict` which will be finally converted to pandas dataframe.

In [21]:
def get_topic_repos(topic_doc):
    # Get the article tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [22]:
topic_urls = get_topic_urls(get_topics_page())
topic_urls[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

In [23]:
get_topic_repos(get_topic_page(topic_urls[2])).head()

Unnamed: 0,username,repo_name,stars,repo_url
0,,jwasham,225k,https://github.com/jwasham
1,CyC2018,CS-Notes,154k,https://github.com/CyC2018/CS-Notes
2,,trekhleb,146k,https://github.com/trekhleb
3,TheAlgorithms,Python,140k,https://github.com/TheAlgorithms/Python
4,yangshun,tech-interview-handbook,74.4k,https://github.com/yangshun/tech-interview-han...


Now a function to convert the dataframe returned by `get_topic_repos` into a csv file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [24]:
import os
def csv_converter(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [25]:
path = os.getcwd()+"\\"+get_topic_titles(get_topics_page())[21]+".csv"
csv_converter(topic_urls[21], path)

The file C:\Users\HP\jupytr notebook\Web Scrapping\Chrome extension.csv already exists...


Now let's read above csv file

In [26]:
csv_file = pd.read_csv(path)
print(csv_file.to_string())

        username                 repo_name  stars                                                  repo_url
0     iamadamdev    bypass-paywalls-chrome  26.1k      https://github.com/iamadamdev/bypass-paywalls-chrome
1            NaN                jaywcjlove  18.7k                             https://github.com/jaywcjlove
2            NaN            refined-github  17.9k                         https://github.com/refined-github
3            NaN                   checkly  14.2k                                https://github.com/checkly
4            NaN                darkreader    14k                             https://github.com/darkreader
5          unbug                    codelf  12.9k                           https://github.com/unbug/codelf
6        Anarios    return-youtube-dislike   9.4k         https://github.com/Anarios/return-youtube-dislike
7            NaN                     crimx   9.2k                                  https://github.com/crimx
8            NaN            

## Finally, Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [27]:
topics_df.head() # This has been defined earlier as well and this is same as used in function below.

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


This function is working as follows:
1. Using function scrape_topics() it is taking the pandas dataframe shown above.
2. Then a directory naming data has been created using the os module 
3. In the data directory all the csv files made through csv_convertor has been stored
4. To iterate over the rows of dataframe iterrows() has been used, this divides whole dataframe into individual rows and a row contains the index along with the list which can be sliced to get tequired results.

In [28]:
def scrape_topic_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics() # See topics_df above for better understanding
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows(): # looping over rows in pandas dataframe
        print('Scraping top repositories for "{}"'.format(row['Title']))
        csv_converter(row['URL'], 'data/{}.csv'.format(row['Title']))

### Here's the magic of web scrapping

In [29]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"


Exception: Failed to load page https://github.com/topics/bootstrap

Let's see the data directory in which all the csv files have been stored.

In [None]:
os.listdir('data')

**Now reading 3D.csv**

In [None]:
csv_file = pd.read_csv('data\\3D.csv')
print(csv_file.to_string())