# WEB SCRAPING PROJECT

### Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It’s a useful technique for creating datasets for research and learning.

# Topic Of Project

# Scraping Top Repositories For Topics On Github

## Introduction:


- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL

### Pick a website and describe your objective

- We chose the Github's Topic Repositories for it's various topics 
- We will be extracting the data and parsing it to make it readable
- Then we'll remodel the parsed data according to our need

- Github has a lot of topics and a lot more Repositories in those particular topics
- It will be tough to extract all the Topics and Repositiories
- Therefore, We will be extracting Top Repositories of Top 25 Topics on Github

## Step 1: Install and Import all required Library

### Use the requests library to download web pages

- Request is a library in Python that allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs

In [2]:
!pip install requests --upgrade --quiet

In [6]:
import requests

### Use Beautiful Soup to parse and extract information

- Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

In [22]:
!pip install beautifulsoup4 --upgrade --quiet

In [7]:
from bs4 import BeautifulSoup

### Use Pandas to save the file as CSV 

- Pandas is an open-source library in Python that is made mainly for working with relational or labeled data both easily and intuitively. 

In [3]:
!pip install pandas --quiet

In [8]:
import pandas as pd

### Import OS to select the Path where Data will be stored

In [20]:
import os

### Getting Topic Informantion From URL

- Using the Requests Library to Load the URL
- Using Beautiful Soup to Parse the Given URL

- Using the Class and Tags from the Parsed URL to Extract the following:
    1. Title Of Topics
    2. Description Of Topics
    3. Url Of Topics

#### Getting Topic Titles:

In [9]:
def get_topic_titles(doc):
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',{'class':selection_class})

    topic_titles=[]

    for tags in topic_title_tags:
        topic_titles.append(tags.text)

    return topic_titles

#### Getting Topic Description:

In [10]:
def get_topic_descs(doc):
    desc_class='f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags=doc.find_all('p',{'class':desc_class})

    topic_descs=[]

    for tags in topic_desc_tags:
        topic_descs.append(tags.text.strip())

    return topic_descs

#### Getting Topic URL :

In [11]:
def get_topic_url(doc):
    topic_link=doc.find_all('a',{'class':'no-underline flex-grow-0'})
    len(topic_link)

    topic_urls=[]
    base_url='https://github.com'
    for tags in topic_link:
        topic_urls.append(base_url+tags['href'])

    return topic_urls

### Making Topics Dataframe
- Providing The URL link 
- Getting the Response Object using Requests Library
- Using Beautiful Soup to Parse
- Calling the other functions to get Required Data
- Using pandas to make a Dataframe of the returned Data

In [16]:
def scrape_topic():
    topic_url= 'https://github.com/topics'
    response= requests.get(topic_url)
    doc = BeautifulSoup(response.text,'html.parser')
    
    topics_dict={'Title':get_topic_titles(doc),'Description':get_topic_descs(doc),'URL':get_topic_url(doc)}

    return pd.DataFrame(topics_dict)

### Getting Topic Repositories Data
- Getting the following items:
    1. Repositories Username
    2. Repositories Name
    3. Repositories URL

### Using Topic Dataframe to Parse
- Using the Requests Library to Load the Repositories URL
- Using Beautiful Soup to Parse the Given URL

#### Getting Topic Repositiories Page :

In [None]:
def get_topic_page(topic_url):
    response= requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

#### Getting Repositories Data :

In [None]:
def get_repo_info(repo_tags,repo_stars):
    a_tags=repo_tags.find_all('a')
    username= a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=int(repo_stars['title'].replace(",", ""))
    return username, repo_name,stars,repo_url

#### Making a Dataframe of all the Information :

In [None]:
def get_topic_repos(topic_doc):
    
    
    h3_selection_class='f3 color-fg-muted text-normal lh-condensed'
    repo_tags=topic_doc.find_all('h3',{'class':h3_selection_class})
    
    star_selection_class='repo-stars-counter-star'
    repo_stars=topic_doc.find_all('span',{'id':star_selection_class})
    
    repo_dict={'Username':[],'Repo_Name':[],'Stars':[],'Repo_URL':[]}
    
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],repo_stars[i])
        repo_dict['Username'].append(repo_info[0])
        repo_dict['Repo_Name'].append(repo_info[1])
        repo_dict['Stars'].append(repo_info[2])
        repo_dict['Repo_URL'].append(repo_info[3])
    return pd.DataFrame(repo_dict)

#### Calling the functions to make the Dataframe and saving it as CSV file

In [17]:
def scrape_topic_repo(topic_urls,path):
    if os.path.exists(path):
        print('The File {} already exists. Skipping...'.format(path))
        return
    topic_repo_df=get_topic_repos(get_topic_page(topic_urls))
    topic_repo_df.to_csv(path, index=None)

#### Defining a function to do all the Task in one go :

In [18]:
def master_function():
    print('Scraping list of topics')
    topics_df= scrape_topic()
    os.makedirs('Github_Data',exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scraping Top Repositories for "{}"'.format(row['Title']))
        scrape_topic_repo(row['URL'],'Github_Data/{}.csv'.format(row['Title']))

### Calling the function

In [19]:
master_function()

Scraping list of topics
Scraping Top Repositories for "3D"
The File Github_Data/3D.csv already exists. Skipping...
Scraping Top Repositories for "Ajax"
The File Github_Data/Ajax.csv already exists. Skipping...
Scraping Top Repositories for "Algorithm"
The File Github_Data/Algorithm.csv already exists. Skipping...
Scraping Top Repositories for "Amp"
The File Github_Data/Amp.csv already exists. Skipping...
Scraping Top Repositories for "Android"
The File Github_Data/Android.csv already exists. Skipping...
Scraping Top Repositories for "Angular"
The File Github_Data/Angular.csv already exists. Skipping...
Scraping Top Repositories for "Ansible"
The File Github_Data/Ansible.csv already exists. Skipping...
Scraping Top Repositories for "API"
The File Github_Data/API.csv already exists. Skipping...
Scraping Top Repositories for "Arduino"
The File Github_Data/Arduino.csv already exists. Skipping...
Scraping Top Repositories for "ASP.NET"
The File Github_Data/ASP.NET.csv already exists. Skippi