# Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic , we'll get topic title, topic page URL, and topic description.
- For each topic. we'll get the top 25 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars, and repo URL
- For each topic we'll create a CDV file in the following format.
```
Repo Name,Username,Starts,Repo URL
three.js,mrdoob,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

# Steps to do:
1. Pick a website and describe your objective.
2. Use the requests library to download web pages.
3. Use Beautiful Soup to parse and extract information.
4. Create CSV file(s) with the extracted information.
5. Document and share your work

## 1. Pick a website and describe your objective.
- Browse through 'https//:github.com/topics' and scrap the website.
- Identify the information you'd like to scrape from the website.


In [1]:
## github topic url
base_url = "https://github.com/topics"
base_url

'https://github.com/topics'

## 2. Use the requests library to download web pages.
1. requests
2. BeautifulSoup
3. os
4. pandas

In [2]:

## Importing the required libraries
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 3. Use Beautiful Soup to parse and extract information.
- Parse and explore the structure of downloaded web pages using Beautiful soup.- 
Use the right properties and methods to extract the required informatio.
- 
Create functions to extract from the page into lists and dictionaried.


In [3]:
## Main web page scraping

## to read the 'https://github.com/topics in the html parser
def read_website_as_html(base_url):
    ##reading the website
    response = requests.get(base_url)
    ## checking the response code whether it successfully read or not
    
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(base_url))
        return 'Faild to scrap'
    ## if we are getinge the 200 as the status_code than it is successfuly read
    else:
        ## convrting the html pasear in to beautiful html parser
        topics_html = BeautifulSoup(response.text, 'html.parser')
    return topics_html


In [4]:
## getiing the topic names and topic URLs
def get_topics_info(topics_html):
    ## To get the topic titles
    topic_title_tags = topics_html.find_all("p", {'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_titles=list()
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    
    ## To get the topic discrptions
    topic_disc_tags = topics_html.find_all("p",{"class":"f5 color-fg-muted mb-0 mt-1"})
    topic_disc = list()
    for tag in topic_disc_tags:
        topic_disc.append(tag.text.strip())

    # github url
    git_url = 'https://github.com'
    topic_urls = topics_html.find_all("a",{'class': "no-underline flex-1 d-flex flex-column"})
    
    
    topic_URLs = list()
    for tag in topic_urls:
        url = "https://github.com" + tag['href']
        topic_URLs.append(url)
        
    ## creating the dictnory for topic info
    topics_dict = {
        'Topic Title' : topic_titles,
        'Description': topic_disc,
        'Topic_URLs': topic_URLs
    }
    return topics_dict

## 4. Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.- 
Execute the function with different inputs to create a dataset of CSV file.s- 
Verify the information in the CSV files by reading them back using Pandas.

In [5]:
def creating_dataframe(topics_dict):
    topics_df = pd.DataFrame(topics_dict)
    return topics_df


## Creating Pipeline

In [6]:
## Colling the all the functions
base_url = 'https://github.com/topics'
topics_html = read_website_as_html(base_url)
topics_info = get_topics_info(topics_html)
df = creating_dataframe(topics_info)
print(df)

               Topic Title                                        Description  \
0                       3D  3D refers to the use of three-dimensional grap...   
1                     Ajax  Ajax is a technique for creating interactive w...   
2                Algorithm  Algorithms are self-contained sequences that c...   
3                      Amp  Amp is a non-blocking concurrency library for ...   
4                  Android  Android is an operating system built by Google...   
5                  Angular  Angular is an open source web application plat...   
6                  Ansible  Ansible is a simple and powerful automation en...   
7                      API  An API (Application Programming Interface) is ...   
8                  Arduino  Arduino is an open source platform for buildin...   
9                  ASP.NET  ASP.NET is a web framework for building modern...   
10                    Atom  Atom is a open source text editor built with w...   
11           Awesome Lists  

In [7]:
#path=os.path.join('Github Topics','Topics')
path = './Github Topics'
try:
    os.mkdir(path)
    df.to_csv('Github Topics/Topics.csv')
except Exception as e:
    print(e)

[WinError 183] Cannot create a file when that file already exists: './Github Topics'


# Scrap each topic inside the GitHub Topics

In [8]:

## To convert the string value into an integer value
## Here 'k' value is replaced with the multiplication with 1000 to the remaining float value
def stars_int(star):
    star = star.strip()
    if star[-1] == 'k':
        return int(float(star[:-1]) * 1000)
    else:
        return int(star)

## To get inside each topic of username, repo_name, total no. of stars, and repo_URL
def get_repo_info(repo_tag , stars_tag ):
    #h"f3 color-fg-muted text-normal lh-condensed"})
    base_url = 'https://github.com'
    a_tags = repo_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    stars = stars_int(stars_tag)
    repo_url = base_url + a_tags[1]['href']
    return user_name, repo_name,stars, repo_url


## scrap the repo_tags and stars
## To Scrap the each topic information with the topic url
def get_topic_repos(topic_url):
    ## check the topic status code
    response =requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
        
    doc = BeautifulSoup(response.text,'html.parser')
   
    repo_tags = doc.find_all('h3', {'class':"f3 color-fg-muted text-normal lh-condensed"})
    stars = doc.find_all('span',{'class':"Counter js-social-count"})


    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    ## Adding all values to the dictionary.
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],stars[i].text)
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    ## Creating the pandas' data frame with the dictionary.
    df = pd.DataFrame(topic_repos_dict)

    return df
    
        

In [9]:
## Each topic urls
df= pd.read_csv('Github Topics/Topics.csv')['Topic_URLs']
for i in range(len(df)):
    print(df[i])

https://github.com/topics/3d
https://github.com/topics/ajax
https://github.com/topics/algorithm
https://github.com/topics/amphp
https://github.com/topics/android
https://github.com/topics/angular
https://github.com/topics/ansible
https://github.com/topics/api
https://github.com/topics/arduino
https://github.com/topics/aspnet
https://github.com/topics/atom
https://github.com/topics/awesome
https://github.com/topics/aws
https://github.com/topics/azure
https://github.com/topics/babel
https://github.com/topics/bash
https://github.com/topics/bitcoin
https://github.com/topics/bootstrap
https://github.com/topics/bot
https://github.com/topics/c
https://github.com/topics/chrome
https://github.com/topics/chrome-extension
https://github.com/topics/cli
https://github.com/topics/clojure
https://github.com/topics/code-quality
https://github.com/topics/code-review
https://github.com/topics/compiler
https://github.com/topics/continuous-integration
https://github.com/topics/covid-19
https://github.com/

In [10]:
path = './Github Topics/Topics'
try:
    os.mkdir(path)
except Exception as e:
    print(e)
topic_url= pd.read_csv('Github Topics/Topics.csv')['Topic_URLs']
topic_title = pd.read_csv('Github Topics/Topics.csv')['Topic Title']
for i in range(len(topic_url)):
    df = get_topic_repos(topic_url[i])
    df.to_csv(path+'_'+str(topic_title[i])+'.csv')
    print(f'Scraping is successfully completed {topic_title[i]}.')
print('Scraping was done...')
    

[WinError 183] Cannot create a file when that file already exists: './Github Topics/Topics'
Scraping is successfully completed 3D.
Scraping is successfully completed Ajax.
Scraping is successfully completed Algorithm.
Scraping is successfully completed Amp.
Scraping is successfully completed Android.
Scraping is successfully completed Angular.
Scraping is successfully completed Ansible.
Scraping is successfully completed API.
Scraping is successfully completed Arduino.
Scraping is successfully completed ASP.NET.
Scraping is successfully completed Atom.
Scraping is successfully completed Awesome Lists.
Scraping is successfully completed Amazon Web Services.
Scraping is successfully completed Azure.
Scraping is successfully completed Babel.
Scraping is successfully completed Bash.
Scraping is successfully completed Bitcoin.
Scraping is successfully completed Bootstrap.
Scraping is successfully completed Bot.
Scraping is successfully completed C.
Scraping is successfully completed Chrome.