# Scraping the top trending repositories of github

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

## Pick a website and describe your objective

  -  Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
  -  Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
  -  Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.



- Here we are going to use https://github.com/topics
- We will scrape the top 20 repositories of some topics

## Use the requests library to download web pages

  -  Inspect the website's HTML source and identify the right URLs to download.
  -  Download and save web pages locally using the requests library.
  -  Create a function to automate downloading for different topics/search queries.



In [16]:
!pip3 install requests --upgrade --quiet

In [17]:
import requests

url = 'https://github.com/topics'
r = requests.get(url)
r.status_code

200

In [18]:
# with open('topics.html', 'w') as page:
#     page.write(r.text)

## Use Beautiful Soup to parse and extract information

  -  Parse and explore the structure of downloaded web pages using Beautiful soup.
  -  Use the right properties and methods to extract the required information.
  -  Create functions to extract from the page into lists and dictionaries.

In [19]:
!pip3 install beautifulsoup4 --upgrade --quiet

In [20]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

In [21]:
topic_divs = soup.find_all('div', {'class' : 'py-4'})
topic_titles = []
topic_urls = []
topic_descs = []
for div in topic_divs:
    link = 'https://github.com' + div.find('a')['href']
    topic_urls.append(link)

    topic_info = div.find_all('p')

    title = topic_info[0].text
    topic_titles.append(title)

    desc = topic_info[1].text.strip()
    topic_descs.append(desc)
    

In [22]:
topics = {
    "title" : topic_titles,
    "description" : topic_descs,
    "url" : topic_urls
}

In [23]:
!pip3 install pandas --upgrade --quiet

In [24]:
import pandas as pd
df = pd.DataFrame(topics)

## Scraping topics from each trending repo

In [25]:
def get_repo_df(topic):
    r = requests.get(topic)
    soup = BeautifulSoup(r.text, 'html.parser')
    repos = soup.find_all('div', {'class' : 'd-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3'})

    repo_urls = []
    repo_names = []
    repo_users = []
    repo_stars = []

    for repo in repos:
        repo_info = repo.findChildren('a', {'class' : "Link"})

        user = repo_info[0].text.strip()
        repo_users.append(user)

        repo_name = repo_info[1].text.strip()
        repo_names.append(repo_name)

        repo_url = 'https://github.com/' + user+'/'+repo_name
        repo_urls.append(repo_url)

        stars = repo.findChild('span', {'id' : 'repo-stars-counter-star'}).text
        repo_stars.append(stars)


    repos_info = {
    "repository owner" : repo_users,
    "repository name" : repo_names,
    "repository url" : repo_urls,
    "repository stars" : repo_stars
    }

    repo_df = pd.DataFrame(repos_info)
    return repo_df

## Create CSV file(s) with the extracted information

  -  Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
  -  Execute the function with different inputs to create a dataset of CSV files.
  -  Verify the information in the CSV files by reading them back using Pandas.



In [26]:
for topic in topics["url"][:21]:
    df_name = topic.split('/')[-1]
    print(f"collecting data from : {df_name}")
    repo_df = get_repo_df(topic)
    repo_df.to_csv(f"csv-files/{df_name}.csv", index=None)
    print(f"processing done for : {df_name} ")
    print("*" * 20 + "\n\n")

collecting data from : 3d
processing done for : 3d 
********************


collecting data from : ajax
processing done for : ajax 
********************


collecting data from : algorithm
processing done for : algorithm 
********************


collecting data from : amphp
processing done for : amphp 
********************


collecting data from : android
processing done for : android 
********************


collecting data from : angular
processing done for : angular 
********************


collecting data from : ansible
processing done for : ansible 
********************


collecting data from : api
processing done for : api 
********************


collecting data from : arduino
processing done for : arduino 
********************


collecting data from : aspnet
processing done for : aspnet 
********************


collecting data from : atom
processing done for : atom 
********************


collecting data from : awesome
processing done for : awesome 
********************


collecting d