# Scraping the top trending repositories of github

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

## Pick a website and describe your objective

  -  Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
  -  Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
  -  Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.



- Here we are going to use https://github.com/topics
- We will scrape the top 20 repositories of some topics

## Use the requests library to download web pages

  -  Inspect the website's HTML source and identify the right URLs to download.
  -  Download and save web pages locally using the requests library.
  -  Create a function to automate downloading for different topics/search queries.



In [1]:
!pip3 install requests --upgrade --quiet

In [2]:
import requests

url = 'https://github.com/topics'
r = requests.get(url)
r.status_code

200

In [4]:
# with open('topics.html', 'w') as page:
#     page.write(r.text)

## Use Beautiful Soup to parse and extract information

  -  Parse and explore the structure of downloaded web pages using Beautiful soup.
  -  Use the right properties and methods to extract the required information.
  -  Create functions to extract from the page into lists and dictionaries.

In [7]:
!pip3 install beautifulsoup4 --upgrade --quiet

In [9]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

In [51]:
topic_divs = soup.find_all('div', {'class' : 'py-4'})
topic_titles = []
topic_urls = []
topic_descs = []
for div in topic_divs:
    link = 'https://github.com' + div.find('a')['href']
    topic_urls.append(link)

    topic_info = div.find_all('p')

    title = topic_info[0].text
    topic_titles.append(title)

    desc = topic_info[1].text.strip()
    topic_descs.append(desc)
    

In [None]:
topics = {
    "title" : topic_titles.
    "description" : topic_descs,
    "url" : topic_urls
}

In [36]:
!pip3 install pandas --upgrade --quiet

In [54]:
import pandas as pd
df = pd.DataFrame(topics, index=None)