# What is Webscraping

Web scraping is an automatic method to obtain large amounts of data from websites. The data is usually in unstructured HTML code which is then converted to structured data in spreadsheet or data is stored in a database. The scraper can then replicate entire website content elsewhere.

Web scraping is used in a variety of businesses that rely on data harvesting. Legitimate use cases include:

 1) Search engine bots crawling a site, analyzing its content and then ranking it.
 2) Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
 3) Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).

## Beautifulsoup

Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML. 

## Project Overview

In this Project our Aim is to scrape all the **topics of repositories available in Github**. **GitHub** is a hosting platform used by developers to save their code online and track its changes to see what will work when it comes to them working on their projects.

The repositories of Github act as essential places for storing the files with maintaining the versions of development. By using GitHub repositories developers can organize, monitor, and save their changes of code to their projects in remote environments. 

**Topics** are labels that create subject-based connections between GitHub repositories and let you explore projects by type, technology, intended purpose, language, subject area and more. With topics, you can explore repositories in a particular subject area, find projects to contribute to, and discover new solutions to a specific problem. 

If you want to explore repositories about a certain topic, find projects to contribute to, or learn which topics are most popular on GitHub, you can search topics with the search qualifiers and our project here extracts the topics on which repos are available in github making it simple. 


## Roadmap of Project

- Identify the page in which you are going to scrape. 
- Import necessary libraries before starting eg: Requests, beautifulsoup
- Pull down the content of the page into a Python (string) variable. For simpler webscraping tasks you can do this with the requests package. 
- parse the html page with beautifulsoup.
- use find, find all fuctions to analyze the data.
- grab necessary data and export it to csv using pandas.

### Importing Libraries

In [1]:
import requests#The Python requests library is a powerful tool for making HTTP requests in Python
import pandas as pd
from bs4 import BeautifulSoup as bs

In [2]:
#lets check if we can iterate through the pages of github topics. we have total of 8 pages. 
page = 1
while page !=8:
    url=f'https://github.com/topics?page={page}'
    print(url)
    page=page+1
    

https://github.com/topics?page=1
https://github.com/topics?page=2
https://github.com/topics?page=3
https://github.com/topics?page=4
https://github.com/topics?page=5
https://github.com/topics?page=6
https://github.com/topics?page=7


In [3]:
page = 1
titles=[]#empty list for grabbing titles of topics
topic_descs = []#empty list for grabbing descriptions of topics
topic_urls = []#empty list for grabbing url of each topic to navigate to its repos
base_url = 'https://github.com'

while page !=8:
    url=f'https://github.com/topics?page={page}'
    response=requests.get(url)#grabbing contents of url 
    soup=bs(response.content,'lxml')#parsing html content
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_link_tags = 'no-underline flex-1 d-flex flex-column'
    
    for p in soup.find_all('p', {'class': selection_class}):
        titles.append(p.get_text(strip=True))
    for p in soup.find_all('p', {'class': desc_selector}):
        topic_descs.append(p.get_text(strip=True))
    for a in soup.find_all('a', {'class':topic_link_tags}):
        topic_urls.append(base_url + a['href'])
    page=page+1
       

In [4]:
titles#checking if all topics are extracted

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++',
 'Cryptocurrency',
 'Crystal',
 'C#',
 'CSS',
 'Data structures',
 'Data visualization',
 'Database',
 'Deep learning',
 'Dependency management',
 'Deployment',
 'Django',
 'Docker',
 'Documentation',
 '.NET',
 'Electron',
 'Elixir',
 'Emacs',
 'Ember',
 'Emoji',
 'Emulator',
 'ESLint',
 'Ethereum',
 'Express',
 'Firebase',
 'Firefox',
 'Flask',
 'Font',
 'Framework',
 'Front end',
 'Game engine',
 'Git',
 'GitHub API',
 'Go',
 'Google',
 'Gradle',
 'GraphQL',
 'Gulp',
 'Hacktoberfest',
 'Haskell',
 'Homebrew',
 'Homebridge',
 'HTML',
 'HTTP',
 'Icon font',
 'iOS',
 'IPFS',
 'Java',
 'JavaScript',
 'Je

In [5]:
topic_descs#checking if all description of topics are extracted

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure 

In [6]:
topic_urls#checking if all urls are extracted

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [7]:
#putting it together as dictionary
topic_dict = {'title':titles,
              'description':topic_descs,
              'Url':topic_urls}

In [8]:
#converting dictionary to dataframe using pandas
topics_df = pd.DataFrame(topic_dict)

In [9]:
topics_df

Unnamed: 0,title,description,Url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
...,...,...,...
175,Windows,Windows is Microsoft's GUI-based operating sys...,https://github.com/topics/windows
176,WordPlate,WordPlate is a modern WordPress stack which si...,https://github.com/topics/wordplate
177,WordPress,WordPress is a popular content management syst...,https://github.com/topics/wordpress
178,Xamarin,Xamarin is a platform for developing iOS and A...,https://github.com/topics/xamarin


#### Converting dataframe to CSV

In [11]:
topics_df.to_csv('topics_of_repos_available_in_github_webscraping.csv')