### Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

1. Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook.
2. Use the requests library to download web pages.

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.
- Use Beautiful Soup to parse and extract information

3. Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- Use a REST API to acquire additional information if required.
4. Create CSV file(s) with the extracted information.

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.
5. Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Write a blog post about your project and share it online.

## Scraping Top Repositories for Topics on GitHub


### Scrape the list of topics from Github

### 1. Pick a website and describe your objective

- Browse through different sites and pick on to scrape. 
- Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. 

### 2. Use the requests library to download web pages

Explain how you'll do it.

- request is a library which is used to get the web pages. use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

- In jupyter, whenever "!" symbol is used in jupyter notebook/lab, jupyter assumes that this is not a python code but this is something to execute on the system directly within the terminal or shell or the command line, jupyter will consider this kind of code (!) is not a python code and takes directly to the system to execute it on the terminal.
- And use quiet command not to see the output of it.

In [1]:
! pip install requests --upgrade --quiet

You should consider upgrading via the 'c:\users\davea\anaconda3\python.exe -m pip install --upgrade pip' command.


In [2]:
# import requests library
import requests

### Scraping has two parts
- 1. The information get from the web page
- 2. Then parse the web page

### requests.get
- creats response object 

In [3]:
topicsURL = 'https://github.com/topics'

In [4]:
response = requests.get(topicsURL)

In [5]:
# CHECK STATUS CODE OF RESPONSE
response.status_code

200

#### check for HTTP response status code in the google and find out which response class has beeen successfully completed.
- Informational responses (100–199)
- Successful responses (200–299)
- Redirects (300–399)
- Client errors (400–499)
- Server errors (500–599)

### Before printing the content of the web as text, check the size of the text

In [6]:
len(response.text)

141561

- The number indicates 141496 number of characters

In [7]:
# can see the full details of text
# Use response.text
# instead use limited data to see as below
pageContents = response.text
pageContents[:100]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="d'

In [8]:
pageContents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-g92Fp85+/vMzJIfgnpKFYqJ8VrYPHFHaqHRTgF8llOI+TnY0Ey75gxhQtNyXtv67Zv3ub5HufJFDJpvPPMO4UQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-83dd85a7ce7efef3332487e09e928562.css" /><link crossorigin="anonymous" media="all" integrity="sha512-KWZpEQ6YJCe1nP9i4+VQwGbQNoPXXgeXQpeA9oeM2AGGBQvAjvL+pCHVV7HF8P29jlbcEcDuA7/ka9iM

- the content is from the web page of where we are looking for
- its a code return in html language
-it can be saved with html format

In [9]:
with open('webpage.html', 'w', encoding="utf-8") as f:
         f.write(pageContents)

- from the above request, it will give a webpage.html file which saves at same location along with local system address even when you try to open you can see the same.
- Once you open the webpage, you can see lot of content in the page and that can be grab using the library called Beautifulsoap

## 3. Parse and explore the structure of downloaded web pages using Beautiful soup.

In [10]:
! pip install beautifulsoup4 --upgrade --quiet

You should consider upgrading via the 'c:\users\davea\anaconda3\python.exe -m pip install --upgrade pip' command.


In [11]:
from bs4 import BeautifulSoup

- beautifulsoup4 is a package which we install
- bs4 is a module available for beautifulsoup
- beautifulsoup is a class which we import from the module.
- html, json ..etc data/text can be parsed using BeautifulSoup library.

In [12]:
docParsed = BeautifulSoup(pageContents , 'html.parser')
type(docParsed)

bs4.BeautifulSoup

- docParsed is a beautifulsoup Object which contains html content in parsed format

### Now we can find the content in the webpage by doing some queries

- look for what to extract from the page, there right cick and select an option name is "inspect", there seperate window is opened and find developers tool spin to position the tools at bottom/top/right/left sides.
- Observe the highlighted row in html page on html tag, understand that what you have selected on pages will be highlighted on html tag in the html page.
- further to understand the tags belongs to which part of the webpage characters, filter the  tags with specified name like <p>
- html tag will have many attributes with tag class, so copy the same for filtering the same.

In [13]:
pTags = docParsed.find_all("p")

In [14]:
pTags

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Nim
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Nim is a statically typed, compiled, garbage-collected systems programming language.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         PICO-8
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs in Lua.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Vue.js
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Vue.js is a JavaScript framework for building interactive web applications.</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing 

In [15]:
len(pTags)

67

- to accurately grap the tags which are required from the html page, we need to filter with suitable attribute in the tags
- 67 tags may be more than what we suppose to extract from the html page, so specifically need to query for the required attribute.
- to undertand this, extract top 5 p tags

In [16]:
pTags[:5]

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Nim
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Nim is a statically typed, compiled, garbage-collected systems programming language.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         PICO-8
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs in Lua.</p>]

- Django is a repository which we planned to extract, like wise same p tags need to be extracted.

In [17]:
pTags[:2]

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Nim
       </p>]

- So, this p class tags is required to extract the top github repositories

In [18]:
pTags = docParsed.find_all('p', {'class' : "f3 lh-condensed mb-0 mt-1 Link--primary"})

In [19]:
pTags[:3]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>]

In [20]:
len(pTags)

30

- otherway of doing is let us create a variable out of this p tags for simple query

In [21]:
selectionPClass = "f3 lh-condensed mb-0 mt-1 Link--primary"
pTags = docParsed.find_all('p', {'class' : selectionPClass})

In [22]:
pTags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

- Another way of doing the same is with class_ function

In [23]:
selectionTopicTitle = "f3 lh-condensed mb-0 mt-1 Link--primary"
topicTitleTags = docParsed.find_all('p', class_ = selectionTopicTitle)

In [24]:
topicTitleTags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

- The above html tags are "topic titles" from the gibhut webpages

## then find the title discription of top topics

In [25]:
selectionTopicDiscreption = "f5 color-text-secondary mb-0 mt-1"
topicDiscreptionTags = docParsed.find_all('p', {'class' : selectionTopicDiscreption})

In [26]:
len(topicDiscreptionTags)

30

In [27]:
topicDiscreptionTags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

- P class tag is inside of Div class tag, which is inside of "a" tag, which is having href, which is a parent tag of html tag of topic

In [28]:
topicTitleTags0 = topicTitleTags[0]

In [29]:
topicTitleTags0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [30]:
topicTitleTags0.parent

<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>

- as above mentioned, .parent function will give title and discreption of tag.
- .parnt.parent gives the url (url link) details of the tag.

In [31]:
topicTitleTags0.parent.parent

<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg aria-hidden="true" class="octicon octicon-star mr-1" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M8 .25a.75.75 0 01.673.418l1.882 3.815 4.21.612a.75.75 0 01.416 1.279l-3.046 2.97.719 4.192a.75.75 0 01-1.088.791L8 12.347l-3.766 1.98a.75.75 0 01-1.088-.79l.72-4.194L.818 6.374a.75.75 0 01.416-1

- .parent.parent gives the url (url link) details of the tag.but unfortunately which has not given the "a href tag", so check for document to retrive the data of a tag.
- So, thru documentation, find for the same or 
- Look for a tag attribute from the web page to extract directly.

In [32]:
selectionTopicLink = "d-flex no-underline"
topicLinkTags = docParsed.find_all('a', {'class' : selectionTopicLink})

In [33]:
len(topicLinkTags)

30

In [34]:
topicLinkTags[0]

<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
<div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg aria-hidden="true" class="octicon octicon-star

- here we extracted link to tags from the html page.
- we are interested in href tags which will have topic title.

In [35]:
# sample check 
topicLinkTags[0]['href']

'/topics/3d'

In [36]:
# sample check
topicZeroUrl = "https://github.com"+topicLinkTags[0]['href']
print(topicZeroUrl)

https://github.com/topics/3d


- successfully generated the link to extract the details of webpage
- we have constructed url to the page

In [37]:
topicTitleTags[0].text

'3D'

In [38]:
topicTitles  = []
for titlesTag in topicTitleTags:
    topicTitles.append(titlesTag.text)
    
print(topicTitles)    

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [39]:
topicDiscreptions  = []
for disrepitonsTag in topicDiscreptionTags:
    topicDiscreptions.append(disrepitonsTag.text)
    
print(topicDiscreptions)

['\n              3D modeling is the process of virtually developing the surface and structure of a 3D object.\n            ', '\n              Ajax is a technique for creating interactive web applications.\n            ', '\n              Algorithms are self-contained sequences that carry out a variety of tasks.\n            ', '\n              Amp is a non-blocking concurrency framework for PHP.\n            ', '\n              Android is an operating system built by Google designed for mobile devices.\n            ', '\n              Angular is an open source web application platform.\n            ', '\n              Ansible is a simple and powerful automation engine.\n            ', '\n              An API (Application Programming Interface) is a collection of protocols and subroutines for building software.\n            ', '\n              Arduino is an open source hardware and software company and maker community.\n            ', '\n              ASP.NET is a web framework for bu

- from the above, we are successful in construction the url to webpage and then generating topic tiles and its descreptions.
- descreption output looks like space at bginning and end of the sentence, which can be done by using function called .strip()

In [40]:
topicDiscreptions  = []
for discrepitonsTag in topicDiscreptionTags:
    topicDiscreptions.append(discrepitonsTag.text.strip())
    
print(topicDiscreptions[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [41]:
# now topic urllink 
topicUrlLinksTag  = []

for urlLinksTag in topicLinkTags:
    topicUrlLinksTag.append(urlLinksTag['href'])
    
print(topicUrlLinksTag[:5])

['/topics/3d', '/topics/ajax', '/topics/algorithm', '/topics/amphp', '/topics/android']


In [42]:
topicUrlLinksTag[:5]

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android']

In [43]:
    # now topic urllink 
topicUrlLinksTag  = []
baseUrl = 'https://github.com'

for urlLinksTag in topicLinkTags:
        topicUrlLinksTag.append(baseUrl + urlLinksTag['href'])
    
topicUrlLinksTag[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [44]:
print(topicUrlLinksTag[:5])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android']


In [45]:
import pandas as pd

In [46]:
topicDictionary = {
    'Topic_title' : topicTitles,
    'Topic_descreption' : topicDiscreptions,
    'Topic_Url_link' : topicUrlLinksTag
    }

In [47]:
topicDictionary

{'Topic_title': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++'],
 'Topic_descreption': ['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
  'Ajax is a technique for creating interactive web applications.',
  'Algorithms are self-contained sequences that carry out a variety of tasks.',
  'Amp is a non-blocking concurrency framework for PHP.',
  'Android is an operating system built by Google designed for mobile devices.',
  'Angular is an open source web application platform.',
  'Ansible is a simple and powerful automation engine.',
  'An API (Application Programming Interfac

In [48]:
topicDataFrame = pd.DataFrame(topicDictionary)

In [49]:
topicDataFrame

Unnamed: 0,Topic_title,Topic_descreption,Topic_Url_link
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [50]:
topicDataFrame.to_csv('TopicdetailsfromGithubrepositories.csv')

In [51]:
pwd

'D:\\Webscrapping using python'

In [52]:
# to remove the index numbers in csv 
topicDataFrame.to_csv('TopicdetailsfromGithubrepositories1.csv', index=None)

In [53]:
pwd

'D:\\Webscrapping using python'

- Upto here we have extracted the data of what we have planned for.

## Getting the information out of a topic page
### understand the text from a topic

In [54]:
topicPageUrl = topicUrlLinksTag[0]

In [55]:
topicPageUrl

'https://github.com/topics/3d'

- This gives the page of 1st topic and which has bunch of repositories.

#### Repeat the same process which we have done for title, discription and url link for the information from the topic

In [56]:
responsePage = requests.get(topicPageUrl)

In [57]:
responsePage.status_code

200

- 200 status code is successful resposes, which is satisfactory.

In [58]:
len(responsePage.text)

632061

- length of the above response is having 632726 characters in the topic.
- LOOKS PREETY LARGE, AND REQUIRED TO PRINT ALL NOW

In [59]:
responsePage

<Response [200]>

In [60]:
topicDocParsed = BeautifulSoup(responsePage.text , 'html.parser')

In [61]:
type(topicDocParsed)

bs4.BeautifulSoup

In [62]:
topicDocParsed


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-83dd85a7ce7efef3332487e09e928562.css" integrity="sha512-g92Fp85+/vMzJIfgnpKFYqJ8VrYPHFHaqHRTgF8llOI+TnY0Ey75gxhQtNyXtv67Zv3ub5HufJFDJpvPPMO4UQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-296669110e982427b59cff62e3e550c0.css" integrity="sha512-KWZpEQ6YJCe1nP9i4+VQwGbQNoPXXgeXQpe

- Get the information like Usename, repository link and stars

- Open the 1st topic page and look for 1st repository with username, right click over there and find html page with "a" tage in the html page.
- Understand "a"tage line of code in the html page for related attributes for filtering with requesting query.

In [63]:
# usernameTags = topicDocParsed.find_all('a',  {'class' : "height:auto;" alt="Avatar" width="260" height="260" class="avatar avatar-user width-full border color-bg-primary" src="https://avatars.githubusercontent.com/u/97088?v=4})

- the above code line for attribute in "a"tag is very difficult to find directly.
- this is part of h3 class tag which is having two "a" tags which we will try to extract for username attribute.

- So grab the tag "h3 class" and extract from there to find the attribute of topic repository.

In [64]:
# repository tags
h3SelectionClass = "f3 color-text-secondary text-normal lh-condensed"
repoTags = topicDocParsed.find_all('h3', {'class' : h3SelectionClass })

In [65]:
len(repoTags)

30

-- Under h3 tag, there are two "a"tags , first one contains Username tag and 2nd one is repository name and link

In [66]:
repoTags[0]

<h3 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d

In [67]:
aTags = repoTags[0].find_all('a')

In [68]:
aTags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [69]:
aTags[0] # gives an information of Username--mrdoob

<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [70]:
aTags[0].text

'\n            mrdoob\n'

In [71]:
aTags[0].text.strip()

'mrdoob'

In [72]:
aTags[1] # gives an information of repository

<a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="true" href="/mrdoob/three.js">
            three.js
</a>

In [73]:
aTags[1].text.strip()

'three.js'

In [74]:
aTags[1]['href']

'/mrdoob/three.js'

In [75]:
baseUrl

'https://github.com'

In [76]:
repoUrl = baseUrl + aTags[1]['href']

In [77]:
repoUrl

'https://github.com/mrdoob/three.js'

In [78]:
print(repoUrl)

https://github.com/mrdoob/three.js


- So for we got Username, repo name and repo url
- then look for stars to filter top repositories in topic at gibhut.com
- right click on stars and find inspect tounderstand the highlighted html tags in html web page.
- find "a" tag with class..

In [79]:
starTags = topicDocParsed.find_all('a' , {'class' : "social-count float-none"})

In [80]:
len(starTags)

30

In [81]:
starTags[0]

<a class="social-count float-none" data-ga-click="Explore, go to repository stargazers, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"STARGAZERS","click_visual_representation":"STARGAZERS_NUMBER","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4f3c0fb1ad4e5a9f72ed698531bf27b302fcc5846d9458e33ceeeeb05888b64c" data-view-component="true" href="/mrdoob/three.js/stargazers">
          74.6k
</a>

In [82]:
starTags[0].text

'\n          74.6k\n'

In [83]:
starTags[0].text.strip()

'74.6k'

- Parse the string of th star count

In [84]:
starsString = '74.6k'

In [85]:
starsString[-1]

'k'

In [86]:
starsString[:-1]

'74.6'

In [87]:
float(starsString[:-1])

74.6

In [88]:
float(starsString[:-1])*1000

74600.0

In [89]:
int(float(starsString[:-1])*1000)

74600

- if "k" is not there in the starts count then

In [90]:
starsString = '74.6'

In [91]:
int(float(starsString[:])*1000)

74600

- Now write the function for stars in numbers using the above steps

In [92]:
def parseStarCount(starsString):
    starsString = starsString.strip()
    if starsString[-1] == 'k':
        return int(float(starsString[:-1])*1000)
    return int(float(starsString[:])*1000)    

In [93]:
parseStarCount(starsString)

74600

In [94]:
def getRepoInfo (repoTags, starTags):
    # returns User name, repository name, Url link and stars count
    aTags = repoTags.find_all('a')
    username = aTags[0].text.strip()
    reponame = aTags[1].text.strip()
    repoUrl = baseUrl + aTags[1]['href']
    stars = parseStarCount(starsString)
    return username, reponame, repoUrl, stars

In [95]:
getRepoInfo(repoTags[0], starTags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 74600)

- the above is for 1st one in the repository

In [96]:
len(repoTags)

30

In [97]:
range(len(repoTags))

range(0, 30)

In [98]:
repoDictionary = {
    'username' : [],
    'reponame' : [],
    'repoUrl' : [],
    'stars' : []
}

for i in range(len(repoTags)):

    repoInfo = getRepoInfo(repoTags[i], starTags[i])
    repoDictionary['username'].append(repoInfo[0])
    repoDictionary['reponame'].append(repoInfo[1])
    repoDictionary['repoUrl'].append(repoInfo[2])
    repoDictionary['stars'].append(repoInfo[3])


In [99]:
repoDictionary

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'domlysz',
  'openscad',
  'ssloy',
  'mosra',
  'blender',
  'google',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'rg3dengine',
  'antvis',
  'cnr-isti-vclab'],
 'reponame': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'BlenderGIS',
  'openscad',
  'tinyraytracer',
  'magnum',
  'blender',
  'model-viewer',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'L7',
  'meshlab'],
 'repoUrl': ['https://github.com/mrdoob/three.js',
  'http

In [100]:
# data into dataframe
repoDictionaryDF = pd.DataFrame(repoDictionary)

In [101]:
repoDictionaryDF

Unnamed: 0,username,reponame,repoUrl,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,74600
1,libgdx,libgdx,https://github.com/libgdx/libgdx,74600
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,74600
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,74600
4,aframevr,aframe,https://github.com/aframevr/aframe,74600
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,74600
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,74600
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,74600
8,metafizzy,zdog,https://github.com/metafizzy/zdog,74600
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,74600


### SO FAR we have extracted the data for one topic repository from GITHUB, And now, defining the function to extract for many topic repositories details.


In [102]:
topicUrlLinksTag

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [103]:
topicUrlLinksTag[4]

'https://github.com/topics/android'

In [107]:
def getTopicPages(topicPageUrl):
    
    #Down load the page
    responsePage = requests.get(topicPageUrl)
    # Check successful response
    if response.status_code !=200:
        raise exception("Failed to load the page".format(topicPageUrl))
        # Parse unsing beautiful soup
        topicPageUrl = BeautifulSoup(responsePage.text , 'html.parser')
        return topicPageUrl

def getRepoInfo (repoTags, starTags):
    
    # returns User name, repository name, Url link and stars count
    aTags = repoTags.find_all('a')
    username = aTags[0].text.strip()
    reponame = aTags[1].text.strip()
    repoUrl = baseUrl + aTags[1]['href']
    stars = parseStarCount(starsString)
    return username, reponame, repoUrl, stars

def getTopicRepos(topicPageUrl):
    
        # Get h3 tag containg Username, reponame and repoUrl
        h3SelectionClass = "f3 color-text-secondary text-normal lh-condensed"
        repoTags = topicPageUrl.find_all('h3', {'class' : h3SelectionClass} )
                       
        # Get star tag for star counting
        starTags = topicPageUrl.find_all('a' , {'class' : "social-count float-none"})
        
        # Get the repo information
        repoDictionary = {
            'username' : [],
            'reponame' : [],
            'repoUrl' : [],
            'stars' : []
            }


        for i in range(len(repoTags)):
            repoInfo = getRepoInfo(repoTags[i], starTags[i])
            repoDictionary['username'].append(repoInfo[0])
            repoDictionary['reponame'].append(repoInfo[1])
            repoDictionary['repoUrl'].append(repoInfo[2])
            repoDictionary['stars'].append(repoInfo[3])
            
        return pd.DataFrame(repoDictionary)

In [108]:
getTopicRepos

<function __main__.getTopicRepos(topicPageUrl)>

In [109]:
getTopicRepos(topicPageUrl)

AttributeError: 'str' object has no attribute 'find_all'

In [110]:
# topicUrlLinksTag

In [None]:
url4 = topicUrlLinksTag[4]

In [None]:
url4

In [None]:
topic4Doc = getTopicPages(url4)

In [None]:
topic4Doc

In [None]:
topic4Repos = getTopicRepos(topic4Doc)

In [None]:
topic4Repos