<a href="https://colab.research.google.com/github/AnuragBalasahebChumble/PortfolioProjects/blob/main/WebScrappingFromGithubPage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrapping-github-topics-repositories



##Pick a website and descirbe the objective.
- Pick the website from which we want to scrape information/data.
- Identify the information that we want to scrape.

### Project approach
1. We are going to scrape the page here:
https://github.com/topics

2. We'll get the list of topics. Then for each topic we'll get i)topic tittle, ii)topic page URL, iii)topic description.

3. For each topic we'll try to get top 25 repositories in the topic from the topic page.

4. For each repository we'll fetch the i)repo name, ii)username, iii)stars, iv)repo URL

5. For each topic we'll create CSV file in the following format.
```
RepoName, UserName, Stars, RepoURL
```

## Use the request library to download the webpage.

In [None]:
!pip install requests --quiet

In [None]:
import requests

In [None]:
topics_url = 'https://github.com/topics'

In [None]:
response = requests.get(topics_url)

In [None]:
type(response)

requests.models.Response

In [None]:
response.status_code

200

###HTTPS status code response:
let's check with the status code, if between 200to 299, than it means success.

1. Informational responses (100–199)
2. Successful responses (200–299)
3. Redirection messages (300–399)
4. Client error responses (400–499)
5. Server error responses (500–599)

```So, our request was a success.```

In [None]:
len(response.text)

142102

```
- The output tell's us that there are 1,42,040 Charachters.
```


In [None]:
page_contents = response.text

In [None]:
page_contents[:300]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubu'

```
this is how the web page contents looks like. It is an HTML code.
```

## Use Beautiful Soup to parse and extract information. 

In [None]:
!pip install beautifulsoup4 --quiet 

In [None]:
from bs4 import BeautifulSoup

```
module name is bs4 and from that we are importing BeautifulSoup class.
```

In [None]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [None]:
type(doc)

bs4.BeautifulSoup

```
- Doc variable contains all the HTML code in the parsed format. It is a BeautifulSoup object.
- Now, we can acutally find the things by querying this BeautifulSoup object or doc variable.
```

### Let's fetch the topic titles from the page.

In [None]:

topic_title_tags = doc.find_all('p', {'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})

In [None]:
len(topic_title_tags)

30

```
We have 30 topics inside the page. 
```

In [None]:
topic_title_tags[:3]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>]

```
- Points to note:
1. doc.find_all method return the list object.
2. List object contains all the p tags of the required class.
3. We have all the topics name inside the p tags. 
```

### Now, let's fetch the "topic description tags".  

In [None]:
topic_description_tags = doc.find_all('p', {'class':"f5 color-fg-muted mb-0 mt-1"})

In [None]:
len(topic_description_tags)

30

```
We have got 30 descrptions, remember there are 30 topics & now we have descriptions for all of them.
```

In [None]:
topic_description_tags[:1]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>]

### Let's fetch the topic URL/link

In [None]:
topic_link_tags0 = topic_title_tags[0].parent

```
- We are using parent child relationship to get the required tag. 
- Using the indexing on the list object(topic_title_tags), we are tracing the parent class. 
```

In [None]:
topic_link_tags0

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [None]:
type(topic_link_tags0)

bs4.element.Tag

```
We can treat 'bs4.element.Tag' object as dictionary. 
```

In [None]:
topic_link_tags0['href']

'/topics/3d'

```
let's build the URL for 1st topic on the web page.
```

In [None]:
topic0_url = "https://github.com" + topic_link_tags0["href"]

In [None]:
topic0_url

'https://github.com/topics/3d'

```
We are able to build the url to the github topic page for the "topic 3D".
```

```
Let's make a list of topic titles.
```

In [None]:
topic_titles = []
for tag in topic_title_tags:
  topic_titles.append(tag.text)
topic_titles[:7]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible']

```
Similary, let's make a list of topic descriptions. 
```

In [None]:
topic_descriptions = []
for desc in topic_description_tags:
  topic_descriptions.append(desc.text.strip())
topic_descriptions[:3]


['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

```
Finaly let's build a list of topic URL's.
```

In [None]:
topic_title_tags[0].parent['href']

'/topics/3d'

In [None]:
topic_urls = []
for link in topic_title_tags:
  url = "https://github.com" + link.parent['href']
  topic_urls.append(url)
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

```
- let's create a CSV file using this lists namely 1)topic titles 2)topic descriptions 3) topic urls.
- We can use pandas data frame to create CSV file.
```

In [None]:
!pip install pandas --quiet

In [None]:
import pandas as pd


In [None]:
topics_dict = {
    "title" : topic_titles,
    "description" : topic_descriptions,
    "URL" : topic_urls
}

In [None]:
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df[:7]

Unnamed: 0,title,description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible


## Create CSV file with extracted information.

In [None]:
topics_df.to_csv('topics.csv', index=None) # in the bracket is the name of the csv file.

```
- Now, we'll go to each and every topic page, remember we have 30 topics. 
- So, we'll got to each and every topic page and fetch some information from there.
```

## Getting information out of a topic page.

In [None]:
topic_page_url = topic_urls[0]

In [None]:
topic_page_url

'https://github.com/topics/3d'

```
We'll just be repeting the process to get the web page parsed.
```

In [None]:
response = requests.get(topic_page_url)

```
let's check if the response was successful using status code.
```

In [None]:
response.status_code

200

In [None]:
len(response.text)

644877

```
let's parse it using Beautiful Soup to fetch useful information out of it.
```

In [None]:
  topic_doc = BeautifulSoup(response.text, 'html.parser')

```
We want to fetch the following information from the page:
i)UserName
ii)RepositoryName
iii)URL
iv)StarCount
```

In [None]:
len(topic_doc)

49

In [None]:
type(topic_doc)

bs4.BeautifulSoup

In [None]:
repo_tags = topic_doc.find_all('h3', {'class':"f3 color-fg-muted text-normal lh-condensed"})

In [None]:
len(repo_tags)

30

```
- There are 30 repositories on the page
- repo_tags contains userame, repository name, link to repository(URL).
```

In [None]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac

In [None]:
a_tags = repo_tags[0].find_all('a')

In [None]:
a_tags[0].text.strip()

'mrdoob'

```
this gives userName
```

In [None]:
a_tags[1].text.strip()

'three.js'

```
this gives repoName
```

In [None]:
base_url = "https://github.com"
repo_url = base_url + a_tags[1]['href']

In [None]:
repo_url

'https://github.com/mrdoob/three.js'

```
- WE have been able to construct the URL for repository page for the first sub-topic in 3D topic's page.
```

In [None]:
star_tags = topic_doc.find_all('span', {'id':"repo-stars-counter-star"})

In [None]:
len(star_tags)

30

```
len is 30, so we are good to go.
```

In [None]:
star_tags[29].text

'3k'

```
let's write a function to convert the star count into int type
```

In [None]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip() # better to always strip first while converting to int. 
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

In [None]:
print(parse_star_count(star_tags[0].text))

83100


```
Let's define a function "get_repo_info()"
```

In [None]:
def get_repo_info(h3_tags, star_tags):
  a_tags = h3_tags.find_all('a')  
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tags.text)
  return [username, repo_name, stars, repo_url]


In [None]:
print(get_repo_info(repo_tags[0], star_tags[0]))

['mrdoob', 'three.js', 83100, 'https://github.com/mrdoob/three.js']


```
- We have programmaticaly achieved the results for the 'first sub-topic' from the topic 3D.
- We can use for loop to get the results for all the sub_topic for the topic 3D. 
```

In [None]:
topic_repos_dict = {
    'username' : [],
    'repo_name': [],
    'stars' : [],
    'url' : []
}
for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i], star_tags[i])
  topic_repos_dict['username'].append(repo_info[0])
  topic_repos_dict['repo_name'].append(repo_info[1])
  topic_repos_dict['stars'].append(repo_info[2])
  topic_repos_dict['url'].append(repo_info[3])


In [None]:
topic_repos_dict

```
let's convert the dictionary into dataframe
```

In [None]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [None]:
topic_repos_df[:3]

Unnamed: 0,username,repo_name,stars,url
0,mrdoob,three.js,83100,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20100,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,18500,https://github.com/pmndrs/react-three-fiber


```
- Now, what we have in this CSV, is the information from the topic 3D. Which happens to be one of the topic from the primary topic page.

- https://github.com/topics/3d this is the page url that we paresed using BeautifulSoup.

- Repeating myself again, there are 30 topics in the primary topic page.
- Let's see how we can get the information from the remaining 29 topics/pages. 
```

##We'll write a function get_topic_repos() to fetch the information from the topic pages. 

In [None]:
def get_topic_page(topic_page_url):
  # Download the page
  response = requests.get(topic_page_url)
  # Check successful response 
  if response.status_code != 200:
    raise Exception('failed to load page {}'.format(topic_page_url))
  # let's parse the page using BeautifulSoup
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  return topic_doc


def get_repo_info(repo_tags, star_tags):
  # return all the required information for the particular/single repo.
  a_tags = repo_tags.find_all('a')  
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tags.text)
  return [username, repo_name, stars, repo_url]


def get_topic_repos(topic_doc):
  # Get the repo_tags.
  # repo_tags contains i)userame, ii)repository name, iii)link to repository(URL)
  repo_tags = topic_doc.find_all('h3', {'class':"f3 color-fg-muted text-normal lh-condensed"})
  # Get the star_tags
  star_tags = topic_doc.find_all('span', {'id':"repo-stars-counter-star"})
  
  # create dict object and then create DataFrame from it.
  topic_repos_dict = {
    'username' : [],
    'repo_name': [],
    'stars' : [],
    'url' : []
  }
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['url'].append(repo_info[3])
  return pd.DataFrame(topic_repos_dict)




```
let's check it.
```

In [None]:
url4 = topic_urls[4] 

In [None]:
url4

'https://github.com/topics/android'

In [None]:
topic_doc4 = get_topic_page(url4)

In [None]:
len(topic_doc4)

49

In [None]:
df4 = get_topic_repos(topic_doc4)

In [None]:
df4[:3]

Unnamed: 0,username,repo_name,stars,url
0,flutter,flutter,142000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,93600,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,66800,https://github.com/Genymobile/scrcpy


```
We can also write this in one line.
```

In [None]:
topic_urls[3]

'https://github.com/topics/amphp'

In [None]:
get_topic_repos(get_topic_page(topic_urls[3]))

```
- Above, we have created DataFrame object for 3rd url from the list.
```

```
let's also write it to csv
```

In [None]:
get_topic_repos(get_topic_page(topic_urls[3])).to_csv('amphp.csv')

```
CSV file with name amphp is created and stored in folder.
```