# Github Top Repository Topics
-This project was designed to scrap and extract useful information from the Github website to answer some analytical questions, like the top 10-20 leading topics, repositories and URLs. Other information can also be extracted further using this techniques. 

## Project objectives

    -Browse through the sites to scrape.
    -Check the "Project Ideas" section for inspiration.
    -Identify the information you'd like to scrape from the site. 
    -Decide the format of the output CSV file.
    -Summarize your project idea and outline your strategy in a Juptyer notebook.
    -Use the "New" button above.







## Use the requests library to download web pages




In [1]:
import requests

In [2]:
topics_url = 'https://github.com/topics'

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
len(response.text)

154175

In [28]:
page_contents = response.text #extracting the page text codes

## Use Beautiful Soup to parse and extract information



In [7]:
from bs4 import BeautifulSoup

In [8]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [9]:
type(doc)

bs4.BeautifulSoup

In [10]:
# to get all the p tags, select the p class
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [11]:
#check the lenght of the first page

len(topic_title_tags)

30

In [12]:
#take a look on the first 5 topic tags
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [13]:
#select all the desc tags
topic_desc_tags = doc.find_all('p',{'class' : 'f5 color-fg-muted mb-0 mt-1'})

In [14]:
#view first 2 desc tags
topic_desc_tags[:2]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>]

In [15]:
#select all topic urls
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})

In [16]:
len(topic_link_tags)

30

In [17]:
#get only the href of the link tag
topic_link_tags[0]['href']

'/topics/3d'

In [18]:
#get the full url
topic0_url = 'https://github.com' + topic_link_tags[0]['href']

In [19]:
print(topic0_url)

https://github.com/topics/3d


In [20]:
#get the lists of the topic titles
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles[:3])

['3D', 'Ajax', 'Algorithm']


In [21]:
#get the lists of the topic desc
topic_desc = []

for desc in topic_desc_tags:
    topic_desc.append(desc.text.strip())
    
print(topic_desc[:3])

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.']


In [22]:
#get the lists of topic urls
topic_urls = []
base_url = 'https://github.com'

for urls in topic_link_tags:
    topic_urls.append(base_url + urls['href'])

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [23]:
import pandas as pd 

In [24]:
#cast all the 3 list in pandas dataframe
topic_dict = {'Topic_titles': topic_titles, 
              'Topic_descriptions': topic_desc, 
              'Topic_urls': topic_urls}

In [25]:
topic_df = pd.DataFrame(topic_dict)

In [26]:
topic_df[:5]

Unnamed: 0,Topic_titles,Topic_descriptions,Topic_urls
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Create CSV file(s) with the extracted information

In [27]:
#save the pandas dataset in csv file
topic_df.to_csv('Topics.csv',index = None)

### Document and share your work

-The dataset was save as csv for further analysis and dashboard visualisation
-The dataset was share to different stakeholders 