## TODO (Intro):

Introduction about web scraping
Introduction about GitHub and the problem statement
Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)

# Here are the steps we'll follow:

We're going to scrape https://github.com/topics
We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
For each topic, we'll get the top 25 repositories in the topic from the topic page
For each repository, we'll grab the repo name, username, stars and repo URL
For each topic we'll create a CSV file in the following format:
### Repo Name,Username,Stars,Repo URL
     three.js,mrdoob,69700,https://github.com/mrdoob/three.js
     libgdx,libgdx,18300,https://github.com/libgdx/libgdx

### Getting the First 30 topics alphabetically

## USE the requests library to download web pages

In [2]:
!pip install requests --quiet

In [3]:
import requests

In [18]:
topics_url='https://github.com/topics'

In [19]:
response = requests.get(topics_url)

In [20]:
response.status_code

200

In [21]:
page_contents = response.text
len(page_contents)

155303

In [76]:
page_contents[:10]

'\n\n<!DOCTYP'

In [23]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)

### USe Beautiful soup to parse and extract information


In [24]:
!pip install beautifulsoup4 --upgrade --quiet

In [25]:
from bs4 import BeautifulSoup

In [26]:
doc = BeautifulSoup(page_contents,'html.parser')
len(doc)

5

In [46]:
# selection_class=class="f3 lh-condensed mb-0 mt-1 Link--primary"'
topic_title_tags = doc.find_all('p',{'class':"f3 lh-condensed mb-0 mt-1 Link--primary"})

In [47]:
len(topic_title_tags)

30

In [48]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [32]:
desc_selector='f5 color-fg-muted mb-0 mt-1'
topic_desc_tags=doc.find_all('p',{'class':desc_selector})

In [33]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [37]:
 topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

In [42]:
topic0_url = 'https://github.com'+topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [50]:
topic_titles = []
for topic in topic_title_tags:
    topic_titles.append(topic.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [54]:
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [60]:
topic_urls=[]
base_url='https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url+tag['href'])
print(topic_urls[:5])

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android']


In [63]:
!pip install pandas --quiet

In [64]:
import pandas as pd

In [66]:
topics_dict={
    'title':topic_titles,
    'description':topic_descs,
    'url':topic_urls
}

In [80]:
topics_df=pd.DataFrame(topics_dict)
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Create CSV file(s) with the extracted infromation

In [81]:
topics_df.to_csv('Topics.csv',index=None)

## Getting information out of a topic page

In [82]:
topic_page_url = topic_urls[0]

In [83]:
topic_page_url

'https://github.com/topics/3d'

In [85]:
response = requests.get(topic_page_url)
response.status_code

200

In [86]:
len(response.text)

463451

In [87]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [91]:
repo_tags=topic_doc.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
len(repo_tags)

20

In [93]:
a_tags= repo_tags[0].find_all('a')

In [95]:
a_tags[0].text.strip()

'mrdoob'

In [98]:
base_url='https://github.com'
repo_url=base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [100]:
star_tags=topic_doc.find_all('span',{'class':"Counter js-social-count"})
len(star_tags)
star_tags[:1]

[<span aria-label="92043 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="92,043">92k</span>]

In [101]:
star_tags[0].text.strip()

'92k'

In [102]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [103]:
parse_star_count(star_tags[0].text.strip())

92000

In [106]:
def get_repo_info(h1_tag,star_tag):
    #returns all the required info aboug a repository
    a_tags=h1_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [107]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 92000, 'https://github.com/mrdoob/three.js')

In [114]:
topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [130]:
def get_topic_page(topic_url):
    #download the page
    response= requests.get(topic_url)
    #check successful repsonse
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using BeautifulSoup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc


def get_repo_info(h1_tag,star_tag):
    #returns all the required info aboug a repository
    a_tags=h1_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url


def get_topic_repos(topic_doc):
    #get the h1 tags containing the repo tile , repo url and username
    repo_tags=topic_doc.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
    #get star tags
    star_tags=topic_doc.find_all('span',{'class':"Counter js-social-count"})
    
    topic_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

### write a single function to:
### 1.get the list of topics from the topics page
### 2.get the list of top repos from the individual topic pages
### 3.for each topic ,create csv of the top repos for the topic

In [None]:
def scrape_topics

In [125]:
url4=topic_urls[4]
url4

'https://github.com/topics/android'

In [126]:
topic4_doc=get_topic_page(url4)

In [131]:
topic4_repos=get_topic_repos(topic4_doc)

In [134]:
get_topic_repos(get_topic_page(topic_urls[5]))

In [117]:
topic_repos_df=pd.DataFrame(topic_repos_dict)


## Document and share your work