### Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### 1. Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [697]:
import requests

In [698]:
topics_url = "https://github.com/topics"

In [699]:
response = requests.get(topics_url)

In [700]:
response.status_code

200

In [701]:
len(response.text)

205601

In [702]:
page_contents = response.text

In [703]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" /><link data-color-theme="dark_dimmed" cross

In [704]:
with open("webpage.html" , "w", encoding="utf-8") as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [705]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_contents, "html.parser")

In [706]:
Topic_title_tags = soup.find_all("p", {"class": "f3 lh-condensed mb-0 mt-1 Link--primary"})
Topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [707]:
Topic_titles = []

for tag in Topic_title_tags:
    Topic_titles.append(tag.text)
    
print(Topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [708]:
Topic_desc_tags = soup.find_all("p", {"class":"f5 color-fg-muted mb-0 mt-1"})
Topic_desc_tags[:1]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>]

In [709]:
topic_desc = []

for tag in Topic_desc_tags:
    topic_desc.append(tag.text.strip())

topic_desc

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure is a cloud computing service created by Microsoft.',
 'Babel is a c

In [710]:
topic_url_tags = soup.find_all("a", {"class": "no-underline flex-grow-0"})

In [711]:
topic0_url = "https://github.com" + topic_url_tags[8]["href"]
topic0_url

'https://github.com/topics/arduino'

In [712]:
topic_url = []
base_url = "https://github.com"
for tag in topic_url_tags:
    topic_url.append(base_url + tag["href"])

topic_url[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [713]:
import pandas as pd
topics_dict = {
    "Title": Topic_titles,
    "description": topic_desc,
    "url":topic_url
}
topics_df = pd.DataFrame(topics_dict)


### Create CSV file with the extracted information


In [714]:
topics_df.to_csv("topics.csv" ,  index=False)

### Getting information out of a topic page

In [715]:
topic_page_url = topic_url[0]

In [716]:
topic_page_url

'https://github.com/topics/3d'

In [717]:
response = requests.get(topic_page_url)

In [718]:
page = BeautifulSoup(response.text, "html.parser")

In [719]:
repo_tags = page.find_all('h3', {"class": "f3 color-fg-muted text-normal lh-condensed"})
a_tags = repo_tags[0].find_all('a')
a_tags[0].text.strip()

'mrdoob'

In [720]:
a_tags[1].text

'three.js'

In [721]:
base_url = "https://github.com/topics"
repo_url = base_url + a_tags[1]["href"]
print(repo_url)

https://github.com/topics/mrdoob/three.js


In [722]:
star_tags = page.find_all("span" , {"class": "Counter js-social-count"})
len(star_tags)

20

In [723]:
star_tags[0].text.strip()

'104k'

In [724]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == "k":
        return int(float(stars_str[:-1]) * 1000)
        return int(star_str)

In [725]:
parse_star_count(star_tags[0].text.strip())

104000

In [726]:
def get_repo_info(h3_tag, star_tags):
    # returns all the information on a repository
    a_tags = repo_tags[0].find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text
    repo_url = base_url + a_tags[1]["href"]
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, repo_url, stars

In [727]:
topic_repos_dict = {
    "username" : [],
    "repo_name": [],
    "repo_url": [],
    "stars": [],
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict["username"].append(repo_info[0])
    topic_repos_dict["repo_name"].append(repo_info[1])
    topic_repos_dict["repo_url"].append(repo_info[2])
    topic_repos_dict["stars"].append(repo_info[3])
                              

In [728]:
topic_repos_dict

{'username': ['mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob',
  'mrdoob'],
 'repo_name': ['three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js',
  'three.js'],
 'repo_url': ['https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.com/topics/mrdoob/three.js',
  'https://github.

### Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.