# Noteook for scraping data

In [1]:
!pip install jovian --upgrade --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.6/68.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for uuid (setup.py) ... [?25l[?25hdone


In [2]:
import jovian

In [3]:
jovian.commit(project="python-web-scraping-project-guide")

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


# Scraping Top Repositories for Topics on GitHub

TODO  (Intro):
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)


Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [4]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Add some explanation

In [5]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)

In [6]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [7]:
titles = get_topic_titles(doc)

In [8]:
len(titles)

16

In [9]:
titles[:5]

['Awesome Lists', 'Chrome', 'Code quality', 'Compiler', 'CSS']

Similarly we have defined functions for descriptions and URLs.

In [10]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs


TODO - example and explanation

In [11]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

Let's put this all together into a single function

In [12]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [13]:
import jovian

In [14]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


## Get the top 25 repositories from a topic page

TODO - explanation and step

In [15]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [16]:
doc = get_topic_page('https://github.com/topics/3d')

TODO - talk about the h1 tags

In [17]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

TODO - show a example

In [18]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})

    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

TODO - show an example

In [19]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [20]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()

    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [22]:
import pandas as pd
scrape_topics_repos()

Scraping list of topics


ValueError: All arrays must be of the same length

It looks like you’re constructing a pandas DataFrame (or similar structure) from multiple arrays/lists, and one or more of those arrays have a different length than the others. Pandas requires all columns (or 1-D arrays that become columns) to have the same number of elements when you build the DataFrame from a dict or a list of arrays.

What the traceback tells you
- The error originates in pandas internals when it tries to align multiple 1-D inputs into a 2-D structure.
- Specifically, the message: "All arrays must be of the same length" means at least one of the arrays you’re passing has a length that differs from the others.

Common causes
- Mismatched lengths in the data sources you’re aggregating (e.g., topics, repos, descriptions, URLs).
- A filtering operation reduced one list but not the others.
- A miscount when you’re iterating and appending to lists inside a loop (off-by-one or conditional path skipping items).
- Returning different numbers of items from a function for each field.

How to diagnose and fix
1. Inspect lengths before creating the DataFrame
   - Print lengths of all arrays/lists you plan to combine:
     - `print(len(topics), len(repos), len(descriptions), len(urls))`
   - If you’re building from a DataFrame or a dict, check each value’s length:
     - `for key, value in data.items(): print(key, len(value) if hasattr(value, '__len__') else 'N/A')`

2. Find where the mismatch occurs
   - If you’re inside a function like `scrape_topics_repos()`, add debug statements or use a small subset:
     - Collect items in a list of dicts, then `pd.DataFrame(list_of_dicts)`; this often helps identify which field is short.
   - Compare lengths after each processing step (e.g., after filtering, deduplication, or splitting strings).

3. Ensure consistent construction
   - Build a list of records (dicts) and convert once:
     - ```
       records = []
       for item in items:
           rec = {
               "topic": item.topic,
               "repo": item.repo,
               "description": item.description,
               "url": item.url
           }
           records.append(rec)
       df = pd.DataFrame(records)
       ```
   - If you must use separate lists, ensure they’re appended in lockstep:
     - ```
       topics, repos, descriptions, urls = [], [], [], []
       for item in items:
           topics.append(item.topic)
           repos.append(item.repo)
           descriptions.append(item.description)
           urls.append(item.url)
       # then:
       df = pd.DataFrame({"topic": topics, "repo": repos, "description": descriptions, "url": urls})
       ```

4. Handle missing data gracefully
   - If some fields are missing for certain items, decide on a strategy:
     - Pad with None/NaN to maintain equal lengths.
     - Use an inner join approach when merging multiple sources so only complete records are kept.

5. Quick checks you can run
   - After collecting each field, print the lengths:
     - `print(len(topics), len(repos), len(descriptions), len(urls))`
   - If you’re parsing HTML, ensure your selectors consistently return results for every item; missing selectors can cause empty lists for some fields.

If you share a snippet of the relevant code (how you build the lists and how you construct the DataFrame), I can pinpoint the exact mismatch and suggest a precise fix.

We can check that the CSVs were created properly

In [23]:
# read and display a CSV using Pandas

## References and Future Work

Summary of what we did

- ?
- ?


References to links you found useful

- ?
- ?

Ideas for future work

- ?
- ?