# Acquisition + Preparation
For this project, you will have to build a dataset yourself. Decide on a list of GitHub repositories to scrape, and write the python code necessary to extract the text of the README file for each page, and the primary language of the repository.

You can find the language of a repository like this:
1. Bottom Right Side of Repo stating **Languages** 
2. html code ```<ul class="list-style-none">```

Which repositories you use are up to you, but you should include at least 100 repositories in your data set.

Using Languages from Java, Javascript, Python, and Swift

As an example of which repositories to use, here is a link to [GitHub's trending repositories](https://github.com/trending), the [most forked repositores](https://github.com/search?o=desc&q=stars:%3E1&s=forks&type=Repositories), and the [most starred repositories](https://github.com/search?q=stars%3A%3E0&s=stars&type=Repositories).

In [1]:
# Imports

import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import os

In [2]:
headers = {'User-Agent': 'GitHub'}

# Trending with Language Python
response = get('https://github.com/trending/python?since=daily', headers=headers)

In [3]:
response.ok

True

In [4]:
print(response.text[:400])






<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="http


In [5]:
soup = BeautifulSoup(response.content, 'html.parser')

In [6]:
soup.title.string

'Trending Python repositories on GitHub today ¬∑ GitHub'

In [7]:
article = soup.find('h1', class_='h3 lh-condensed')
repo_name = article.text

In [8]:
repo_name = repo_name.replace("\n","")
repo_name = repo_name.replace(" ","")

In [9]:
repo_name

'ytdl-org/youtube-dl'

In [10]:
soup.find_all('h1', class_='h3 lh-condensed')

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":1039520,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="b20fbb2ae4d398ce0be32669206fb73ade4f4b8f35ba525f6af4ab0c53f99964" href="/ytdl-org/youtube-dl">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 text-gray" height="16" version="1.1" viewbox="0 0 16 16" width="16"><path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path></svg>
 <span cl

https://divyanshushekhar.com/python-beautifulsoup-find-findall/

In [11]:
h1 = soup.find_all('h1', class_='h3 lh-condensed')

repo_names = []

for h in h1:
    repo_name = h.get_text()
    repo_name = repo_name.replace("\n","")
    repo_name = repo_name.replace(" ","")
    repo_names.append(repo_name)
    
repo_names

['ytdl-org/youtube-dl',
 'TsinghuaAI/CPM-Generate',
 'y1ndan/genshin-impact-helper',
 'scikit-learn/scikit-learn',
 'PyTorchLightning/pytorch-lightning',
 'joke2k/faker',
 'horovod/horovod',
 'microsoft/restler-fuzzer',
 'bitcoin/bips',
 'bridgecrewio/checkov',
 'jiupinjia/SkyAR',
 'tornadoweb/tornado',
 'zulip/zulip',
 'netbox-community/netbox',
 'apache/incubator-superset',
 'kizniche/Mycodo',
 'githubharald/SimpleHTR',
 'elyra-ai/elyra',
 '1nfinityLoop/Sudoku-Solver-AI',
 'jupyterhub/jupyterhub',
 'NVIDIA/DeepLearningExamples',
 'Strip3s/PhoenixBot',
 'facebookresearch/Detectron',
 'huggingface/transformers',
 'ghidraninja/game-and-watch-hacking']

In [12]:
len(repo_names)

25

# Acquire Repo URL's Function

In [13]:
def acquire_repo_urls(language, period):
    
    headers = {'User-Agent': 'GitHub'}

    # Trending with Language Python
    response = get(f'https://github.com/trending/{language}?since={period}&spoken_language_code=en', headers=headers)
    
    print(response.ok)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    h1 = soup.find_all('h1', class_='h3 lh-condensed')

    repo_names = []

    for h in h1:
        repo_name = h.get_text()
        repo_name = repo_name.replace("\n","")
        repo_name = repo_name.replace(" ","")
        repo_name = 'https://github.com/' + repo_name
        repo_names.append(repo_name)
    
    return repo_names

In [14]:
REPOS = acquire_repo_urls('javascript','weekly')

True


# Now we need a function to scrape the readme files using these URL's

In [15]:
"""
A module for obtaining repo readme and language data from the github API.
Before using this module, read through it, and follow the instructions marked
TODO.
After doing so, run it like this:
    python acquire.py
To create the `data.json` file that contains the data.
"""
import os
import json
from typing import Dict, List, Optional, Union, cast
import requests

from env import github_token, github_username

# TODO: Make a github personal access token.
#     1. Go here and generate a personal access token https://github.com/settings/tokens
#        You do _not_ need select any scopes, i.e. leave all the checkboxes unchecked
#     2. Save it in your env.py file under the variable `github_token`
# TODO: Add your github username to your env.py file under the variable `github_username`
# TODO: Add more repositories to the `REPOS` list below.

headers = {"Authorization": f"token {github_token}", "User-Agent": github_username}

if headers["Authorization"] == "token " or headers["User-Agent"] == "":
    raise Exception(
        "You need to follow the instructions marked TODO in this script before trying to use it"
    )


def github_api_request(url: str) -> Union[List, Dict]:
    response = requests.get(url, headers=headers)
    response_data = response.json()
    if response.status_code != 200:
        raise Exception(
            f"Error response from github api! status code: {response.status_code}, "
            f"response: {json.dumps(response_data)}"
        )
    return response_data


def get_repo_language(repo: str) -> str:
    url = f"https://api.github.com/repos/{repo}"
    repo_info = github_api_request(url)
    if type(repo_info) is dict:
        repo_info = cast(Dict, repo_info)
        return repo_info.get("language", None)
    raise Exception(
        f"Expecting a dictionary response from {url}, instead got {json.dumps(repo_info)}"
    )


def get_repo_contents(repo: str) -> List[Dict[str, str]]:
    url = f"https://api.github.com/repos/{repo}/contents/"
    contents = github_api_request(url)
    if type(contents) is list:
        contents = cast(List, contents)
        return contents
    raise Exception(
        f"Expecting a list response from {url}, instead got {json.dumps(contents)}"
    )


def get_readme_download_url(files: List[Dict[str, str]]) -> str:
    """
    Takes in a response from the github api that lists the files in a repo and
    returns the url that can be used to download the repo's README file.
    """
    for file in files:
        if file["name"].lower().startswith("readme"):
            return file["download_url"]
    return ""


def process_repo(repo: str) -> Dict[str, str]:
    """
    Takes a repo name like "gocodeup/codeup-setup-script" and returns a
    dictionary with the language of the repo and the readme contents.
    """
    contents = get_repo_contents(repo)
    readme_download_url = get_readme_download_url(contents)
    if readme_download_url == "":
        readme_contents = None
    else:
        readme_contents = requests.get(readme_download_url).text
    return {
        "repo": repo,
        "language": get_repo_language(repo),
        "readme_contents": readme_contents,
    }


def scrape_github_data() -> List[Dict[str, str]]:
    """
    Loop through all of the repos and process them. Returns the processed data.
    """
    return [process_repo(repo) for repo in REPOS]


if __name__ == "__main__":
    data = scrape_github_data()
    json.dump(data, open("data.json", "w"), indent=1)

Exception: Error response from github api! status code: 404, response: {"message": "Not Found", "documentation_url": "https://docs.github.com/rest"}

In [16]:
# creating a function that automates the parsing process given a list of links to parse

def get_content_df(links):
    '''Takes in a list of urs and parses through everyone of them and get the content of 
    that we need'''
    
    content = []
    for elem in links:
        url = elem
        response = get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        article = soup.find('article', itemprop = 'text')
        article_text = article.get_text()
        item = {
        'content': article_text
        }
        content.append(item)
        df = pd.DataFrame(content)
    return df

In [50]:
test_url = REPOS[0]

response = get(test_url)
soup = BeautifulSoup(response.content, 'html.parser')

text = soup.find_all('a', class_='social-count')

In [69]:
counts = []
watchers = []
stars = []
forks = []

for h in text:
    counts.append(h.get_text().replace('\n','').replace(' ',''))
    
watchers.append(counts[0])
stars.append(counts[1])
forks.append(counts[2])

In [82]:
counts.append('20')

In [83]:
counts

['57', '588', '53', 20, '20']

In [60]:
soup.find_all('a', class_='social-count')

[<a aria-label="57 users are watching this repository" class="social-count" href="/odensc/ttv-ublock/watchers">
       57
     </a>,
 <a aria-label="588 users starred this repository" class="social-count js-social-count" href="/odensc/ttv-ublock/stargazers">
       588
     </a>,
 <a aria-label="53 users forked this repository" class="social-count" href="/odensc/ttv-ublock/network/members">
         53
       </a>]

In [62]:
counts

['57', '588', '53']

In [84]:
# creating a function that automates the parsing process given a list of links to parse

def get_content_df(links):
    '''Takes in a list of urs and parses through everyone of them and get the content of 
    that we need'''
    
    content = []
    watchers = []
    stars = []
    forks = []
    
    for elem in links:
        
        counts = []
        
        url = elem
        response = get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        article = soup.find('article', itemprop = 'text')
        article_text = article.get_text()
        item = {
        'content': article_text
        }
        content.append(item)
        
        text = soup.find_all('a', class_='social-count')

        for h in text:
            #['watchers','stars','forks']
            counts.append(h.get_text().replace('\n','').replace(' ',''))
        
        watchers.append(counts[0])
        stars.append(counts[1])
        forks.append(counts[2])
    
    df = pd.DataFrame(content)
    df['watchers'] = watchers
    df['starts'] = stars
    df['forks'] = forks
    
    return df

In [85]:
get_content_df(REPOS)

Unnamed: 0,content,watchers,starts,forks
0,TTV ad-block\nWorks best when paired with uBlo...,57,588,53
1,IPTV\nCollection of 5000+ publicly available I...,850,21.3k,155
2,clean-code-javascript\nTable of Contents\n\nIn...,1.4k,41.4k,5k
3,JavaScript Algorithms and Data Structures\n\n\...,3.5k,86.1k,14.5k
4,Tech Interview Handbook\n\n\n\n\n\n\n\n\n\n\nC...,1.7k,46.8k,6.6k
5,"\nAbout\ndiagrams.net, previously draw.io, is ...",511,20.5k,4.2k
6,mermaid \n\nüèÜ Mermaid was nominated and wo...,563,32.9k,2.2k
7,WebGL Fluid Simulation\nPlay here\n\nReference...,209,10.1k,929
8,Front End Interview Handbook\n\n\n\n\n\n\nCred...,768,26.6k,3.8k
9,faker.js - generate massive amounts of fake da...,359,28k,2.4k


# Reading Final CSV of Data to Use in Project

In [None]:
df = pd.read_csv('train_validate.csv')

In [None]:
df