# Acquisition + Preparation
For this project, you will have to build a dataset yourself. Decide on a list of GitHub repositories to scrape, and write the python code necessary to extract the text of the README file for each page, and the primary language of the repository.

You can find the language of a repository like this:
1. Bottom Right Side of Repo stating **Languages** 
2. html code ```<ul class="list-style-none">```

Which repositories you use are up to you, but you should include at least 100 repositories in your data set.

As an example of which repositories to use, here is a link to [GitHub's trending repositories](https://github.com/trending), the [most forked repositores](https://github.com/search?o=desc&q=stars:%3E1&s=forks&type=Repositories), and the [most starred repositories](https://github.com/search?q=stars%3A%3E0&s=stars&type=Repositories).

In [1]:
# Imports

import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import os

In [11]:
headers = {'User-Agent': 'GitHub'}

# Trending with Language Python
response = get('https://github.com/trending/python?since=daily', headers=headers)

In [12]:
response.ok

True

In [13]:
print(response.text[:400])






<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="http


In [14]:
soup = BeautifulSoup(response.content, 'html.parser')

In [15]:
soup.title.string

'Trending Python repositories on GitHub today · GitHub'

In [16]:
article = soup.find('h1', class_='h3 lh-condensed')
repo_name = article.text

In [17]:
repo_name = repo_name.replace("\n","")
repo_name = repo_name.replace(" ","")

In [18]:
repo_name

'ytdl-org/youtube-dl'

In [22]:
soup.find_all('h1', class_='h3 lh-condensed')

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":1039520,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="b20fbb2ae4d398ce0be32669206fb73ade4f4b8f35ba525f6af4ab0c53f99964" href="/ytdl-org/youtube-dl">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 text-gray" height="16" version="1.1" viewbox="0 0 16 16" width="16"><path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path></svg>
 <span cl

https://divyanshushekhar.com/python-beautifulsoup-find-findall/

In [29]:
for h in h1:
    print(h.get_text())





        ytdl-org /

      youtube-dl
 




        rtcatc /

      Packer-Fuzzer
 




        microsoft /

      Bringing-Old-Photos-Back-to-Life
 




        TsinghuaAI /

      CPM-Generate
 




        hzwer /

      arXiv2020-RIFE
 




        PyTorchLightning /

      pytorch-lightning
 




        joke2k /

      faker
 




        soimort /

      you-get
 




        davidteather /

      TikTok-Api
 




        faif /

      python-patterns
 




        KalleHallden /

      pwManager
 




        bitcoinbook /

      bitcoinbook
 




        facebook /

      prophet
 




        apple /

      ml-hypersim
 




        open-mmlab /

      OpenPCDet
 




        pyserial /

      pyserial
 




        PaddlePaddle /

      PARL
 




        rougier /

      numpy-100
 




        parsampsh /

      pashmak
 




        rskmoi /

      namedivider-python
 




        MrS0m30n3 /

      youtube-dl-gui
 




        lehaifeng /

      T-GCN
 




        de

In [30]:
h1 = soup.find_all('h1', class_='h3 lh-condensed')

repo_names = []

for h in h1:
    repo_name = h.get_text()
    repo_name = repo_name.replace("\n","")
    repo_name = repo_name.replace(" ","")
    repo_names.append(repo_name)
    
repo_names

['ytdl-org/youtube-dl',
 'rtcatc/Packer-Fuzzer',
 'microsoft/Bringing-Old-Photos-Back-to-Life',
 'TsinghuaAI/CPM-Generate',
 'hzwer/arXiv2020-RIFE',
 'PyTorchLightning/pytorch-lightning',
 'joke2k/faker',
 'soimort/you-get',
 'davidteather/TikTok-Api',
 'faif/python-patterns',
 'KalleHallden/pwManager',
 'bitcoinbook/bitcoinbook',
 'facebook/prophet',
 'apple/ml-hypersim',
 'open-mmlab/OpenPCDet',
 'pyserial/pyserial',
 'PaddlePaddle/PARL',
 'rougier/numpy-100',
 'parsampsh/pashmak',
 'rskmoi/namedivider-python',
 'MrS0m30n3/youtube-dl-gui',
 'lehaifeng/T-GCN',
 'deepfakes/faceswap',
 'benedekrozemberczki/pytorch_geometric_temporal',
 'coleifer/peewee']

In [28]:
len(repo_names)

25

# Acquire Repo URL's Function

In [43]:
def acquire_repo_urls(language, period):
    
    headers = {'User-Agent': 'GitHub'}

    # Trending with Language Python
    response = get(f'https://github.com/trending/{language}?since={period}', headers=headers)
    
    print(response.ok)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    h1 = soup.find_all('h1', class_='h3 lh-condensed')

    repo_names = []

    for h in h1:
        repo_name = h.get_text()
        repo_name = repo_name.replace("\n","")
        repo_name = repo_name.replace(" ","")
        repo_name = 'https://github.com/' + repo_name
        repo_names.append(repo_name)
    
    return repo_names

In [44]:
acquire_repo_urls('javascript','weekly')

True


['https://github.com/lxk0301/jd_scripts',
 'https://github.com/Marak/faker.js',
 'https://github.com/iptv-org/iptv',
 'https://github.com/mermaid-js/mermaid',
 'https://github.com/ryanmcdermott/clean-code-javascript',
 'https://github.com/yangshun/tech-interview-handbook',
 'https://github.com/trekhleb/javascript-algorithms',
 'https://github.com/PavelDoGreat/WebGL-Fluid-Simulation',
 'https://github.com/odensc/ttv-ublock',
 'https://github.com/ryanhanwu/How-To-Ask-Questions-The-Smart-Way',
 'https://github.com/w37fhy/QuantumultX',
 'https://github.com/lerna/lerna',
 'https://github.com/yogeshojha/rengine',
 'https://github.com/Advanced-Frontend/Daily-Interview-Question',
 'https://github.com/ArugaZ/whatsapp-bot',
 'https://github.com/wekan/wekan',
 'https://github.com/GoogleChrome/lighthouse',
 'https://github.com/electerm/electerm',
 'https://github.com/dortania/OpenCore-Install-Guide',
 'https://github.com/yangshun/front-end-interview-handbook',
 'https://github.com/mapbox/mapbox-gl