## Entendendo como extrair informações sobre repositórios através da api do GitHub

Links: 
- [Search](https://developer.github.com/v3/search/)
- [Searching for repositories](https://help.github.com/en/articles/searching-for-repositories#search-by-repository-name-description-or-contents-of-the-readme-file)
- Lista de qualifiers:  https://help.github.com/en/articles/searching-code
- Documentação para search: https://developer.github.com/v3/search/

In [72]:
import requests
import pandas as pd
import time
from datetime import datetime

Montando url de busca para os repositórios com maior quantidade de estrelas.
Usando como consulta stars:>1 para indicar que são todos os repositorios acima de 1 estrela, o que não tem impacto negativo para a busca pois existem certa de 3.305.358 repositórios e só vamos analisar 2.500.

In [138]:
q = 'q=stars:>1'
sort = '&sort=stars&order=desc'
url_base = 'https://api.github.com/search/repositories?'
url_final = url_base+q+sort
credentials=('lorenaps','token')
url_final

'https://api.github.com/search/repositories?q=stars:>1&sort=stars&order=desc'

Verificando o limite de requisições

In [94]:
t = requests.get('https://api.github.com/rate_limit', auth=credentials)
t.json()

{'rate': {'limit': 5000, 'remaining': 5000, 'reset': 1562436829},
 'resources': {'core': {'limit': 5000, 'remaining': 5000, 'reset': 1562436829},
  'graphql': {'limit': 5000, 'remaining': 5000, 'reset': 1562436829},
  'integration_manifest': {'limit': 5000,
   'remaining': 5000,
   'reset': 1562436829},
  'search': {'limit': 30, 'remaining': 30, 'reset': 1562433289}}}

In [87]:
colunas=['id',
         'owner_type', 
           'owner_url',
           'owner_html_url',
           'html_url',
           'url',
           'fork',
           'created_at',
           'updated_at',
           'size',
           'stargazers_count',
           'language',
           'has_issues',
           'has_wiki',
           'forks_count',
           'forks',
           'open_issues',
           'watchers',
         'timestamp_extract']

In [127]:
def add_resultado(item):
    df = pd.DataFrame([[
                        item.get('id'),
                        item.get('owner').get('type', None),
                        item.get('owner').get('url', None),
                        item.get('owner').get('html_url', None),
                        item.get('html_url', None),
                        item.get('url', None),
                        item.get('fork', None),
                        item.get('created_at', None),
                        item.get('updated_at', None),
                        item.get('size', None),
                        item.get('stargazers_count', None),
                        item.get('language', None),
                        item.get('has_issues', None),
                        item.get('has_wiki', None),
                        item.get('forks_count', None),
                        item.get('forks', None),
                        item.get('open_issues', None),
                        item.get('watchers', None),
                        str(time.time()).split('.')[0]]], columns=colunas)

    return df    

In [128]:
def extrair_dados(data, resultados):
    for item in data.get('items', None):
        resultados = pd.concat([resultados, add_resultado(item)], ignore_index=True, sort=False)
        
    return resultados

In [129]:
def get_total_paginas(data):
    
    itens_por_pagina = len(data.get('items'))
    total_paginas = data.get('total_count') // itens_por_pagina # opreração de div em python
    total_paginas

    print('Total de registros:{0} , Registros por página:{1}, Total de Páginas:{2}'.format(
        data.get('total_count'), itens_por_pagina, total_paginas))
    
    return total_paginas        

Para pegar em torno de 2500 registros, semelhante ao artigo original, seria preciso 84 iterações.
Mas como a api limita o ferramenta de pesquisa aos 1000 primeiros resultados, usaremos 34 iterações.
É possíve fazer o teste requisitanto a página 34 dos resultados

In [130]:
mais_de_mil = 'https://api.github.com/search/repositories?q=stars%3A%3E1&sort=stars&order=desc&page=35'
t = requests.get(mais_de_mil, auth=credentials)
t.json()

{'documentation_url': 'https://developer.github.com/v3/search/',
 'message': 'Only the first 1000 search results are available'}

In [131]:
def percorrendo_paginas(url):
    
    resultados = pd.DataFrame(columns=['id', 
                                       'owner_type', 
                                   'owner_url',
                                   'owner_html_url',
                                   'html_url',
                                   'url',
                                   'fork',
                                   'created_at',
                                   'updated_at',
                                   'size',
                                   'stargazers_count',
                                   'language',
                                   'has_issues',
                                   'has_wiki',
                                   'forks_count',
                                   'forks',
                                   'open_issues',
                                   'watchers',
                                   'timestamp_extract',
                                   'commits',
                                   'contributors',])
    
    print('Extraindo página:1')
    results = requests.get(url, auth=credentials)    
    data = dict(results.json())
    resultados = extrair_dados(data, resultados)

    iteracoes = 34 
    
    for iteracao in range(1, iteracoes):
        print("\n>>>>>>>> Iteracao:{0}".format(iteracao+1))
        print("Tempo atual:{0}".format(datetime.now()))

        # Para requisições não autenticadas a api restringe para 10 requisições por minuto,
        # Para requisições autenticadas 30 por minuto.
        #if iteracao % 10 == 0:
        #    print("sleep 1 minuto")
        #    time.sleep(60)
        
        header = dict(results.links)
        next_url = header.get('next').get('url')
        print("Next url extraída: {0}".format(next_url))
        
        print('Extraindo página:{0}'.format(iteracao+1))
        results = requests.get(next_url, auth=credentials)
        print('Status:{0}'.format(results))
        
        data = dict(results.json())
            
        resultados = extrair_dados(data, resultados)
        
    return resultados

In [132]:
str(time.time()).split('.')[0]

'1562435050'

In [133]:
%%time
resultados = percorrendo_paginas(url_final)

Extraindo página:1

>>>>>>>> Iteracao:2
Tempo atual:2019-07-06 14:44:12.723108
Next url extraída: https://api.github.com/search/repositories?q=stars%3A%3E1&sort=stars&order=desc&page=2
Extraindo página:2
Status:<Response [200]>

>>>>>>>> Iteracao:3
Tempo atual:2019-07-06 14:44:14.959309
Next url extraída: https://api.github.com/search/repositories?q=stars%3A%3E1&sort=stars&order=desc&page=3
Extraindo página:3
Status:<Response [200]>

>>>>>>>> Iteracao:4
Tempo atual:2019-07-06 14:44:17.274794
Next url extraída: https://api.github.com/search/repositories?q=stars%3A%3E1&sort=stars&order=desc&page=4
Extraindo página:4
Status:<Response [200]>

>>>>>>>> Iteracao:5
Tempo atual:2019-07-06 14:44:19.055233
Next url extraída: https://api.github.com/search/repositories?q=stars%3A%3E1&sort=stars&order=desc&page=5
Extraindo página:5
Status:<Response [200]>

>>>>>>>> Iteracao:6
Tempo atual:2019-07-06 14:44:20.865714
Next url extraída: https://api.github.com/search/repositories?q=stars%3A%3E1&sort=sta

In [135]:
resultados.describe()

Unnamed: 0,id,owner_type,owner_url,owner_html_url,html_url,url,fork,created_at,updated_at,size,...,language,has_issues,has_wiki,forks_count,forks,open_issues,watchers,timestamp_extract,commits,contributors
count,1020,1020,1020,1020,1020,1020,1020,1020,1020,1020,...,902,1020,1020,1020,1020,1020,1020,1020,0.0,0.0
unique,1020,2,859,859,1020,1020,1,1020,1002,1003,...,40,2,2,949,949,479,989,40,0.0,0.0
top,3214406,Organization,https://api.github.com/users/facebook,https://github.com/facebook,https://github.com/gionkunz/chartist-js,https://api.github.com/repos/sindresorhus/awes...,False,2010-03-18T22:32:22Z,2019-07-06T15:32:45Z,284,...,JavaScript,True,True,1203,1203,2,12466,1562435071,,
freq,1,621,17,17,1,1,1020,1,2,2,...,327,974,685,3,3,18,2,30,,


In [136]:
resultados.to_csv('../dados/repositorios_06_07_2019.csv', index=False)

In [137]:
resultados.columns.values

array(['id', 'owner_type', 'owner_url', 'owner_html_url', 'html_url',
       'url', 'fork', 'created_at', 'updated_at', 'size',
       'stargazers_count', 'language', 'has_issues', 'has_wiki',
       'forks_count', 'forks', 'open_issues', 'watchers',
       'timestamp_extract', 'commits', 'contributors'], dtype=object)

## Extraindo commits e contributors através da biblioteca PyGithub

Links:
- https://github.com/PyGithub/PyGithub

In [139]:
from github import Github
from github import Repository
from github import ContentFile
from bs4 import BeautifulSoup as bs

In [140]:
token = 'token'

In [141]:
g = Github(token)

In [143]:
resultados_csv = pd.read_csv('../dados/repositorios_06_07_2019.csv')
resultados_csv.head()

Unnamed: 0,id,owner_type,owner_url,owner_html_url,html_url,url,fork,created_at,updated_at,size,...,language,has_issues,has_wiki,forks_count,forks,open_issues,watchers,timestamp_extract,commits,contributors
0,28457823,Organization,https://api.github.com/users/freeCodeCamp,https://github.com/freeCodeCamp,https://github.com/freeCodeCamp/freeCodeCamp,https://api.github.com/repos/freeCodeCamp/free...,False,2014-12-24T17:49:19Z,2019-07-06T17:41:39Z,117553,...,JavaScript,True,False,22196,22196,1433,303703,1562435052,,
1,177736533,User,https://api.github.com/users/996icu,https://github.com/996icu,https://github.com/996icu/996.ICU,https://api.github.com/repos/996icu/996.ICU,False,2019-03-26T07:31:14Z,2019-07-06T15:23:35Z,59169,...,Rust,False,False,21325,21325,16696,246419,1562435052,,
2,11730342,Organization,https://api.github.com/users/vuejs,https://github.com/vuejs,https://github.com/vuejs/vue,https://api.github.com/repos/vuejs/vue,False,2013-07-29T03:24:51Z,2019-07-06T17:33:13Z,27671,...,JavaScript,True,True,20674,20674,312,143036,1562435052,,
3,2126244,Organization,https://api.github.com/users/twbs,https://github.com/twbs,https://github.com/twbs/bootstrap,https://api.github.com/repos/twbs/bootstrap,False,2011-07-29T21:19:00Z,2019-07-06T16:37:43Z,143925,...,JavaScript,True,False,65975,65975,366,134418,1562435052,,
4,10270250,Organization,https://api.github.com/users/facebook,https://github.com/facebook,https://github.com/facebook/react,https://api.github.com/repos/facebook/react,False,2013-05-24T16:15:54Z,2019-07-06T16:14:15Z,143121,...,JavaScript,True,True,24471,24471,751,132169,1562435052,,


In [26]:
resultados_csv[355:]

Unnamed: 0,id,owner_type,owner_url,owner_html_url,html_url,url,fork,created_at,updated_at,size,...,language,has_issues,has_wiki,forks_count,forks,open_issues,watchers,commits,contributors,readme
355,90528830,User,https://api.github.com/users/Solido,https://github.com/Solido,https://github.com/Solido/awesome-flutter,https://api.github.com/repos/Solido/awesome-fl...,False,2017-05-07T11:45:27Z,2019-07-01T19:40:07Z,1372,...,Dart,False,False,2428,2428,27,19302,1189.0,173,
356,18840003,Organization,https://api.github.com/users/kriasoft,https://github.com/kriasoft,https://github.com/kriasoft/react-starter-kit,https://api.github.com/repos/kriasoft/react-st...,False,2014-04-16T13:08:18Z,2019-07-01T14:13:06Z,4622,...,JavaScript,True,True,3879,3879,486,19294,,,
357,4037197,Organization,https://api.github.com/users/ycm-core,https://github.com/ycm-core,https://github.com/ycm-core/YouCompleteMe,https://api.github.com/repos/ycm-core/YouCompl...,False,2012-04-16T03:12:14Z,2019-07-01T18:53:40Z,34969,...,Python,True,True,2153,2153,52,19293,,,
358,19141383,User,https://api.github.com/users/fouber,https://github.com/fouber,https://github.com/fouber/blog,https://api.github.com/repos/fouber/blog,False,2014-04-25T09:44:42Z,2019-07-01T10:27:13Z,19107,...,,True,True,2186,2186,11,19263,,,
359,60325062,User,https://api.github.com/users/terryum,https://github.com/terryum,https://github.com/terryum/awesome-deep-learni...,https://api.github.com/repos/terryum/awesome-d...,False,2016-06-03T06:48:30Z,2019-07-01T17:50:13Z,147,...,TeX,True,True,3703,3703,28,19196,,,
360,8859474,User,https://api.github.com/users/skylot,https://github.com/skylot,https://github.com/skylot/jadx,https://api.github.com/repos/skylot/jadx,False,2013-03-18T17:08:21Z,2019-07-01T14:11:30Z,13473,...,Java,True,True,2108,2108,90,19137,,,
361,6007295,User,https://api.github.com/users/fxsjy,https://github.com/fxsjy,https://github.com/fxsjy/jieba,https://api.github.com/repos/fxsjy/jieba,False,2012-09-29T07:52:01Z,2019-07-01T16:22:31Z,44251,...,Python,True,True,4969,4969,482,19127,,,
362,12244426,User,https://api.github.com/users/rstacruz,https://github.com/rstacruz,https://github.com/rstacruz/nprogress,https://api.github.com/repos/rstacruz/nprogress,False,2013-08-20T13:58:02Z,2019-07-01T14:19:58Z,344,...,JavaScript,True,True,1588,1588,114,19122,,,
363,790359,Organization,https://api.github.com/users/sequelize,https://github.com/sequelize,https://github.com/sequelize/sequelize,https://api.github.com/repos/sequelize/sequelize,False,2010-07-22T07:11:11Z,2019-07-01T17:54:03Z,25711,...,JavaScript,True,False,3005,3005,537,19120,,,
364,33895378,User,https://api.github.com/users/bevacqua,https://github.com/bevacqua,https://github.com/bevacqua/dragula,https://api.github.com/repos/bevacqua/dragula,False,2015-04-13T21:35:38Z,2019-07-01T12:56:36Z,1832,...,JavaScript,True,False,1639,1639,212,19092,,,


In [145]:
id_repos = resultados_csv.id
len(id_repos)

1020

In [148]:
id_repos

0        28457823
1       177736533
2        11730342
3         2126244
4        10270250
5        45717250
6        13491895
7        21737465
8        14440270
9          291137
10        6498492
11        1062897
12         943149
13       85077558
14       60493101
15       41881900
16       29028775
17        2325298
18        9384267
19      121395510
20       21289110
21       63537249
22       31792824
23       83222441
24       27193779
25       23088740
26        2561582
27       23096959
28        3470471
29       35955666
          ...    
990      26784827
991      73929422
992      32689863
993      28704549
994        908607
995      22553797
996     103749180
997      14376285
998      19463625
999     189890377
1000      2609642
1001       312262
1002     11393110
1003     53632140
1004       138312
1005      2826918
1006     25449064
1007       331603
1008     45936895
1009     47561138
1010     18405734
1011     27835638
1012    138331573
1013     41215439
1014    12

In [231]:
def get_contributors(url_repo, indice):
    page = requests.get(url_repo)
    page_parse = bs(page.text, 'html.parser')
    lista = page_parse.findAll('span', attrs={'class': 'num text-emphasized'}) 
    
    text = lista[indice].text
    contributors = text.replace(" ", "").replace("\n", "").replace(",", "")
    #contributors = int(contributors)
    
    return(contributors)

In [None]:
it = 0
indice = 3 # todo: explicar o pq desse indice

for id_repo in id_repos:
    
    it = it + 1
    print("\nIteração: ", it)
    
    #if it % 10 == 0:
    #    print("sleep 1 minuto")
    #    time.sleep(60)

    repo = g.get_repo(id_repo)
    print("Nome do repositório:{0}".format(repo.name))

    commits = repo.get_commits().totalCount
    print("Commits: ", commits)

    url_repo = resultados_csv.loc[resultados_csv["id"] == id_repo]['html_url'].to_list()[0]
    print("Url: ", url_repo)

    contributors = get_contributors(url_repo, indice)
    print("Contributors: ", contributors)

    print("Atualizando data frame :)")
    resultados_csv.loc[resultados_csv["id"] == id_repo, 'commits'] = commits 
    resultados_csv.loc[resultados_csv["id"] == id_repo, 'contributors'] = contributors

    resultados_csv.to_csv('../dados/repositorios_06_07_2019_update.csv', index=False)

In [233]:
resultados_csv.tail()

Unnamed: 0,id,owner_type,owner_url,owner_html_url,html_url,url,fork,created_at,updated_at,size,...,language,has_issues,has_wiki,forks_count,forks,open_issues,watchers,timestamp_extract,commits,contributors
1015,45896779,Organization,https://api.github.com/users/jikexueyuanwiki,https://github.com/jikexueyuanwiki,https://github.com/jikexueyuanwiki/tensorflow-zh,https://api.github.com/repos/jikexueyuanwiki/t...,False,2015-11-10T07:59:30Z,2019-07-06T08:10:34Z,44266,...,TeX,True,True,4090,4090,28,10977,1562435119,714.0,
1016,58327877,Organization,https://api.github.com/users/browsh-org,https://github.com/browsh-org,https://github.com/browsh-org/browsh,https://api.github.com/repos/browsh-org/browsh,False,2016-05-08T19:34:15Z,2019-07-06T16:21:52Z,7371,...,JavaScript,True,True,308,308,108,10972,1562435119,372.0,18.0
1017,43482568,Organization,https://api.github.com/users/jobbole,https://github.com/jobbole,https://github.com/jobbole/awesome-java-cn,https://api.github.com/repos/jobbole/awesome-j...,False,2015-10-01T06:53:06Z,2019-07-06T16:17:25Z,257,...,,True,True,3879,3879,1,10971,1562435119,239.0,
1018,63935685,User,https://api.github.com/users/gztchan,https://github.com/gztchan,https://github.com/gztchan/awesome-design,https://api.github.com/repos/gztchan/awesome-d...,False,2016-07-22T08:06:02Z,2019-07-06T17:22:52Z,241,...,,True,True,881,881,8,10968,1562435119,194.0,
1019,29900673,User,https://api.github.com/users/phanan,https://github.com/phanan,https://github.com/phanan/htaccess,https://api.github.com/repos/phanan/htaccess,False,2015-01-27T06:25:41Z,2019-07-05T01:35:41Z,211,...,,True,True,1144,1144,6,10968,1562435119,156.0,30.0


# Verificando dados extraídos

In [170]:
resultados_csv.isna().sum()

id                     0
owner_type             0
owner_url              0
owner_html_url         0
html_url               0
url                    0
fork                   0
created_at             0
updated_at             0
size                   0
stargazers_count       0
language             118
has_issues             0
has_wiki               0
forks_count            0
forks                  0
open_issues            0
watchers               0
timestamp_extract      0
commits                0
contributors           0
dtype: int64

### Repositórios onde as linguaguagens vem como Nan
Aparentemente são repostirórios de apostilas, listas, textos.

In [175]:
resultados_csv.loc[resultados_csv['language'].isnull()]

Unnamed: 0,id,owner_type,owner_url,owner_html_url,html_url,url,fork,created_at,updated_at,size,...,language,has_issues,has_wiki,forks_count,forks,open_issues,watchers,timestamp_extract,commits,contributors
6,13491895,Organization,https://api.github.com/users/EbookFoundation,https://github.com/EbookFoundation,https://github.com/EbookFoundation/free-progra...,https://api.github.com/repos/EbookFoundation/f...,False,2013-10-11T06:50:37Z,2019-07-06T17:40:54Z,4868,...,,True,True,31180,31180,42,125013,1562435052,5017.0,1060
7,21737465,User,https://api.github.com/users/sindresorhus,https://github.com/sindresorhus,https://github.com/sindresorhus/awesome,https://api.github.com/repos/sindresorhus/awesome,False,2014-07-11T13:42:37Z,2019-07-06T17:41:01Z,729,...,,True,False,14724,14724,18,111599,1562435052,842.0,398
8,14440270,User,https://api.github.com/users/getify,https://github.com/getify,https://github.com/getify/You-Dont-Know-JS,https://api.github.com/repos/getify/You-Dont-K...,False,2013-11-16T02:37:24Z,2019-07-06T17:00:48Z,7285,...,,True,True,20641,20641,257,104485,1562435052,1487.0,160
11,1062897,Organization,https://api.github.com/users/github,https://github.com/github,https://github.com/github/gitignore,https://api.github.com/repos/github/gitignore,False,2010-11-08T20:17:14Z,2019-07-06T16:41:30Z,1956,...,,False,False,42421,42421,95,86170,1562435052,3138.0,1068
13,85077558,User,https://api.github.com/users/kamranahmedse,https://github.com/kamranahmedse,https://github.com/kamranahmedse/developer-roa...,https://api.github.com/repos/kamranahmedse/dev...,False,2017-03-15T13:45:52Z,2019-07-06T17:40:33Z,23113,...,,True,True,13358,13358,181,83991,1562435052,201.0,20
14,60493101,User,https://api.github.com/users/jwasham,https://github.com/jwasham,https://github.com/jwasham/coding-interview-un...,https://api.github.com/repos/jwasham/coding-in...,False,2016-06-06T02:34:12Z,2019-07-06T17:28:20Z,9499,...,,True,False,23501,23501,62,79326,1562435052,1273.0,132
29,35955666,User,https://api.github.com/users/jlevy,https://github.com/jlevy,https://github.com/jlevy/the-art-of-command-line,https://api.github.com/repos/jlevy/the-art-of-...,False,2015-05-20T15:11:03Z,2019-07-06T17:39:42Z,2557,...,,True,False,6528,6528,149,60235,1562435052,1185.0,153
38,14098069,User,https://api.github.com/users/justjavac,https://github.com/justjavac,https://github.com/justjavac/free-programming-...,https://api.github.com/repos/justjavac/free-pr...,False,2013-11-04T01:59:19Z,2019-07-06T16:42:32Z,848,...,,True,False,16955,16955,164,51994,1562435054,885.0,
57,44571718,Organization,https://api.github.com/users/vuejs,https://github.com/vuejs,https://github.com/vuejs/awesome-vue,https://api.github.com/repos/vuejs/awesome-vue,False,2015-10-20T00:16:14Z,2019-07-06T17:40:37Z,4019,...,,True,False,6475,6475,103,46726,1562435054,2748.0,1495
59,132750724,User,https://api.github.com/users/danistefanovic,https://github.com/danistefanovic,https://github.com/danistefanovic/build-your-o...,https://api.github.com/repos/danistefanovic/bu...,False,2018-05-09T12:03:18Z,2019-07-06T16:21:05Z,821,...,,True,False,3131,3131,23,45917,1562435054,355.0,53


### Alguns contribuidoes não foram extraídos. Por que?

In [234]:
contributors_faill = resultados_csv.loc[resultados_csv['contributors'] == '']
contributors_faill

Unnamed: 0,id,owner_type,owner_url,owner_html_url,html_url,url,fork,created_at,updated_at,size,...,language,has_issues,has_wiki,forks_count,forks,open_issues,watchers,timestamp_extract,commits,contributors
115,211666,Organization,https://api.github.com/users/nodejs,https://github.com/nodejs,https://github.com/nodejs/node-v0.x-archive,https://api.github.com/repos/nodejs/node-v0.x-...,False,2009-05-27T16:29:46Z,2019-07-06T12:14:12Z,144238,...,,True,True,7779,7779,571,35462,1562435059,2.0,
122,69629434,Organization,https://api.github.com/users/FreeCodeCampChina,https://github.com/FreeCodeCampChina,https://github.com/FreeCodeCampChina/freecodec...,https://api.github.com/repos/FreeCodeCampChina...,False,2016-09-30T03:13:43Z,2019-07-06T17:35:11Z,30794,...,CSS,True,True,1302,1302,125,34804,1562435060,8843.0,
147,5271882,User,https://api.github.com/users/astaxie,https://github.com/astaxie,https://github.com/astaxie/build-web-applicati...,https://api.github.com/repos/astaxie/build-web...,False,2012-08-02T11:49:35Z,2019-07-06T16:21:16Z,38118,...,Go,True,True,8578,8578,96,30990,1562435060,5093.0,
161,10187082,User,https://api.github.com/users/Unitech,https://github.com/Unitech,https://github.com/Unitech/pm2,https://api.github.com/repos/Unitech/pm2,False,2013-05-21T03:25:25Z,2019-07-06T12:40:45Z,12088,...,JavaScript,True,False,1990,1990,712,29677,1562435062,4545.0,
201,10865436,User,https://api.github.com/users/dypsilon,https://github.com/dypsilon,https://github.com/dypsilon/frontend-dev-bookm...,https://api.github.com/repos/dypsilon/frontend...,False,2013-06-22T13:23:55Z,2019-07-06T12:55:00Z,1048,...,,True,False,4229,4229,108,26069,1562435065,390.0,
207,10788737,Organization,https://api.github.com/users/codepath,https://github.com/codepath,https://github.com/codepath/android_guides,https://api.github.com/repos/codepath/android_...,False,2013-06-19T10:24:45Z,2019-07-06T14:18:41Z,1093,...,,True,True,6183,6183,143,25544,1562435065,54.0,
215,75830968,Organization,https://api.github.com/users/exacity,https://github.com/exacity,https://github.com/exacity/deeplearningbook-ch...,https://api.github.com/repos/exacity/deeplearn...,False,2016-12-07T11:46:51Z,2019-07-06T14:05:33Z,8859,...,TeX,True,True,7379,7379,41,25293,1562435067,727.0,
220,68957920,User,https://api.github.com/users/justjavac,https://github.com/justjavac,https://github.com/justjavac/awesome-wechat-weapp,https://api.github.com/repos/justjavac/awesome...,False,2016-09-22T20:04:48Z,2019-07-06T12:00:18Z,659,...,,True,False,5395,5395,2,25063,1562435067,515.0,
231,136328388,User,https://api.github.com/users/imhuay,https://github.com/imhuay,https://github.com/imhuay/Algorithm_Interview_...,https://api.github.com/repos/imhuay/Algorithm_...,False,2018-06-06T12:53:14Z,2019-07-06T16:32:45Z,233283,...,Python,True,True,7465,7465,26,24267,1562435067,517.0,
242,10446890,Organization,https://api.github.com/users/bilibili,https://github.com/bilibili,https://github.com/bilibili/ijkplayer,https://api.github.com/repos/bilibili/ijkplayer,False,2013-06-03T04:12:04Z,2019-07-06T11:05:33Z,8034,...,C,True,True,6293,6293,2276,23859,1562435069,2584.0,


In [235]:
contributors_faill[['id', 'html_url', 'contributors']]

Unnamed: 0,id,html_url,contributors
115,211666,https://github.com/nodejs/node-v0.x-archive,
122,69629434,https://github.com/FreeCodeCampChina/freecodec...,
147,5271882,https://github.com/astaxie/build-web-applicati...,
161,10187082,https://github.com/Unitech/pm2,
201,10865436,https://github.com/dypsilon/frontend-dev-bookm...,
207,10788737,https://github.com/codepath/android_guides,
215,75830968,https://github.com/exacity/deeplearningbook-ch...,
220,68957920,https://github.com/justjavac/awesome-wechat-weapp,
231,136328388,https://github.com/imhuay/Algorithm_Interview_...,
242,10446890,https://github.com/bilibili/ijkplayer,


In [245]:
page = requests.get('https://github.com/nodejs/node-v0.x-archive')
page_parse = bs(page.text, 'html.parser')
lista = page_parse.findAll('span', attrs={'class': 'num text-emphasized'}) 
lista

[<span class="num text-emphasized">
                 2
               </span>, <span class="num text-emphasized">
               143
             </span>, <span class="num text-emphasized">
               263
             </span>, <span class="num text-emphasized"></span>]

In [246]:
len(lista)

4

In [248]:
lista[3]

<span class="num text-emphasized"></span>

In [249]:
lista[2]

<span class="num text-emphasized">
              263
            </span>

In [247]:
text = lista[3].text
contributors = text.replace(" ", "").replace("\n", "").replace(",", "")
contributors

''

Executanto parte a parte do código com as urls dos repositórios que 
ficaram com o campo contributors vazio ('') não encontrei nenhum erro.

Vou rodar a função de extração novamente.

In [253]:
id_repos = resultados_csv.loc[resultados_csv['contributors'] == '']['id']
len(id_repos)

146

In [255]:
it = 0
total = len(id_repos)

for id_repo in id_repos:
    
    it = it + 1
    print("\nIteração:{0}/{1}".format(it, total))

    url_repo = resultados_csv.loc[resultados_csv["id"] == id_repo]['html_url'].to_list()[0]
    print("Url: ", url_repo)

    contributors = get_contributors(url_repo, 3)
    print("Contributors: ", contributors)

    print("Atualizando data frame :)")
    resultados_csv.loc[resultados_csv["id"] == id_repo, 'contributors'] = contributors

    resultados_csv.to_csv('../dados/repositorios_06_07_2019_update.csv', index=False)


Iteração:1/146
Url:  https://github.com/nodejs/node-v0.x-archive
Contributors:  
Atualizando data frame :)

Iteração:2/146
Url:  https://github.com/codepath/android_guides
Contributors:  
Atualizando data frame :)

Iteração:3/146
Url:  https://github.com/exacity/deeplearningbook-chinese
Contributors:  
Atualizando data frame :)

Iteração:4/146
Url:  https://github.com/justjavac/awesome-wechat-weapp
Contributors:  146
Atualizando data frame :)

Iteração:5/146
Url:  https://github.com/imhuay/Algorithm_Interview_Notes-Chinese
Contributors:  
Atualizando data frame :)

Iteração:6/146
Url:  https://github.com/shengxinjing/programmer-job-blacklist
Contributors:  
Atualizando data frame :)

Iteração:7/146
Url:  https://github.com/xitu/gold-miner
Contributors:  
Atualizando data frame :)

Iteração:8/146
Url:  https://github.com/aosabook/500lines
Contributors:  
Atualizando data frame :)

Iteração:9/146
Url:  https://github.com/raywenderlich/swift-algorithm-club
Contributors:  225
Atualizando 

Contributors:  
Atualizando data frame :)

Iteração:77/146
Url:  https://github.com/AllThingsSmitty/css-protips
Contributors:  
Atualizando data frame :)

Iteração:78/146
Url:  https://github.com/swoole/swoole-src
Contributors:  109
Atualizando data frame :)

Iteração:79/146
Url:  https://github.com/android10/Android-CleanArchitecture
Contributors:  9
Atualizando data frame :)

Iteração:80/146
Url:  https://github.com/nswbmw/N-blog
Contributors:  
Atualizando data frame :)

Iteração:81/146
Url:  https://github.com/PerfectlySoft/Perfect
Contributors:  
Atualizando data frame :)

Iteração:82/146
Url:  https://github.com/dvajs/dva
Contributors:  
Atualizando data frame :)

Iteração:83/146
Url:  https://github.com/1c7/chinese-independent-developer
Contributors:  
Atualizando data frame :)

Iteração:84/146
Url:  https://github.com/realm/realm-cocoa
Contributors:  104
Atualizando data frame :)

Iteração:85/146
Url:  https://github.com/FormidableLabs/webpack-dashboard
Contributors:  
Atualiza

Pegando readme

In [None]:
repo = g.get_repo("freeCodeCamp")
contents = repo.get_contents("README.md")
print(contents.decoded_content)