# Extração de dados do Github

Pesquisando por iniciativas/projetos que utilizam Dados Abertos Governamentais através da [API do Github](https://developer.github.com/v3/)

In [1]:
import requests
import pandas as pd
import time
import logging

Configuração para gerar arquivo de log

In [2]:
logging.basicConfig(level=logging.DEBUG, 
                    filename="log_file.txt", 
                    filemode="a+",
                    format="%(asctime)s - %(levelname)s - %(funcName)s - %(message)s")

logging.info("Extração de dados do Github")

In [3]:
search_strings = [
            'dados abertos',
            'dados abertos brasil',
            'dados abertos governo',
            'dados abertos governamentais',
            'dados governamentais',
            'dados publicos abertos',
            'dados do governo',
            'analise de dados do governo',
            'analise de dados governamentais',
            'portal de dados do governo',
            'portal de dados governamentais',
            'portal publico do governo',
            'portal de dados abertos do governo',
        ]

Para a acesso a alguns recursos da API do github é preciso se autenticar, como aumentar o limite de requisições. Informações sobre autenticação podem ser encontradas [aqui](https://developer.github.com/v3/#authentication).

In [5]:
credentials = ('<user_name>','<token>')

Limite de requisições sem autenticação

In [6]:
t = requests.get('https://api.github.com/rate_limit')
t.json()

{'rate': {'limit': 60, 'remaining': 58, 'reset': 1580332669},
 'resources': {'core': {'limit': 60, 'remaining': 58, 'reset': 1580332669},
  'graphql': {'limit': 0, 'remaining': 0, 'reset': 1580334748},
  'integration_manifest': {'limit': 5000,
   'remaining': 5000,
   'reset': 1580334748},
  'search': {'limit': 10, 'remaining': 10, 'reset': 1580331208},
  'source_import': {'limit': 5, 'remaining': 5, 'reset': 1580331208}}}

Limite de requisições com autenticação

In [7]:
t = requests.get('https://api.github.com/rate_limit', auth=credentials)
t.json()

{'rate': {'limit': 5000, 'remaining': 5000, 'reset': 1580334752},
 'resources': {'core': {'limit': 5000, 'remaining': 5000, 'reset': 1580334752},
  'graphql': {'limit': 5000, 'remaining': 5000, 'reset': 1580334752},
  'integration_manifest': {'limit': 5000,
   'remaining': 5000,
   'reset': 1580334752},
  'search': {'limit': 30, 'remaining': 30, 'reset': 1580331212},
  'source_import': {'limit': 100, 'remaining': 100, 'reset': 1580331212}}}

Verificando limitação de extração de dados da API

In [8]:
page_35 = 'https://api.github.com/search/repositories?q=stars%3A%3E1&sort=stars&order=desc&page=35'
t = requests.get(page_35, auth=credentials)
t.json()

{'documentation_url': 'https://developer.github.com/v3/search/',
 'message': 'Only the first 1000 search results are available'}

Informações sobre a ferramenta de pesquisa da API podem ser encontradas [aqui](https://developer.github.com/v3/search/)

In [9]:
url_base = 'https://api.github.com/search/repositories?q='

Podemos adicionar uma ordenação nos resultados, como quantidade de _stars_ de forma descrescente.

In [10]:
sort = '&sort=stars&order=desc'

## Extraindo informações gerais

In [11]:
def extract_results(data):
    
    items_list = []
    
    for item in data.get('items', None):
        
        
        item_dict = {
                'id': item.get('id'),
                'full_name': item.get('full_name', None),
                'description': item.get('description', None),      
                'owner_type': item.get('owner').get('type', None),
                'owner_api_url': item.get('owner').get('url', None),
                'owner_url': item.get('owner').get('html_url', None),
                'api_url': item.get('url', None),
                'url': item.get('html_url', None),
                'fork': item.get('fork', None),
                'created_at': item.get('created_at', None),
                'updated_at': item.get('updated_at', None),
                'pushed_at': item.get('pushed_at', None),
                'size': item.get('size', None),
                'stargazers_count': item.get('stargazers_count', None),
                'language': item.get('language', None),
                'has_issues': item.get('has_issues', None),
                'has_wiki': item.get('has_wiki', None),
                'forks_count': item.get('forks_count', None),
                'forks': item.get('forks', None),
                'open_issues': item.get('open_issues', None),
                'license': item.get('license').get('name', None) if item.get('license', None) else None,
                'timestamp_extract': str(time.time()).split('.')[0]
        }

        items_list.append(item_dict)
            
    return items_list

In [14]:
def scroll_pages(url):
        
    results = requests.get(url, auth=credentials)    
    data = results.json()
    total = data.get('total_count', None)
        
    logging.info("Foram encontrados {0} resultados. Extraindo...".format(total))
        
    items_list = []
    items_list = extract_results(data)
        
    iterations = total // 30 
    
    for iteracao in range(0, iterations):        
        header = results.links
        
        if header.get('next', False):
            next_url = header.get('next').get('url')
                        
            results = requests.get(next_url, auth=credentials)
            data = results.json()
            
            items_list = items_list + extract_results(data)
    
    return items_list

In [15]:
%%time

items_list = []
repositories_df = None

for string in search_strings:
    url = url_base + string + sort
    
    logging.info("Pesquisando repositórios para a string: '{0}'".format(string))
    
    items_list = items_list + scroll_pages(url)
        
repositories_df = pd.DataFrame(items_list)

CPU times: user 845 ms, sys: 52 ms, total: 897 ms
Wall time: 42.5 s


Quantidade de resultados:

In [17]:
len(repositories_df)

608

Retirando registros duplicados visto que palavras de busca diferentes podem levar a um mesmo repositório.

In [18]:
repositories_df = repositories_df.drop_duplicates(['id', 'api_url'])

In [20]:
len(repositories_df)

445

In [21]:
repositories_df.describe()

Unnamed: 0,forks,forks_count,id,open_issues,size,stargazers_count
count,445.0,445.0,445.0,445.0,445.0,445.0
mean,0.802247,0.802247,117762300.0,0.961798,9735.101124,2.752809
std,3.33376,3.33376,68391660.0,10.072524,42848.278656,12.182435
min,0.0,0.0,813115.0,0.0,0.0,0.0
25%,0.0,0.0,56194710.0,0.0,23.0,0.0
50%,0.0,0.0,124591300.0,0.0,350.0,0.0
75%,0.0,0.0,171327400.0,0.0,3295.0,1.0
max,44.0,44.0,236543400.0,209.0,555527.0,141.0


Quantidade de colunas:

In [22]:
len(repositories_df.columns)

22

In [25]:
repositories_df.head()

Unnamed: 0,api_url,created_at,description,fork,forks,forks_count,full_name,has_issues,has_wiki,id,...,open_issues,owner_api_url,owner_type,owner_url,pushed_at,size,stargazers_count,timestamp_extract,updated_at,url
0,https://api.github.com/repos/CamaraDosDeputado...,2015-01-14T17:32:49Z,Repositório do serviço de Dados Abertos da Câm...,False,7,7,CamaraDosDeputados/dados-abertos,True,True,29256552,...,209,https://api.github.com/users/CamaraDosDeputados,Organization,https://github.com/CamaraDosDeputados,2019-12-13T15:13:19Z,34007,141,1580331534,2020-01-23T14:21:59Z,https://github.com/CamaraDosDeputados/dados-ab...
1,https://api.github.com/repos/dadosgovbr/catalo...,2015-07-17T14:02:34Z,Mapeamento de iniciativas (e catálogos) de dad...,False,40,40,dadosgovbr/catalogos-dados-brasil,True,True,39256926,...,1,https://api.github.com/users/dadosgovbr,Organization,https://github.com/dadosgovbr,2019-12-02T18:58:47Z,91,139,1580331534,2020-01-19T19:44:00Z,https://github.com/dadosgovbr/catalogos-dados-...
2,https://api.github.com/repos/prefeiturasp/dado...,2016-11-10T13:35:40Z,Análises e tutoriais das bases de dados aberto...,False,18,18,prefeiturasp/dados-educacao,True,True,73385196,...,1,https://api.github.com/users/prefeiturasp,Organization,https://github.com/prefeiturasp,2019-10-02T18:43:26Z,2737,48,1580331534,2019-08-17T00:49:49Z,https://github.com/prefeiturasp/dados-educacao
3,https://api.github.com/repos/dadosgovbr/aplica...,2016-01-15T13:29:14Z,Mapeamento de aplicativos e visualizações que ...,False,9,9,dadosgovbr/aplicativos-dados-brasil,True,True,49720381,...,1,https://api.github.com/users/dadosgovbr,Organization,https://github.com/dadosgovbr,2019-08-30T02:07:18Z,2357,48,1580331534,2019-06-16T17:26:35Z,https://github.com/dadosgovbr/aplicativos-dado...
4,https://api.github.com/repos/mapaslivres/local...,2015-04-24T19:50:16Z,Dados em formato aberto sobre municípios e uni...,False,13,13,mapaslivres/localidades,True,False,34538558,...,6,https://api.github.com/users/mapaslivres,Organization,https://github.com/mapaslivres,2018-12-25T11:09:43Z,6357,41,1580331534,2019-12-11T15:03:40Z,https://github.com/mapaslivres/localidades


## Extraindo _Commits_, _Contributors_ e dados do _Owner_

In [26]:
repo_copia = repositories_df

In [27]:
def extract_commits(url_repo):
    
    commits_url = url_repo + '/commits'  
    results = requests.get(commits_url, auth=credentials)
    
    if results.status_code == 409:
        return None
    
    commits = len(results.json())

    header = results.links
    
    while header.get('next', False):
        next_url = header.get('next').get('url')        
        results = requests.get(next_url, auth=credentials)
        commits = commits + len(results.json())    
        header = results.links


    return commits

In [28]:
def extract_contributors(url_repo):
    
    contributors_url = url_repo + '/contributors'
    results = requests.get(contributors_url, auth=credentials)
    
    if results.status_code == 204:
        return None
    
    contributors = len(results.json())

    header = results.links
    
    while header.get('next', False):
        next_url = header.get('next').get('url')
        results = requests.get(next_url, auth=credentials)
        contributors = contributors + len(results.json())
        header = results.links
    
    return contributors

In [29]:
def extract_owner_data(owner_api_url):
    
    results = requests.get(owner_api_url, auth=credentials)
    data = results.json()

    owner_data = {
        'owner_location': data.get('location', None),
        'owner_email': data.get('email', None),
        'owner_blog': data.get('blog', None),
        'owner_name': data.get('name', None)
    }
    
    return owner_data

In [32]:
%%time
urls = repositories_df['api_url']

for url in urls:

    owner_api_url = repositories_df.loc[repositories_df["api_url"] == url]['owner_api_url'].item()
    owner_data = extract_owner_data(owner_api_url)
    commits = extract_commits(url)
    contributors = extract_contributors(url)
    
    logging.info("Repositório: {0}".format(url))
    logging.info("Tem {0} Commits - {1} Contributors".format(commits,contributors))
    logging.info("Owner location: {0}".format(owner_data.get('owner_location')))

    repositories_df.loc[repositories_df["api_url"] == url, 'commits'] = commits
    repositories_df.loc[repositories_df["api_url"] == url, 'contributors'] = contributors
    repositories_df.loc[repositories_df["api_url"] == url, 'owner_location'] = owner_data.get('owner_location')
    repositories_df.loc[repositories_df["api_url"] == url, 'owner_email'] = owner_data.get('owner_email')
    repositories_df.loc[repositories_df["api_url"] == url, 'owner_blog'] = owner_data.get('owner_blog')
    repositories_df.loc[repositories_df["api_url"] == url, 'owner_name'] = owner_data.get('owner_name')

CPU times: user 1min 12s, sys: 2.84 s, total: 1min 15s
Wall time: 20min 52s


Agora devemos ter mais 6 colunas

In [34]:
len(repositories_df.columns)

28

In [35]:
repositories_df.head()

Unnamed: 0,api_url,created_at,description,fork,forks,forks_count,full_name,has_issues,has_wiki,id,...,stargazers_count,timestamp_extract,updated_at,url,commits,contributors,owner_location,owner_email,owner_blog,owner_name
0,https://api.github.com/repos/CamaraDosDeputado...,2015-01-14T17:32:49Z,Repositório do serviço de Dados Abertos da Câm...,False,7,7,CamaraDosDeputados/dados-abertos,True,True,29256552,...,141,1580331534,2020-01-23T14:21:59Z,https://github.com/CamaraDosDeputados/dados-ab...,43.0,4.0,Brazil,,http://www.camara.leg.br,Câmara dos Deputados do Brasil
1,https://api.github.com/repos/dadosgovbr/catalo...,2015-07-17T14:02:34Z,Mapeamento de iniciativas (e catálogos) de dad...,False,40,40,dadosgovbr/catalogos-dados-brasil,True,True,39256926,...,139,1580331534,2020-01-19T19:44:00Z,https://github.com/dadosgovbr/catalogos-dados-...,50.0,5.0,Brazil,,dados.gov.br,dados.gov.br
2,https://api.github.com/repos/prefeiturasp/dado...,2016-11-10T13:35:40Z,Análises e tutoriais das bases de dados aberto...,False,18,18,prefeiturasp/dados-educacao,True,True,73385196,...,48,1580331534,2019-08-17T00:49:49Z,https://github.com/prefeiturasp/dados-educacao,18.0,2.0,"São Paulo, SP",tecnologia@prefeitura.sp.gov.br,http://www.capital.sp.gov.br,Prefeitura Municipal de São Paulo
3,https://api.github.com/repos/dadosgovbr/aplica...,2016-01-15T13:29:14Z,Mapeamento de aplicativos e visualizações que ...,False,9,9,dadosgovbr/aplicativos-dados-brasil,True,True,49720381,...,48,1580331534,2019-06-16T17:26:35Z,https://github.com/dadosgovbr/aplicativos-dado...,40.0,5.0,Brazil,,dados.gov.br,dados.gov.br
4,https://api.github.com/repos/mapaslivres/local...,2015-04-24T19:50:16Z,Dados em formato aberto sobre municípios e uni...,False,13,13,mapaslivres/localidades,True,False,34538558,...,41,1580331534,2019-12-11T15:03:40Z,https://github.com/mapaslivres/localidades,69.0,4.0,,,,


Conferindo valores nulos

In [36]:
len(repositories_df.loc[repositories_df['commits'].isnull()][['api_url', 'commits', 'contributors']])

17

Alguns repositórios realmente não tem nenhum commit como o [ccdpoa/ocida](https://github.com/ccdpoa/ocida).

In [42]:
repositories_df.loc[repositories_df['contributors'].isnull()][['id', 'url', 'api_url', 'commits', 'contributors']]

Unnamed: 0,id,url,api_url,commits,contributors
155,193599332,https://github.com/renatachagasc/Api-DadosAber...,https://api.github.com/repos/renatachagasc/Api...,,
162,13354150,https://github.com/ccdpoa/ocida,https://api.github.com/repos/ccdpoa/ocida,,
167,176943542,https://github.com/Kassio-Ferreira/perfil_raci...,https://api.github.com/repos/Kassio-Ferreira/p...,,
189,212679438,https://github.com/GabrielLimaSnT/ProjetoOpeDa...,https://api.github.com/repos/GabrielLimaSnT/Pr...,,
217,209605170,https://github.com/DINALVAGOMES/INSS_dados-abe...,https://api.github.com/repos/DINALVAGOMES/INSS...,,
225,144884415,https://github.com/eduponto21/dados_abertos_cu...,https://api.github.com/repos/eduponto21/dados_...,,
245,86007528,https://github.com/MarxSteel/MDIO-InteractBrasil,https://api.github.com/repos/MarxSteel/MDIO-In...,,
249,59862291,https://github.com/yelken/livecity,https://api.github.com/repos/yelken/livecity,,
291,128829155,https://github.com/danielmbicalho/Dados_reposi...,https://api.github.com/repos/danielmbicalho/Da...,,
326,69135911,https://github.com/klismark/vigieSeuDeputado,https://api.github.com/repos/klismark/vigieSeu...,,


Salvando repositorios com commits nulos para verificar manualmente depois.

In [41]:
null_commits = repositories_df.loc[repositories_df['contributors'].isnull()][['id', 'url', 'api_url', 'commits', 'contributors']]

In [44]:
null_commits.to_csv('../data/repositories_with_null_commits_' + str(time.time()).split('.')[0] + '.csv', index=False)

Salvando todos os repositórios.

In [45]:
repositories_df.to_csv('../data/repositories_' + str(time.time()).split('.')[0] + '.csv', index=False)

## Extraindo contribuidores dos repositórios

In [46]:
def get_contributors(data, repo_data):

    list_contributors = []

    for item in data:        
        contributor = {
            'repo_id': repo_data.get('repo_id', None),
            'repo_name': repo_data.get('repo_name', None),
            'repo_url': repo_data.get('repo_url', None),
            'repo_api_url': repo_data.get('repo_api_url', None),
            'contributor_id': item.get('id', None),
            'contributor_login': item.get('login', None),
            'contributor_type': item.get('type', None),
            'contributor_url': item.get('html_url', None),
            'contributor_api_url': item.get('url', None),
            'timestamp_extract': str(time.time()).split('.')[0]
        }

        list_contributors.append(contributor)

    return list_contributors

In [50]:
def scroll_contributors(url, repo_data):

    list_contributors = []
    results = requests.get(url, auth=credentials)
    
    if results.status_code is 204:
        return None
    
    data = results.json()
    list_contributors = get_contributors(data, repo_data)
    header = results.links
    
    while header.get('next', False):
        
        next_url = header.get('next').get('url')            
        results = requests.get(next_url, auth=credentials)
        data = results.json()
        list_contributors = list_contributors + get_contributors(data, repo_data)  
        header = results.links
        
    return list_contributors

In [51]:
def search_contributors(repositories_df):
    
    urls = repositories_df['api_url']
    list_contributors_all_repo = []
    
    for url in urls:
        logging.info('Extraindo contribuidores de: {0}'.format(url))
        
        repo_data = {
                'repo_id': repositories_df.loc[repositories_df["api_url"] == url, 'id'].values[0],
                'repo_name': repositories_df.loc[repositories_df["api_url"] == url, 'full_name'].values[0],
                'repo_url': repositories_df.loc[repositories_df["api_url"] == url, 'url'].values[0],
                'repo_api_url': url,
            }
        
        url_contributors = url + '/contributors'        
        contributors = scroll_contributors(url_contributors, repo_data)
        
        if contributors:
            list_contributors_all_repo = list_contributors_all_repo + contributors
    
    contributors_df = pd.DataFrame(list_contributors_all_repo)     
        
    return contributors_df

In [52]:
%%time
contributors_df = search_contributors(repositories_df)

CPU times: user 14.2 s, sys: 542 ms, total: 14.7 s
Wall time: 4min 20s


In [54]:
contributors_df.head()

Unnamed: 0,contributor_api_url,contributor_id,contributor_login,contributor_type,contributor_url,repo_api_url,repo_id,repo_name,repo_url,timestamp_extract
0,https://api.github.com/users/FabricioRocha,19875696,FabricioRocha,User,https://github.com/FabricioRocha,https://api.github.com/repos/CamaraDosDeputado...,29256552,CamaraDosDeputados/dados-abertos,https://github.com/CamaraDosDeputados/dados-ab...,1580336014
1,https://api.github.com/users/EquipeDadosAbertosCD,16920325,EquipeDadosAbertosCD,User,https://github.com/EquipeDadosAbertosCD,https://api.github.com/repos/CamaraDosDeputado...,29256552,CamaraDosDeputados/dados-abertos,https://github.com/CamaraDosDeputados/dados-ab...,1580336014
2,https://api.github.com/users/JoaoCarabetta,19963732,JoaoCarabetta,User,https://github.com/JoaoCarabetta,https://api.github.com/repos/CamaraDosDeputado...,29256552,CamaraDosDeputados/dados-abertos,https://github.com/CamaraDosDeputados/dados-ab...,1580336014
3,https://api.github.com/users/labhacker,7976552,labhacker,User,https://github.com/labhacker,https://api.github.com/repos/CamaraDosDeputado...,29256552,CamaraDosDeputados/dados-abertos,https://github.com/CamaraDosDeputados/dados-ab...,1580336014
4,https://api.github.com/users/augusto-herrmann,1058414,augusto-herrmann,User,https://github.com/augusto-herrmann,https://api.github.com/repos/dadosgovbr/catalo...,39256926,dadosgovbr/catalogos-dados-brasil,https://github.com/dadosgovbr/catalogos-dados-...,1580336015


Verificando se há contribuidores repetidos para um mesmo repositório.

In [55]:
contributors_df[contributors_df.duplicated(['contributor_id', 'repo_id'])]

Unnamed: 0,contributor_api_url,contributor_id,contributor_login,contributor_type,contributor_url,repo_api_url,repo_id,repo_name,repo_url,timestamp_extract


Salvando dataframe com mapeamento de repositórios e contribuidores.

In [57]:
contributors_df.to_csv('../data/contributors_' + str(time.time()).split('.')[0] + '.csv', index=False)