# Documentation of data collection / processing

The following sections describe the different steps which were necessary to gather the necessary data for the research of bdd glue code
The date were the code was executed is:

In [7]:
import datetime
print(datetime.datetime.now())

2020-11-30 01:29:45.600423


# Import used libraries

In [3]:
# Library to query
from github import Github
# Library to better query and display dataframes
import pandas as pd
# used to get environment variable
import os
# define the github api query object
g = Github(os.environ['GITHUBTOKEN'])
# can be used to make direct http requests 
import requests
# used to serialize and save objects in python
import pickle

# Find repositories that contain Gherkin language

Unfortunately the Github API only allows to find repositories (repos) by their main programming language.
Therefore a generic query that returns repositories that can be iterated over is used


## Get Repository list

In [2]:
repos = g.search_repositories(query='stars:>500', sort='stars', order='desc')
repos.totalCount

1000

The pygithub library only returns a totalCount of 1000.

If we do the same query on the Github Homepage the results are a little different

![./repo_count.png](./repo_count.png)

after iterating over this a new request should be made.
The chosen approach ist ot get the stars count of the last repo and limit the next search to repos with stars between `500-<stars_of_last_repo>`

In [29]:
repos = g.search_repositories(query='stars:500..41689', sort='stars', order='desc')

## Check if they contain gherkin

To check if a repo contains Gherkin files we check the list of languages returned by the github api
The following code example does work for a small number of requests. But since Github API has a rate limit, the used script is a little more complicated (check `find_gherkin_repos.py`)

In [32]:
contains_gherkin_files = {}
for repo in repos[:5]:
    contains_gherkin_files[repo.full_name] = 'Gherkin' in repo.get_languages()

pd.DataFrame.from_dict(contains_gherkin_files, orient='index', columns=['contains_gherkin'])

Unnamed: 0,contains_gherkin
jekyll/jekyll,True
Hack-with-Github/Awesome-Hacking,False
necolas/normalize.css,False
scutan90/DeepLearning-500-questions,False
google/material-design-icons,False


# Get List of Feature Files

The code used to get the feature files is shown in the following code section.
Again this is an example on how to

In [7]:
def get_feature_files(repo):
    files = g.search_code(query=f'repo:{repo.full_name} extension:feature')
    return [ file.path for file in files ]

example_repo = g.get_repo('influxdata/influxdb')
feature_files = get_feature_files(example_repo)
pd.DataFrame(feature_files, columns=[['feature_files']])

Unnamed: 0,feature_files
0,e2e/features/homePage/homePage.feature
1,e2e/features/influx/influx.feature
2,e2e/features/loadData/clientlib.feature
3,e2e/features/loadData/loadData.feature
4,e2e/features/onboarding/onboarding.feature
5,e2e/features/settings/labels.feature
6,e2e/features/settings/settings.feature
7,e2e/features/settings/variables.feature
8,e2e/features/signin/signin.feature
9,e2e/features/monitoring/history.feature


<https://github.com/influxdata/influxdb/blob/master/e2e/features/homePage/homePage.feature>