# Documentation of data collection / processing

The following sections describe the different steps which were necessary to gather the necessary data for the research of bdd glue code
The date were the code was executed is:

In [7]:
import datetime
print(datetime.datetime.now())

2020-11-30 01:29:45.600423


# Import used libraries

In [3]:
# Library to query
from github import Github
# Library to better query and display dataframes
import pandas as pd
# used to get environment variable
import os
# define the github api query object
g = Github(os.environ['GITHUBTOKEN'])
# can be used to make direct http requests 
import requests
# used to serialize and save objects in python
import pickle

# Find repositories that contain Gherkin language

Unfortunately the Github API only allows to find repositories (repos) by their main programming language.
Therefore a generic query that returns repositories that can be iterated over is used


## Get Repository list

In [2]:
repos = g.search_repositories(query='stars:>500', sort='stars', order='desc')
repos.totalCount

1000

The pygithub library only returns a totalCount of 1000.

If we do the same query on the Github Homepage the results are a little different

![./repo_count.png](./repo_count.png)



after iterating over this a new request should be made.
The chosen approach ist ot get the stars count of the last repo and limit the next search to repos with stars between `500-<stars_of_last_repo>`

In [29]:
repos = g.search_repositories(query='stars:500..41689', sort='stars', order='desc')

## Check if they contain gherkin

To check if a repo contains Gherkin files we check the list of languages returned by the github api
The following code example does work for a small number of requests. But since Github API has a rate limit, the used script is a little more complicated (check `find_gherkin_repos.py`)

In [32]:
contains_gherkin_files = {}
for repo in repos[:5]:
    contains_gherkin_files[repo.full_name] = 'Gherkin' in repo.get_languages()

pd.DataFrame.from_dict(contains_gherkin_files, orient='index', columns=['contains_gherkin'])

Unnamed: 0,contains_gherkin
jekyll/jekyll,True
Hack-with-Github/Awesome-Hacking,False
necolas/normalize.css,False
scutan90/DeepLearning-500-questions,False
google/material-design-icons,False


# Gather characteristics of the Repositories




The following generic characteristics are gathered:

- `num_stars`: Number of github stars
- `num_watchers`: Number of watchers
- `num_commits`: Number of commits
- `comments`: Number of comments
- `languages`: Programming languages (sorted by loc)
- `date_of_last_commit`: Date of the last commit

Special characteristics interesting for BDD glue code study:

- **`commits_with_given`: Commits that contain keyword 'Given'**
- **`file_cucumber`: Gherkin Feature files that are used with Cucumber**
- **`file_given`: Files that contain keyword 'Given'**
- **`files_feature`: Files that have the extension '.feature'**

For more details about the characteristics and how they collected can be seen in the scripts:

- [fetch_repo_stats.py](fetch_data/fetch_repo_stats.py)

- [repo_stats.py](fetch_data/repo_stats.py)


## Get Feature Files

The following code shows how the files that have the file extension `.feature` are obtained for the example repository `influxdata/influxdb`.

In [7]:
def get_feature_files(repo):
    files = g.search_code(query=f'repo:{repo.full_name} extension:feature')
    return [ file.path for file in files ]

example_repo = g.get_repo('influxdata/influxdb')
feature_files = get_feature_files(example_repo)
pd.DataFrame(feature_files, columns=[['feature_files']])

Unnamed: 0,feature_files
0,e2e/features/homePage/homePage.feature
1,e2e/features/influx/influx.feature
2,e2e/features/loadData/clientlib.feature
3,e2e/features/loadData/loadData.feature
4,e2e/features/onboarding/onboarding.feature
5,e2e/features/settings/labels.feature
6,e2e/features/settings/settings.feature
7,e2e/features/settings/variables.feature
8,e2e/features/signin/signin.feature
9,e2e/features/monitoring/history.feature


<https://github.com/influxdata/influxdb/blob/master/e2e/features/homePage/homePage.feature>

In [17]:
# either the manual gathered stuff
with open('./data/manual_feature_file_list.pickle', 'rb') as file:
    repos= pickle.load(file)
df = pd.DataFrame(repos)
df[['full_name','num_stars','num_commits','languages']].head(10)

Unnamed: 0,full_name,num_stars,num_commits,languages
0,torvalds/linux,101487,968187,"{'C': 853161843, 'C++': 11419111, 'Assembly': ..."
1,iluwatar/java-design-patterns,62226,2978,"{'Java': 3390250, 'HTML': 20964, 'CSS': 11102,..."
2,jekyll/jekyll,41693,11200,"{'Ruby': 696738, 'Gherkin': 224137, 'JavaScrip..."
3,eugenp/tutorials,23873,21061,"{'Java': 18787946, 'JavaScript': 2946071, 'HTM..."
4,hashicorp/consul,20701,13255,"{'Go': 10117251, 'JavaScript': 1082533, 'SCSS'..."
5,github/hub,20563,3306,"{'Go': 358904, 'Gherkin': 289265, 'Shell': 392..."
6,influxdata/influxdb,20017,34431,"{'Go': 14021584, 'TypeScript': 3730682, 'JavaS..."
7,elastic/kibana,15187,38439,"{'TypeScript': 76447863, 'JavaScript': 1268035..."
8,diaspora/diaspora,12570,20279,"{'Ruby': 2291824, 'JavaScript': 754337, 'Haml'..."
9,nextcloud/server,12369,56380,"{'PHP': 19363568, 'JavaScript': 10473779, 'Vue..."


In [24]:
from IPython.display import display
for repo in repos:
    temp_df = pd.DataFrame(data=repo['feature_files'], columns=[repo['full_name']])
    display(temp_df)

Unnamed: 0,torvalds/linux
0,tools/build/Makefile.feature


Unnamed: 0,iluwatar/java-design-patterns
0,naked-objects/integtests/src/test/java/domaina...


Unnamed: 0,jekyll/jekyll
0,features/collections_dir.feature
1,features/data.feature
2,features/drafts.feature
3,features/frontmatter_defaults.feature
4,features/include_tag.feature
5,features/incremental_rebuild.feature
6,features/layout_data.feature
7,features/site_data.feature
8,features/highlighting.feature
9,features/create_sites.feature


Unnamed: 0,eugenp/tutorials
0,spring-cucumber/src/test/resources/baelung.fea...
1,spring-cucumber/src/test/resources/version.fea...
2,testing-modules/testing-libraries/src/test/res...
3,testing-modules/rest-testing/src/test/resource...
4,testing-modules/testing-libraries/src/test/res...
5,testing-modules/testing-libraries/src/test/res...
6,testing-modules/testing-libraries/src/test/res...
7,testing-modules/testing-libraries/src/test/res...
8,testing-modules/testing-libraries/src/test/res...
9,testing-modules/testing-libraries/src/test/res...


Unnamed: 0,hashicorp/consul
0,ui/packages/consul-ui/tests/acceptance/compone...
1,ui/packages/consul-ui/tests/acceptance/compone...
2,ui/packages/consul-ui/tests/acceptance/compone...
3,ui/packages/consul-ui/tests/acceptance/compone...
4,ui/packages/consul-ui/tests/acceptance/compone...
...,...
103,ui/packages/consul-ui/tests/acceptance/dc/serv...
104,ui/packages/consul-ui/tests/acceptance/dc/serv...
105,ui/packages/consul-ui/tests/acceptance/dc/serv...
106,ui/packages/consul-ui/tests/acceptance/page-na...


Unnamed: 0,github/hub
0,features/alias.feature
1,features/cherry_pick.feature
2,features/delete.feature
3,features/fish_completion.feature
4,features/init.feature
5,features/merge.feature
6,features/push.feature
7,features/zsh_completion.feature
8,features/ci_status.feature
9,features/pr-checkout.feature


Unnamed: 0,influxdata/influxdb
0,e2e/features/homePage/homePage.feature
1,e2e/features/influx/influx.feature
2,e2e/features/loadData/clientlib.feature
3,e2e/features/loadData/loadData.feature
4,e2e/features/onboarding/onboarding.feature
5,e2e/features/settings/labels.feature
6,e2e/features/settings/settings.feature
7,e2e/features/settings/variables.feature
8,e2e/features/signin/signin.feature
9,e2e/features/monitoring/history.feature


Unnamed: 0,elastic/kibana
0,x-pack/plugins/apm/e2e/cypress/integration/csm...
1,x-pack/plugins/apm/e2e/cypress/integration/apm...


Unnamed: 0,diaspora/diaspora
0,features/desktop/activity_stream.feature
1,features/desktop/aspect_navigation.feature
2,features/desktop/blocks_user.feature
3,features/desktop/change_settings.feature
4,features/desktop/closes_account.feature
...,...
65,features/desktop/registrations.feature
66,features/mobile/registrations.feature
67,features/desktop/post_with_a_poll.feature
68,features/mobile/drawer.feature


Unnamed: 0,nextcloud/server
0,tests/acceptance/features/apps.feature
1,build/integration/ldap_features/ldap-openldap....
2,build/integration/ldap_features/openldap-numer...
3,build/integration/capabilities_features/capabi...
4,build/integration/features/auth.feature
5,build/integration/features/caldav.feature
6,build/integration/features/checksums.feature
7,build/integration/features/download.feature
8,build/integration/features/maintenance-mode.fe...
9,build/integration/features/ocs-v1.feature
