# Extracting features from text - a demo of using the GitHub Co-pilot

This script shows how Github Co-pilot can quickly create a basic, but still good, feature extraction tool for a text. The script is based on my upcoming book. 

The goal of the demonstration is to show how Co-Pilot helps to be fast and to discuss how much of the code can be attributed to me compared to the code attributed to Co-Pilot; both from the development perspective and from the legal perspective. 

In [1]:
# import the libraries to operate on the dataframes
import pandas as pd
import numpy as np

# import the libraries to plot the data
import matplotlib.pyplot as plt
import seaborn as sns

# import the libraries to perform the feature extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer



In [4]:
# read the dataset
df = pd.read_csv('./gerrit_reviews.csv', sep=';')

# show the head
df.head()

Unnamed: 0,change_id,revision-id,filename,line,start_line,end_line,LOC,message
0,sdc~master~Iab23331c5eb2e8ff5526a877adab8babba...,96f3585b6d36769568268134c9d59b4b8a925cb6,openecomp-be/lib/openecomp-sdc-vendor-software...,201.0,195,201,if (CollectionUtils.isNotEmpty(data)) {,Could this cause backwards compatibility issue...
1,sdc~master~Iab23331c5eb2e8ff5526a877adab8babba...,96f3585b6d36769568268134c9d59b4b8a925cb6,openecomp-be/lib/openecomp-sdc-vendor-software...,201.0,195,201,return data.stream(),Could this cause backwards compatibility issue...
2,sdc~master~Iab23331c5eb2e8ff5526a877adab8babba...,96f3585b6d36769568268134c9d59b4b8a925cb6,openecomp-be/lib/openecomp-sdc-vendor-software...,201.0,195,201,.anyMatch(fileData -> fileData...,Could this cause backwards compatibility issue...
3,sdc~master~Iab23331c5eb2e8ff5526a877adab8babba...,96f3585b6d36769568268134c9d59b4b8a925cb6,openecomp-be/lib/openecomp-sdc-vendor-software...,201.0,195,201,} else {,Could this cause backwards compatibility issue...
4,sdc~master~Iab23331c5eb2e8ff5526a877adab8babba...,96f3585b6d36769568268134c9d59b4b8a925cb6,openecomp-be/lib/openecomp-sdc-vendor-software...,201.0,195,201,return artifact.toUpperCase().cont...,Could this cause backwards compatibility issue...


In [5]:
# get only the column with the message to extract the features
dfMessage = df['message']

# extract the features using the CountVectorizer
countVectorizer = CountVectorizer()
countVectorizer.fit(dfMessage)
countVectorizer.vocabulary_



{'could': 206,
 'this': 859,
 'cause': 135,
 'backwards': 87,
 'compatibility': 169,
 'issues': 455,
 'for': 342,
 'any': 60,
 'packages': 626,
 'relying': 703,
 'on': 600,
 'the': 851,
 'cba': 137,
 'files': 331,
 'being': 100,
 'identified': 403,
 'by': 123,
 'naming': 561,
 'convention': 196,
 'if': 405,
 'file': 328,
 'type': 894,
 'is': 449,
 'not': 585,
 'present': 662,
 'but': 122,
 'perhaps': 649,
 'needs': 566,
 'to': 870,
 'be': 93,
 'done': 268,
 'order': 612,
 'of': 597,
 'operation': 607,
 'should': 766,
 'static': 803,
 'final': 334,
 'definedconfigresolver': 241,
 'keycloak': 477,
 'configured': 176,
 'probably': 667,
 'throw': 863,
 'more': 550,
 'specific': 789,
 'exception': 305,
 'declared': 237,
 'at': 82,
 'top': 877,
 'actions': 32,
 'handledrequest': 386,
 'get': 358,
 'rid': 732,
 'extra': 315,
 'statement': 801,
 'match': 521,
 'and': 58,
 'remove': 706,
 'duplicate': 272,
 'code': 155,
 'block': 108,
 'you': 969,
 'don': 267,
 'need': 564,
 'assert': 77,
 'her

In [7]:
# extract the features using the TfidfVectorizer
tfidfVectorizer = TfidfVectorizer()
tfidfVectorizer.fit(dfMessage)
tfidfVectorizer.vocabulary_


['103',
 '11',
 '110',
 '12',
 '122',
 '132505',
 '14',
 '143',
 '148',
 '16',
 '2022',
 '21',
 '2a',
 '2b',
 '326',
 '330',
 '3317',
 '377',
 '4288',
 '62',
 '70',
 '71',
 '75',
 '87',
 '95',
 'about',
 'above',
 'abstractcomparablepropertyconstraint',
 'abstractpropertyconstraint',
 'ack',
 'acknowledged',
 'aclass',
 'actions',
 'actuallly',
 'actually',
 'add',
 'added',
 'adds',
 'adjust',
 'advise',
 'after',
 'again',
 'ah',
 'algorithm',
 'align',
 'all',
 'allargsconstructor',
 'already',
 'also',
 'alternate',
 'alternative',
 'alternatively',
 'always',
 'am',
 'an',
 'anchor',
 'anchorrepository',
 'anchors',
 'and',
 'another',
 'any',
 'anymore',
 'anystring',
 'anything',
 'anyway',
 'anywhere',
 'apart',
 'api',
 'apiparameters',
 'apiversion',
 'app',
 'appended',
 'applies',
 'are',
 'argument',
 'array',
 'as',
 'assert',
 'asserttrue',
 'assigned',
 'assigning',
 'async',
 'at',
 'avc',
 'avoid',
 'avoided',
 'backward',
 'backwards',
 'badly',
 'baeldung',
 'bandit

In [9]:
# convert one message into a vector of features
message = dfMessage[0]

# convert the message into a vector of features using the CountVectorizer
res = countVectorizer.transform([message]).toarray()

# print the vector of features
print(res)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [10]:
# convert the message into a vector of features using the TfidfVectorizer
res = tfidfVectorizer.transform([message]).toarray()

# print the vector of features
print(res)

[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.15326135 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.19574091 0.         0.
  0.         0.         0.         0.11045615 0.         0.
  0.         0.         0.         0.   

In [11]:
# make one dataframe with the features extracted using the CountVectorizer
dfCountVectorizer = pd.DataFrame(res, columns=countVectorizer.get_feature_names())

# show the head
dfCountVectorizer.head()

Unnamed: 0,103,11,110,12,122,132505,14,143,148,16,...,xml,yaml,yangtextschemasourceset,yeah,year,yes,yesterday,yml,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# transform the message into a vector of features using the CountVectorizer
res = countVectorizer.transform(dfMessage).toarray()

# make one dataframe with the features extracted using the CountVectorizer
dfCountVectorizer = pd.DataFrame(res, columns=countVectorizer.get_feature_names())

# show the head
dfCountVectorizer.head()

Unnamed: 0,103,11,110,12,122,132505,14,143,148,16,...,xml,yaml,yangtextschemasourceset,yeah,year,yes,yesterday,yml,you,your
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
