New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Words and Chars primitives #51

Merged
merged 6 commits into from Jan 2, 2018

Conversation

Projects
None yet
4 participants
@Seth-Rothschild
Contributor

Seth-Rothschild commented Dec 22, 2017

Add two text primitives, NumWords and NumCharacter which count the number of words and the number of characters when the variable type is Text.

Seth-Rothschild added some commits Dec 22, 2017

Seth-Rothschild
Seth-Rothschild
Seth-Rothschild
Seth-Rothschild
@codecov-io

This comment has been minimized.

codecov-io commented Dec 22, 2017

Codecov Report

Merging #51 into master will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   87.14%   87.19%   +0.04%     
==========================================
  Files          74       74              
  Lines        6946     6973      +27     
==========================================
+ Hits         6053     6080      +27     
  Misses        893      893
Impacted Files Coverage Δ
primitives/transform_primitive.py 97.46% <0%> (+0.13%) ⬆️
.../feature_function_tests/test_transform_features.py 85.45% <0%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d30e76...687e1b0. Read the comment docs.

@kmax12

Good to merge after these changes

@@ -336,6 +337,30 @@ class Weekday(DatetimeUnitBasePrimitive):
name = "weekday"
class NumCharacter(TransformPrimitive):

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017

Member

Should we name this NumCharacters to be consistent with NumWords below?

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 1, 2018

Contributor

Changed

return_type = Numeric
def get_function(self):
return lambda array: pd.Series([len(x) for x in array])

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017

Member

I think you can use more pandas built syntax for this since the array variable is going to be a pandas series (please double check this though)

In [1]: array = pd.Series(["1","12 2","1212 3"])

In [2]: array.str.len()
Out[2]: 
0    1
1    4
2    6
dtype: int64

so, it would just be

def get_function(self):
    return lambda array: array.str.len()

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 1, 2018

Contributor

Array is a numpy.ndarray

This comment has been minimized.

@kmax12

kmax12 Jan 1, 2018

Member

thanks for double checking. I don't have strong preference, but perhaps do this to avoid the list comprehension

pd.Series(array).str.len()
return_type = Numeric
def get_function(self):
return lambda array: pd.Series([len(x.split(" ")) for x in array])

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017

Member

similar to above, you can do

def get_function(self):
    return lambda array: array.str.split(" ").str.len()

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017

Member

actually, this is probably better.

def get_function(self):
    return lambda array: array.str.count(" ") + 1

easier to read and might be up to 25% faster

This comment has been minimized.

@PaulHobbs

PaulHobbs Dec 29, 2017

Those aren't quite equivalent - what if there's some leading or trailing whitespace, or if there's more than one space between characters?

This comment has been minimized.

@kmax12

kmax12 Dec 29, 2017

Member

I think both pieces of code handle white space and more than one space the same way because they look for just " ". To do the multiple spaces, the only way I can think of is a regex and I rather keep it simple for now. However, for the trailing or leading white space we could do

def get_function(self):
    return lambda array: array.str.strip().str.count(" ") + 1

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 1, 2018

Contributor

Multiple spaces could be taken care of in the str.split(" ") case since they'll show up as empty strings. Could just remove with

new_list = [x for x in split_list if x != '']

For now, it has been changed to str.count(" ") + 1

Seth-Rothschild added some commits Jan 1, 2018

@kmax12 kmax12 merged commit 1656c82 into master Jan 2, 2018

2 checks passed

ci/circleci Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details

@Seth-Rothschild Seth-Rothschild deleted the words-and-chars branch Jan 2, 2018

@rwedge rwedge referenced this pull request Jan 18, 2018

Merged

Release v0.1.17 #72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment