mnl-ws-norm

Light-weight tool for normalizing whitespace and accurately tokenizing words (no regex). Multiple natural languages supported. Useful for scrapping, machine learning, and data analysis.

Installation

pip install mnl-ws-norm

Code Examples (Splitting)

split_by_spaces(input_str)

input_str is the string from which words are to be tokenized.

input_str must be passed as a str type.

from mnl_ws_norm.normalizer import split_by_spaces

#Source string 1 with half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009).
source_str1 = "Hey, everybody,  how are you doing?"

#Source string 2 with half-width spaces and a \n character (Unicode: U+000A).
source_str2 = "Hey, everybody\nhow are you doing?"

#Source string 3 with half-width spaces and a full-width space (Unicode: U+3000).
source_str3 = "Hey, everybody,	how are you doing?"

print("source_str1: {}".format(split_by_spaces(source_str1)))
print("source_str2: {}".format(split_by_spaces(source_str2)))
print("source_str3: {}".format(split_by_spaces(source_str3)))

#There may be some cases where you want to split a string into lines and then split those lines by whitespace character.
#In such a case, you can use the splitlines() method.

source_str = "Hey, everybody.\nHow are you doing?\rI am alright."

line_list = source_str.splitlines()

for i in range(len(line_list)):
	print("Line {}: ".format(i) + str(split_by_spaces(line_list[i])))

Code Examples (Normalization)

norm_spaces(input_str, space_type, remove_extra_spaces = False)

Required arguments -> input_str, space_type

input_str is the string in which the whitespace characters are to be replaced.

input_str must be passed as a str type.

space_type is the string used to replace all whitespace characters in input_str.

space_type must be passed as a str type.

Optional argument -> remove_extra_spaces

By default, extra whitespace characters are not removed from input_str.

Specifying remove_extra_spaces as True removes extra whitespace characters from input_str.

Note: Regardless of the value of remove_extra_spaces, the returned string may have leading/trailing whitespace characters, so you may want to use the strip() method as necessary.

from mnl_ws_norm.normalizer import norm_spaces

#Source string with consecutive half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009).
source_str = "  Hey,  everybody, 	how  are  you  doing?  "

#Spaces in source_str are replaced with a half-width space, while extra spaces are ignored.
print(norm_spaces(source_str, " "))

#Spaces in source_str are replaced with a half-width space, and extra spaces are removed.
print(norm_spaces(source_str, " ", True))

#Spaces in source_str are replaced with a full-width space, while extra spaces are ignored.
print(norm_spaces(source_str, "　"))

#Spaces in source_str are replaced with a full-width space (Unicode: U+3000), and extra spaces are removed.
print(norm_spaces(source_str, "　", True))

Other languages

JavaScript -> https://github.com/Rairye/js-mnl-ws-norm

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
mnl_ws_norm		mnl_ws_norm
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mnl-ws-norm

Installation

Code Examples (Splitting)

Code Examples (Normalization)

Other languages

About

Releases 4

Packages

Languages

License

Rairye/mnl-ws-norm

Folders and files

Latest commit

History

Repository files navigation

mnl-ws-norm

Installation

Code Examples (Splitting)

Code Examples (Normalization)

Other languages

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages