brown_pos_converter

I use this script to generate one big brown.pos to train part of speech models for natural language processing. It constructs one large easily parsable file with POS tagged sentences (one token per line and sentences separated by empty lines) from brown POS corpus files. This script ignores some fucked up sentences and performs token mappings like "I'm" -> "I" "am".

Just set path variable BROWN_POS_PATH to where all your brown POS training files are and run it. You can also chose the token/tag separator-string TAG_TOKEN_SEPARATOR.

Output with TAG_TOKEN_SEPARATOR='#!#' looks like this:

The#!#at
Fulton#!#np
County#!#nn
Grand#!#jj
Jury#!#nn
said#!#vbd
Friday#!#nr
an#!#at
investigation#!#nn
of#!#in
Atlanta's#!#np$
recent#!#jj
primary#!#nn
election#!#nn
produced#!#vbd
no#!#at
evidence#!#nn
that#!#cs
any#!#dti
irregularities#!#nns
took#!#vbd
place#!#nn
.#!#.

The#!#at
jury#!#nn
further#!#rbr
said#!#vbd
in#!#in
term-end#!#nn
presentments#!#nns
that#!#cs
the#!#at
City#!#nn
Executive#!#jj
Committee#!#nn
,#!#,
which#!#wdt
had#!#hvd
over-all#!#jj
charge#!#nn
of#!#in
the#!#at
election#!#nn
,#!#,
deserves#!#vbz
the#!#at
praise#!#nn
and#!#cc
thanks#!#nns
of#!#in
the#!#at
City#!#nn
of#!#in
Atlanta#!#np
for#!#in
the#!#at
manner#!#nn
in#!#in
which#!#wdt
the#!#at
election#!#nn
was#!#bedz
conducted#!#vbn
.#!#.

...

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
convert_brown_pos.py		convert_brown_pos.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

convert_brown_pos.py

convert_brown_pos.py

Repository files navigation

brown_pos_converter

About

Releases

Packages

Languages

License

pixelogik/BrownPOSConverter

Folders and files

Latest commit

History

Repository files navigation

brown_pos_converter

About

Resources

License

Stars

Watchers

Forks

Languages