An Introduction to Sentence Segmentation

Sentence segmentation, means, to split a given paragraph of text into sentences, by identifying the sentence boundaries. In many cases, a full stop is all that is required to identify the end of a sentence, but the task is not all that simple.

This is an open ended challenge to which there are no perfect solutions. Try to break up given paragraphs into text into individual sentences. Even if you don't manage to segment the text perfectly, the more sentences you identify and display correctly, the more you will score.

Abbreviations: Dr. W. Watson is amazing. In this case, the first and second "." occurs after Dr (Doctor) and W (initial in the person's name) and should not be confused as the end of the sentence.

Sentences enclosed in quotes: "What good are they? They're led about just for show!" remarked another. All of this, should be identified as just one sentence.

Questions and exclamations: Who is it? -This is a question. This should be identified as a sentence. I am tired!: Something which has been exclaimed. This should also be identified as a sentence.

INPUT FORMAT

You will be given a chunk of text, containing several sentences, questions, statements and exclamations- all in 1 line.

Constraints

Number of characters in every input does not exceed 10000.
Number of words in every input does not exceed 1000. There will be more than 1 sentence in each input and this number does not exceed 30.
There will be more than 2 characters in every expected sentence and this number does not exceed 10000. There will be more than 2 characters in every test file and this number does not exceed 10000.

OUTPUT FORMAT

You will split the chunk of text into sentences, and display one sentence per line.

In [None]:
import re

def split_sentences(text):
    # Define regex patterns for identifying sentence boundaries
    sentence_endings = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|\:)\s'
    # Split the text into sentences using regex
    sentences = re.split(sentence_endings, text)
    # Return the list of sentences
    return sentences

# Read input text from stdin
input_text = input().strip()

# Split the text into sentences
sentences = split_sentences(input_text)

# Print each sentence on a separate line
for sentence in sentences:
    print(sentence.strip())


The perplexity of a bigram model is 170. Compute its cross-entropy corrected to 2 decimal places.

You may either submit the final answer in the plain-text mode, or you may submit a program in the language of your choice to compute the required value.

Your answer should look like this:

5.50  
Do not use any extra leading or trailing spaces or newlines

In [1]:
# Enter your code here. Read input from STDIN. Print output to STDOUT
import math

perplexity = 170
cross_entropy = math.log2(perplexity)

print("{:.2f}".format(cross_entropy))


7.41


Deterministic Url and HashTag Segmentation

This problem will introduce you to the segmentation of Domain Names and Social Media HashTags, into English Language words. To give you a quick idea of what segmentation means, here are a few examples of Domain names and Hash Tags which have been segmented.

Domain Name Examples:-

www.checkdomain.com => [check domain]

www.bigrock.com => [big rock]

www.namecheap.com => [name cheap]

www.appledomains.in => [apple domains]

Twitter Hash Tag Examples:

#honestyhour => [honesty hour]

#beinghuman => [being human]

#followback => [follow back]

#socialmedia => [social media]

#30secondstoearth => [30 seconds to earth]

The segmentation should be based on the list of 5000 most common words from here Apart from the words in this list, you should also pick up numbers (both integer and decimal) like 100, 200.10 etc.

At this stage, we are going to use a very simple algorithm for the process. In case the input is a domain name, ignore the www. and/or the extensions (.com,.edu,.org,.in, etc.) In case the input is a hashtag, ignore the first # symbol. Split the input string, into a sequence of tokens. A token can either be:

A word in from the provided lexicon/dictionary.
An integer or decimal number.
There might be cases where it might be possible to parse (or split) an input string into tokens in multiple possible ways.

currentratesoughttogodown 
This can be split into:

current rate sought to go down

current rates ought to go down.

thisisinsane

This can be split into:

this is in sane
this is insane
Write your splitter in such a way, that as you tokenise a string from left to right; in case there are multiple possible ways to split the string,

select the longest possible string from the left side, such that the remaining string can be split into valid tokens. So, for the two cases above, the appropriate ways to split the strings are:

current rates ought to go down
this is insane
In case there is no valid way to split the string into a valid sequence of tokens, output the original string itself, after scrubbing out the # for hashtags, the 'www' and extensions for domain names.

Input Format

First line will contain the number of test cases N

This will be followed by N inputs on separate lines, which will contain twitter hash-tags and domain names, which you need to segment

There will be a file named "words.txt" in the run directory of your program that contains all of the words each seperated by a new line. It is the same file that is linked earlier in the problem statement.

Output Format

(Everything should be in lower case)

Segmentation for Input 1  
Segmentation for Input 2
             .
             .
Segmentation for Input N
Sample Input

3
#isittime  
www.whatismyname.com  
#letusgo  
Sample Output

is it time  
what is my name  
let us go
The sample input is just to get an idea of what to do. Your program is not expected to be able to run it. You can also read the corpus of words by making your program read the file "words.txt" from its current directory.

Please note, that the "words.txt" file, is a list of several common words, but it will not necessarily produce the ideal natural language segmentation for each of the examples, samples or tests. That is the expected behavior: we are only trying to gauge how well you can segment these hashtags and domain names, with this limited list of words.

Scoring

All test cases have equal weightage.

Score = M * (C)/N Where M is the Maximum Score for the test case.
C = Number of correct answers in your output.
N = Total number of tests.

In [None]:
# Enter your code here. Read input from STDIN. Print output to STDOUT
def load_words(filename):
    with open(filename, 'r') as file:
        words = set(word.strip().lower() for word in file)
    return words

def segment_string(input_string, dictionary):
    input_string = input_string.lower()
    if input_string.startswith('#'):
        input_string = input_string[1:]  # Remove leading '#' for hashtags
    elif input_string.startswith('www.'):
        input_string = input_string[4:]  # Remove leading 'www.' for domain names
    input_string = input_string.split('.')[0]  # Remove domain extensions
    
    segmented = []
    while input_string:
        found = False
        for i in range(len(input_string), 0, -1):
            token = input_string[:i]
            if token in dictionary or token.isdigit():
                segmented.append(token)
                input_string = input_string[i:].strip()
                found = True
                break
        if not found:
            segmented.append(input_string)
            break
    
    return ' '.join(segmented)

# Load dictionary
dictionary = load_words("words.txt")

# Read number of test cases
N = int(input())

# Process each test case
for _ in range(N):
    input_string = input().strip()
    segmented_string = segment_string(input_string, dictionary)
    print(segmented_string)


You are provided with the POS-tagged version of a sentence. The tags that are used are the most commonly used standard from the Pen Tree Tagset.

Your task is to fill in the missing tags which have been replaced by question marks. The missing tags will be restricted to the set of tags which you already see in the POS tagged version of this sentence.
If there are two question marks (??), it indicates a 2-letter tag (CC, JJ, NN etc.).
If there are three question marks (???), it indicates a 3-letter tag (NNP, PPS, VBP).

The/DT planet/NN Jupiter/NNP and/CC its/PPS moons/NNS are/VBP in/IN effect/NN a/DT minisolar/JJ system/?? ,/, and/CC Jupiter/NNP itself/PRP is/VBZ often/RB called/VBN a/DT star/?? that/IN never/RB caught/??? fire/NN ./.
In the plain-text box below, submit the sentence, as is, with only the question marks filled in appropriately and nothing else changed.

For example, your answer may look like this (This is not the correct answer):

The/DT planet/NN Jupiter/NNP and/CC its/PPS moons/NNS are/VBP in/IN effect/NN a/DT minisolar/JJ system/CC ,/, and/CC Jupiter/NNP itself/PRP is/VBZ often/RB called/VBN a/DT star/CC that/IN never/RB caught/NNP fire/NN ./.

In [2]:
def fill_missing_tags(sentence):
    tokens = sentence.split()
    for i, token in enumerate(tokens):
        if '?' in token:
            missing_length = token.count('?')
            if missing_length == 2:
                for j in range(i-1, -1, -1):
                    if len(tokens[j]) == 2 and tokens[j].isalpha():
                        tokens[i] = tokens[i].replace('??', tokens[j])
                        break
            elif missing_length == 3:
                for j in range(i-1, -1, -1):
                    if len(tokens[j]) == 3 and tokens[j].isalpha():
                        tokens[i] = tokens[i].replace('???', tokens[j])
                        break

    return ' '.join(tokens)

# Provided test case
test_case = "The/DT planet/NN Jupiter/NNP and/CC its/PPS moons/NNS are/VBP in/IN effect/NN a/DT minisolar/JJ system/?? ,/, and/CC Jupiter/NNP itself/PRP is/VBZ often/RB called/VBN a/DT star/?? that/IN never/RB caught/??? fire/NN ./."

# Fill in missing POS tags
result = fill_missing_tags(test_case)

# Print the result
print(result)


The/DT planet/NN Jupiter/NNP and/CC its/PPS moons/NNS are/VBP in/IN effect/NN a/DT minisolar/JJ system/?? ,/, and/CC Jupiter/NNP itself/PRP is/VBZ often/RB called/VBN a/DT star/?? that/IN never/RB caught/??? fire/NN ./.
