# 2.1 Sentence Segmentation

Here I will compare two libraries: Stanza and Spacy. The code will provide with following information:
* Total number of sentences in the parsed data
* Number of sentences recognized by Spacy
* Number of sentences recognized by Stanza
* Number of shared sentences

In [21]:
import spacy
import stanza
import pandas as pd
import numpy as np
import nltk
import string
import collections
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /Users/anna/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
df = pd.read_csv("part2_dataset.csv")

In [23]:
df

Unnamed: 0,texts
0,Some notable French Huguenots or people with F...
1,Abel Boyer (1667? – 16 November 1729) was a Fr...
2,"Abolitionism, or the abolitionist movement, is..."
3,"In the United States, abolitionism, the moveme..."
4,Abraham Bosse (c. 1604 – 14 February 1676) was...
5,Abraham Faure (29 August 1795 – 28 March 1875)...
6,"Jean Humbert de Superville (Amsterdam, 7 May 1..."
7,Some notable French Huguenots or people with F...
8,Some notable French Huguenots or people with F...
9,Abraham Mazel (5 September 1677 – 17 October 1...


In [24]:
def sentence_segmentation(df):
    
    '''
    performs sentence segmentation,
    calculates the number of sentences in the text according to stanza and spacy,
    finds the shared sentences between stanza and spacy,
    finds the unique sentences for stanza and spacy,
    
    '''
    
    nlp1 = stanza.Pipeline(lang='en', processors='tokenize')
    nlp2 = spacy.load('en_core_web_sm')
    
    df['sentences_stanza'] = df['texts'].apply(lambda x: [sentence.text for sentence in nlp1(x).sentences])
    df['sentences_spacy'] = df['texts'].apply(lambda x: [sentence.text for sentence in nlp2(x).sents])
    df['num_sentences_stanza'] = df['sentences_stanza'].apply(len) # number of sentences in the text according to stanza
    df['num_sentences_spacy'] = df['sentences_spacy'].apply(len) # number of sentences in the text according to spacy
    
    
    shared_sentences = set() # set of shared sentences
    unique_sentences_stanza = set() # set of unique sentences for stanza
    unique_sentences_spacy = set() # set of unique sentences for spacy
    
    for i in range(len(df)):
        
        shared_sentences.update(set(df['sentences_stanza'][i]).intersection(set(df['sentences_spacy'][i])))  # intersection of sentences
        unique_sentences_stanza.update(set(df['sentences_stanza'][i]).difference(set(df['sentences_spacy'][i])))
        unique_sentences_spacy.update(set(df['sentences_spacy'][i]).difference(set(df['sentences_stanza'][i])))
        
    df['unique_sent_stanza'] = pd.Series(list(unique_sentences_stanza))
    df['unique_sent_spacy'] = pd.Series(list(unique_sentences_spacy))  
      
    SharedSentences = pd.DataFrame({"shared_sentences": list(shared_sentences)})
    SharedSentences.to_csv("shared_sentences.csv", index=False)
    df.to_csv("part2_processed_dataset.csv", index=False)
    
    
    print("--------" * 10)
    print("Total number of sentences in the dataset: ", df['num_sentences_stanza'].sum() + df['num_sentences_spacy'].sum())
    print("Number of sentences in the text according to stanza: ", df['num_sentences_stanza'].sum())
    print("Number of sentences in the text according to spacy: ", df['num_sentences_spacy'].sum())
    print("Number of shared sentences: ", len(shared_sentences))
    print("Number of unique sentences for stanza: ", len(unique_sentences_stanza))
    print("Number of unique sentences for spacy: ", len(unique_sentences_spacy))
        
    return df, SharedSentences

In [25]:
#!python -m spacy download en_core_web_sm

In [26]:
df, shared_sent_def = sentence_segmentation(df)

2023-05-14 04:32:44 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, 22.3MB/s]                    
2023-05-14 04:32:44 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2023-05-14 04:32:44 INFO: Using device: cpu
2023-05-14 04:32:44 INFO: Loading: tokenize
2023-05-14 04:32:44 INFO: Done loading processors!


--------------------------------------------------------------------------------
Total number of sentences in the dataset:  31281
Number of sentences in the text according to stanza:  15943
Number of sentences in the text according to spacy:  15338
Number of shared sentences:  4740
Number of unique sentences for stanza:  3786
Number of unique sentences for spacy:  3406


In [27]:
shared_sent_def

Unnamed: 0,shared_sentences
0,"An attempt to resupply the fort on January 9, ..."
1,The second was a civil war with all thirteen s...
2,The Grimké sisters' public speaking played a c...
3,The debates were covered in detail by American...
4,Galland was upset about the director's decisio...
...,...
4735,"Charles-Édouard Babut (1835-1916), pastor, Nîmes."
4736,"Instead, the fighter force was committed to th..."
4737,"Knowlton's Rangers, which included Nathan Hale..."
4738,Benjamin Franklin and James Madison each helpe...


# 2.2 Tokenization

### In this part we will compare tokens which appear firtst in texts without segmentation and then - with segmentation.

### 2.2.1 SharedTokensNoSentences

Here we will need our preprocessing function again.

In [28]:
def preprocess(tokens_list, remove_stop_words=True):
    
    """ 
    removes non-ASCII chars, performs tokenization, lowercases the tokens,
    delete stopwords, punctuation and non-alphabetical characters,
    converts tokens to normal form
    
    """
    tokens = [token.lower() for token in tokens_list] # lowercase
    tokens = [token for token in tokens if token not in string.punctuation] # remove punctuation
    tokens = [token for token in tokens if token.isalpha()] # remove non-alphabetical characters
    
    if remove_stop_words:
        tokens = [token for token in tokens if token not in stop_words] # remove stopwords

    lemmatizer = WordNetLemmatizer()  
    tokens = [lemmatizer.lemmatize(token) for token in tokens] # convert to normal form

    return tokens

In [29]:
df = pd.read_csv("part2_processed_dataset.csv")
df.head(5)

Unnamed: 0,texts,sentences_stanza,sentences_spacy,num_sentences_stanza,num_sentences_spacy,unique_sent_stanza,unique_sent_spacy
0,Some notable French Huguenots or people with F...,['Some notable French Huguenots or people with...,['Some notable French Huguenots or people with...,1642,1630,"Jean Jacques Favre, pastor.","Antoine Barnave (1761-1783), French revolution..."
1,Abel Boyer (1667? – 16 November 1729) was a Fr...,['Abel Boyer (1667? – 16 November 1729) was a ...,['Abel Boyer (1667? – 16 November 1729) was a ...,54,51,Glen Buxton said he could listen to Barrett's ...,[The psychiatric evaluation of Jesus.
2,"Abolitionism, or the abolitionist movement, is...","['Abolitionism, or the abolitionist movement, ...","['Abolitionism, or the abolitionist movement, ...",332,302,"Francis Durand, convert from Roman Catholicism...","Faneuil hall and Faneuil Hall Market: or, Pete..."
3,"In the United States, abolitionism, the moveme...","['In the United States, abolitionism, the move...","['In the United States, abolitionism, the move...",545,518,"Renaud (1952-), pop-rock singer, anti-military...","Michael Pertwee (1916-1991), playwright and sc..."
4,Abraham Bosse (c. 1604 – 14 February 1676) was...,['Abraham Bosse (c.\u20091604 – 14 February 16...,['Abraham Bosse (c.\u20091604 – 14 February 16...,65,75,"Charles Chauvel (1897–1959), Australian film-m...","Ludwig Devrient (1784–1832), German actor.\n"


In [30]:
import spacy
import stanza

def tokenization_before_segmentation(df):
    
    nlp1 = stanza.Pipeline(lang='en', processors='tokenize') # stanza
    df['tokens_stanza'] = df['texts'].apply(lambda x: [token.text for sentence in nlp1(x).sentences for token in sentence.tokens])
    
    nlp2 = spacy.load('en_core_web_sm') # spacy
    df['tokens_spacy'] = df['texts'].apply(lambda x: [token.text for token in nlp2(x)])
    
    # preprocess our tokens
    df['tokens_stanza'] = df['tokens_stanza'].apply(preprocess) 
    df['tokens_spacy'] = df['tokens_spacy'].apply(preprocess)
    
    df["tokens_occurence_stanza"] = df["tokens_stanza"].apply(lambda x: collections.Counter(x)) # tokens occurence for stanza
    df["tokens_occurence_spacy"] = df["tokens_spacy"].apply(lambda x: collections.Counter(x)) # tokens occurence for spacy
    
    # creating vocabularies of unique tokens for each library
    vocab_stanza = set(token for tokens in df['tokens_stanza'] for token in tokens)
    vocab_spacy = set(token for tokens in df['tokens_spacy'] for token in tokens)
    
    # tokens which simalteneously present in both vocabularies
    both_vocab_tokens = vocab_stanza.intersection(vocab_spacy)
    
    # all tokens which present in both vocabularies
    common_tokens_stanza = vocab_stanza.union(vocab_spacy)
    
    only_spacy_tokens = vocab_spacy.difference(vocab_stanza)
    only_stanza_tokens = vocab_stanza.difference(vocab_spacy)
    
    print("-----------------" * 10)
    print("Number of tokens in the vocabulary of stanza: ", len(vocab_stanza))
    print("STANZA TOKENS\n", vocab_stanza) 
    
    print("-----------------" * 10)
    print("Number of tokens in the vocabulary of spacy: ", len(vocab_spacy))
    print("SPACY TOKENS\n", vocab_spacy)
    
    print("-----------------" * 10)
    print("Number of tokens in the vocabulary of both libraries: ", len(both_vocab_tokens))
    print("Tokens in the vocabulary of both libraries\n", both_vocab_tokens)
    
    print("-----------------" * 10)
    print("Number of all tokens: ", len(common_tokens_stanza))
    print("All tokens\n", common_tokens_stanza)
    
    print("-----------------" * 10)
    print("Number of tokens that only in the vocabulary of stanza: ", len(only_stanza_tokens))
    print("UNIQUE FOR STANZA", only_stanza_tokens)
    
    print("-----------------" * 10)
    print("Number of tokens that only in the vocabulary of spacy: ", len(only_spacy_tokens))
    print("UNIQUE FOR SPACY", only_spacy_tokens)
    
    
    print("-----------------" * 10)
    print("Top 30 most common tokens for stanza", df["tokens_occurence_stanza"].sum().most_common(30))
    
    print("-----------------" * 10)
    print("Top 30 most common tokens for spacy", df["tokens_occurence_spacy"].sum().most_common(30))
    
    return df


In [31]:
df = tokenization_before_segmentation(df)
df.head(5)
df.to_csv("tokenization_before_segmentation.csv", index=False)

2023-05-14 04:36:20 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, 24.1MB/s]                    
2023-05-14 04:36:20 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2023-05-14 04:36:20 INFO: Using device: cpu
2023-05-14 04:36:20 INFO: Loading: tokenize
2023-05-14 04:36:20 INFO: Done loading processors!


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of tokens in the vocabulary of stanza:  15382
STANZA TOKENS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of tokens in the vocabulary of spacy:  15413
SPACY TOKENS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of tokens in the vocabulary of both libraries:  15349
Tokens in the vocabulary of both libraries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of all tokens:  15446
All tokens
----------------------------------------

### 2.2.2 SharedTokensInSentences

Now we will work with shared segmented sentences.

In [32]:
shared_sent_df = pd.read_csv("shared_sentences.csv")

In [33]:
shared_sent_df.head(5)

Unnamed: 0,shared_sentences
0,"An attempt to resupply the fort on January 9, ..."
1,The second was a civil war with all thirteen s...
2,The Grimké sisters' public speaking played a c...
3,The debates were covered in detail by American...
4,Galland was upset about the director's decisio...


In [40]:
def tokenize_after_segmentation(shared_sent_df):
    
    nlp1 = stanza.Pipeline(lang='en', processors='tokenize')
    nlp2 = spacy.load('en_core_web_sm')
    
    shared_sent_df['tokens_stanza'] = shared_sent_df['shared_sentences'].apply(lambda x: [token.text for sentence in nlp1(x).sentences for token in sentence.tokens])
    shared_sent_df['tokens_spacy'] = shared_sent_df['shared_sentences'].apply(lambda x: [token.text for token in nlp2(x)])
    
    shared_sent_df["cleaned_tokens_stanza"] = shared_sent_df["tokens_stanza"].apply(lambda x: preprocess(x))
    shared_sent_df["cleaned_tokens_spacy"] = shared_sent_df["tokens_spacy"].apply(lambda x: preprocess(x))
    
    shared_sent_df["tokens_occurence_stanza"] = shared_sent_df["cleaned_tokens_stanza"].apply(lambda x: collections.Counter(x))
    shared_sent_df["tokens_occurence_spacy"] = shared_sent_df["cleaned_tokens_spacy"].apply(lambda x: collections.Counter(x))
    

    
     # creating vocabularies of unique tokens for each library
    vocab_stanza = set(token for tokens in df['tokens_stanza'] for token in tokens)
    vocab_spacy = set(token for tokens in df['tokens_spacy'] for token in tokens)
    
    # tokens which simalteneously present in both vocabularies
    both_vocab_tokens = vocab_stanza.intersection(vocab_spacy)
    
    # all tokens which present in both vocabularies
    common_tokens_stanza = vocab_stanza.union(vocab_spacy)
    
    only_spacy_tokens = vocab_spacy.difference(vocab_stanza)
    only_stanza_tokens = vocab_stanza.difference(vocab_spacy)
    
    print("-----------------" * 10)
    print("Number of tokens in the vocabulary of stanza: ", len(vocab_stanza))
    print("STANZA TOKENS\n", vocab_stanza) 
    
    print("-----------------" * 10)
    print("Number of tokens in the vocabulary of spacy: ", len(vocab_spacy))
    print("SPACY TOKENS\n", vocab_spacy)
    
    print("-----------------" * 10)
    print("Number of tokens in the vocabulary of both libraries: ", len(both_vocab_tokens))
    print("Tokens in the vocabulary of both libraries\n", both_vocab_tokens)
    
    print("-----------------" * 10)
    print("Number of all tokens: ", len(common_tokens_stanza))
    print("All tokens\n", common_tokens_stanza)
    
    print("-----------------" * 10)
    print("Number of tokens that only in the vocabulary of stanza: ", len(only_stanza_tokens))
    print("UNIQUE FOR STANZA", only_stanza_tokens)
    
    print("-----------------" * 10)
    print("Number of tokens that only in the vocabulary of spacy: ", len(only_spacy_tokens))
    print("UNIQUE FOR SPACY", only_spacy_tokens)
    
    
    print("-----------------" * 10)
    print("Top 30 most common tokens for stanza", df["tokens_occurence_stanza"].sum().most_common(30))
    
    print("-----------------" * 10)
    print("Top 30 most common tokens for spacy", df["tokens_occurence_spacy"].sum().most_common(30))
    
    df.to_csv("tokenization_after_segmentation.csv", index=False)
    return df
    
    

In [41]:
df_tokens = tokenize_after_segmentation(shared_sent_df)

2023-05-14 04:50:30 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, 31.1MB/s]                    
2023-05-14 04:50:30 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2023-05-14 04:50:30 INFO: Using device: cpu
2023-05-14 04:50:30 INFO: Loading: tokenize
2023-05-14 04:50:30 INFO: Done loading processors!


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of tokens in the vocabulary of stanza:  15382
STANZA TOKENS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of tokens in the vocabulary of spacy:  15413
SPACY TOKENS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of tokens in the vocabulary of both libraries:  15349
Tokens in the vocabulary of both libraries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of all tokens:  15446
All tokens
----------------------------------------