# COGS 109: (Data Modeling & Analysis) Final Project: Sentence Classifier 

## Contributors:
* **Philip Leo Pascual**
* **Benjamin Isip**
* **Gustav Santo-Tomas**

# Introduction:
**With natural language processing as our motivation, we came to decide on a topic that will be using text-based data for our analysis. In pursuing a more novel topic, we chose to implement a sentence classifier as oppose to a document classifier. This in part lead us to use the [Sentence Classification Dataset](http://archive.ics.uci.edu/ml/datasets/Sentence+Classification) from UCI's machine learning database.**

**By implementing a sentence classifier we would hope to get some insight to how features of a sentence contributes to its own "meaning/purpose". This analysis will look into methods that one can use to classify sentences within the context of research papers, to identify key parts of a given text. We would hope that this project brings up the awareness of how aspects of sentences affect their intended usage.**


## Dataset:



# Pre-Processing Step

In [4]:
import nltk, re, pprint

import pandas as pd
import re
from nltk import word_tokenize

## Single Text File

In [5]:
# Unlabled text file
# Note: Move an unlabeled text file into the same directory
#       as this jupyter notebook
unlab_file = open("unlabeled.txt", "r")
unlab_text = unlab_file.read()
unlab_list = unlab_text.split('\n')

unlab_file.close()

# Note: Sentences are separated by '\n'

# List of sentences
unlab_list[:5]

['### abstract ###',
 'Fitness functions based on test cases are very common in Genetic Programming (GP)',
 'This process can be assimilated to a learning task, with the inference of models from a limited number of samples',
 'This paper is an investigation on two methods to improve generalization in GP-based learning: 1) the selection of the best-of-run individuals using a three data sets methodology, and 2) the application of parsimony pressure in order to reduce the complexity of the solutions',
 'Results using GP in a binary classification setup show that while the accuracy on the test sets is preserved, with less variances compared to baseline results, the mean tree size obtained with the tested methods is significantly reduced']

In [6]:
# Labeled text file
# Note: I chose a random labeled text file
text_file = open("arxiv_annotate1_13_3.txt", "r")
body_text= text_file.read()

# Note: Sentence's labels are found at the end of each sentence, 
#       hence we can separate them when we find a label

text_file.close()
body_text

'### abstract ###\nMISC Although the Internet AS-level topology has been extensively studied over the past few years, little is known about the details of the AS taxonomy\nMISC An AS "node" can represent a wide variety of organizations, e g , large ISP, or small private business, university, with vastly different network characteristics, external connectivity patterns, network growth tendencies, and other properties that we can hardly neglect while working on veracious Internet representations in simulation environments\nAIMX In this paper, we introduce a radically new approach based on machine learning techniques to map all the ASes in the Internet into a natural AS taxonomy\nOWNX We successfully classify ~95.3\\% of ASes with expected accuracy of ~78.1\\%\nOWNX We release to the community the AS-level topology dataset augmented with: 1) the AS taxonomy information and 2) the set of AS attributes we used to classify ASes\nOWNX We believe that this dataset will serve as an invaluable a

In [7]:
############ Word Tokenization ############
text_list = nltk.word_tokenize(body_text)

############ Sentence Segmentation ############
sents_list = []   # List of sentences from text 
temp_list = []    # Stores words temporary to form sentences

# Iterates through the list of words to extract sents
for word in text_list:
    # If the current word is a label do the following
    if re.fullmatch(r'(AIMX|OWNX|CONT|BASE|MISC)', word):
        # Dictionary for each sentence
        sents_temp = {}
        sents_temp["Sentence"] = " ".join(temp_list)
        sents_temp["Label"] = word
        sents_list.append(sents_temp)
        
        # Reset the list & increment count
        temp_list = [] 
    else:
        temp_list.append(word)

# List of dictionaries where each entry represents a sentence & its label
sents_list[:5]

[{'Label': 'MISC', 'Sentence': '# # # abstract # # #'},
 {'Label': 'MISC',
  'Sentence': 'Although the Internet AS-level topology has been extensively studied over the past few years , little is known about the details of the AS taxonomy'},
 {'Label': 'AIMX',
  'Sentence': "An AS `` node '' can represent a wide variety of organizations , e g , large ISP , or small private business , university , with vastly different network characteristics , external connectivity patterns , network growth tendencies , and other properties that we can hardly neglect while working on veracious Internet representations in simulation environments"},
 {'Label': 'OWNX',
  'Sentence': 'In this paper , we introduce a radically new approach based on machine learning techniques to map all the ASes in the Internet into a natural AS taxonomy'},
 {'Label': 'OWNX',
  'Sentence': 'We successfully classify ~95.3\\ % of ASes with expected accuracy of ~78.1\\ %'}]

In [8]:
# Number of sentences
print("{}: {} sentences".format("arxiv_annotate1_13_3.txt", len(sents_list)))

arxiv_annotate1_13_3.txt: 34 sentences


## Multiple Text Files

In [9]:
################ Text File Extraction ################
import os
# Place jupyternotebook into 'labeled_articles' directory 
# Replace Pathway with your own
root = "C:/Users/Leo Pascual/Desktop/UCSD/Fall_Quarter_2017/Cogs_109/Final_Project/SentenceCorpus/labeled_articles"
textfiles = []

# Directory that contains all the text files
directory = os.listdir(root)

# We only want to look at the third annotators files
for file in directory:
    if file.endswith("3.txt"):
        textfiles.append(file)

print('Number of text files: {}'.format(len(textfiles)))

# 30 annotated text files
textfiles

Number of text files: 30


['arxiv_annotate10_7_3.txt',
 'arxiv_annotate1_13_3.txt',
 'arxiv_annotate2_66_3.txt',
 'arxiv_annotate3_80_3.txt',
 'arxiv_annotate4_168_3.txt',
 'arxiv_annotate5_240_3.txt',
 'arxiv_annotate6_52_3.txt',
 'arxiv_annotate7_268_3.txt',
 'arxiv_annotate8_81_3.txt',
 'arxiv_annotate9_279_3.txt',
 'jdm_annotate10_210_3.txt',
 'jdm_annotate1_103_3.txt',
 'jdm_annotate2_107_3.txt',
 'jdm_annotate3_120_3.txt',
 'jdm_annotate4_220_3.txt',
 'jdm_annotate5_228_3.txt',
 'jdm_annotate6_32_3.txt',
 'jdm_annotate7_265_3.txt',
 'jdm_annotate8_177_3.txt',
 'jdm_annotate9_45_3.txt',
 'plos_annotate10_1140_3.txt',
 'plos_annotate1_6_3.txt',
 'plos_annotate2_336_3.txt',
 'plos_annotate3_798_3.txt',
 'plos_annotate4_1052_3.txt',
 'plos_annotate5_1375_3.txt',
 'plos_annotate6_1032_3.txt',
 'plos_annotate7_1233_3.txt',
 'plos_annotate8_123_3.txt',
 'plos_annotate9_1187_3.txt']

In [53]:
textwrappers = []       # Stores all textwrappers before being read
textfile_tokens = []    # Stores each textfiles list of words

# Opens all the files
for file in textfiles:
    textwrappers.append(open(file, 'r'))
    
############ Word Tokenization ############
# Tokenizes each file 
for wrapper in textwrappers:
    temp_list = []
    body_text = wrapper.read()
    temp_list = nltk.word_tokenize(body_text)
    textfile_tokens.append(temp_list)

    
################ Sentence Segmentation ################
sents_list = []   # List of sentences from text 
temp_list = []    # Stores words temporary to form sentences

# Iterates through each list of tokens 
for text_list in textfile_tokens:
    # Iterates through the list of words to extract sents
    for word in text_list:
        # If the current word: is a label do the following
        if re.fullmatch(r'(AIMX|OWNX|CONT|BASE|MISC)', word):
            # Dictionary for each sentence
            sents_temp = {}
            sents_temp["Sentence"] = " ".join(temp_list)
            sents_temp["Label"] = word
            sents_list.append(sents_temp)

            # Reset the list 
            temp_list = [] 
        else:
            # Forms the sentence until a label is found
            temp_list.append(word)

# Close files
for wrapper in textwrappers:
    wrapper.close()
    
# List of sentences and their label
sents_list[:10]

[{'Label': 'OWNX', 'Sentence': '# # # abstract # # #'},
 {'Label': 'MISC',
  'Sentence': 'The Minimum Description Length principle for online sequence estimation/prediction in a proper learning setup is studied'},
 {'Label': 'MISC',
  'Sentence': 'If the underlying model class is discrete , then the total expected square loss is a particularly interesting performance measure : ( a ) this quantity is finitely bounded , implying convergence with probability one , and ( b ) it additionally specifies the convergence speed'},
 {'Label': 'AIMX',
  'Sentence': 'For MDL , in general one can only have loss bounds which are finite but exponentially larger than those for Bayes mixtures'},
 {'Label': 'OWNX',
  'Sentence': 'We show that this is even the case if the model class contains only Bernoulli distributions'},
 {'Label': 'OWNX',
  'Sentence': 'We derive a new upper bound on the prediction error for countable Bernoulli classes'},
 {'Label': 'OWNX',
  'Sentence': 'This implies a small bound ( 

In [11]:
# Number of labled sentences out of the 30 text files
print('Number of training sentences: {}'.format(len(sents_list)))

Number of training sentences: 1040


# Labeled Text: Training Set

In [62]:
########### Dictionary to Data Frame ###########
# Training set
df = pd.DataFrame.from_dict(sents_list, orient = 'columns')
df = df[['Sentence', 'Label']]
df.iloc[50:70,]

Unnamed: 0,Sentence,Label
50,The description length SYMBOL of a parameter S...,MISC
51,A prior weight may then be defined by SYMBOL,MISC
52,If a string SYMBOL is generated by a Bernoulli...,MISC
53,"That is , the two-part complexity with respect...",MISC
54,Many Machine Learning tasks are or can be redu...,MISC
55,An important example is classification,MISC
56,The task of classifying a new instance SYMBOL ...,MISC
57,"Typically the ( instance , class ) pairs are iid",MISC
58,Cumulative loss bounds for prediction usually ...,OWNX
59,Then we can solve classification problems in t...,OWNX


# Unlabeled Text: Testing Set

In [13]:
################ Directory Extraction ################
# Place the three (unlabeled file) directories within labeled_articles directory
# Replace path with your own
root = "C:/Users/Leo Pascual/Desktop/UCSD/Fall_Quarter_2017/Cogs_109/Final_Project/SentenceCorpus/labeled_articles"
unlabeled_folders = ["arxiv_unlabeled", "jdm_unlabeled", "plos_unlabeled"]
list_unlabeled = []

# List of directory paths to a folder
for file in unlabeled_folders:
    list_unlabeled.append(root + "/" + file)

list_unlabeled

['C:/Users/Leo Pascual/Desktop/UCSD/Fall_Quarter_2017/Cogs_109/Final_Project/SentenceCorpus/labeled_articles/arxiv_unlabeled',
 'C:/Users/Leo Pascual/Desktop/UCSD/Fall_Quarter_2017/Cogs_109/Final_Project/SentenceCorpus/labeled_articles/jdm_unlabeled',
 'C:/Users/Leo Pascual/Desktop/UCSD/Fall_Quarter_2017/Cogs_109/Final_Project/SentenceCorpus/labeled_articles/plos_unlabeled']

In [64]:
################ Text File Extraction ################
unlabeled_textfiles = []    # List of (list of text files) for each folder
for path in list_unlabeled:
    temp_list = []          # Stores files
    directory = os.listdir(path)
    for file in directory:
        temp_list.append(file)
    unlabeled_textfiles.append(temp_list) 
 
################ Sentence segmentation ################
total_sents = []            # List of (list of sentences) for each file 
for i, list_textfiles in enumerate(unlabeled_textfiles):
    for file in list_textfiles:
        unlab_file = open(list_unlabeled[i] + "/" + file, encoding="Latin-1")
        unlab_text = unlab_file.read()
        total_sents.append(unlab_text.split('\n'))  # Note: Each sentence is separated by a "\n"

total_sents[0][0:10]

['### abstract ###',
 'Fitness functions based on test cases are very common in Genetic Programming (GP)',
 'This process can be assimilated to a learning task, with the inference of models from a limited number of samples',
 'This paper is an investigation on two methods to improve generalization in GP-based learning: 1) the selection of the best-of-run individuals using a three data sets methodology, and 2) the application of parsimony pressure in order to reduce the complexity of the solutions',
 'Results using GP in a binary classification setup show that while the accuracy on the test sets is preserved, with less variances compared to baseline results, the mean tree size obtained with the tested methods is significantly reduced',
 '### introduction ###',
 'GP is particularly suited for problems that can be assimilated to learning tasks, with the minimization of the error between the obtained and desired outputs for a limited number of test cases -- the training data, using a ML te

## Number of Sentences Within Training & Testing Sets

In [15]:
################ Dataset Size ################
count = 0 # Counts the number of total sents within our dataset
for file in total_sents:
    for sents in file:
        count += 1

print('Number of training sentences: {}'.format(len(sents_list)))        
print('Number of sents within our testing set: {}'.format(count))     

Number of training sentences: 1040
Number of sents within our testing set: 37970


# Cleaning the data

In [63]:
################ Lowercase Conversion ################

print(df[50:56])

# Training set
for i, sent in enumerate(df.Sentence):
    df.Sentence[i] = sent.lower()

print("\n")  
print(df[50:56])   



#Boundry line between outputs
separator = []
for i in range(0,59):
    separator.append("-")
 


print("\n")  
print("".join(separator))
print(total_sents[0][0:5])

# Testing set
new_total_sents = []
for txt_file in total_sents:
    for i, sent in enumerate(txt_file):
        new_total_sents.append(sent.lower())

print("\n")
print(new_total_sents[0:5])

                                             Sentence Label
50  The description length SYMBOL of a parameter S...  MISC
51       A prior weight may then be defined by SYMBOL  MISC
52  If a string SYMBOL is generated by a Bernoulli...  MISC
53  That is , the two-part complexity with respect...  MISC
54  Many Machine Learning tasks are or can be redu...  MISC
55             An important example is classification  MISC


                                             Sentence Label
50  the description length symbol of a parameter s...  MISC
51       a prior weight may then be defined by symbol  MISC
52  if a string symbol is generated by a bernoulli...  MISC
53  that is , the two-part complexity with respect...  MISC
54  many machine learning tasks are or can be redu...  MISC
55             an important example is classification  MISC


-----------------------------------------------------------
['### abstract ###', 'Fitness functions based on test cases are very common in Genetic Programmi

In [66]:
stop_words = set(nltk.corpus.stopwords.words("english"))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

In [79]:
# Compile the list of stopwords we would want to filter out

################ Text File Extraction ################
# Note: Place the word_lists directory into the labeled_articles directory
# Note: Remove the stopwords text file - we will be using nltk's stop words list
root = "C:/Users/Leo Pascual/Desktop/UCSD/Fall_Quarter_2017/" \
       "Cogs_109/Final_Project/SentenceCorpus/labeled_articles/word_lists"
directory = os.listdir(root)

stopwords_files = []
for file in directory:
    if file.endswith(".txt"):
        stopwords_files.append(file)
  
textwrappers = []       # Stores all textwrappers before being read
# Opens all the files
for file in stopwords_files:
    textwrappers.append(open(root + "/" + file, 'r'))

wanted_stopwords = []
for wrapper in textwrappers:
    body_text = wrapper.read()
    print(body_text)
    

we
this
paper
study
show
present
new
model
introduce
current
investigated
compare
designed
argue
examines
propose
interested
address
give
discuss
discover
how
identification
developed
influences
main
goal
motivate
establish
applied
formalize
assess
purpose
examine
to
survey
use
have
investigation
investigate
validate
influence
make
answer
suggest
work
analyze
investigates
theoretical
was
concentrate
approach

using
extend
relies
spirit
previously
foundation
on
reuse
similar
legacy
refined
earlier
extended

not
difficult
limited
however
assume
only
many
although
typically
directly
little
may
unlike
without
while
sometimes
infeasible
shortcomings
expensive
ignore
reasonable
less
handful
disparate
seems
accurately
confidence
must
variable
substantial
nevertheless
fall
readily
unfortunately
often
trades-offs
relatively
require
yet
artificial
restricted
variability
might
capture
tailored
apparently
generally
notorious
somewhat
full
whereas
obstacles
even
most
missing
sparse
formidable
drawb