# Project Description

This project serves as a command line tool utility for performing basic analysis on text files through the creation of word frequency charts and also cleaning up and modifying certain aspects of the data contained within text files. Users can specify the preprocessing method or tool they want to use using command line flags and by providing additional required arguments to certain methods using command line flags as well. This tool was designed to speed up basic preprocessing operations that might be relevant for fields such as Natural Language Processing (NLP). My idea for this tool actually came from a project that I am working on with a small group on others that will use NLP for sentiment analysis on text data to try and improve mental health. The command line interface supports the following operations:

* count: gets the number of words contained within a provided text file
* replace: given a word and a new word, finds all instances of the old word and replaces it with the new word. The output is written to a new file specified
* create word frequency chart: given a file, show frequencies of word contained on the file in the form of a bar chart
* create preprocessed word frequency chart: given a file, preprocess (apply stopwords and stemming) the file and then show the frequencies of each word in the form of a bar chart
* create next word frequency chart for given word: Given a file and a word, create a word frequency chart for the words that follow the provided word and the number of times that each occurs
* remove stopwords: remove all stopwords from a given file, and write the output to a specified file
* stem words: stem each word in given file and write the output to the specified file
* preprocess file: remove stopwords and stem words in given file and write the output to specified file
* sentence generation: creates word frequency distribution and word map that stores words as individual keys where each word stores the frequencies of the words that come after that word key. A sentence is then generated using the most common word and looking through the most common next word for each following word.


## Acknowledgements
All of the data that I used in this demonstration can be found at the following links:

Stars Wars Episode IV Script: https://www.kaggle.com/xvivancos/star-wars-movie-scripts

Sonnet 18: https://www.poetryfoundation.org/poems/45087/sonnet-18-shall-i-compare-thee-to-a-summers-day

# Project Code/Demonstration

Here is the documentation for the complete commands list available for the text utilities command line tool from the scripts/ directory:
* count: python interface.py -c -i <infile\>
* create word frequency chart: python interface.py -d -i <infile\>
* remove stopwords and write to file: python interface.py --rs -i <infile\> -o <outfile\>
* stem words and write to file: python interface.py --sf -i <infile\> -o <outfile\>
* preprocess (stopwords and stemming): python interface.py -p -i <infile\> -o <outfile\>
* show preprocessed word frequency bar chart: python interface.py --dp -i <infile\>
* generate sentence from file: python interface.py -s -i <infile\> -l <sentence_length\>
* show next word frequency chart for specified word: python interface.py --dn -i <infile\> -w <word\>
* find and replace, write to file: python interface.py -u -i <infile\> -o <outfile\> -r <word_to_replace\> -w <word\>

Below are the messages raised whenever a user does not provide the correct required arguments:

In [1]:
# All usage messages to assist users
!python scripts/interface.py -c     # count
!python scripts/interface.py -d     # create word frequency chart
!python scripts/interface.py --rs   # remove stopwords and write to file
!python scripts/interface.py --sf   # stem words and write to file
!python scripts/interface.py -p     # preprocess (stopwords and stemming)
!python scripts/interface.py --dp   # show preprocessed word frequency bar chart
!python scripts/interface.py -s     # generate sentence from file
!python scripts/interface.py --dn   # show next word frequency chart for specified word
!python scripts/interface.py -u     # find and replace, write to file

usage: python script.py -c -i <infile>
usage: python script.py -d -i <infile>
usage: python script.py --rs -i <infile> -o <outfile>
usage: python script.py --sf -i <infile> -o <outfile>
usage: python script.py -p -i <infile> -o <outfile>
usage: python script.py --dp -i <infile>
usage: python script.py -s -i <infile> -l <sentence_length>
usage: python script.py --dn -i <infile> -w <word>
usage: python script.py -u -i <infile> -o <outfile> -r <word_to_replace> -w <word>


### Count

The following demonstration shows that the script of Star Wars Episode IV has 14140 words in it.

In [2]:
!python scripts/interface.py -c -i EpIV.txt

14140


### Remove Stopwords and Write to File


In [3]:
!cat Shakespeare_sonnet_18

Sonnet 18: Shall I compare thee to a summer’s day?
BY WILLIAM SHAKESPEARE

Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm'd;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall death brag thou wander’st in his shade,
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.

In [4]:
!python scripts/interface.py --rs -i Shakespeare_sonnet_18 -o sonnet_18_no_stopwords

In [5]:
!cat sonnet_18_no_stopwords

Sonnet 18: Shall I compare thee summer’s day?
BY WILLIAM SHAKESPEARE

Shall I compare thee summer’s day?
Thou art lovely temperate:
Rough winds shake darling buds May,
And summer’s lease hath short date;
Sometime hot eye heaven shines,
And often gold complexion dimm'd;
And every fair fair sometime declines,
By chance nature’s changing course untrimm'd;
But thy eternal summer shall fade,
Nor lose possession fair thou ow’st;
Nor shall death brag thou wander’st shade,
When eternal lines time thou grow’st:
So long men breathe eyes see,
So long lives this, gives life thee.

As can be seen, words such as 'to', 'in', 'more' have been removed in the text above.

### Stem Words and Write to File
Refer to the remove stopwords example above for the original text.

In [6]:
!python scripts/interface.py --sf -i Shakespeare_sonnet_18 -o sonnet_18_stemmed

In [7]:
!cat sonnet_18_stemmed

Sonnet 18: shall I compar thee to a summer’ day?
bi william shakespeare

shal I compar thee to a summer’ day?
thou art more love and more temperate:
rough wind do shake the darl bud of may,
and summer’ leas hath all too short a date;
sometim too hot the eye of heaven shines,
and often is hi gold complexion dimm'd;
and everi fair from fair sometim declines,
bi chanc or nature’ chang cours untrimm'd;
but thi etern summer shall not fade,
nor lose possess of that fair thou ow’st;
nor shall death brag thou wander’st in hi shade,
when in etern line to time thou grow’st:
so long as men can breath or eye can see,
so long live this, and thi give life to thee.

Above are the stemmed results when applied to Shakespeare's sonnet 18.

### Preprocessing (Removing Stopwords and Stemming)
Refer to the remove stopwords example above for the original text.

In [8]:
!python scripts/interface.py -p -i Shakespeare_sonnet_18 -o sonnet_18_preprocessed

In [9]:
!cat sonnet_18_preprocessed

Sonnet 18: shall I compar thee summer’ day?
bi william shakespeare

shal I compar thee summer’ day?
thou art love temperate:
rough wind shake darl bud may,
and summer’ leas hath short date;
sometim hot eye heaven shines,
and often gold complexion dimm'd;
and everi fair fair sometim declines,
bi chanc nature’ chang cours untrimm'd;
but thi etern summer shall fade,
nor lose possess fair thou ow’st;
nor shall death brag thou wander’st shade,
when etern line time thou grow’st:
so long men breath eye see,
so long live this, give life thee.

When compared to the individual calls above, the combined call simplifies the text body even further.

### Sentence Generation
Using word frequencies and next word mappings, here is an original sentence that the text utility came up with using Star Wars Episode IV:

In [10]:
!python scripts/interface.py -s -i EpIV.txt -l 10

 i dont know what happened to be able to.


### Find and Replace Word Instances With New Word
Once again, refer to the remove stopwords example for the original text.

In [11]:
!python scripts/interface.py -u -i Shakespeare_sonnet_18 -o updated_sonnet -r thee -w COGS18

In [12]:
!cat updated_sonnet

Sonnet 18: Shall I compare COGS18 to a summer’s day?
BY WILLIAM SHAKESPEARE

Shall I compare COGS18 to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm'd;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall death brag thou wander’st in his shade,
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to COGS18.

As can be seen above, all instances of 'thee' in the text were replaced by 'COGS18'.

### Top 100 Word Frequencies
Here is how to get the top 100 word frequencies for the script for Star Wars Episode IV. Unfortunately the way I wrote the code prevents the chart from showing up naturally in Jupyter notebooks, so I instead got images and inserted them instead.

In [13]:
!python scripts/interface.py -d -i EpIV.txt

Figure(640x480)


![title](SW_EPIV_Frequencies.png)

From this chart, it appears that the most common word in the Episode IV script was 'the', followed by 'you' and then 'luke'. The file can also be preprocessed first and then have the word frequency chart generated as well.

### Top 100 Word Frequencies After Preprocessing
Here is the word frequency chart for Star Wars Episode IV after removing all stopwords and stemming the text. The results will be fairly similar, as the only thing that really changed was the addition of some new words towards the start of the chart like 'stay' and 'intercom'.

In [14]:
!python scripts/interface.py --dp -i EpIV.txt

Figure(640x480)


![title](SW_EPIV_preprocessed_frequencies.png)

### Top 100 Most Frequently Occurring Words After 'luke'
Below is the chart for the frequencies of the words that occur directly after the word 'luke'.

In [15]:
!python scripts/interface.py --dn -i EpIV.txt -w luke

Figure(640x480)


![title](word_frequencies_after_luke.png)


***


I have been programming for a little over a year now as a Computer Science major. In terms of my Python experience, I knew all of the essentials of the language up to inherited classes and basic multithreading coming in to the course. I have used Python extensively when working with the opencv computer vision library as the computer vision team lead for Triton Robotics, one of the engineering organizations on campus. 

This project goes above the minimum requirements because I had to learn the basics of the optparse module that comes standard in the python library to implement my command line utility tool. This was challenging for me because I had never had to deal with more than handling one or two flags using getopt, and never had to handle the possibility that 3 to 4 flags could be input for a single call to my command line tool. Many of my methods also relied on the word maps stored in dictionaries, so I had to figure out some more advanced techniques when it comes to modifying, sorting, and updating dictionary values. This got especially confusing when I was creating my word map, where the map was a dictionary of word keys and each key contained its own dictionary of word, integer key value pairs.