## Assignment one 
### Language Analytics
February 2021

_Marie Damsgaard Mortensen_

__Basic scripting with Python__

Using the corpus called 100-english-novels found on the cds-language GitHub repo, write a Python programme which does the following:

- Calculate the total word count for each novel
- Calculate the total number of unique words for each novel
- Save result as a single file consisting of three columns: filename, total_words, unique_words

__General instructions__

- For this exercise, you can upload either a standalone script OR a Jupyter Notebook
- Save your script as word_counts.py OR word_counts.ipynb
- You can either upload the script/notebook here or push to GitHub and include a link - or both!
- Your code should be clearly documented in a way that allows others to easily follow the structure of your script.
- Similarly, remember to use descriptive variable names! A name like word_count is more readable than wcnt.

__Purpose__

This assignment is designed to test that you have a understanding of:

1. how to structure, document, and share a Python script;
2. how to effectively make use of native Python data structures, functions, and flow control;
3. how to load, save, and process text files.

### Loading files

importing modules and using os to specify the path.

In [1]:
import os
import pandas as pd
from pathlib import Path

In [2]:
data_path = os.path.join("..", "data", "100_english_novels", "corpus")

In [3]:
# checking my path if it looks correct
data_path

'../data/100_english_novels/corpus'

In [4]:
# defining list to save information there
file_name = []
total_words = []
unique_words = []

Below, I have made a loop that reads every file and saves filename, total number of words in every file and amount of unique words.

The first word _for_ instantiates the loop that performs several actions on each of the filenames found in the path. Before I perform specific actions on each of the files, Path from _pathlib_ creates a concrete path with all the text files in data_path that we want to use. 

Afterwards, the specific text file is saved in a variable with file.read(). This variable is then split up at every point of a whitespace meaning that words are seperated and are now each a variable in a list called split_text. 

The length of split_text is appended to the list mentioned in the chunk above and is saved here. The same goes for the filename in question. 

To save the amount of unique words in every novel, the splitted text is made into the _set_ data type where there are no duplicates. By taking the length of this list unique words are counted and appended to the unique_words list.

In [5]:
for filename in Path(data_path).glob("*.txt"):
    with open(filename, "r", encoding = "utf-8") as file:
        loaded_text = file.read()
        
        #splitting text whenever there is a whitespace 
        split_text = loaded_text.split()
        total_words.append(len(split_text))
        
        #filename
        file_name.append(filename.name)
        
        #unique words
        unique_words.append(len(set(split_text)))

Now, it is time to make the three lists file_name, total_words and unique_words into one dataframe. Using pandas DataFrame function, each of the lists are made into columns.

In [6]:
#collecting lists to make a dataframe
novel_info = pd.DataFrame({'filename': file_name, 'total_words': total_words, 'unique_words': unique_words})

In [93]:
#inspecting the dataframe
print(novel_info)

Unnamed: 0,filename,total_words,unique_words
0,Cbronte_Villette_1853.txt,196557,29084
1,Forster_Angels_1905.txt,50477,9464
2,Woolf_Lighthouse_1927.txt,70185,11157
3,Meredith_Richmond_1871.txt,214985,28892
4,Stevenson_Treasure_1883.txt,68448,10831
...,...,...,...
95,Chesterton_Thursday_1908.txt,58299,10385
96,Burnett_Lord_1886.txt,58698,8131
97,Braddon_Phantom_1883.txt,180676,22474
98,Gaskell_Ruth_1855.txt,161797,18148


Lastly, I'm making a new output folder where the file is saved as .csv. The path is made in the same way as the data-path in the beginning. The function to_csv() saves the file novel_info with the path specified.

In [7]:
# making a directory to keep track of output files
os.mkdir("output_marie")

In [8]:
outpath = os.path.join("output_marie", "novel_info.csv")

In [9]:
novel_info.to_csv(outpath)