# Assignment 1: Basic Scripting with Python (word counts)

Nicole Dwenger - 2021/02/03

---

### Instructions

__Basic scripting with Python__
Using the corpus called 100-english-novels found on the cds-language GitHub repo, write a Python programme which does the following:
- Calculate the total word count for each novel
- Calculate the total number of unique words for each novel
- Save result as a single file consisting of three columns: filename, total_words, unique_words

__General instructions__
- For this exercise, you can upload either a standalone script OR a Jupyter Notebook
- Save your script as word_counts.py OR word_counts.ipynb
- You can either upload the script/notebook here or push to GitHub and include a link - or both!
- Your code should be clearly documented in a way that allows others to easily follow the structure of your script.
- Similarly, remember to use descriptive variable names! A name like word_count is more readable than wcnt.

__Purpose__
This assignment is designed to test that you have a understanding of:
- how to structure, document, and share a Python script;
- how to effectively make use of native Python data structures, functions, and flow control;
- how to load, save, and process text files.

-------

### Dependencies and Data

In [2]:
# load necessary dependencies
import os
from pathlib import Path
import pandas as pd

In [3]:
# define path to load data files
data_path = os.path.join("..", "data", "100_english_novels", "corpus")

### Interatig over files to save information in a dataframe

In [4]:
# create empty data frame so that data can be saved in it
df_word_counts = pd.DataFrame(columns=["filename", "total_words", "unique_words"])

In [5]:
# loop to get number of words, unqiue words and save info in df_word_counts
for filepath in Path(data_path).glob("*.txt"):
    with open(filepath, "r", encoding = "utf-8") as file:
        
        # read file and get info
        loaded_text = file.read() # read file 
        filename = Path(filepath).name # extract filename
        words = loaded_text.split() # split into words
        unique_words = set(words) # keep the unique words
       
        # append row with info to df
        df_word_counts = df_word_counts.append({"filename": filename, 
                                                "total_words": len(words), 
                                                "unique_words": len(unique_words)}, ignore_index = True)

In [6]:
# check df
df_word_counts

Unnamed: 0,filename,total_words,unique_words
0,Cbronte_Villette_1853.txt,196557,29084
1,Forster_Angels_1905.txt,50477,9464
2,Woolf_Lighthouse_1927.txt,70185,11157
3,Meredith_Richmond_1871.txt,214985,28892
4,Stevenson_Treasure_1883.txt,68448,10831
...,...,...,...
95,Chesterton_Thursday_1908.txt,58299,10385
96,Burnett_Lord_1886.txt,58698,8131
97,Braddon_Phantom_1883.txt,180676,22474
98,Gaskell_Ruth_1855.txt,161797,18148


In [7]:
# save as csv file
df_word_counts.to_csv("word_counts.csv")

# if file should not be saved in current directory, it could be specifie here and then used in .to_csv()
# csv_path = os.path.join("..", "data", "100_english_novels", "word_counts.csv")