<a href="https://colab.research.google.com/github/Abir866/AbirGist/blob/main/Text%20Analysis" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align = "center">Computer Prog. & Problem Solving</h1>
<h2 align = "center">CSCI 1227 - Spring, 2022</h2>

<h3 align = "center"><u>Project 1</u>: File IO and Text Processing</h3>
<h3 align = "center"><u> Prepared by</u>: M. Shamsuzzaman, PhD</h3>

<h5 align = "center"><u> Document version#</u>: draft 1</h5>


## Objective and Goals

- How to work with a text file in Python (open, read/write, cleaning, etc.)
- Bonus (optional): computation of entropy of letters  

## Project Description

In many cases, we need to read from a file and/or write to a file as part of our task. In this project, you are going to use the book **Through the Looking-Glass** by _Lewis Carroll_. The text version of the book `Through the Looking Glass.txt` is provided on **BrightSpace**. However, you can get the book from [here](https://www.gutenberg.org/ebooks/12) as part of the [Project Gutenberg](https://www.gutenberg.org/).

The character set encoding of the text is `UTF-8`.  

Your task is to write a program that can:
1. Read the file and clean the text (remove all punctuations, non-ascii characters, all lower-case etc.)
2. Split the text into words
3. Count the total number of unique words in the book
4. Find the longest word in the document, and its length
5. Calculate the frequencies of "word lengths"
6. Top 10 most used words and their frequencies
7. Calculate the frequencies of letters
8. Calculate the entropy of each letter (optional/bonus)
9. Write your findings (3 to 8) to a text file (`summary.txt`).

Note that,
- _The frequencies of "word lengths"_ in 5 mean the total counts of 1-letter words, 2-letter words, 3-letter words,... etc.  
- Similarly, in 6, word frequency mean the total count of a particular word. For example, the frequency of the word 'the' is the total number of times the word 'the' appears in the document.
- In 7, you need to calculate the the total counts of individual letters (a-z) in the document.

### Entropy

[Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) is the measure of information content. If we want to calculate the entropy in unit of `bits`, we use the following equation:

$$H = - \sum_{i=1}^{n} p(x_i) \log_2{(p(x_i))}$$

Where,

- $x$ are data values, and
- $p$ is the probability

In our case, $n$ is 26 (as there are 26 letters in English alphabet), and probability of a letter in the document can be calculated using the frequency of that letter divided by the sum of the frequency of all the letters.

><u>Note</u>: This project is basically an extension of the Lab 6 and 7 combined.

In [None]:
%%writefile P1_Tufan.py
# Worked from goggle colab
import string
from collections import OrderedDict
from math import log
def read_file(name : str)-> str:
  """A function to read a file whose path is the parameter name and return the cleaned text
  """
  with open(name,"r") as input_file:
    lines = input_file.read()
  text = ...
  st_no_digit = ''''''.join([n for n in lines if not n.isdigit()])
  st_no_digit_punct = ''''''.join([alphabet for alphabet in st_no_digit if alphabet not in string.punctuation])
  encoded_st = st_no_digit_punct.encode("ascii", "ignore")
  decode_st = encoded_st.decode()
  text = ''''''.join([i.lower() for i in decode_st])
  return text
def unique_words(text : str, x = {} )->tuple:
  """A function to find the uniques words and their count from  the given text parameter
  and return them together in as a tuple
  """
  for i in text.split():
   x[i] = x.get(i,0)+1
  return len(x.keys()),x

def find_longest_word(x : dict)->str:
   """A function to find the longest word from the dictionary as the given perameter
   and return it with it's length a in a tuple
   """
   length = max([len(i) for i in x.keys()])
   tpl =()

   for i in x.keys():
    if length == len(i):
      tpl+= (i,length)
   return tpl
def find_word_length_frequencies(y : str,x ={})->dict:
  """A fucntion to find the number of words from the given string parameter y
  that are of same size  and return them in the given dictionary
  """
  gather = [len(i) for i in y.split()]
  for i in sorted(gather):
    x[i]=x.get(i,0)+1
  return x

def find_ten_frequent_words(x : dict)->None:
  """A fucntion to find the top ten most frequen words from the given dictionary as parameter.
  Return the top ten words as a dictionary
  """
  ten_words ={}

  values = x.values()
  values =sorted(values)
  values = values[-1:-11:-1]

  while len(values) != 0:
    for i in x:
      if x[i] == max(values):
        ten_words[i] = x.get(i,0)
    values.remove(max(values))
  return ten_words

def find_letter_frequency(text : str, x : dict)->str:
  """A find the number of occurences of a particular alphabet from the given string y as parameter
  and return them in the given dictionary x
  """
  alphabets = string.ascii_lowercase
  #print(alphabets)
  for i in alphabets:
    x[i] = text.count(i)
  return x

def calculate_entropy(x : dict)->dict:
  """A function to calculate the entropy of each letter from their frequency in the given x dictionary as a parameter
  and return them as a dictionary
  """
  y ={}
  h = sum(x.values())
  for i in x:
   z = x.get(i)/h
   y[i] = -z*log(z,2)
  return y

# Ready to write by opening a file (Working from google colab)
out_file = open("summary.txt","w")

# Call the read_file function with the path name to file to be read, present in temporary Goggle colab working directory
text = read_file("/content/Through the Looking Glass.txt")

# Write the cleaned text in the file
#out_file.write(text)

dic ={} # Create an empty dictionary too sent to collect unique words

# call the unique words method
ln,dic = unique_words(text, dic)

# Print the unique words in the file
print("-"*50+"\nUniqueWords\n"+"-"*50+"\n", file = out_file)
print(f"{ln}",file = out_file)
# print('wwwgutenbergorg' in dic)

# Call to the function find_longest_word to get the word
long_word = find_longest_word(dic)

#Print the longest words in file
print("-"*50+"\nLongest word\n"+"-"*50+"\n", file = out_file)
for i in range(len(long_word)):
  if (i%2) == 0:
   print(f"{long_word[i]}".ljust(30),file =out_file,end="")
  else:
   print(f"Length : {long_word[i]}".rjust(10),file = out_file)
# Remove the previous dictionary
del(dic)

# Call to find_word_length_frequencies and store the result in a new dictionary
dic = find_word_length_frequencies(text)

#Print the word length frequencies in the file
print("-"*50+"\nWordlength frequencies\n"+"-"*50+"\n", file = out_file)
print(f"Word".ljust(30)+f"Frequency".rjust(10),file = out_file)
for itm in dic:
 print(f"{itm}".ljust(30)+f"{dic.get(itm)}".rjust(10),file = out_file)

#Refer to a new dictionary
dic ={}

# Call to the unique words method again to get the unque words and their count in a tuple
ln,dic = unique_words(text,dic)
#print(dic)

# Call to the find_ten_frequent_words  and passing the dictionary built
dic = find_ten_frequent_words(dic)

#Print the ten most frequent words with their frequencies in the file
print("-"*50+"\nTen Most Frequent Words\n"+"-"*50+"\n", file = out_file)
print(f"Word".ljust(30)+f"Frequency".rjust(10),file = out_file)
for i in dic:
 print(f"{i}".ljust(30)+f"{dic.get(i)}".rjust(10),file = out_file)

# Refer to a new dictionary again
dic = {}

# Call to the find_letter_fequency fucntion and store the result in that dictionary
dic = find_letter_frequency(text,dic)

#Print the letters with their frequency in the file
print("-"*50+"\nLetter Frequency\n"+"-"*50+"\n", file = out_file)
print(f"Letter".ljust(30)+f"Frequency".rjust(10),file = out_file)
for it in dic:
 print(f"{it}".ljust(30)+f"{dic.get(it)}".rjust(10),file = out_file)

# Call to the calculate_entropy function , passed with the already populated dictionary
dic = calculate_entropy(dic)

#Print the letter with their entropies in the file
print("-"*50+"\nLetter Entropy\n"+"-"*50+"\n", file = out_file)
print(f"Letter".ljust(30)+f"Entropy".rjust(10),file = out_file)
for value in dic:
 print(f"{value}".ljust(30)+f"{dic.get(value)}".rjust(10),file = out_file)
out_file.close()
print("Done")


Writing P1_Tufan.py


In [None]:
  for i in x:
    if x[i] in values:
      print()
      ten_words[i] = x.get(i,0)
      values.insert(0,0)
      print(len(values))

### Your Tasks:

Complete all the tasks described in the project description and submit your completed script file in the designated folder @BrightSpace.


Hints:
- Use a function to complete each task

Feel free to ask if you have any questions.

>**Notify immediately, if you find any ambiguity and/or error in this document.**


### Grading scheme :
-    [1 pt] The program reads the file and cleans the text (removes all punctuations, non-ascii characters, all lower-case etc.)
-    [1.5 pts] The program reports the total number of unique words in the book correctly
-    [1.5 pts] The program reports the longest word in the document and its length correctly
-    [1.5 pts] The program calculates and the frequencies of "word lengths" correctly
-    [1.5 pts] The program reports top 10 most used words and their frequencies correctly
-    [1 pt] The program reports the frequencies of letters correctly
-    [1 pt] Correct use of functions, variable names, docstrings
-    [1 pt] Summary.txt file contains the correct findings
-    [1.5 pts: BONUS] The program calculates and reports the entropy of each letter correctly

#### Submission deadline

- The submission deadline for this project is **June 20, 2022**.