Skip to content

CRIS-CS-Python/frequency-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Frequency Analysis

We will create a program that

  • reads a text file from a path given on the command line.
  • counts only the letters and numbers in the text file. Letters should be converted to upper case.
  • prints a frequency table (called a histogram) in reverse order (from most frequent to least) containing freq<tab>character<tab>percentage where
    • freq is the frequency of occurence for a character (count how many times that character occurs in the file).
    • character is each unique character
    • percentage the percentage of the freq to the total.
  • prints the total number characters (letters & numbers) counted.

The starter code in count_characters.py check proper command line usage and opens the text file, but you will need to implement the code that counts the frequencies and prints the histogram.

Finally, you will need to create a file brown-freqs.txt containing the output of the file from analyzing brown.txt.

Brown Corpus

This repository contains a file brown.txt which contains the Brown Corpus, a large English language document that contains text from a well balance set of categories. It is especially used for the "scientific study of the frequency and distribution of word categories in everyday language use."

The file was fetched from here.

Usage & Output

You can test it with any text file, but a correct program will analyze brown.txt to produce the following output.

$ py count_characters.py brown.txt
592982  E       0.12433649370089686
438979  T       0.09204513740438328
381728  A       0.08004074502675622
360310  O       0.07554981777755505
345777  I       0.07250253765277026
336729  N       0.07060535258643483
310710  S       0.06514968744043777
290953  R       0.061007038749501745
258019  H       0.05410143607767472
196168  L       0.04113251548329888
188251  D       0.03947247855025538
147237  C       0.030872661097704403
128805  U       0.02700783846920146
120667  M       0.025301462245744595
110706  F       0.023212839296389245
95951   P       0.020119010201143973
92531   G       0.019401904439995964
89151   W       0.01869318588073273
81735   Y       0.017138198651296
72822   B       0.015269320391321678
47272   V       0.00991199518742356
31188   K       0.006539501309556734
9439    X       0.0019791699647590745
7640    J       0.0016019555599914533
5217    1       0.001093900805821389
5103    Q       0.0010699972804497889
4516    Z       0.0009469150927907597
4473    0       0.0009378988507646297
2690    2       0.0005640393267509175
2186    5       0.0004583605830027902
2134    9       0.0004474572205525866
1778    3       0.00037281112377811576
1491    4       0.0003126329502548766
1467    6       0.00030760062912401335
1289    8       0.0002702775807367779
1077    7       0.0002258254107474863
Total characters: 4769171

Output Redirection

$ py count_characters.py brown.txt > brown-freqs.txt

git add, commit and push

When your files are complete and tested.

# Check repository status
git status

# add new files or files that changed
git add "<changed files>"

# commit the added files, giving a good message describing the changes
git commit -m 'decription of change'`

# push changes to github
git push

About

This is an exercise in counting tokens (characters or words) and analyzing the results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages