High-Occurance-Pairs

Using Python, given an input TXT file Artist_lists_small.txt,

find pairs that appear at least 50x most frequent
sort it by frequency
then output it to outputFile.csv

The application can be modified:

minimum appearance can be changed here: https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L8
input file name can be changed here: https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L6
output file name can be changed here: https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L7

To Run The Program

Follow the setup instruction to install python and cloning the repo
cd into the root of the program directory cd High-Occurance-Pairs
run python find_high_occurance_pairs.py

To Run The Unit Test (IMPORTANT! TEST NEED TO BE RUN FROM ROOT DUE)

Follow the setup instruction to install python and cloning the repo
cd into the root of the program directory cd High-Occurance-Pairs
run python test/classes/permutation_test.py

.............

Ran 10 tests in 0.001s

OK

Setup Instructions

Let install homebrew, a package manager that will help us install python

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install Python through brew

brew install python

Clone the Repo

git clone https://github.com/Mr-Ming/High-Occurance-Pairs.git

Optimization / Choice of Language

Originally I was going to write this script using PHP that does the following

Read the Input File, then iterate through everything, then generates the pairs through a double for-loop
Save each pair as an index of an array and the value of the array would be its # of occurance
Finally Save the result to the output file while occurance >= 50

But then using that method has a big O complexity of O(n^3) due to 3 nested-loops (1st-reading from input, 2nd & 3rd is generating the pair)

So I though this would be a better time to explore python and read more about it, because python is very good for performing complex computation as its widely use in Machine Learning

My research led me to 2 things

itertool.combination (https://docs.python.org/2/library/itertools.html) -- Which is a very powerful python package that lets you quickly generate permutation of result in various length you like
collections.defaultdict(int) (https://docs.python.org/2/library/collections.html#collections.defaultdict) -- Which is the best data structure to store exactly this specific type of data because I wanted a data structure type that allows me to --- (1) Store both the pair and # of occurance as 1 array --- (2) Have O(1) constant time for searching

In this scenario the big O complexity is O(n^2) due to 2 nested-loops because now it only cost 1 loop to generate the pair instead of 2 from the php case.

Screenshots

input

2. output

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
classes		classes
test/classes		test/classes
.gitignore		.gitignore
Artist_lists_small.txt		Artist_lists_small.txt
LICENSE		LICENSE
README.md		README.md
find_high_occurance_pairs.py		find_high_occurance_pairs.py
outputFile.csv		outputFile.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Occurance-Pairs

To Run The Program

To Run The Unit Test (IMPORTANT! TEST NEED TO BE RUN FROM ROOT DUE)

.............

Setup Instructions

Optimization / Choice of Language

Screenshots

About

Releases

Packages

Languages

License

Mr-Ming/High-Occurance-Pairs

Folders and files

Latest commit

History

Repository files navigation

High-Occurance-Pairs

To Run The Program

To Run The Unit Test (IMPORTANT! TEST NEED TO BE RUN FROM ROOT DUE)

.............

Setup Instructions

Optimization / Choice of Language

Screenshots

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages