Skip to content

Using Python, given an input csv file, find pairs that appear the most frequency

License

Notifications You must be signed in to change notification settings

Mr-Ming/High-Occurance-Pairs

Repository files navigation

High-Occurance-Pairs

Using Python, given an input TXT file Artist_lists_small.txt,

  1. find pairs that appear at least 50x most frequent
  2. sort it by frequency
  3. then output it to outputFile.csv

The application can be modified:

  1. minimum appearance can be changed here: https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L8
  2. input file name can be changed here: https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L6
  3. output file name can be changed here: https://github.com/Mr-Ming/High-Occurance-Pairs/blob/master/find_high_occurance_pairs.py#L7

To Run The Program

  1. Follow the setup instruction to install python and cloning the repo
  2. cd into the root of the program directory cd High-Occurance-Pairs
  3. run python find_high_occurance_pairs.py

To Run The Unit Test (IMPORTANT! TEST NEED TO BE RUN FROM ROOT DUE)

  1. Follow the setup instruction to install python and cloning the repo
  2. cd into the root of the program directory cd High-Occurance-Pairs
  3. run python test/classes/permutation_test.py

.............

Ran 10 tests in 0.001s

OK

screen shot 2018-11-22 at 12 58 04 am

Setup Instructions

  1. Let install homebrew, a package manager that will help us install python
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  1. Install Python through brew
brew install python
  1. Clone the Repo
git clone https://github.com/Mr-Ming/High-Occurance-Pairs.git

Optimization / Choice of Language

Originally I was going to write this script using PHP that does the following

  • Read the Input File, then iterate through everything, then generates the pairs through a double for-loop
  • Save each pair as an index of an array and the value of the array would be its # of occurance
  • Finally Save the result to the output file while occurance >= 50

But then using that method has a big O complexity of O(n^3) due to 3 nested-loops (1st-reading from input, 2nd & 3rd is generating the pair)

So I though this would be a better time to explore python and read more about it, because python is very good for performing complex computation as its widely use in Machine Learning

My research led me to 2 things

In this scenario the big O complexity is O(n^2) due to 2 nested-loops because now it only cost 1 loop to generate the pair instead of 2 from the php case.

Screenshots

  1. input

screen shot 2018-11-22 at 1 00 46 am 2. output

screen shot 2018-11-22 at 1 00 36 am

About

Using Python, given an input csv file, find pairs that appear the most frequency

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages