GitHub - Kbarias/BigData_Project2

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Project2_Report.pdf		Project2_Report.pdf
README		README
tf-idf.py		tf-idf.py

Repository files navigation

*Prerequisites for Running on Jupyter Notebook (Recommended; no decimal precision using terminal)
1. Have Java installed (specifically version 8)
2. Have Apache Spark downloaded
3. Have Anaconda installed
4. Have Jupyter Notebook installed (useful link: https://medium.com/@naomi.fridman/install-pyspark-to-run-on-jupyter-notebook-on-windows-4ec2009de21f)

How to run the project on Jupyter Notebook
1. Open Jupyter Notebook via Anaconda
2. Open tf-idf.py in a Jupyter Notebook
2. Press the 'Run' button.


How to run the project in terminal
1. Navigate to folder where tf-idf.py file is downloaded
2. Comment out the line in the file: 'import findspark' and 'findspark.init()'
3. Type into terminal: spark-submit tf-idf.py
4. Press Enter

**to get tf x idf for terms with the pattern 'gene_xxx_gene', uncomment the line: #print(filtered_terms.collect())