Project Report can be seen here.
- Python virtual environment has been set up using
pipenv
. You needpipenv
installed (learn more about installation and usage). - Even though we have converted the data into json files using Tika, you may want to do it yourself. To learn more, check out the notes we have written below and its documentations.
- There are several other packages/tools you may want to use along the way. You should check out the instruction about this assignment
-
First of foremost, build up the
pipenv
environment by runningpipenv install
command in this working directory. We are using Jupyter notebooks for all of our coding, so you may want to install the ipykernel as well. To do so:$ pipenv shell # this take you to the virtual environment $ python -m ipykernel install --user --name=<my-virtualenv-name> # change the kernel name as you see fit $ jupyter lab # run a jupyterlab instance on your localhost
-
[Task 4] Download fraudulent emails datasets from Kaggle and put it into the data directory.
Convert tika output to json:
$ java -jar tika-app-2.0.0-ALPHA.jar -J -t -r data/fradulent_emails.txt > data/fradulent_emails_t.json
Explanation (learn more):
-J
is recursive JSON.- [doc] -J or --jsonRecursive Output metadata and content from all embedded files (choose content type with -x, -h, -t or -m; default is -x)
-t
is output plain text content.- [doc] -t or --text Output plain text content
-r
is pretty print.
We have converted all flag options, but we mainly used -t
option.
-
[Task 5] Jupyter notebooks in Task 5
Just run through each cell in the notebooks, they either generate a new feature JSON file or upload each of the features to the Firebase, where our team store the data to. As long as you are using the virtual environment kernel we mentioned in the 0 step of Build Instructions, you should have the packages you need in your virtual environment.
-
[Task 6] Jupyter notebooks in Task 6
Just run through each cell in the three notebooks, each notebook handles one dataset. We used firebase to store our data but we have accommodate the grader to have a local version by using json dump.
-
[Task 7] Export PDF files in visualization directory. We offer circle packing and dynamic circle packing clustering visualizations.
Also, we have saved all the
circle.json
andcluster.json
from each similarity metrics.To re-run the visualization:
-
Step 0: copy circle-packing-for-all.py to the compiled
tika-similarity
repo. -
$ python3 circle-packing-for-all.py --inputCSV <path-to-tika-output-csv> --cluster 2
-
The above step generates
circle.json
andcluster.json
files. -
$ python3 -m http.server 8000
-
The above step starts a localhost server at port 8000. Then open dynamic circle packing or circle packing to see the visualizations.
Sample visualizations (edit-distance, dynamic circle packing): edit-distance-viz
Sample visualizations (cosine, circle packing): cosine-viz
-
-
[Task 8] TSV generation: Jupyter notebook [here](notebooks/Task7-TikaSimilarity/TSV generation & data for tika-smilarity.ipynb)
Output in the data directory
Firebase URL: https://copydsci550.firebaseio.com/
We stored additional data in firebase. There is a local backup here. If you want to access the data using REST API, you can use curl
:
$ curl '<firebase-URL>.json'
-
Python virtual environment has been set up using
pipenv
. You needpipenv
installed (learn more). Then run:$ pipenv install
pipenv
will install all python packages in the virtual environment. In the future, use$ pipenv install <wanted-package>
to install a python package and it will keep track of what packages used in our project.
-
fradulent_emails.txt
has been converted to read-only. To modify the data, run this command in the data directory:$ new_file_name="<your-new-file-name>" bash -c 'cp fradulent_emails.txt ${new_file_name}; chmod 0644 ${new_file_name}'
The command will make a copy of the data that can be read and written.
Please feel free to fork the repo and give it a pull request. If you encounter any problem, feel free to email me.
This is the assignment 1 from DSCI 550 Spring 2021 at USC Viterbi School of Engineering. This repo is collaborated by a group of six.
Team members: Zixi Jiang, Peizhen Li, Xiaoyu Wang, Xiuwen Zhang, Yuchen Zhang, Nat Zheng