No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
CMPS242_proj_KRJB algorithm.pdf


This is the class project for CMPS 242 using data from the Yelp Dataset Challenge. Currently we only support python 2, and anything less than 8GB of ram will crash.


Tools we need:

  • pandas
sudo pip install pandas
  • nltk
sudo pip install nltk
  • download packages in nltk
>>> import nltk
  • scikit-learn for comparison
sudo pip install -U scikit-learn

Data we need:

  • yelp_academic_dataset_business.json
  • yelp_academic_dataset_review.json

First use from this repository to convert the json files into csv format (yelp_academic_dataset_business.json and yelp_academic_dataset_review.json to yelp_academic_dataset_business.csv and yelp_academic_dataset_review.csv).

Use these two commands on the shell.

python yelp_academic_dataset_business.json
python yelp_academic_dataset_review.json

Put these two files into 'data' directory. Then run to generate pickled feature files. Here we randomly sample 1% of the dataset, since processing the entire dataset would take too much time. Optionally give flags (-u, -b, -l, -a, -t) to select the features to use. Run the following command to see detailed messages about the options.

python -h

For instance, run the command below to generate feature files using unigrams, LIWC scores, and TF-IDF frequency:

python -u -l -t

Then find all the features file will be put into the directory


##Modeling and Prediction Run the file with required keyword arguments (-c, -d) to train a model and predict. Run the following command to detailed messages about the required arguments.

python -h

For instance, run the command below to build a Naive Bayes classifier trained on the features selected above (unigrams, LIWC scores, and TF-IDF frequency).

python -c nb -d pickle-l-t-u

The prediction result will be printed to sys.stdout as follows, showing the accuracy, precision, recall, and F1 score:

Recall = number of results returned Precision = number of correct results returned Fscore = weighted average of the precision and recall Accuracy = correctness with respect to the anotatted data

  Accuracy:     0.752411455812
  Precision:    0.793605698051
  Recall:       0.930122403039
  F1 score:     0.856458090426

word_category_counter is a LIWC simulator script created and distributed within the NLDS lab. It uses the LIWC.dic data file to simulate the functionality of LIWC.