Skip to content

ModelerGuanxuSu/Text-Mining-Yelp-Star-Rating-Prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-Mining-Yelp-Star-Rating-Prediction

Guanxu Su, Shurong Gu, Yuwei Sun

This is a course project aims at finding out what makes a review positive or negative based on the review and a small set of attributes and proposing a prediction model to predict the ratings of reviews based on the text. predicting review stars based on yelp review text features

There are 4 files in this repository: data, image and code, summary.ipynb and the presentation slides presentation_slides.pptx.

data

There is a file size limit of 100MB on github. Hence we put the large files together with all other relavent files at the google drive link: https://drive.google.com/open?id=1GTTO_KAm55bn2m3ZsT2xsLjwOyxv8Kvg

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

code

step1_output_rawtext.R : Precleaning the text data of training set (remove line feed, "," and ""). Sample training data evenly.

Input : train_data.csv ; Output: raw_text.csv

step2_process_text_(1-5).py : clean the review texts of sampled training set, by removing punctuation, digits, extra whitespaces and transforming all cases to lower cases. Then we performed spelling correction and lemmatization, parallely.

Input : raw_text.csv ; Output : processed_text(1-5).csv

step2_process_text_merge.py : merge the output of step2_process_text_(1-5).py together.

Input : processed_text(1-5).csv ; Output : processed_text.csv (rows: 500,000)

step3_fiture.R : Remove the businesses with less than 3 comments and 14 running days. Extract phrases, select words for interpretable model on training data of size 400,000. Extract features for interpretable model for both training data of size 400,000 and testset of size about 80,000.

Input : train_data.csv, processed_text.csv ; Output : bottom_right.csv, phrase.csv, phrase_united.csv, byword.csv, words_score.csv, train_sample.csv, train_val.csv

step4_output_rawtext_test.R : Precleaning the text data of test and validation set (remove line feed, "," and ""). Sample test and validation data evenly.

Input : testval_data.csv ; Output: raw_text_test.csv

step5_test_text_(1-5).py : clean the review texts of test and validation set, by removing punctuation, digits, extra whitespaces and transforming all cases to lower cases. Then we performed spelling correction and lemmatization, parallely.

Input : raw_text_test.csv ; Output : test_text(1-5).csv

step5_test_text_merge.py : merge the output of step5_test_text_(1-5).py together.

Input : test_text(1-5).csv ; Output : test_text.csv

step5_test_features.R : deal with the phrases in test and validation set. Extract features for test and validation set.

Input : testval_data.csv, test_text.csv, phrase_united.csv, phrase.csv, words_score.csv ; Output : test_clean.csv

step6_interpretable.R : perform CART model on the extracted features, and test the model on testset of size about 80,000 (extract from train_data.csv and do not envolve in words score calculation and phrase extraction procedure).

Input : train_val.csv, train_sample.csv; Output : None

step7_kaggle prediction.py : perform high accuracy model on the sparse matrix, and predict on test and validation set. Input : train_sample.csv, test_clean.csv; Output : result1.csv, result2.csv, result3.csv, result4.csv

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.1%
  • Python 1.1%
  • R 0.8%