Text-Mining-Yelp-Star-Rating-Prediction

This is a course project aims at finding out what makes a review positive or negative based on the review and a small set of attributes and proposing a prediction model to predict the ratings of reviews based on the text. predicting review stars based on yelp review text features

There are 4 files in this repository: data, image and code, summary.ipynb and the presentation slides presentation_slides.pptx.

data

There is a file size limit of 100MB on github. Hence we put the large files together with all other relavent files at the google drive link: https://drive.google.com/open?id=1GTTO_KAm55bn2m3ZsT2xsLjwOyxv8Kvg

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

code

step1_output_rawtext.R : Precleaning the text data of training set (remove line feed, "," and ""). Sample training data evenly.

Input : train_data.csv ; Output: raw_text.csv

step2_process_text_(1-5).py : clean the review texts of sampled training set, by removing punctuation, digits, extra whitespaces and transforming all cases to lower cases. Then we performed spelling correction and lemmatization, parallely.

Input : raw_text.csv ; Output : processed_text(1-5).csv

step2_process_text_merge.py : merge the output of step2_process_text_(1-5).py together.

Input : processed_text(1-5).csv ; Output : processed_text.csv (rows: 500,000)

step3_fiture.R : Remove the businesses with less than 3 comments and 14 running days. Extract phrases, select words for interpretable model on training data of size 400,000. Extract features for interpretable model for both training data of size 400,000 and testset of size about 80,000.

Input : train_data.csv, processed_text.csv ; Output : bottom_right.csv, phrase.csv, phrase_united.csv, byword.csv, words_score.csv, train_sample.csv, train_val.csv

step4_output_rawtext_test.R : Precleaning the text data of test and validation set (remove line feed, "," and ""). Sample test and validation data evenly.

Input : testval_data.csv ; Output: raw_text_test.csv

step5_test_text_(1-5).py : clean the review texts of test and validation set, by removing punctuation, digits, extra whitespaces and transforming all cases to lower cases. Then we performed spelling correction and lemmatization, parallely.

Input : raw_text_test.csv ; Output : test_text(1-5).csv

step5_test_text_merge.py : merge the output of step5_test_text_(1-5).py together.

Input : test_text(1-5).csv ; Output : test_text.csv

step5_test_features.R : deal with the phrases in test and validation set. Extract features for test and validation set.

Input : testval_data.csv, test_text.csv, phrase_united.csv, phrase.csv, words_score.csv ; Output : test_clean.csv

step6_interpretable.R : perform CART model on the extracted features, and test the model on testset of size about 80,000 (extract from train_data.csv and do not envolve in words score calculation and phrase extraction procedure).

Input : train_val.csv, train_sample.csv; Output : None

step7_kaggle prediction.py : perform high accuracy model on the sparse matrix, and predict on test and validation set. Input : train_sample.csv, test_clean.csv; Output : result1.csv, result2.csv, result3.csv, result4.csv

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
code		code
data		data
image		image
README.md		README.md
Summary.ipynb		Summary.ipynb
presentation_slides.pptx		presentation_slides.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

image

image

README.md

README.md

Summary.ipynb

Summary.ipynb

presentation_slides.pptx

presentation_slides.pptx

Repository files navigation

Text-Mining-Yelp-Star-Rating-Prediction

data

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

code

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

About

Releases

Packages

Languages

ModelerGuanxuSu/Text-Mining-Yelp-Star-Rating-Prediction

Folders and files

Latest commit

History

Repository files navigation

Text-Mining-Yelp-Star-Rating-Prediction

data

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

code

result1.csv, result2.csv, result3.csv, result4.csv are the final result files for kaggle competation.

About

Resources

Stars

Watchers

Forks

Languages