#Step 1: BUSINESS UNDERSTANDING

* On foody.vn, there are many restaurants doing business with diverse and rich products and menus. To increase the level of interaction with customers, each restaurant has a comment area for customers to give reviews, comments and scores.Instead of having to conduct customer survey campaigns, analyzing customer comments for food on foody.vn will be faster and more effective.
* Sentiment analysis is a complicated and sometimes personal opinion. Manual analysis (ie reading each comment and classification) will give inconsistent results between the analysts and take a lot of time. 
* Natural language analysis algorithms, especially sentiment, are very developed. The application of NLP algorithms to predict customer sentiment will help restaurants to improve service quality in a timely manner.

#Step 2: DATA UNDERSTANDING

* Dataset has 3 columns: restaurant, review_text, review_score. 
* review_score: from 1 to 10. The higher the score, the higher the satisfaction.
* Create new feature "review_score_new": class 1 if review_score < 7, class 0 if review_score >= 7. Find out the imbalance between classes. Class 0 is more than class 1. But we focus on class 1 (dislike)
* review_text: Vietnamese language.
* Use Underthesea and CountVectorizer to transform data to bag_of_words.
* Use WordCloud to view important words.
* Save all dataset to csv files.
* Note: don't remove additional stopwords. Train models with original data, and then remove stopwords and train models again. 

#Step 3: DATA PREPARATION

* Do Train Test spliting with rate = 70:30 => X_train, X_test, y_train, y_test
* Do Undersampling => X_train_us, X_test_us, y_train_us, y_test_us
* Do Oversampling => X_train_os, X_test_os, y_train_os, y_test_os
* Save all dataset to csv files.
* Note: when buidl models and evaluation, if can use these saved dataset, use them, if need to do Data Preparation again, do gain.

# Step 4&5: MODELING & EVALUATION/ ANALYZE AND REPORT

## 2A: Use postag, don't remove additional stopwords:
* Build LazyClassifier to identify what models will be suitable.
* According to ROC AUC, F1-score, Accuracy of general models of LazyClassifier, the Naive Bayes models and Tree models are the suiltable.
* Build and evaluate each models: MultinomialMB, BernoulliNB, LGBMClassifier, ExtraTreesClassifier, RandomForestClassifier.
* Build an RNN LSTM model.

* **Conclusion:** If focus on recall score of class 1 (dislike), there are 2 models are good:

 MultinomialMB: with recall score of class 1 = 74%, recall score of class 0 = 89%, Macro F1-score = 80% (precision of class 1 = 65%, precision of class 0 = 92%)
 
 ExtraTreesClassifier: with recall score of class 1 = 77%, recall score of class 0 = 87%, Macro F1-score = 80% (precision of class 1 = 63%, precision of class 0 = 93%)

# TRY AGAIN STEP 3,4,5 WITH SOME OPTIONS:

## 2B: Try to **remove more stopwords**:
* Because of the long time to transform data. I selected about 2500 value words that related to sentiment. These other words are not value for sentiment.
* Use saved X_train_us, X_test_us, X_train_os, X_test_os (after undersampling and oversamplin) on the folder already and drop columns that isn't the value words. So just about 2500 columns remained.
* Buld models again with clean data.
* **Conclusion:** If focus on recall score of class 1 (dislike), the MultinomialMB is the best model, with recall score of class 1 = 73%, recall score of class 0 = 87% and Macro F1-score = 78% (precision of class 1 = 62%, precision of class 0 = 92%). So after removing stopwords, the model is not improved.
* Note: see file jpynb with suffix: "_stopwords"

## 2C: Try Vietnamese **without accents**:
* Don't use Underthesea, just conver to Vietnamese without accents.
* Do again all 3,4,5 steps, but with data withoutaccents.
* **Conclusion:** If focus on recall score of class 1 (dislike), there are 2 models are good:

 MultinomialMB: with recall score of class 1 = 74%, recall score of class 0 = 87%, Macro F1-score = 79% (precision of class 1 = 63%, precision of class 0 = 92%)
 
 ExtraTreesClassifier: with recall score of class 1 = 77%, recall score of class 0 = 88%, Macro F1-score = 81% (precision of class 1 = 66%, precision of class 0 = 93%)

* Note: see jpynb file with suffix: **"_noAccent"**

## 2D: Try to define **my own text processing**:
* Use only word_tokenizer from Underthesea library, and define next text cleaning steps by my self.
* Convert emojicons to words: Extract enojicons from text dataset, scan each emojicon and define to the meaning words. Load edited emojicons for data cleaning.
* Remove some special symbols: List down list of special symbols, such as: '~','`','!','@','#','$',... Remove these symbols.
* Remove some typing mistakes: " " (2 spaces), " " (1 space, 1 underscore), " " (1 underscore, 1 space)
* Link some special words with other words: Link "không", "ko", "kg", "chả", "chẳng" to the word right behind. Ex: "không thích" => "không_thích"
* Remove stopwords:
1. First, do all above text cleaning steps and use CountVectorize to convert to bag of words.
2. Save this bag of words to csv file.
3. Scan each words and define what words will be removed, what words will be remained.
4. Save the chosen words to txt file.
5. Load chosen words, if words in document are in chosen words, they will be remaned. If not, they will be removed.

* **Conclusion**:If focus on recall score of class 1 (dislike), the MultinomialMB is the best model, with recall score of class 1 = 78%, recall score of class 0 = 90% and Macro F1-score = 82% (precision of class 1 = 66%, precision of class 0 = 94%)* 

* Note: see jpynb file with suffix: **"_self"**

## 2E: Try **crawling additional data** from foody.vn: 
* Crawling data from foody.vn by selenium on jupyter notebook: about 18,948 samples. Append to existing data, we have 58,873 samples.
* Do data pre-processing by my own text processing.
* Append new clean crawling data to exiting data
* Build model again.
* Conclusion: If focus on recall score of class 1 (dislike), the MultinomialMB is the best model, with recall score of class 1 = 74%, recall score of class 0 = 89% and Macro F1-score = 80% (precision of class 1 = 66%, precision of class 0 = 93%).
* But this result is not better than the previous model (at step 2D)
* Note: see the jpynb file with suffix: "_crawling"

# Step 6: MODELING & EVALUATION/ ANALYZE AND REPORT

* Choose MultinomialMB model at option 2D, because this model has highest preformance, specially recall of class 1 (dislike).
* Create a pipeline that helps end-user can use models easily.
* Create a new jpynb and use pipeline to predict data.