medbuddy

{"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import pandas as pd #Analysis \nimport matplotlib.pyplot as plt #Visulization\nimport seaborn as sns #Visulization\nimport numpy as np #Analysis \nfrom scipy.stats import norm #Analysis \nfrom sklearn.preprocessing import StandardScaler #Analysis \nfrom scipy import stats #Analysis \nimport warnings \nwarnings.filterwarnings('ignore')\n%matplotlib inline\nimport gc\n\nimport os\nimport string\ncolor = sns.color_palette()\n\n%matplotlib inline\n\nfrom plotly import tools\nimport plotly.offline as py\npy.init_notebook_mode(connected=True)\nimport plotly.graph_objs as go\n\nfrom sklearn import model_selection, preprocessing, metrics, ensemble, naive_bayes, linear_model\nfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\nfrom sklearn.decomposition import TruncatedSVD\nimport lightgbm as lgb\n\npd.options.mode.chained_assignment = None\npd.options.display.max_columns = 999","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_kg_hide-input":true,"execution":{"iopub.status.busy":"2021-10-10T06:52:03.536642Z","iopub.execute_input":"2021-10-10T06:52:03.536979Z","iopub.status.idle":"2021-10-10T06:52:03.561680Z","shell.execute_reply.started":"2021-10-10T06:52:03.536926Z","shell.execute_reply":"2021-10-10T06:52:03.559975Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"markdown","source":"## 1. Exploration Data Analysis\n\n### 1.1. Data understanding\n\n\nFirst we will import Train data and Test data. The sizes of the two data are as follows:\n\nIt was data from https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29 and crawled reviews from online pharmaceutical review sites.","metadata":{"_uuid":"6ba47a95deb9764565521d373d3de8bbd24c6bd7"}},{"cell_type":"code","source":"import os\nprint(os.listdir(\"../input\"))","metadata":{"_uuid":"2db18f8c9faff8d19ecb0d3aff3969b37d4bc7ed","execution":{"iopub.status.busy":"2021-10-10T06:52:05.836341Z","iopub.execute_input":"2021-10-10T06:52:05.836662Z","iopub.status.idle":"2021-10-10T06:52:05.842891Z","shell.execute_reply.started":"2021-10-10T06:52:05.836608Z","shell.execute_reply":"2021-10-10T06:52:05.841850Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"code","source":"df_train = pd.read_csv(\"../input/kuc-hackathon-winter-2018/drugsComTrain_raw.csv\", parse_dates=[\"date\"])\ndf_test = pd.read_csv(\"../input/kuc-hackathon-winter-2018/drugsComTest_raw.csv\", parse_dates=[\"date\"])","metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","execution":{"iopub.status.busy":"2021-10-10T06:52:06.358505Z","iopub.execute_input":"2021-10-10T06:52:06.358849Z","iopub.status.idle":"2021-10-10T06:52:27.740947Z","shell.execute_reply.started":"2021-10-10T06:52:06.358792Z","shell.execute_reply":"2021-10-10T06:52:27.740214Z"},"trusted":true},"execution_count":4,"outputs":[]},{"cell_type":"code","source":"print(\"Train shape :\" ,df_train.shape)\nprint(\"Test shape :\", df_test.shape)","metadata":{"_uuid":"3ab95e3f3ad6f51426d9732fabb29398c50223b8","execution":{"iopub.status.busy":"2021-10-10T06:52:27.742541Z","iopub.execute_input":"2021-10-10T06:52:27.742826Z","iopub.status.idle":"2021-10-10T06:52:27.748840Z","shell.execute_reply.started":"2021-10-10T06:52:27.742780Z","shell.execute_reply":"2021-10-10T06:52:27.748092Z"},"trusted":true},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":"This is the result of looking at the data through the head () command. There are six variables except for the unique ID that identifies the individual, and review is the key variable.","metadata":{"_uuid":"99dad5eb1affeacf030fae4f4a15661c9848979e"}},{"cell_type":"code","source":"df_train.head()","metadata":{"_uuid":"7fd9b5423836819b3841d444b91492854ab8865a","execution":{"iopub.status.busy":"2021-10-10T06:52:27.751002Z","iopub.execute_input":"2021-10-10T06:52:27.751475Z","iopub.status.idle":"2021-10-10T06:52:27.777836Z","shell.execute_reply.started":"2021-10-10T06:52:27.751425Z","shell.execute_reply":"2021-10-10T06:52:27.776893Z"},"trusted":true},"execution_count":6,"outputs":[]},{"cell_type":"markdown","source":"These are additional explanations for variables.\n\n- drugName (categorical): name of drug \n- condition (categorical): name of condition\n- review (text): patient review \n- rating (numerical): 10 star patient rating \n- date (date): date of review entry \n- usefulCount (numerical): number of users who found review useful\n\nThe structure of the data is that a patient with a unique ID purchases a drug that meets his condition and writes a review and rating for the drug he/she purchased on the date. Afterwards, if the others read that review and find it helpful, they will click usefulCount, which will add 1 for the variable.","metadata":{"_uuid":"5531c6797f0f890bd3b8c7d4328d25c11d375386"}},{"cell_type":"markdown","source":"### 1.2. Data understanding\n\nFirst, we will start exploring variables, starting from uniqueID. We compared the unique number of unique IDs and the length of the train data to see if the same customer has written multiple reviews, and there weren't more than one reviews for one customer.","metadata":{"_uuid":"1180500f856ae3f1b772e3b8e2faa393809fdf23"}},{"cell_type":"code","source":"print(\"unique values count of train : \" ,len(set(df_train['uniqueID'].values)))\nprint(\"length of train : \" ,df_train.shape[0])","metadata":{"_uuid":"8bf7685ab8278b0709c391aed2560386536552a4","execution":{"iopub.status.busy":"2021-10-10T06:52:27.780472Z","iopub.execute_input":"2021-10-10T06:52:27.780977Z","iopub.status.idle":"2021-10-10T06:52:27.821805Z","shell.execute_reply.started":"2021-10-10T06:52:27.780733Z","shell.execute_reply":"2021-10-10T06:52:27.820644Z"},"trusted":true},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":"DrugName is closely related to condition, so we have analyzed them together. The unique values of the two variables are 3671 and 917, respectively, and there are about 4 drugs for each condition. Let's go ahead and visualize this in more detail.","metadata":{"_uuid":"5f5f13d517dd6de104f3303ce7ab9c7da5eacd11"}},{"cell_type":"code","source":"df_all = pd.concat([df_train,df_test])","metadata":{"_kg_hide-input":true,"_uuid":"1ce8f6dad73bddd2bf2c2a3a9cab10f2595679c2","execution":{"iopub.status.busy":"2021-10-10T06:52:27.823609Z","iopub.execute_input":"2021-10-10T06:52:27.824256Z","iopub.status.idle":"2021-10-10T06:52:27.853115Z","shell.execute_reply.started":"2021-10-10T06:52:27.823892Z","shell.execute_reply":"2021-10-10T06:52:27.852320Z"},"trusted":true},"execution_count":8,"outputs":[]},{"cell_type":"code","source":"condition_dn = df_all.groupby(['condition'])['drugName'].nunique().sort_values(ascending=False)\ncondition_dn[0:20].plot(kind=\"bar\", figsize = (14,6), fontsize = 10,color=\"green\")\nplt.xlabel(\"\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Top20 : The number of drugs per condition.\", fontsize = 20)","metadata":{"_uuid":"bb2b7aac9798cb96aa4cd032aee6fca2ddb74b4a","execution":{"iopub.status.busy":"2021-10-10T06:52:27.853985Z","iopub.execute_input":"2021-10-10T06:52:27.854331Z","iopub.status.idle":"2021-10-10T06:52:28.727327Z","shell.execute_reply.started":"2021-10-10T06:52:27.854285Z","shell.execute_reply":"2021-10-10T06:52:28.726487Z"},"trusted":true},"execution_count":9,"outputs":[]},{"cell_type":"markdown","source":"As you can see from the picture above, the number of drugs for top eight conditions is about 100 for each condition. On the other hand, it should be noted that the phrase \"3</span> users found this comment helpful\" appears in the condition, which seems like an error in the crawling process. I have looked into it to see in more details.","metadata":{"_uuid":"b1b2d517b68e747298b5a17fdb3e6e60aaa1cdaf"}},{"cell_type":"code","source":"df_all[df_all['condition']=='3</span> users found this comment helpful.'].head(3)","metadata":{"_uuid":"d0a89c7ebae84c817b3f2e6aab294088c18b261f","execution":{"iopub.status.busy":"2021-10-10T06:52:28.728551Z","iopub.execute_input":"2021-10-10T06:52:28.729069Z","iopub.status.idle":"2021-10-10T06:52:28.781527Z","shell.execute_reply.started":"2021-10-10T06:52:28.729006Z","shell.execute_reply":"2021-10-10T06:52:28.780775Z"},"trusted":true},"execution_count":10,"outputs":[]},{"cell_type":"markdown","source":"It is expected that for structure of '</ span> users found this comment helpful.' phrase, there will be not only 3, but also 4 as shown above, and other numbers as well. We will remove these data in the future preprocessing.\n\nThe following are the low 20 conditions of 'drugs per condition'. As you can see, the number is all 1. Considering the recommendation system, it is not feasible to recommend with that when there is only one product. Therefore, we will analyze only the conditions that have at least 2 drugs per condition.","metadata":{"_uuid":"5585d77c9e6c06707fae1e82aea83abc49d75acb"}},{"cell_type":"code","source":"condition_dn = df_all.groupby(['condition'])['drugName'].nunique().sort_values(ascending=False)\n\ncondition_dn[condition_dn.shape[0]-20:condition_dn.shape[0]].plot(kind=\"bar\", figsize = (14,6), fontsize = 10,color=\"green\")\nplt.xlabel(\"\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Bottom20 : The number of drugs per condition.\", fontsize = 20)","metadata":{"_uuid":"f46bf457b8c4b0a7b4ac2d476f42e38323ca483a","execution":{"iopub.status.busy":"2021-10-10T06:52:28.782467Z","iopub.execute_input":"2021-10-10T06:52:28.782957Z","iopub.status.idle":"2021-10-10T06:52:29.800635Z","shell.execute_reply.started":"2021-10-10T06:52:28.782725Z","shell.execute_reply":"2021-10-10T06:52:29.799998Z"},"trusted":true},"execution_count":11,"outputs":[]},{"cell_type":"markdown","source":"Next, let's have a look at the review. First, noticeable parts are the html strings like \\ r \\ n, and the parts that express emotions in parentheses such as (very unusual for him) and (a good thing) and words in capital letters like MUCH.","metadata":{"_uuid":"fd9d70cb99c89d6def5496da845533aa682e68be"}},{"cell_type":"code","source":"df_train['review'][1]","metadata":{"_uuid":"1c283d11b69ada1b9defd9f5c9fab9d69022b631","execution":{"iopub.status.busy":"2021-10-10T06:52:29.801447Z","iopub.execute_input":"2021-10-10T06:52:29.801811Z","iopub.status.idle":"2021-10-10T06:52:29.812115Z","shell.execute_reply.started":"2021-10-10T06:52:29.801649Z","shell.execute_reply":"2021-10-10T06:52:29.810955Z"},"trusted":true},"execution_count":12,"outputs":[]},{"cell_type":"markdown","source":"In addition, there were some words with errors like didn&# 039;t for didn't, and also characters like ...","metadata":{"_uuid":"89a40ddaee7f0c89ca9d0ad8d14c9e56c0f47b0b"}},{"cell_type":"code","source":"df_train['review'][2]","metadata":{"_uuid":"c7d59b1d8ab8cd481ef1d99b91862fbcd34032d4","execution":{"iopub.status.busy":"2021-10-10T06:52:29.813501Z","iopub.execute_input":"2021-10-10T06:52:29.814019Z","iopub.status.idle":"2021-10-10T06:52:29.820131Z","shell.execute_reply.started":"2021-10-10T06:52:29.813768Z","shell.execute_reply":"2021-10-10T06:52:29.819080Z"},"trusted":true},"execution_count":13,"outputs":[]},{"cell_type":"markdown","source":"We will delete these parts in preprocessing as well.","metadata":{"_uuid":"253b7284fb0685ddff06c1328d2e52a54c1ee884"}},{"cell_type":"markdown","source":"Next up, it's Word Cloud.","metadata":{"_uuid":"a9a3e123470563259697ac03491b08d125493757"}},{"cell_type":"code","source":"\nfrom wordcloud import WordCloud, STOPWORDS\n\n# Thanks : https://www.kaggle.com/aashita/word-clouds-of-various-shapes ##\ndef plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), \n                   title = None, title_size=40, image_color=False):\n    stopwords = set(STOPWORDS)\n    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}\n    stopwords = stopwords.union(more_stopwords)\n\n    wordcloud = WordCloud(background_color='white',\n                    stopwords = stopwords,\n                    max_words = max_words,\n                    max_font_size = max_font_size, \n                    random_state = 42,\n                    width=800, \n                    height=400,\n                    mask = mask)\n    wordcloud.generate(str(text))\n    \n    plt.figure(figsize=figure_size)\n    if image_color:\n        image_colors = ImageColorGenerator(mask);\n        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation=\"bilinear\");\n        plt.title(title, fontdict={'size': title_size,  \n                                  'verticalalignment': 'bottom'})\n    else:\n        plt.imshow(wordcloud);\n        plt.title(title, fontdict={'size': title_size, 'color': 'black', \n                                  'verticalalignment': 'bottom'})\n    plt.axis('off');\n    plt.tight_layout()  \n    \nplot_wordcloud(df_all[\"review\"], title=\"Word Cloud of review\")","metadata":{"_kg_hide-input":true,"_uuid":"ba8744fb8f95488d63b517eb64d61ee43afbee0d","execution":{"iopub.status.busy":"2021-10-10T06:52:29.824987Z","iopub.execute_input":"2021-10-10T06:52:29.825380Z","iopub.status.idle":"2021-10-10T06:52:31.115310Z","shell.execute_reply.started":"2021-10-10T06:52:29.825216Z","shell.execute_reply":"2021-10-10T06:52:31.114499Z"},"trusted":true},"execution_count":14,"outputs":[]},{"cell_type":"markdown","source":"Next, we will classify 1 ~ 5 as negative, and 6 ~ 10 as positive, and we will check through 1 ~ 4 grams which corpus best classifies emotions.","metadata":{"_uuid":"ab15f4931cc199933f76bb659d4825287827fe3d"}},{"cell_type":"code","source":"from collections import defaultdict\ndf_all_6_10 = df_all[df_all[\"rating\"]>5]\ndf_all_1_5 = df_all[df_all[\"rating\"]<6]","metadata":{"_kg_hide-input":true,"_uuid":"72569a2d173760ec20f1893ee67c74e42773f186","execution":{"iopub.status.busy":"2021-10-10T06:52:31.116564Z","iopub.execute_input":"2021-10-10T06:52:31.116901Z","iopub.status.idle":"2021-10-10T06:52:31.145551Z","shell.execute_reply.started":"2021-10-10T06:52:31.116842Z","shell.execute_reply":"2021-10-10T06:52:31.144711Z"},"trusted":true},"execution_count":15,"outputs":[]},{"cell_type":"code","source":"## custom function for ngram generation ##\ndef generate_ngrams(text, n_gram=1):\n    token = [token for token in text.lower().split(\" \") if token != \"\" if token not in STOPWORDS]\n    ngrams = zip(*[token[i:] for i in range(n_gram)])\n    return [\" \".join(ngram) for ngram in ngrams]\n\n## custom function for horizontal bar chart ##\ndef horizontal_bar_chart(df, color):\n    trace = go.Bar(\n        y=df[\"word\"].values[::-1],\n        x=df[\"wordcount\"].values[::-1],\n        showlegend=False,\n        orientation = 'h',\n        marker=dict(\n            color=color,\n        ),\n    )\n    return trace\n\n## Get the bar chart from rating  8 to 10 review ##\nfreq_dict = defaultdict(int)\nfor sent in df_all_1_5[\"review\"]:\n    for word in generate_ngrams(sent):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace0 = horizontal_bar_chart(fd_sorted.head(50), 'blue')\n\n## Get the bar chart from rating  4 to 7 review ##\nfreq_dict = defaultdict(int)\nfor sent in df_all_6_10[\"review\"]:\n    for word in generate_ngrams(sent):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace1 = horizontal_bar_chart(fd_sorted.head(50), 'blue')\n\n# Creating two subplots\nfig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,\n                          subplot_titles=[\"Frequent words of rating 1 to 5\", \n                                          \"Frequent words of rating 6 to 10\"])\nfig.append_trace(trace0, 1, 1)\nfig.append_trace(trace1, 1, 2)\nfig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title=\"Word Count Plots\")\npy.iplot(fig, filename='word-plots')","metadata":{"_kg_hide-input":true,"_uuid":"2d28e3c380acba0078318c33063d51ee86943555","execution":{"iopub.status.busy":"2021-10-10T06:52:31.146536Z","iopub.execute_input":"2021-10-10T06:52:31.146778Z","iopub.status.idle":"2021-10-10T06:52:39.858404Z","shell.execute_reply.started":"2021-10-10T06:52:31.146735Z","shell.execute_reply":"2021-10-10T06:52:39.857378Z"},"trusted":true},"execution_count":16,"outputs":[]},{"cell_type":"markdown","source":"When you use 1-gram, you can see that the top 5 words have the same contents, although the order of left (negative) and right (positive) are different. This means when we analyze the text with a single corpus, it does not classify the emotion well. So, we will expand the corpus.","metadata":{"_uuid":"2d12cc47d486b171f127de48d6dbe640bfbdc4ad"}},{"cell_type":"code","source":"freq_dict = defaultdict(int)\nfor sent in df_all_1_5[\"review\"]:\n    for word in generate_ngrams(sent,2):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace1 = horizontal_bar_chart(fd_sorted.head(50), 'orange')\n\nfreq_dict = defaultdict(int)\nfor sent in df_all_6_10[\"review\"]:\n    for word in generate_ngrams(sent,2):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace2 = horizontal_bar_chart(fd_sorted.head(50), 'orange')\n\n# Creating two subplots\nfig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,\n                          subplot_titles=[\"Frequent biagrams of rating 1 to 5\", \n                                          \"Frequent biagrams of rating 6 to 10\"])\nfig.append_trace(trace1, 1, 1)\nfig.append_trace(trace2, 1, 2)\nfig['layout'].update(height=1200, width=1000, paper_bgcolor='rgb(233,233,233)', title=\"Bigram Count Plots\")\npy.iplot(fig, filename='word-plots')","metadata":{"_kg_hide-input":true,"_uuid":"03e56f5eb12b5d2e738191a2ebfb61a85a2b963a","execution":{"iopub.status.busy":"2021-10-10T06:52:39.859378Z","iopub.execute_input":"2021-10-10T06:52:39.859617Z","iopub.status.idle":"2021-10-10T06:52:54.174980Z","shell.execute_reply.started":"2021-10-10T06:52:39.859574Z","shell.execute_reply":"2021-10-10T06:52:54.173953Z"},"trusted":true},"execution_count":17,"outputs":[]},{"cell_type":"markdown","source":"Likewise, in 2-gram, the contents of the top five corpus are similar, and it is hard to classify positive and negative. In addition, 'side effects' and 'side effects.' are interpreted differently, which means preprocessing of review data is necessary. However, you can see that this is better to classify emotions rather than previous 1-grams, like side effects, weight gain, and highly recommend.","metadata":{"_uuid":"faa15e62d9bd51b7e538a84f89ad6061efc4c48b"}},{"cell_type":"code","source":"freq_dict = defaultdict(int)\nfor sent in df_all_1_5[\"review\"]:\n    for word in generate_ngrams(sent,3):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace1 = horizontal_bar_chart(fd_sorted.head(50), 'green')\n\nfreq_dict = defaultdict(int)\nfor sent in df_all_6_10[\"review\"]:\n    for word in generate_ngrams(sent,3):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace2 = horizontal_bar_chart(fd_sorted.head(50), 'green')\n\n# Creating two subplots\nfig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,\n                          subplot_titles=[\"Frequent trigrams of rating 1 to 5\", \n                                          \"Frequent trigrams of rating 6 to 10\"])\nfig.append_trace(trace1, 1, 1)\nfig.append_trace(trace2, 1, 2)\nfig['layout'].update(height=1200, width=1600, paper_bgcolor='rgb(233,233,233)', title=\"Trigram Count Plots\")\npy.iplot(fig, filename='word-plots')","metadata":{"_kg_hide-input":true,"_uuid":"cae502a170fe1665fd9aff62a257c04d111887b8","execution":{"iopub.status.busy":"2021-10-10T06:52:54.176287Z","iopub.execute_input":"2021-10-10T06:52:54.176698Z","iopub.status.idle":"2021-10-10T06:53:10.781802Z","shell.execute_reply.started":"2021-10-10T06:52:54.176533Z","shell.execute_reply":"2021-10-10T06:53:10.780150Z"},"trusted":true},"execution_count":18,"outputs":[]},{"cell_type":"markdown","source":"From 3-gram you can see that there is a difference between positive and negative corpus. Bad side effects, birth control pills, negative side effects are corpus that classify positive and negative. However, both positive and negative parts can be thought that it has missing parts that reverses the context, such as' not' in front of a corpus.","metadata":{"_uuid":"d2ff80f6d7cf321f40f65c6330a2ad902f59659a"}},{"cell_type":"code","source":"freq_dict = defaultdict(int)\nfor sent in df_all_1_5[\"review\"]:\n    for word in generate_ngrams(sent,4):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace1 = horizontal_bar_chart(fd_sorted.head(50), 'red')\n\nfreq_dict = defaultdict(int)\nfor sent in df_all_6_10[\"review\"]:\n    for word in generate_ngrams(sent,4):\n        freq_dict[word] += 1\nfd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\nfd_sorted.columns = [\"word\", \"wordcount\"]\ntrace2 = horizontal_bar_chart(fd_sorted.head(50), 'red')\n\n# Creating two subplots\nfig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,\n                          subplot_titles=[\"Frequent 4-grams of rating 1 to 5\", \n                                          \"Frequent 4-grams of rating 6 to 10\"])\nfig.append_trace(trace1, 1, 1)\nfig.append_trace(trace2, 1, 2)\nfig['layout'].update(height=1200, width=1600, paper_bgcolor='rgb(233,233,233)', title=\"4-grams Count Plots\")\npy.iplot(fig, filename='word-plots')","metadata":{"_kg_hide-input":true,"_uuid":"b59a9e02961f89349684d69e793fc3270b103022","execution":{"iopub.status.busy":"2021-10-10T06:53:10.782728Z","iopub.execute_input":"2021-10-10T06:53:10.782967Z","iopub.status.idle":"2021-10-10T06:53:27.435251Z","shell.execute_reply.started":"2021-10-10T06:53:10.782917Z","shell.execute_reply":"2021-10-10T06:53:27.434639Z"},"trusted":true},"execution_count":19,"outputs":[]},{"cell_type":"markdown","source":"Clearly, 4-gram classifies emotions much betther than other grams. Therefore, we will use 4-gram to build deep learning model.","metadata":{"_uuid":"d2ce70ff984cda025a8edbd820f756bf6feacd5e"}},{"cell_type":"markdown","source":"Next, we will look for relationship between rating and weather. First of all, we will count the number of ratings.","metadata":{"_uuid":"8617d65924307b6e97c5828b680a0f6bedf4a360"}},{"cell_type":"code","source":"rating = df_all['rating'].value_counts().sort_values(ascending=False)\nrating.plot(kind=\"bar\", figsize = (14,6), fontsize = 10,color=\"green\")\nplt.xlabel(\"\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Count of rating values\", fontsize = 20)","metadata":{"_uuid":"6d119068cf249e999a8f216e0167af8e628fd70a","execution":{"iopub.status.busy":"2021-10-10T06:53:27.436082Z","iopub.execute_input":"2021-10-10T06:53:27.436316Z","iopub.status.idle":"2021-10-10T06:53:27.775957Z","shell.execute_reply.started":"2021-10-10T06:53:27.436273Z","shell.execute_reply":"2021-10-10T06:53:27.775076Z"},"trusted":true},"execution_count":20,"outputs":[]},{"cell_type":"markdown","source":"Most people choose four values; 10, 9, 1, 8, and the number of 10 is more than twice as many as the others. With this, we can see that the percentage of positives is higher than negative, and people's reactions are extreme.","metadata":{"_uuid":"d23525d7fd8b400a4786f86216d663facc7c46ca"}},{"cell_type":"markdown","source":"Next, we will check the number of reviews and percentage of ratings according to weather.","metadata":{"_uuid":"3f7386932ed0289c2623059e0636084af3ede94c"}},{"cell_type":"code","source":"\n\ncnt_srs = df_all['date'].dt.year.value_counts()\ncnt_srs = cnt_srs.sort_index()\nplt.figure(figsize=(14,6))\nsns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color='green')\nplt.xticks(rotation='vertical')\nplt.xlabel('year', fontsize=12)\nplt.ylabel('', fontsize=12)\nplt.title(\"Number of reviews in year\")\nplt.show()","metadata":{"_uuid":"14b215ad07c7d5733a4d950043d17f905732c222","execution":{"iopub.status.busy":"2021-10-10T06:53:27.779179Z","iopub.execute_input":"2021-10-10T06:53:27.781403Z","iopub.status.idle":"2021-10-10T06:53:28.247281Z","shell.execute_reply.started":"2021-10-10T06:53:27.781335Z","shell.execute_reply":"2021-10-10T06:53:28.246371Z"},"trusted":true},"execution_count":21,"outputs":[]},{"cell_type":"code","source":"df_all['year'] = df_all['date'].dt.year\nrating = df_all.groupby('year')['rating'].mean()\nrating.plot(kind=\"bar\", figsize = (14,6), fontsize = 10,color=\"green\")\nplt.xlabel(\"\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Mean rating in year\", fontsize = 20)","metadata":{"_uuid":"16dcf7a747d2eb06895c49ee6fd476e179aed2ba","execution":{"iopub.status.busy":"2021-10-10T06:53:28.250581Z","iopub.execute_input":"2021-10-10T06:53:28.252884Z","iopub.status.idle":"2021-10-10T06:53:28.596650Z","shell.execute_reply.started":"2021-10-10T06:53:28.252824Z","shell.execute_reply":"2021-10-10T06:53:28.595766Z"},"trusted":true},"execution_count":22,"outputs":[]},{"cell_type":"code","source":"\ncnt_srs = df_all['date'].dt.month.value_counts()\ncnt_srs = cnt_srs.sort_index()\nplt.figure(figsize=(14,6))\nsns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color='green')\nplt.xticks(rotation='vertical')\nplt.xlabel('month', fontsize=12)\nplt.ylabel('', fontsize=12)\nplt.title(\"Number of reviews in month\")\nplt.show()","metadata":{"_uuid":"7dfdffd170217e7d753ea6d28a6630b7f494041c","execution":{"iopub.status.busy":"2021-10-10T06:53:28.597790Z","iopub.execute_input":"2021-10-10T06:53:28.598333Z","iopub.status.idle":"2021-10-10T06:53:28.950890Z","shell.execute_reply.started":"2021-10-10T06:53:28.598283Z","shell.execute_reply":"2021-10-10T06:53:28.949857Z"},"trusted":true},"execution_count":23,"outputs":[]},{"cell_type":"code","source":"df_all['month'] = df_all['date'].dt.month\nrating = df_all.groupby('month')['rating'].mean()\nrating.plot(kind=\"bar\", figsize = (14,6), fontsize = 10,color=\"green\")\nplt.xlabel(\"\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Mean rating in month\", fontsize = 20)","metadata":{"_uuid":"0a6c95402fb6d936ea4799d93f2c9eeab47585db","execution":{"iopub.status.busy":"2021-10-10T06:53:28.954283Z","iopub.execute_input":"2021-10-10T06:53:28.954734Z","iopub.status.idle":"2021-10-10T06:53:29.329373Z","shell.execute_reply.started":"2021-10-10T06:53:28.954553Z","shell.execute_reply":"2021-10-10T06:53:29.328469Z"},"trusted":true},"execution_count":24,"outputs":[]},{"cell_type":"markdown","source":"Interestingly, you can see that the average rating differs by year, but it is similar by month.","metadata":{"_uuid":"25da863d027db802c5e3cf11b6b77c6ecf606ebe"}},{"cell_type":"code","source":"df_all['day'] = df_all['date'].dt.day\nrating = df_all.groupby('day')['rating'].mean()\nrating.plot(kind=\"bar\", figsize = (14,6), fontsize = 10,color=\"green\")\nplt.xlabel(\"\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Mean rating in day\", fontsize = 20)","metadata":{"_uuid":"808970cc32f9a7e573ab81b54b2afe12f86729c4","execution":{"iopub.status.busy":"2021-10-10T06:53:29.332666Z","iopub.execute_input":"2021-10-10T06:53:29.335007Z","iopub.status.idle":"2021-10-10T06:53:29.817013Z","shell.execute_reply.started":"2021-10-10T06:53:29.334939Z","shell.execute_reply":"2021-10-10T06:53:29.816144Z"},"trusted":true},"execution_count":25,"outputs":[]},{"cell_type":"markdown","source":"We checked whether the day of the week affects the rating like salary day, but it does not make a big difference.","metadata":{"_uuid":"0049b5a13be3ddb3b0147ebc83ba42d1952eb109"}},{"cell_type":"code","source":"plt.figure(figsize=(14,6))\nsns.distplot(df_all[\"usefulCount\"].dropna(),color=\"green\")\nplt.xticks(rotation='vertical')\nplt.xlabel('', fontsize=12)\nplt.ylabel('', fontsize=12)\nplt.title(\"Distribution of usefulCount\")\nplt.show()","metadata":{"_uuid":"46c8c44e5e61e6a4139db9cbc9b0c10533923f7e","execution":{"iopub.status.busy":"2021-10-10T06:53:29.817841Z","iopub.execute_input":"2021-10-10T06:53:29.818092Z","iopub.status.idle":"2021-10-10T06:53:30.197074Z","shell.execute_reply.started":"2021-10-10T06:53:29.818048Z","shell.execute_reply":"2021-10-10T06:53:30.196216Z"},"trusted":true},"execution_count":26,"outputs":[]},{"cell_type":"code","source":"df_all[\"usefulCount\"].describe()","metadata":{"_uuid":"0b8c382bf6e31efc48656e098dfefc76cf88deff","execution":{"iopub.status.busy":"2021-10-10T06:53:30.218348Z","iopub.execute_input":"2021-10-10T06:53:30.218933Z","iopub.status.idle":"2021-10-10T06:53:30.246217Z","shell.execute_reply.started":"2021-10-10T06:53:30.218876Z","shell.execute_reply":"2021-10-10T06:53:30.245606Z"},"trusted":true},"execution_count":27,"outputs":[]},{"cell_type":"markdown","source":"If you look at the distribution of usefulCount, you can see that the difference between minimum and maximum is 1291, which is high. In addition, the deviation is huge, which is 36. The reason for this is that the more drugs people look for, the more people read the review no matter their contents are good or bad, which makes the usefulcount very high. So when we create the model, we will normalize it by conditions, considering people's accessibility.","metadata":{"_uuid":"4bb7102e270fcfa98d9659a577de26ef9611e016"}},{"cell_type":"markdown","source":"### 1.3 Missing value","metadata":{"_uuid":"541ea018457f9107abf3f2c11051b39b2b17198d"}},{"cell_type":"code","source":"percent = (df_all.isnull().sum()).sort_values(ascending=False)\npercent.plot(kind=\"bar\", figsize = (14,6), fontsize = 10, color='green')\nplt.xlabel(\"Columns\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Total Missing Value \", fontsize = 20)","metadata":{"_uuid":"fe78d3aec92fbac8f64723535bb47c09a40231c3","execution":{"iopub.status.busy":"2021-10-10T06:53:30.247827Z","iopub.execute_input":"2021-10-10T06:53:30.248108Z","iopub.status.idle":"2021-10-10T06:53:30.694267Z","shell.execute_reply.started":"2021-10-10T06:53:30.248063Z","shell.execute_reply":"2021-10-10T06:53:30.693241Z"},"trusted":true},"execution_count":28,"outputs":[]},{"cell_type":"code","source":"print(\"Missing value (%):\", 1200/df_all.shape[0] *100)","metadata":{"_uuid":"bcbd81bf2a4013c6ce9cb81067ab15684171203c","execution":{"iopub.status.busy":"2021-10-10T06:53:30.697801Z","iopub.execute_input":"2021-10-10T06:53:30.700130Z","iopub.status.idle":"2021-10-10T06:53:30.707537Z","shell.execute_reply.started":"2021-10-10T06:53:30.698108Z","shell.execute_reply":"2021-10-10T06:53:30.706710Z"},"trusted":true},"execution_count":29,"outputs":[]},{"cell_type":"markdown","source":"We will delete because the percentage is lower than 1%.","metadata":{"_uuid":"97c340b894003a1ad2fc0b6e589ae351cbd3e253"}},{"cell_type":"markdown","source":"## 2. Date Preprocessing","metadata":{"_uuid":"0c9a363d9fc7b266ee9c8a390d3c841b71126037"}},{"cell_type":"markdown","source":"### 2.1. Missing Values Removal","metadata":{"_uuid":"32e167512298155137e85532b01158049819c100"}},{"cell_type":"code","source":"df_train = df_train.dropna(axis=0)\ndf_test = df_test.dropna(axis=0)","metadata":{"_uuid":"af5d3f3c7fccd92a93413514030b6a33b0a89052","execution":{"iopub.status.busy":"2021-10-10T06:53:30.708763Z","iopub.execute_input":"2021-10-10T06:53:30.709186Z","iopub.status.idle":"2021-10-10T06:53:30.814481Z","shell.execute_reply.started":"2021-10-10T06:53:30.709138Z","shell.execute_reply":"2021-10-10T06:53:30.813552Z"},"trusted":true},"execution_count":30,"outputs":[]},{"cell_type":"code","source":"df_all = pd.concat([df_train,df_test]).reset_index()\ndel df_all['index']\npercent = (df_all.isnull().sum()).sort_values(ascending=False)\npercent.plot(kind=\"bar\", figsize = (14,6), fontsize = 10, color='green')\nplt.xlabel(\"Columns\", fontsize = 20)\nplt.ylabel(\"\", fontsize = 20)\nplt.title(\"Total Missing Value \", fontsize = 20)","metadata":{"_uuid":"1012e7c255ad73cf21048fd6534cb20bbcf6f1ce","execution":{"iopub.status.busy":"2021-10-10T06:53:30.815525Z","iopub.execute_input":"2021-10-10T06:53:30.815799Z","iopub.status.idle":"2021-10-10T06:53:31.282444Z","shell.execute_reply.started":"2021-10-10T06:53:30.815752Z","shell.execute_reply":"2021-10-10T06:53:31.281562Z"},"trusted":true},"execution_count":31,"outputs":[]},{"cell_type":"markdown","source":"### 2.2 Condition Preprocessing","metadata":{"_uuid":"aaceaccc91bf0a846df8a6377558d89824a01c9c"}},{"cell_type":"markdown","source":"We will delete the sentences with the form above.","metadata":{"_uuid":"127f1391018fc3049ea14df7e5ea43c96812a9c3"}},{"cell_type":"code","source":"all_list = set(df_all.index)\nspan_list = []\nfor i,j in enumerate(df_all['condition']):\n    if '</span>' in j:\n        span_list.append(i)","metadata":{"_uuid":"f4269e93fb9a64b79f9e74db338d4df28e2ba0f8","execution":{"iopub.status.busy":"2021-10-10T06:53:31.283528Z","iopub.execute_input":"2021-10-10T06:53:31.283821Z","iopub.status.idle":"2021-10-10T06:53:31.354602Z","shell.execute_reply.started":"2021-10-10T06:53:31.283771Z","shell.execute_reply":"2021-10-10T06:53:31.353778Z"},"trusted":true},"execution_count":32,"outputs":[]},{"cell_type":"code","source":"new_idx = all_list.difference(set(span_list))\ndf_all = df_all.iloc[list(new_idx)].reset_index()\ndel df_all['index']","metadata":{"_uuid":"8545b5cdc904897a886bfad95fadf0196515f420","execution":{"iopub.status.busy":"2021-10-10T06:53:31.355725Z","iopub.execute_input":"2021-10-10T06:53:31.356124Z","iopub.status.idle":"2021-10-10T06:53:31.451205Z","shell.execute_reply.started":"2021-10-10T06:53:31.356074Z","shell.execute_reply":"2021-10-10T06:53:31.446918Z"},"trusted":true},"execution_count":33,"outputs":[]},{"cell_type":"markdown","source":"Next, we will delete conditions with only one drug.","metadata":{"_uuid":"b1a5f5508ba8367b0e1fe86bbd9356965d104a98"}},{"cell_type":"code","source":"df_condition = df_all.groupby(['condition'])['drugName'].nunique().sort_values(ascending=False)\ndf_condition = pd.DataFrame(df_condition).reset_index()\ndf_condition.tail(20)","metadata":{"_uuid":"13324f1aaed4de4d20c53cd52a6ee2dca301d42a","execution":{"iopub.status.busy":"2021-10-10T06:53:31.452548Z","iopub.execute_input":"2021-10-10T06:53:31.453101Z","iopub.status.idle":"2021-10-10T06:53:31.934511Z","shell.execute_reply.started":"2021-10-10T06:53:31.452895Z","shell.execute_reply":"2021-10-10T06:53:31.933783Z"},"trusted":true},"execution_count":34,"outputs":[]},{"cell_type":"code","source":"df_condition_1 = df_condition[df_condition['drugName']==1].reset_index()\ndf_condition_1['condition'][0:10]","metadata":{"_uuid":"3a1341f12498cbbd65e8e7cd1ab9789e94f5468e","execution":{"iopub.status.busy":"2021-10-10T06:53:31.937239Z","iopub.execute_input":"2021-10-10T06:53:31.939206Z","iopub.status.idle":"2021-10-10T06:53:31.952120Z","shell.execute_reply.started":"2021-10-10T06:53:31.939155Z","shell.execute_reply":"2021-10-10T06:53:31.951430Z"},"trusted":true},"execution_count":35,"outputs":[]},{"cell_type":"code","source":"all_list = set(df_all.index)\ncondition_list = []\nfor i,j in enumerate(df_all['condition']):\n    for c in list(df_condition_1['condition']):\n        if j == c:\n            condition_list.append(i)\n            \nnew_idx = all_list.difference(set(condition_list))\ndf_all = df_all.iloc[list(new_idx)].reset_index()\ndel df_all['index']","metadata":{"_uuid":"edf19fcd7a99264a5436034d648b20d0bff2fa7b","execution":{"iopub.status.busy":"2021-10-10T06:53:31.954768Z","iopub.execute_input":"2021-10-10T06:53:31.956889Z","iopub.status.idle":"2021-10-10T06:53:39.181961Z","shell.execute_reply.started":"2021-10-10T06:53:31.956835Z","shell.execute_reply":"2021-10-10T06:53:39.181178Z"},"trusted":true},"execution_count":36,"outputs":[]},{"cell_type":"markdown","source":"### 2.3 Review Preprocessing","metadata":{"_uuid":"cb80a22040f4f9ca03311ce9e1ea782eb8ee89ad"}},{"cell_type":"code","source":"from bs4 import BeautifulSoup\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.stem.snowball import SnowballStemmer","metadata":{"_uuid":"5afbc9c8b87abc646ed24fb5a0feadd5996169a1","execution":{"iopub.status.busy":"2021-10-10T06:53:39.182824Z","iopub.execute_input":"2021-10-10T06:53:39.183066Z","iopub.status.idle":"2021-10-10T06:53:39.680856Z","shell.execute_reply.started":"2021-10-10T06:53:39.183025Z","shell.execute_reply":"2021-10-10T06:53:39.680141Z"},"trusted":true},"execution_count":37,"outputs":[]},{"cell_type":"markdown","source":"- \\r\\n : we need to convert html grammer\n- ... , &#039; : deal with not alphabet","metadata":{"_uuid":"da532e45210e8a15eb1361aad88fc749eebea620"}},{"cell_type":"code","source":"stops = set(stopwords.words('english'))\n#stops","metadata":{"_kg_hide-output":true,"_uuid":"6f546ffb47ae31a30903d4075e8c514a45c7a951","execution":{"iopub.status.busy":"2021-10-10T06:53:39.681685Z","iopub.execute_input":"2021-10-10T06:53:39.681918Z","iopub.status.idle":"2021-10-10T06:53:39.696905Z","shell.execute_reply.started":"2021-10-10T06:53:39.681876Z","shell.execute_reply":"2021-10-10T06:53:39.695993Z"},"trusted":true},"execution_count":38,"outputs":[]},{"cell_type":"code","source":" \nfrom wordcloud import WordCloud, STOPWORDS\n\n\ndef plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), \n                   title = None, title_size=40, image_color=False):\n    stopwords = set(STOPWORDS)\n    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}\n    stopwords = stopwords.union(more_stopwords)\n\n    wordcloud = WordCloud(background_color='white',\n                    stopwords = stopwords,\n                    max_words = max_words,\n                    max_font_size = max_font_size, \n                    random_state = 42,\n                    width=800, \n                    height=400,\n                    mask = mask)\n    wordcloud.generate(str(text))\n    \n    plt.figure(figsize=figure_size)\n    if image_color:\n        image_colors = ImageColorGenerator(mask);\n        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation=\"bilinear\");\n        plt.title(title, fontdict={'size': title_size,  \n                                  'verticalalignment': 'bottom'})\n    else:\n        plt.imshow(wordcloud);\n        plt.title(title, fontdict={'size': title_size, 'color': 'black', \n                                  'verticalalignment': 'bottom'})\n    plt.axis('off');\n    plt.tight_layout()  \n    \nplot_wordcloud(stops, title=\"Word Cloud of stops\")","metadata":{"_kg_hide-input":true,"_uuid":"dde27caf5b374b0f3f0c2acd9d038daba43c4c31","execution":{"iopub.status.busy":"2021-10-10T06:53:39.698056Z","iopub.execute_input":"2021-10-10T06:53:39.698326Z","iopub.status.idle":"2021-10-10T06:53:40.927097Z","shell.execute_reply.started":"2021-10-10T06:53:39.698280Z","shell.execute_reply":"2021-10-10T06:53:40.926319Z"},"trusted":true},"execution_count":39,"outputs":[]},{"cell_type":"markdown","source":"First, let's see what words are used as stopwords. There are many words that include not, like needn't. These words are key parts of emotional analysis, so we will remove them from stopwords.","metadata":{"_uuid":"85cf983ca3865accc2b5630f8ef4430c4fd70e92"}},{"cell_type":"code","source":"not_stop = [\"aren't\",\"couldn't\",\"didn't\",\"doesn't\",\"don't\",\"hadn't\",\"hasn't\",\"haven't\",\"isn't\",\"mightn't\",\"mustn't\",\"needn't\",\"no\",\"nor\",\"not\",\"shan't\",\"shouldn't\",\"wasn't\",\"weren't\",\"wouldn't\"]\nfor i in not_stop:\n    stops.remove(i)","metadata":{"_uuid":"275f9e83f9935d0891fa11a6440b6caf4604e5d3","execution":{"iopub.status.busy":"2021-10-10T06:53:40.928222Z","iopub.execute_input":"2021-10-10T06:53:40.928686Z","iopub.status.idle":"2021-10-10T06:53:40.934061Z","shell.execute_reply.started":"2021-10-10T06:53:40.928628Z","shell.execute_reply":"2021-10-10T06:53:40.933332Z"},"trusted":true},"execution_count":40,"outputs":[]},{"cell_type":"code","source":"from sklearn import model_selection, preprocessing, metrics, ensemble, naive_bayes, linear_model\nfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\nfrom sklearn.decomposition import TruncatedSVD\nimport lightgbm as lgb\n\npd.options.mode.chained_assignment = None\npd.options.display.max_columns = 999\nfrom bs4 import BeautifulSoup\nimport re\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.pipeline import Pipeline\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn import metrics\n\nfrom keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing.sequence import pad_sequences\nfrom keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D\nfrom keras.layers import Bidirectional, GlobalMaxPool1D\nfrom keras.models import Model\nfrom keras import initializers, regularizers, constraints, optimizers, layers","metadata":{"_kg_hide-input":true,"_uuid":"386e3e954a0b1db1d5e7e739f6928ba245e3b9c0","execution":{"iopub.status.busy":"2021-10-10T06:53:40.935175Z","iopub.execute_input":"2021-10-10T06:53:40.935665Z","iopub.status.idle":"2021-10-10T06:53:41.236847Z","shell.execute_reply.started":"2021-10-10T06:53:40.935616Z","shell.execute_reply":"2021-10-10T06:53:41.236114Z"},"trusted":true},"execution_count":41,"outputs":[]},{"cell_type":"code","source":"stemmer = SnowballStemmer('english')\n\ndef review_to_words(raw_review):\n    # 1. Delete HTML \n    review_text = BeautifulSoup(raw_review, 'html.parser').get_text()\n    # 2. Make a space\n    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)\n    # 3. lower letters\n    words = letters_only.lower().split()\n    # 5. Stopwords \n    meaningful_words = [w for w in words if not w in stops]\n    # 6. Stemming\n    stemming_words = [stemmer.stem(w) for w in meaningful_words]\n    # 7. space join words\n    return( ' '.join(stemming_words))","metadata":{"_uuid":"8368a97c8e435adb9f07d144f494fc60a083a5a1","execution":{"iopub.status.busy":"2021-10-10T06:53:41.237806Z","iopub.execute_input":"2021-10-10T06:53:41.238053Z","iopub.status.idle":"2021-10-10T06:53:41.244271Z","shell.execute_reply.started":"2021-10-10T06:53:41.238009Z","shell.execute_reply":"2021-10-10T06:53:41.243179Z"},"trusted":true},"execution_count":42,"outputs":[]},{"cell_type":"code","source":"%time df_all['review_clean'] = df_all['review'].apply(review_to_words)","metadata":{"_uuid":"be46dc37dd26328efedf9ca166b5635e0f13fb53","execution":{"iopub.status.busy":"2021-10-10T06:53:41.245516Z","iopub.execute_input":"2021-10-10T06:53:41.246153Z","iopub.status.idle":"2021-10-10T06:56:40.730094Z","shell.execute_reply.started":"2021-10-10T06:53:41.245782Z","shell.execute_reply":"2021-10-10T06:56:40.729068Z"},"trusted":true},"execution_count":43,"outputs":[]},{"cell_type":"markdown","source":"## 3. Model","metadata":{"_uuid":"1bdd3b519e1df8d4959c77274910f24fb0d35440"}},{"cell_type":"markdown","source":"### 3.1. Deep Learning Model Using N-gram","metadata":{"_uuid":"6f6301c9d03a198e3ce7b22b172970152ee21b6a"}},{"cell_type":"code","source":"# Make a rating\ndf_all['sentiment'] = df_all[\"rating\"].apply(lambda x: 1 if x > 5 else 0)","metadata":{"_uuid":"6adae84cb2d9d4e0588cb10033c17447bf976d58","execution":{"iopub.status.busy":"2021-10-10T06:56:40.731372Z","iopub.execute_input":"2021-10-10T06:56:40.731813Z","iopub.status.idle":"2021-10-10T06:56:40.865127Z","shell.execute_reply.started":"2021-10-10T06:56:40.731636Z","shell.execute_reply":"2021-10-10T06:56:40.864369Z"},"trusted":true},"execution_count":44,"outputs":[]},{"cell_type":"code","source":"df_train, df_test = train_test_split(df_all, test_size=0.33, random_state=42) ","metadata":{"_uuid":"ba7f1a884407f9247915b3b6757219501cf13a15","execution":{"iopub.status.busy":"2021-10-10T06:56:40.866121Z","iopub.execute_input":"2021-10-10T06:56:40.866357Z","iopub.status.idle":"2021-10-10T06:56:40.961244Z","shell.execute_reply.started":"2021-10-10T06:56:40.866315Z","shell.execute_reply":"2021-10-10T06:56:40.960486Z"},"trusted":true},"execution_count":45,"outputs":[]},{"cell_type":"code","source":"\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.pipeline import Pipeline\n\nvectorizer = CountVectorizer(analyzer = 'word', \n                             tokenizer = None,\n                             preprocessor = None, \n                             stop_words = None, \n                             min_df = 2, # 토큰이 나타날 최소 문서 개수\n                             ngram_range=(4, 4),\n                             max_features = 20000\n                            )\nvectorizer","metadata":{"_uuid":"d6a9555e8c90e09fab2edaf2107c6b6bfcb588fd","execution":{"iopub.status.busy":"2021-10-10T06:56:40.962163Z","iopub.execute_input":"2021-10-10T06:56:40.962407Z","iopub.status.idle":"2021-10-10T06:56:40.970470Z","shell.execute_reply.started":"2021-10-10T06:56:40.962362Z","shell.execute_reply":"2021-10-10T06:56:40.969640Z"},"trusted":true},"execution_count":46,"outputs":[]},{"cell_type":"code","source":"\npipeline = Pipeline([\n    ('vect', vectorizer),\n])","metadata":{"_uuid":"1b983fc4de6f9e7e3be07445c418ef1060d6f02f","execution":{"iopub.status.busy":"2021-10-10T06:56:40.971780Z","iopub.execute_input":"2021-10-10T06:56:40.972294Z","iopub.status.idle":"2021-10-10T06:56:40.981408Z","shell.execute_reply.started":"2021-10-10T06:56:40.972235Z","shell.execute_reply":"2021-10-10T06:56:40.980627Z"},"trusted":true},"execution_count":47,"outputs":[]},{"cell_type":"code","source":"%time train_data_features = pipeline.fit_transform(df_train['review_clean'])\n%time test_data_features = pipeline.fit_transform(df_test['review_clean'])","metadata":{"_uuid":"3b1e9c628a7dec80e1e863f7acffdae64e3ae7ab","execution":{"iopub.status.busy":"2021-10-10T06:56:40.982305Z","iopub.execute_input":"2021-10-10T06:56:40.982548Z","iopub.status.idle":"2021-10-10T06:57:32.469285Z","shell.execute_reply.started":"2021-10-10T06:56:40.982506Z","shell.execute_reply":"2021-10-10T06:57:32.468330Z"},"trusted":true},"execution_count":48,"outputs":[]},{"cell_type":"code","source":"from tensorflow.python.keras.models import Sequential\nfrom tensorflow.python.keras.layers import Dense, Bidirectional, LSTM, BatchNormalization, Dropout\nfrom tensorflow.python.keras.preprocessing.sequence import pad_sequences","metadata":{"_uuid":"f71f9a7ad9cec9777814bf2746bc4237116b8b35","execution":{"iopub.status.busy":"2021-10-10T06:57:32.470161Z","iopub.execute_input":"2021-10-10T06:57:32.470400Z","iopub.status.idle":"2021-10-10T06:57:32.475555Z","shell.execute_reply.started":"2021-10-10T06:57:32.470357Z","shell.execute_reply":"2021-10-10T06:57:32.474774Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"\nimport numpy as np\nimport keras\nfrom keras.models import Sequential\nfrom keras.layers import Dense\nimport random\n\n# 1. Dataset\ny_train = df_train['sentiment']\ny_test = df_test['sentiment']\nsolution = y_test.copy()\n\n# 2. Model Structure\nmodel = keras.models.Sequential()\n\nmodel.add(keras.layers.Dense(200, input_shape=(20000,)))\nmodel.add(keras.layers.BatchNormalization())\nmodel.add(keras.layers.Activation('relu'))\nmodel.add(keras.layers.Dropout(0.5))\n\nmodel.add(keras.layers.Dense(300))\nmodel.add(keras.layers.BatchNormalization())\nmodel.add(keras.layers.Activation('relu'))\nmodel.add(keras.layers.Dropout(0.5))\n\nmodel.add(keras.layers.Dense(100, activation='relu'))\nmodel.add(keras.layers.Dense(1, activation='sigmoid'))\n\n# 3. Model compile\nmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])","metadata":{"_uuid":"ce3dfe83a105f9628b61ee107c3fed27f252fb40","execution":{"iopub.status.busy":"2021-10-10T06:57:32.476915Z","iopub.execute_input":"2021-10-10T06:57:32.477439Z","iopub.status.idle":"2021-10-10T06:57:32.799319Z","shell.execute_reply.started":"2021-10-10T06:57:32.477346Z","shell.execute_reply":"2021-10-10T06:57:32.798452Z"},"trusted":true},"execution_count":50,"outputs":[]},{"cell_type":"code","source":"model.summary()","metadata":{"_uuid":"ea8db26bcd3cba382ff381684b3f229d381bca5f","execution":{"iopub.status.busy":"2021-10-10T06:57:32.800105Z","iopub.execute_input":"2021-10-10T06:57:32.800337Z","iopub.status.idle":"2021-10-10T06:57:32.811547Z","shell.execute_reply.started":"2021-10-10T06:57:32.800295Z","shell.execute_reply":"2021-10-10T06:57:32.810605Z"},"trusted":true},"execution_count":51,"outputs":[]},{"cell_type":"code","source":"# 4. Train model\nhist = model.fit(train_data_features, y_train, epochs=10, batch_size=64)\n\n# 5. Traing process\n%matplotlib inline\nimport matplotlib.pyplot as plt\n\nfig, loss_ax = plt.subplots()\n\nacc_ax = loss_ax.twinx()\n\nloss_ax.set_ylim([0.0, 1.0])\nacc_ax.set_ylim([0.0, 1.0])\n\nloss_ax.plot(hist.history['loss'], 'y', label='train loss')\nacc_ax.plot(hist.history['acc'], 'b', label='train acc')\n\nloss_ax.set_xlabel('epoch')\nloss_ax.set_ylabel('loss')\nacc_ax.set_ylabel('accuray')\n\nloss_ax.legend(loc='upper left')\nacc_ax.legend(loc='lower left')\n\nplt.show()\n\n# 6. Evaluation\nloss_and_metrics = model.evaluate(test_data_features, y_test, batch_size=32)\nprint('loss_and_metrics : ' + str(loss_and_metrics))","metadata":{"_uuid":"48b6724a5d7b994c15bd4505bccdb6b3e602ba21","execution":{"iopub.status.busy":"2021-10-10T06:57:32.812427Z","iopub.execute_input":"2021-10-10T06:57:32.812671Z","iopub.status.idle":"2021-10-10T07:01:47.922283Z","shell.execute_reply.started":"2021-10-10T06:57:32.812628Z","shell.execute_reply":"2021-10-10T07:01:47.921475Z"},"trusted":true},"execution_count":52,"outputs":[]},{"cell_type":"code","source":"sub_preds_deep = model.predict(test_data_features,batch_size=32)","metadata":{"_uuid":"d678d5dd391fc455de2cad0b4b8f6fe76374583c","execution":{"iopub.status.busy":"2021-10-10T07:01:47.923189Z","iopub.execute_input":"2021-10-10T07:01:47.923426Z","iopub.status.idle":"2021-10-10T07:01:53.793014Z","shell.execute_reply.started":"2021-10-10T07:01:47.923383Z","shell.execute_reply":"2021-10-10T07:01:53.792170Z"},"trusted":true},"execution_count":53,"outputs":[]},{"cell_type":"markdown","source":"### 3.2 Lightgbm","metadata":{"_uuid":"603ee53139269ed2a9b009a65a9b026b9ce11bf4"}},{"cell_type":"markdown","source":"To improve the low accuracy, we will use machine learning. First of all, this is the sentiment analysis model using only usefulCount.","metadata":{"_uuid":"fee63f2564ce6ad240d6b2e3d1aaf028e98c15a8","trusted":true}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score\nfrom sklearn.model_selection import KFold\nfrom lightgbm import LGBMClassifier\nfrom sklearn.metrics import confusion_matrix\n\n#folds = KFold(n_splits=5, shuffle=True, random_state=546789)\ntarget = df_train['sentiment']\nfeats = ['usefulCount']\n\nsub_preds = np.zeros(df_test.shape[0])\n\ntrn_x, val_x, trn_y, val_y = train_test_split(df_train[feats], target, test_size=0.2, random_state=42) \nfeature_importance_df = pd.DataFrame() \n    \nclf = LGBMClassifier(\n        n_estimators=2000,\n        learning_rate=0.05,\n        num_leaves=30,\n        #colsample_bytree=.9,\n        subsample=.9,\n        max_depth=7,\n        reg_alpha=.1,\n        reg_lambda=.1,\n        min_split_gain=.01,\n        min_child_weight=2,\n        silent=-1,\n        verbose=-1,\n        )\n        \nclf.fit(trn_x, trn_y, \n        eval_set= [(trn_x, trn_y), (val_x, val_y)], \n        verbose=100, early_stopping_rounds=100  #30\n    )\n\nsub_preds = clf.predict(df_test[feats])\n        \nfold_importance_df = pd.DataFrame()\nfold_importance_df[\"feature\"] = feats\nfold_importance_df[\"importance\"] = clf.feature_importances_\nfeature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)","metadata":{"_kg_hide-input":true,"_uuid":"2c4d7f24d1f25d99979687c2101f60bf202d2f04","execution":{"iopub.status.busy":"2021-10-10T07:01:53.800276Z","iopub.execute_input":"2021-10-10T07:01:53.800733Z","iopub.status.idle":"2021-10-10T07:01:59.086440Z","shell.execute_reply.started":"2021-10-10T07:01:53.800535Z","shell.execute_reply":"2021-10-10T07:01:59.085644Z"},"trusted":true},"execution_count":54,"outputs":[]},{"cell_type":"code","source":"solution = df_test['sentiment']\nconfusion_matrix(y_pred=sub_preds, y_true=solution)","metadata":{"_uuid":"1684caab656ca830228cc230871ac3e3d961352a","execution":{"iopub.status.busy":"2021-10-10T07:01:59.088016Z","iopub.execute_input":"2021-10-10T07:01:59.088459Z","iopub.status.idle":"2021-10-10T07:01:59.174662Z","shell.execute_reply.started":"2021-10-10T07:01:59.088283Z","shell.execute_reply":"2021-10-10T07:01:59.173407Z"},"trusted":true},"execution_count":55,"outputs":[]},{"cell_type":"markdown","source":"We will add variables for higher accuracy.","metadata":{"_uuid":"29474d09e5ec8832fbacc39d5738472c8470d768"}},{"cell_type":"code","source":"len_train = df_train.shape[0]\ndf_all = pd.concat([df_train,df_test])\ndel df_train, df_test;\ngc.collect()","metadata":{"_uuid":"8e6032557347dab3092ecfa45baec6f214b49017","execution":{"iopub.status.busy":"2021-10-10T07:01:59.176094Z","iopub.execute_input":"2021-10-10T07:01:59.176583Z","iopub.status.idle":"2021-10-10T07:01:59.549081Z","shell.execute_reply.started":"2021-10-10T07:01:59.176374Z","shell.execute_reply":"2021-10-10T07:01:59.548054Z"},"trusted":true},"execution_count":56,"outputs":[]},{"cell_type":"code","source":"df_all['date'] = pd.to_datetime(df_all['date'])\ndf_all['day'] = df_all['date'].dt.day\ndf_all['year'] = df_all['date'].dt.year\ndf_all['month'] = df_all['date'].dt.month","metadata":{"_uuid":"0ae19226fc6ec7c02e0510a51ec7dd05724da876","execution":{"iopub.status.busy":"2021-10-10T07:01:59.550273Z","iopub.execute_input":"2021-10-10T07:01:59.550748Z","iopub.status.idle":"2021-10-10T07:01:59.591559Z","shell.execute_reply.started":"2021-10-10T07:01:59.550520Z","shell.execute_reply":"2021-10-10T07:01:59.590715Z"},"trusted":true},"execution_count":57,"outputs":[]},{"cell_type":"code","source":"from textblob import TextBlob\nfrom tqdm import tqdm\nreviews = df_all['review_clean']\n\nPredict_Sentiment = []\nfor review in tqdm(reviews):\n    blob = TextBlob(review)\n    Predict_Sentiment += [blob.sentiment.polarity]\ndf_all[\"Predict_Sentiment\"] = Predict_Sentiment\ndf_all.head()","metadata":{"_uuid":"bd4f0e2c792a6d084979c548596df92848d06856","execution":{"iopub.status.busy":"2021-10-10T07:01:59.592485Z","iopub.execute_input":"2021-10-10T07:01:59.592727Z","iopub.status.idle":"2021-10-10T07:04:35.932082Z","shell.execute_reply.started":"2021-10-10T07:01:59.592685Z","shell.execute_reply":"2021-10-10T07:04:35.931242Z"},"trusted":true},"execution_count":58,"outputs":[]},{"cell_type":"code","source":"np.corrcoef(df_all[\"Predict_Sentiment\"], df_all[\"rating\"])","metadata":{"_uuid":"1091f4a36ae81c2973ecdb6fac964c646d93ba0f","execution":{"iopub.status.busy":"2021-10-10T07:04:35.933011Z","iopub.execute_input":"2021-10-10T07:04:35.933238Z","iopub.status.idle":"2021-10-10T07:04:35.944435Z","shell.execute_reply.started":"2021-10-10T07:04:35.933197Z","shell.execute_reply":"2021-10-10T07:04:35.943479Z"},"trusted":true},"execution_count":59,"outputs":[]},{"cell_type":"code","source":"np.corrcoef(df_all[\"Predict_Sentiment\"], df_all[\"sentiment\"])","metadata":{"_uuid":"6deb741f0941fa04ac1a87e9343f9079b5c1a83c","execution":{"iopub.status.busy":"2021-10-10T07:04:35.945497Z","iopub.execute_input":"2021-10-10T07:04:35.945807Z","iopub.status.idle":"2021-10-10T07:04:35.955995Z","shell.execute_reply.started":"2021-10-10T07:04:35.945708Z","shell.execute_reply":"2021-10-10T07:04:35.954778Z"},"trusted":true},"execution_count":60,"outputs":[]},{"cell_type":"code","source":"reviews = df_all['review']\n\nPredict_Sentiment = []\nfor review in tqdm(reviews):\n    blob = TextBlob(review)\n    Predict_Sentiment += [blob.sentiment.polarity]\ndf_all[\"Predict_Sentiment2\"] = Predict_Sentiment","metadata":{"_uuid":"38f83afcdd82bb469b8d6b6b90e9b60b6921219e","execution":{"iopub.status.busy":"2021-10-10T07:04:35.957509Z","iopub.execute_input":"2021-10-10T07:04:35.958340Z","iopub.status.idle":"2021-10-10T07:08:47.689100Z","shell.execute_reply.started":"2021-10-10T07:04:35.958196Z","shell.execute_reply":"2021-10-10T07:08:47.688425Z"},"trusted":true},"execution_count":61,"outputs":[]},{"cell_type":"code","source":"np.corrcoef(df_all[\"Predict_Sentiment2\"], df_all[\"rating\"])","metadata":{"_uuid":"8952070fd762f0d27667940b8a9c764a75eff67e","execution":{"iopub.status.busy":"2021-10-10T07:08:47.690089Z","iopub.execute_input":"2021-10-10T07:08:47.690329Z","iopub.status.idle":"2021-10-10T07:08:47.703036Z","shell.execute_reply.started":"2021-10-10T07:08:47.690285Z","shell.execute_reply":"2021-10-10T07:08:47.702406Z"},"trusted":true},"execution_count":62,"outputs":[]},{"cell_type":"code","source":"np.corrcoef(df_all[\"Predict_Sentiment2\"], df_all[\"sentiment\"])","metadata":{"_uuid":"69fa8972589337694b89e823a10e8134cd0b8ed2","execution":{"iopub.status.busy":"2021-10-10T07:08:47.703917Z","iopub.execute_input":"2021-10-10T07:08:47.704148Z","iopub.status.idle":"2021-10-10T07:08:47.714146Z","shell.execute_reply.started":"2021-10-10T07:08:47.704106Z","shell.execute_reply":"2021-10-10T07:08:47.713126Z"},"trusted":true},"execution_count":63,"outputs":[]},{"cell_type":"code","source":"\ndf_all['count_sent']=df_all[\"review\"].apply(lambda x: len(re.findall(\"\\n\",str(x)))+1)\n\n\ndf_all['count_word']=df_all[\"review_clean\"].apply(lambda x: len(str(x).split()))\n\n\ndf_all['count_unique_word']=df_all[\"review_clean\"].apply(lambda x: len(set(str(x).split())))\n\n\ndf_all['count_letters']=df_all[\"review_clean\"].apply(lambda x: len(str(x)))\n\n\ndf_all[\"count_punctuations\"] = df_all[\"review\"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))\n\n\ndf_all[\"count_words_upper\"] = df_all[\"review\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n\n\ndf_all[\"count_words_title\"] = df_all[\"review\"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))\n\n\ndf_all[\"count_stopwords\"] = df_all[\"review\"].apply(lambda x: len([w for w in str(x).lower().split() if w in stops]))\n\n\ndf_all[\"mean_word_len\"] = df_all[\"review_clean\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))","metadata":{"_uuid":"6488abff05ebf6698e01cfae7550d977fb8a6b02","execution":{"iopub.status.busy":"2021-10-10T07:08:47.715589Z","iopub.execute_input":"2021-10-10T07:08:47.716117Z","iopub.status.idle":"2021-10-10T07:09:13.868535Z","shell.execute_reply.started":"2021-10-10T07:08:47.715861Z","shell.execute_reply":"2021-10-10T07:09:13.867484Z"},"trusted":true},"execution_count":64,"outputs":[]},{"cell_type":"markdown","source":"We added a season variable.","metadata":{"_uuid":"cfeeba96c727574a58ade17ffdab2f1d7d7ebf62"}},{"cell_type":"code","source":"df_all['season'] = df_all[\"month\"].apply(lambda x: 1 if ((x>2) & (x<6)) else(2 if (x>5) & (x<9) else (3 if (x>8) & (x<12) else 4)))","metadata":{"_uuid":"6a4129d1c7c60b57014db60a990bbeef5e73e223","execution":{"iopub.status.busy":"2021-10-10T07:09:13.869639Z","iopub.execute_input":"2021-10-10T07:09:13.869951Z","iopub.status.idle":"2021-10-10T07:09:14.053809Z","shell.execute_reply.started":"2021-10-10T07:09:13.869904Z","shell.execute_reply":"2021-10-10T07:09:14.052899Z"},"trusted":true},"execution_count":65,"outputs":[]},{"cell_type":"markdown","source":"We normalized useful count.","metadata":{"_uuid":"025efc594ab90e1c2e4c0f32a3eca56aa063976c"}},{"cell_type":"code","source":"df_train = df_all[:len_train]\ndf_test = df_all[len_train:]","metadata":{"_kg_hide-input":true,"_uuid":"d001d32f4216747d3ad768c3d136ab4974c21f0f","execution":{"iopub.status.busy":"2021-10-10T07:09:14.054834Z","iopub.execute_input":"2021-10-10T07:09:14.055073Z","iopub.status.idle":"2021-10-10T07:09:14.125468Z","shell.execute_reply.started":"2021-10-10T07:09:14.055031Z","shell.execute_reply":"2021-10-10T07:09:14.124776Z"},"trusted":true},"execution_count":66,"outputs":[]},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score\nfrom sklearn.model_selection import KFold\nfrom lightgbm import LGBMClassifier\n\n#folds = KFold(n_splits=5, shuffle=True, random_state=546789)\ntarget = df_train['sentiment']\nfeats = ['usefulCount','day','year','month','Predict_Sentiment','Predict_Sentiment2', 'count_sent',\n 'count_word', 'count_unique_word', 'count_letters', 'count_punctuations',\n 'count_words_upper', 'count_words_title', 'count_stopwords', 'mean_word_len', 'season']\n\nsub_preds = np.zeros(df_test.shape[0])\n\ntrn_x, val_x, trn_y, val_y = train_test_split(df_train[feats], target, test_size=0.2, random_state=42) \nfeature_importance_df = pd.DataFrame() \n    \nclf = LGBMClassifier(\n        n_estimators=10000,\n        learning_rate=0.10,\n        num_leaves=30,\n        #colsample_bytree=.9,\n        subsample=.9,\n        max_depth=7,\n        reg_alpha=.1,\n        reg_lambda=.1,\n        min_split_gain=.01,\n        min_child_weight=2,\n        silent=-1,\n        verbose=-1,\n        )\n        \nclf.fit(trn_x, trn_y, \n        eval_set= [(trn_x, trn_y), (val_x, val_y)], \n        verbose=100, early_stopping_rounds=100  #30\n    )\n\nsub_preds = clf.predict(df_test[feats])\n        \nfold_importance_df = pd.DataFrame()\nfold_importance_df[\"feature\"] = feats\nfold_importance_df[\"importance\"] = clf.feature_importances_\nfeature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)","metadata":{"_kg_hide-input":true,"_uuid":"99577a2f98a7e997ff0d1c317da32219cbadf506","execution":{"iopub.status.busy":"2021-10-10T07:09:14.126386Z","iopub.execute_input":"2021-10-10T07:09:14.126618Z","iopub.status.idle":"2021-10-10T07:12:04.995025Z","shell.execute_reply.started":"2021-10-10T07:09:14.126576Z","shell.execute_reply":"2021-10-10T07:12:04.994284Z"},"trusted":true},"execution_count":67,"outputs":[]},{"cell_type":"code","source":"confusion_matrix(y_pred=sub_preds, y_true=solution)","metadata":{"_uuid":"a666baea3f58fe9c6d5292cfe24391f0ebb903e8","execution":{"iopub.status.busy":"2021-10-10T07:12:04.996047Z","iopub.execute_input":"2021-10-10T07:12:04.996285Z","iopub.status.idle":"2021-10-10T07:12:05.078204Z","shell.execute_reply.started":"2021-10-10T07:12:04.996242Z","shell.execute_reply":"2021-10-10T07:12:05.077088Z"},"trusted":true},"execution_count":68,"outputs":[]},{"cell_type":"code","source":"cols = feature_importance_df[[\"feature\", \"importance\"]].groupby(\"feature\").mean().sort_values(\n    by=\"importance\", ascending=False)[:50].index\n\nbest_features = feature_importance_df.loc[feature_importance_df.feature.isin(cols)]\n\nplt.figure(figsize=(14,10))\nsns.barplot(x=\"importance\", y=\"feature\", data=best_features.sort_values(by=\"importance\", ascending=False))\nplt.title('LightGBM Features (avg over folds)')\nplt.tight_layout()\nplt.savefig('lgbm_importances.png')","metadata":{"_uuid":"208f503ba23bc2d8412b27c61772371315ffef78","execution":{"iopub.status.busy":"2021-10-10T07:12:05.079506Z","iopub.execute_input":"2021-10-10T07:12:05.079946Z","iopub.status.idle":"2021-10-10T07:12:05.803412Z","shell.execute_reply.started":"2021-10-10T07:12:05.079767Z","shell.execute_reply":"2021-10-10T07:12:05.802649Z"},"trusted":true},"execution_count":69,"outputs":[]},{"cell_type":"markdown","source":"### 3.3 Dictionary_Sentiment_Analysis","metadata":{"_uuid":"c8feb72b512e8032016da05083eebe80c184c58a"}},{"cell_type":"markdown","source":"Because the package used for prediction of 'Predict value' is formed with movie review data, it can be unsuitable for this project which analyzes reviews for drugs. To make up for this, we conducted additional emotional analysis using the Harvard emotional dictionary.","metadata":{"_uuid":"6d26dbb8f59e4a7d4d9518e6b14e485cd79b6a08"}},{"cell_type":"code","source":"# import dictionary data\nword_table = pd.read_csv(\"../input/dictionary/inquirerbasic.csv\")","metadata":{"_uuid":"62f0f62033a5f3661c119213ed322b880a545da4","execution":{"iopub.status.busy":"2021-10-10T07:12:05.804560Z","iopub.execute_input":"2021-10-10T07:12:05.805025Z","iopub.status.idle":"2021-10-10T07:12:05.833781Z","shell.execute_reply.started":"2021-10-10T07:12:05.804969Z","shell.execute_reply":"2021-10-10T07:12:05.833082Z"},"trusted":true},"execution_count":70,"outputs":[]},{"cell_type":"code","source":"word_table.head()","metadata":{"_uuid":"e3ea9c71d4d8c47a991b2ddd11143699af5c6ac0","execution":{"iopub.status.busy":"2021-10-10T07:12:05.834568Z","iopub.execute_input":"2021-10-10T07:12:05.834832Z","iopub.status.idle":"2021-10-10T07:12:05.850142Z","shell.execute_reply.started":"2021-10-10T07:12:05.834786Z","shell.execute_reply":"2021-10-10T07:12:05.849309Z"},"trusted":true},"execution_count":71,"outputs":[]},{"cell_type":"code","source":"##1. make list of sentiment\n#Positiv word list   \ntemp_Positiv = []\nPositiv_word_list = []\nfor i in range(0,len(word_table.Positiv)):\n    if word_table.iloc[i,2] == \"Positiv\":\n        temp = word_table.iloc[i,0].lower()\n        temp1 = re.sub('\\d+', '', temp)\n        temp2 = re.sub('#', '', temp1) \n        temp_Positiv.append(temp2)\n\nPositiv_word_list = list(set(temp_Positiv))\nlen(temp_Positiv)\nlen(Positiv_word_list)  #del temp_Positiv\n\n#Negativ word list          \ntemp_Negativ = []\nNegativ_word_list = []\nfor i in range(0,len(word_table.Negativ)):\n    if word_table.iloc[i,3] == \"Negativ\":\n        temp = word_table.iloc[i,0].lower()\n        temp1 = re.sub('\\d+', '', temp)\n        temp2 = re.sub('#', '', temp1) \n        temp_Negativ.append(temp2)\n\nNegativ_word_list = list(set(temp_Negativ))\nlen(temp_Negativ)\nlen(Negativ_word_list)  #del temp_Negativ","metadata":{"_uuid":"384ba6b02f776f9080d57c18a7f3740b3a2236d3","execution":{"iopub.status.busy":"2021-10-10T07:12:05.851057Z","iopub.execute_input":"2021-10-10T07:12:05.851270Z","iopub.status.idle":"2021-10-10T07:12:06.382198Z","shell.execute_reply.started":"2021-10-10T07:12:05.851230Z","shell.execute_reply":"2021-10-10T07:12:06.381103Z"},"trusted":true},"execution_count":72,"outputs":[]},{"cell_type":"markdown","source":"We counted the number of words in review_clean which are included in dictionary.","metadata":{"_uuid":"73ac447f7fad4fdef122155d74cbe58e00fc57ec"}},{"cell_type":"code","source":"##2. counting the word 98590\nimport numpy as np\nfrom sklearn.feature_extraction.text import CountVectorizer\n\nvectorizer = CountVectorizer(vocabulary = Positiv_word_list)\ncontent = df_test['review_clean']\nX = vectorizer.fit_transform(content)\nf = X.toarray()\nf = pd.DataFrame(f)\nf.columns=Positiv_word_list\ndf_test[\"num_Positiv_word\"] = f.sum(axis=1)\n\nvectorizer2 = CountVectorizer(vocabulary = Negativ_word_list)\ncontent = df_test['review_clean']\nX2 = vectorizer2.fit_transform(content)\nf2 = X2.toarray()\nf2 = pd.DataFrame(f2)\nf2.columns=Negativ_word_list\ndf_test[\"num_Negativ_word\"] = f2.sum(axis=1)","metadata":{"_uuid":"bddbe0aacbe8480869d1935d49bac6397827cd6c","execution":{"iopub.status.busy":"2021-10-10T07:12:06.385139Z","iopub.execute_input":"2021-10-10T07:12:06.385556Z","iopub.status.idle":"2021-10-10T07:12:15.157157Z","shell.execute_reply.started":"2021-10-10T07:12:06.385383Z","shell.execute_reply":"2021-10-10T07:12:15.156425Z"},"trusted":true},"execution_count":73,"outputs":[]},{"cell_type":"code","source":"##3. decide sentiment\ndf_test[\"Positiv_ratio\"] = df_test[\"num_Positiv_word\"]/(df_test[\"num_Positiv_word\"]+df_test[\"num_Negativ_word\"])\ndf_test[\"sentiment_by_dic\"] = df_test[\"Positiv_ratio\"].apply(lambda x: 1 if (x>=0.5) else (0 if (x<0.5) else 0.5))\n\ndf_test.head()","metadata":{"_uuid":"63f678116cb243d737982bb6ae8e9dbd3da49094","scrolled":true,"execution":{"iopub.status.busy":"2021-10-10T07:12:15.158023Z","iopub.execute_input":"2021-10-10T07:12:15.158255Z","iopub.status.idle":"2021-10-10T07:12:15.326214Z","shell.execute_reply.started":"2021-10-10T07:12:15.158212Z","shell.execute_reply":"2021-10-10T07:12:15.325523Z"},"trusted":true},"execution_count":74,"outputs":[]},{"cell_type":"markdown","source":"We defined Positiv_ratio = the number of positive words / (the number of positive words+the number of negative words) If the ratio is lower than 0.5, we classified as negative and if it's higher than 0.5, we classified as positive. With remainders, we classified as neutral, which includes the sentence without either positive or negative words.","metadata":{"_uuid":"da172c9523937834f34ba3d2a137d73c0ad41443"}},{"cell_type":"markdown","source":"As mentioned earlier, we have normalized usefulCount by condition to solve the problem that usefulCount shows bias depending on condition. You can then add three predicted emotion values and multiply them by the normalized usefulCount to get the predicted value.\n\nNow, we can recommend drug by condition in order of final predicted value.","metadata":{"_uuid":"ebc76b1ffbc7eafc6632f8c89949040812a704eb"}},{"cell_type":"code","source":"def userful_count(data):\n    grouped = data.groupby(['condition']).size().reset_index(name='user_size')\n    data = pd.merge(data,grouped,on='condition',how='left')\n    return data\n#___________________________________________________________\ndf_test =  userful_count(df_test) \ndf_test['usefulCount'] = df_test['usefulCount']/df_test['user_size']","metadata":{"_uuid":"85cf8b344e09601d6fb5951ece7c9e2601c3ed8d","execution":{"iopub.status.busy":"2021-10-10T07:12:15.327315Z","iopub.execute_input":"2021-10-10T07:12:15.327561Z","iopub.status.idle":"2021-10-10T07:12:15.447903Z","shell.execute_reply.started":"2021-10-10T07:12:15.327520Z","shell.execute_reply":"2021-10-10T07:12:15.447014Z"},"trusted":true},"execution_count":75,"outputs":[]},{"cell_type":"code","source":"df_test['deep_pred'] = sub_preds_deep\ndf_test['machine_pred'] = sub_preds\n\ndf_test['total_pred'] = (df_test['deep_pred'] + df_test['machine_pred'] + df_test['sentiment_by_dic'])*df_test['usefulCount']","metadata":{"_uuid":"ff0e4b2e3a4c9d6b82b2f32a9ac8601f2a18573d","execution":{"iopub.status.busy":"2021-10-10T07:12:15.448892Z","iopub.execute_input":"2021-10-10T07:12:15.449141Z","iopub.status.idle":"2021-10-10T07:12:15.458458Z","shell.execute_reply.started":"2021-10-10T07:12:15.449099Z","shell.execute_reply":"2021-10-10T07:12:15.457537Z"},"trusted":true},"execution_count":76,"outputs":[]},{"cell_type":"code","source":"df_test = df_test.groupby(['condition','drugName']).agg({'total_pred' : ['mean']})\ndf_test","metadata":{"_uuid":"46e8ff4f7569243d85c489e5034e37850ce623f3","scrolled":true,"execution":{"iopub.status.busy":"2021-10-10T07:12:15.460120Z","iopub.execute_input":"2021-10-10T07:12:15.460887Z","iopub.status.idle":"2021-10-10T07:12:15.530978Z","shell.execute_reply.started":"2021-10-10T07:12:15.460547Z","shell.execute_reply":"2021-10-10T07:12:15.529952Z"},"trusted":true},"execution_count":77,"outputs":[]},{"cell_type":"markdown","source":"## 4. Result","metadata":{"_uuid":"f03520f613d39b0721baae76ebca2be0f722cdb3"}},{"cell_type":"markdown","source":"Our team set the topic as recommending the right medicine for the patient's condition with reviews and proceeded the project according to the topic with the data exploration, data preprocessing and modeling. In the data exploration section, we looked at the forms of data using visualization techniques and statistical techniques. We also looked for n-grams that can best represent emotions, and the relationship with date and rating. The next step was to preprocess the data according to the topic we set, such as removing the condition that has only one drug for recommendation. In the process of modeling, we used deep learning model with n-gram, and additionally used a machine learning model called Lightgbm to overcome the limitation of natural language processing. In addition, we conducted emotional analysis using emotional word dictionary to overcome limitations of package formed with movie data. In addition, we nomalized usefulcount by condition for better reliability. These steps allowed us to calculate the final predicted value and recommend the appropriate drug for each condition according to the order of the value.","metadata":{"_uuid":"1fac2f7625a0a7f7e8d349adcff64e0e50bf75e6"}},{"cell_type":"markdown","source":"## 5. Limitations","metadata":{"_uuid":"24aedde8fec7cde0d040d027ba5e9ff76667de70"}},{"cell_type":"markdown","source":"In conclusion, these are the limitations we had during the project.\n\n1. Sentiment analysis using sentiment word dictionary has low reliability when the number of positive and negative words is small. For example, if there are 0 positive words and 1 negative word, it is classified as negative. Therefore, if the number of sentiment words is 5 or less, we could exclude the observations.\n2. To ensure the reliability of the predicted values, we normalized usefulCount and multiplied it to the predicted values. However, usefulCount may tend to be higher for older reviews as the number of cumulated site visitors increases. Therefore, we should have also considered time when normalizing usefulCount.\n3. If the emotion is positive, the reliability should be increased to the positive side, and if it is negative, the reliability should be increased toward the negative side. However, we simply multiplied the usefulCount for reliability and did not consider this part. So we should have multiplied considering the sign of usefulCount according to different kinds of emotion.\n\n\n","metadata":{"_uuid":"959209a6bc1c99eb5526699b8fc60702521e5bfe"}}]}