## This notebook analyzes description column in winemag-data_first150k.csv. 
The description contains text and I just learn some techniques about NLP. Try to use them here. This is kind of sentiment analysis. I add a new column called tasety based on points column. If the points is greater than or equal 88, the tasety is 1, otherwise, tasety is 0. 88 is chosen since it is the median of the whole data in points column. The goal of this notebook is to find the high frequent positive description words and negative description words.


In [1]:
# import pandas
import pandas as pd

import string

# import sklearn libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# import nltk library to do text analysis
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### select 5000 sample from positive data and select 5000 sample from negative data. Since the raw data is too large, the notebook will die.

In [6]:
description_tasety = pd.read_csv("/Users/rwang/PycharmProjects/Kaggle_projects/Wine data analysis/description_tasety.csv", index_col=0)
description_tasety_pos = description_tasety[description_tasety.tasety==1].sample(5000, random_state=0)
description_tasety_neg = description_tasety[description_tasety.tasety==0].sample(5000, random_state=0)
description_tasety = description_tasety_pos.append(description_tasety_neg).sample(frac=1).reset_index(drop=True)

In [7]:
# check duplicated rows. Many rows are duplicated. and remove the duplicated rows.
description_tasety.duplicated().value_counts()

False    9695
True      305
dtype: int64

In [8]:
duplicateRows = description_tasety[description_tasety.duplicated()]
description_tasety.drop_duplicates(inplace=True)

In [9]:
# check positive and negative rows number
description_tasety.tasety.value_counts()

1.0    4849
0.0    4846
Name: tasety, dtype: int64

### Below is beginning to preprocess text

In [10]:
# convert all string in description into lower case
description_tasety.loc[:, 'description'] = description_tasety['description'].str.lower()

In [11]:
# remove punctuations in description
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

In [12]:
description_tasety.loc[:, 'description'] = description_tasety['description'].apply(remove_punctuations)

In [13]:
# remove number in description
description_tasety.loc[:, 'description'] = description_tasety['description'].str.replace('\d+', '')

In [14]:
# convert description into word token. Each instance in description is becoming list of words without punctuations 
# and number
description_tasety['description_token'] = description_tasety['description'].apply(word_tokenize)

In [16]:
# remove stop words in text field
stop = stopwords.words('english')
description_tasety['description_token'] = description_tasety['description_token'].apply(lambda x: [item for item in x if item not in stop])

In [17]:
# convert list of words to string which will be used for next steps
description_tasety['description_token']=description_tasety['description_token'].apply(lambda x: " ".join(x) )

In [24]:
# take bag of 1-gram with tfidf
tfidf = TfidfVectorizer(ngram_range=(1, 1))
features = tfidf.fit_transform(description_tasety['description_token'])
X = pd.DataFrame(features.todense(),
            columns=tfidf.get_feature_names())
target = description_tasety['tasety']

### End preprocess text. Use logisticRegression to creat the model.

In [25]:
final_model = LogisticRegression(C=1)
final_model.fit(X, target)



LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [26]:
# find the top positive words and negative words
feature_to_coef = {word: coef for word, coef in zip(tfidf.get_feature_names(), final_model.coef_[0])}

positives = sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)
for best_positive in positives[:5]:
    print ("top positive word: ", best_positive)

print('-'*60)
negatives = sorted(feature_to_coef.items(), key=lambda x: x[1])
    
for best_negative in negatives[:5]:
    print ("top negative word: ", best_negative)

top positive word:  ('long', 4.857741021932491)
top positive word:  ('years', 4.519974764546312)
top positive word:  ('elegant', 3.423864927559103)
top positive word:  ('lovely', 3.2824082492823266)
top positive word:  ('pure', 3.1929487910152043)
------------------------------------------------------------
top negative word:  ('simple', -4.828173651004282)
top negative word:  ('easy', -3.1711054901580664)
top negative word:  ('flavors', -2.957474059668208)
top negative word:  ('everyday', -2.7993960121546495)
top negative word:  ('soft', -2.7436186687785096)


### conclusion:
The top positive words are 'long', 'years', 'elegant', 'lovely', and 'pure'. 'long', and 'years' are difficult to explan. Since I just use 1-gram tfidf, it just gives the single word. We could guess the phrase according to them. Like 'long sunshine', 'many years'. Now, they are consistent with our opinion when we comment the good wines. <br>
The top negative words are 'simple', 'easy', 'flavors', 'everyday', 'soft'. Also, guess the phrase, 'bad flavors', 'it's soft'. They are common words what we will say when we feel the wines are not good.