Text Mining
Zbiór Danych: sentyment_movies.csv zawiera recenzję filmów z portalu branżowego, wraz ze zmienną zawierającą informację, czy recenzja jest pozytywna, czy negatywna.

- Wskaż, które słowa są najbardziej charakterystyczne dla recenzji pozytywnych, a które dla negatywnych.
- Pamiętaj o filtrowaniu słów stanowiących szum, oraz o wybieraniu tych kategorii słów, które mogą być adekwatne do zadania.
- Za pomocą znanych Ci metod uczenia maszynowego sprawdź, czy istnieją jakieś zgrupowania współwystępujących ze sobą słów.

- https://www.kaggle.com/oumaimahourrane/sentiment-analysis-ml-models-comparison
- https://www.kaggle.com/oumaimahourrane/imdb-reviews/kernels
- https://www.kaggle.com/sergiadi/iet-x-mlda-workshop
- https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

In [13]:
import pandas as pd
import re
from nltk.corpus import stopwords

In [7]:
# 2 columns, comma-separated values
data = pd.read_csv('sentiment_movies.csv', encoding='latin-1')
data.head()

Unnamed: 0,SentimentText,Sentiment
0,"first think another Disney movie, might good, ...",1
1,"Put aside Dr. House repeat missed, Desperate H...",0
2,"big fan Stephen King's work, film made even gr...",1
3,watched horrid thing TV. Needless say one movi...,0
4,truly enjoyed film. acting terrific plot. Jeff...,1


In [8]:
data.describe()

Unnamed: 0,Sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


Feature Extraction

In [9]:
# Number of words
# Assumption - the negative sentiments contain a lesser amount of words than the positive ones
data['word_count'] = data['SentimentText'].apply(lambda x: len(str(x).split(" ")))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count
0,"first think another Disney movie, might good, ...",1,52
1,"Put aside Dr. House repeat missed, Desperate H...",0,86
2,"big fan Stephen King's work, film made even gr...",1,193
3,watched horrid thing TV. Needless say one movi...,0,63
4,truly enjoyed film. acting terrific plot. Jeff...,1,65


In [10]:
# Number of characters
# The calculation will also include the number of spaces
data['char_count'] = data['SentimentText'].str.len() ## this also includes spaces
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count
0,"first think another Disney movie, might good, ...",1,52,314
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565
2,"big fan Stephen King's work, film made even gr...",1,193,1268
3,watched horrid thing TV. Needless say one movi...,0,63,414
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477


In [11]:
# Average word length
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

data['avg_word'] = data['SentimentText'].apply(lambda x: avg_word(x))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word
0,"first think another Disney movie, might good, ...",1,52,314,5.057692
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846


In [14]:
# Number of stopwords
stop = stopwords.words('english')
data['stopwords'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x in stop]))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word,stopwords
0,"first think another Disney movie, might good, ...",1,52,314,5.057692,1
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395,2
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513,3
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302,1
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846,2


In [15]:
# Number of numerics
data['numerics'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word,stopwords,numerics
0,"first think another Disney movie, might good, ...",1,52,314,5.057692,1,2
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395,2,4
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513,3,1
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302,1,0
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846,2,0


In [16]:
# Number of Uppercase words
# Anger or rage is quite often expressed by writing in UPPERCASE words
data['upper'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word,stopwords,numerics,upper
0,"first think another Disney movie, might good, ...",1,52,314,5.057692,1,2,0
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395,2,4,1
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513,3,1,0
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302,1,0,2
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846,2,0,0


Preprocessing

In [18]:
# Lower case
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x.lower() for x in x.split()))
data['SentimentText'].head()

0    first think another disney movie, might good, ...
1    put aside dr. house repeat missed, desperate h...
2    big fan stephen king's work, film made even gr...
3    watched horrid thing tv. needless say one movi...
4    truly enjoyed film. acting terrific plot. jeff...
Name: SentimentText, dtype: object

In [None]:
# url and html tags removal?

In [19]:
# Remove punctuation
data['SentimentText'] = data['SentimentText'].str.replace('[^\w\s]','')
data['SentimentText'].head()

0    first think another disney movie might good it...
1    put aside dr house repeat missed desperate hou...
2    big fan stephen kings work film made even grea...
3    watched horrid thing tv needless say one movie...
4    truly enjoyed film acting terrific plot jeff c...
Name: SentimentText, dtype: object