Text Mining
Zbiór Danych: sentyment_movies.csv zawiera recenzję filmów z portalu branżowego, wraz ze zmienną zawierającą informację, czy recenzja jest pozytywna, czy negatywna.

- Wskaż, które słowa są najbardziej charakterystyczne dla recenzji pozytywnych, a które dla negatywnych.
- Pamiętaj o filtrowaniu słów stanowiących szum, oraz o wybieraniu tych kategorii słów, które mogą być adekwatne do zadania.
- Za pomocą znanych Ci metod uczenia maszynowego sprawdź, czy istnieją jakieś zgrupowania współwystępujących ze sobą słów.

- https://www.kaggle.com/oumaimahourrane/sentiment-analysis-ml-models-comparison
- https://www.kaggle.com/oumaimahourrane/imdb-reviews/kernels
- https://www.kaggle.com/sergiadi/iet-x-mlda-workshop
- https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

In [27]:
import pandas as pd
import re
import nltk

from nltk.corpus import stopwords
from textblob import TextBlob, Word

nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rafal_000\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rafal_000\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

# 1. Read data

In [3]:
# 2 columns, comma-separated values
data = pd.read_csv('sentiment_movies.csv', encoding='latin-1')
data.head()

Unnamed: 0,SentimentText,Sentiment
0,"first think another Disney movie, might good, ...",1
1,"Put aside Dr. House repeat missed, Desperate H...",0
2,"big fan Stephen King's work, film made even gr...",1
3,watched horrid thing TV. Needless say one movi...,0
4,truly enjoyed film. acting terrific plot. Jeff...,1


In [4]:
data.describe()

Unnamed: 0,Sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


# 2. Data description

## 2.1 Number of words

In [5]:
# Assumption - the negative sentiments contain a lesser amount of words than the positive ones
data['word_count'] = data['SentimentText'].apply(lambda x: len(str(x).split(" ")))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count
0,"first think another Disney movie, might good, ...",1,52
1,"Put aside Dr. House repeat missed, Desperate H...",0,86
2,"big fan Stephen King's work, film made even gr...",1,193
3,watched horrid thing TV. Needless say one movi...,0,63
4,truly enjoyed film. acting terrific plot. Jeff...,1,65


## 2.2 Number of characters

In [6]:
# The calculation will also include the number of spaces
data['char_count'] = data['SentimentText'].str.len() ## this also includes spaces
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count
0,"first think another Disney movie, might good, ...",1,52,314
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565
2,"big fan Stephen King's work, film made even gr...",1,193,1268
3,watched horrid thing TV. Needless say one movi...,0,63,414
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477


## 2.3 Average word length

In [7]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

data['avg_word'] = data['SentimentText'].apply(lambda x: avg_word(x))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word
0,"first think another Disney movie, might good, ...",1,52,314,5.057692
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846


## 2.4 Number of stopwords

In [8]:
stop = stopwords.words('english')
data['stopwords'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x in stop]))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word,stopwords
0,"first think another Disney movie, might good, ...",1,52,314,5.057692,1
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395,2
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513,3
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302,1
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846,2


## 2.5 Number of numerics

In [9]:
data['numerics'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word,stopwords,numerics
0,"first think another Disney movie, might good, ...",1,52,314,5.057692,1,2
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395,2,4
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513,3,1
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302,1,0
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846,2,0


## 2.6 Number of uppercase words

In [10]:
# Anger or rage could be expressed by writing in UPPERCASE
data['upper'] = data['SentimentText'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
data.head()

Unnamed: 0,SentimentText,Sentiment,word_count,char_count,avg_word,stopwords,numerics,upper
0,"first think another Disney movie, might good, ...",1,52,314,5.057692,1,2,0
1,"Put aside Dr. House repeat missed, Desperate H...",0,86,565,5.581395,2,4,1
2,"big fan Stephen King's work, film made even gr...",1,193,1268,5.57513,3,1,0
3,watched horrid thing TV. Needless say one movi...,0,63,414,5.587302,1,0,2
4,truly enjoyed film. acting terrific plot. Jeff...,1,65,477,6.353846,2,0,0


# 3. Preprocessing

## 3.1 Lower case

In [11]:
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x.lower() for x in x.split()))
data['SentimentText'].head()

0    first think another disney movie, might good, ...
1    put aside dr. house repeat missed, desperate h...
2    big fan stephen king's work, film made even gr...
3    watched horrid thing tv. needless say one movi...
4    truly enjoyed film. acting terrific plot. jeff...
Name: SentimentText, dtype: object

## 3.2 Remove URLs

In [12]:
data['SentimentText'] = data['SentimentText'].apply(lambda x: re.sub(r'(?:\@|https?\://)\S+', '', x))
data['SentimentText'].head()

0    first think another disney movie, might good, ...
1    put aside dr. house repeat missed, desperate h...
2    big fan stephen king's work, film made even gr...
3    watched horrid thing tv. needless say one movi...
4    truly enjoyed film. acting terrific plot. jeff...
Name: SentimentText, dtype: object

## 3.3 Remove html tags

In [13]:
data['SentimentText'] = data['SentimentText'].apply(lambda x: re.sub(r'<[^>]+>', '', x))
data['SentimentText'].head()

0    first think another disney movie, might good, ...
1    put aside dr. house repeat missed, desperate h...
2    big fan stephen king's work, film made even gr...
3    watched horrid thing tv. needless say one movi...
4    truly enjoyed film. acting terrific plot. jeff...
Name: SentimentText, dtype: object

## 3.4 Remove punctuation

In [14]:
data['SentimentText'] = data['SentimentText'].str.replace('[^\w\s]','')
data['SentimentText'].head()

0    first think another disney movie might good it...
1    put aside dr house repeat missed desperate hou...
2    big fan stephen kings work film made even grea...
3    watched horrid thing tv needless say one movie...
4    truly enjoyed film acting terrific plot jeff c...
Name: SentimentText, dtype: object

## 3.5 Remove stop words

In [15]:
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
data['SentimentText'].head()

0    first think another disney movie might good ki...
1    put aside dr house repeat missed desperate hou...
2    big fan stephen kings work film made even grea...
3    watched horrid thing tv needless say one movie...
4    truly enjoyed film acting terrific plot jeff c...
Name: SentimentText, dtype: object

## 3.6 Common words

### 3.6.1 Count common words

In [16]:
freq_common = pd.Series(' '.join(data['SentimentText']).split()).value_counts()[:10]
freq_common

movie     41797
film      37455
one       25147
like      19558
good      14508
even      12325
would     12124
time      11781
really    11636
story     11425
dtype: int64

### 3.6.2 Remove common words

In [17]:
freq_common = list(freq_common.index)
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_common))
data['SentimentText'].head()

0    first think another disney might kids watch ca...
1    put aside dr house repeat missed desperate hou...
2    big fan stephen kings work made greater fan ki...
3    watched horrid thing tv needless say movies wa...
4    truly enjoyed acting terrific plot jeff combs ...
Name: SentimentText, dtype: object

## 3.7 Rare words

In [18]:
freq_rare = pd.Series(' '.join(data['SentimentText']).split()).value_counts()[-100000:]
freq_rare

kitchener     2
laurenti      2
infraction    2
defoes        2
beefs         2
             ..
hotelsin      1
rolexand      1
commonthis    1
whimpered     1
handticks     1
Length: 100000, dtype: int64

## 3.8 Correct spelling

In [None]:
data['SentimentText'].apply(lambda x: str(TextBlob(x).correct()))
data['SentimentText'].head()

## 3.9 Lemmatization - convert the word into its root word

In [None]:
data['SentimentText'] = data['SentimentText'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
data['SentimentText'].head()

# 4. Feature extraction

## 4.1 N-grams
Capture the language structure, like what letter or word is likely to follow the given one

In [28]:
TextBlob(data['SentimentText'][0]).ngrams(2)

[WordList(['first', 'think']),
 WordList(['think', 'another']),
 WordList(['another', 'disney']),
 WordList(['disney', 'might']),
 WordList(['might', 'kids']),
 WordList(['kids', 'watch']),
 WordList(['watch', 'cant']),
 WordList(['cant', 'help']),
 WordList(['help', 'enjoy']),
 WordList(['enjoy', 'ages']),
 WordList(['ages', 'love']),
 WordList(['love', 'first']),
 WordList(['first', 'saw']),
 WordList(['saw', '10']),
 WordList(['10', '8']),
 WordList(['8', 'years']),
 WordList(['years', 'later']),
 WordList(['later', 'still']),
 WordList(['still', 'love']),
 WordList(['love', 'danny']),
 WordList(['danny', 'glover']),
 WordList(['glover', 'superb']),
 WordList(['superb', 'could']),
 WordList(['could', 'play']),
 WordList(['play', 'part']),
 WordList(['part', 'better']),
 WordList(['better', 'christopher']),
 WordList(['christopher', 'lloyd']),
 WordList(['lloyd', 'hilarious']),
 WordList(['hilarious', 'perfect']),
 WordList(['perfect', 'part']),
 WordList(['part', 'tony']),
 WordList

## 4.2 Term frequency - dopracowac
TF = (Number of times term T appears in the particular row) / (number of terms in that row)

In [40]:
tf1 = (data['SentimentText'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index
tf1

<bound method Series.reset_index of dont         2
say          2
id           2
dinosaurs    2
thought      2
            ..
missed       1
actors       1
device       1
puts         1
office       1
Length: 62, dtype: int64>

## 4.3 Inverse Document Frequency - dopracowac
IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

In [38]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(data.shape[0]/(len(data[data['SentimentText'].str.contains(word)])))

tf1

TypeError: 'method' object is not subscriptable