###Dataset

>Dataset is taken from https://www.win.tue.nl/~mpechen/projects/smm/.

>It consists of 5331 positive and 5331 negative movie reviews in Turkish.

###Load Dataset

In [1]:
!curl https://www.win.tue.nl/~mpechen/projects/smm/Turkish_Movie_Sentiment.zip -o ./Turkish_Movie_Sentiment.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  520k  100  520k    0     0   526k      0 --:--:-- --:--:-- --:--:--  526k


In [2]:
!unzip /content/Turkish_Movie_Sentiment.zip -d /content

Archive:  /content/Turkish_Movie_Sentiment.zip
  inflating: /content/tr_polarity.neg  
  inflating: /content/tr_polarity.pos  


> Each line in the review files is one review.

In [3]:
# Positive Reviews
with open("./tr_polarity.pos", "rb") as f:
  review_pos_list = f.read().decode("iso-8859-9").replace("\r", "").split("\n")

# Negative Reviews
with open("./tr_polarity.neg", "rb") as f:
  review_neg_list = f.read().decode("iso-8859-9").replace("\r", "").split("\n")

In [4]:
print(f"First 5 positive reviews of total {len(review_pos_list)}:\n ", *review_pos_list[:5], sep="\n")
print(f"\n\nFirst 5 negative review of total {len(review_neg_list)}:\n ", *review_neg_list[:5], sep="\n")

First 5 positive reviews of total 5332:
 
gerçekten harika bir yapim birçok kez izledim gene izlerim özgürlük askini ve ingilizlerin ne kadar vahset olduklarini gözler önüne seren bir film ve tabi ki ask.... 
her izledigimde hayranlik duydugum gerçek klasik diyebilecegimiz filmlerden . içinde teknik hatalar barindirsa bile sinema olgusunun en üst noktalarindan.. 
gerçekten tarihi savas filmleri arasinda tartismasiz en iyisi , 12 yil boyunca acaba ikincisi çekirimi diye bekledigim bir film ,belki william wallace babasinin ölümünden sonra amcasi yanina almisti onu yetistirmisti belki bunu anlatan mükkemmel bir filim olablilr=). 
aldigi ödülleri sonuna dek hak eden muhtesem bir basyapit . 
özgürlük denilince aklima gelen ilk film.bir basyapit.. 


First 5 negative review of total 5332:
 
giseye oynayan bir film.mel gibson'in oyunculugu yine çok kötü.film bastan sona duygu sömürüsü ama anlayan nerde!. 
bircok yonden sahip olduklari zayifliklari populerligi iyi kullanmasiyla gidermis zayif 

In [5]:
import pandas as pd

In [12]:
# Convert reviews to pandas dataframe
dataset = pd.DataFrame([[1, prev] for prev in review_pos_list if prev] + [[0, nrev] for nrev in review_neg_list if nrev] , columns=["label", "review"])
dataset

Unnamed: 0,label,review
0,1,gerçekten harika bir yapim birçok kez izledim ...
1,1,her izledigimde hayranlik duydugum gerçek klas...
2,1,gerçekten tarihi savas filmleri arasinda tarti...
3,1,aldigi ödülleri sonuna dek hak eden muhtesem b...
4,1,özgürlük denilince aklima gelen ilk film.bir b...
...,...,...
10656,0,"yarisina bile gelmeden sikilip biraktim,murat ..."
10657,0,rezalet bir senaryo rezalet oyunculuklar(tuba ...
10658,0,nerden bulmuslar böyle yönetmeni oyuncuyu bast...
10659,0,konu:bilindik senaryo:basit kurgu:çakma geriye...


###Tokenization, Punctuations, Stop Words, Case Folding

> Tokeinizing, removing punctuations and stopword, applying case folding.

In [14]:
# nltk library for tokenizing, punctuations and stopwords
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
from nltk.corpus import stopwords;
nltk.download("stopwords");
stop_words = stopwords.words("turkish");

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
print(stop_words)

['acaba', 'ama', 'aslında', 'az', 'bazı', 'belki', 'biri', 'birkaç', 'birşey', 'biz', 'bu', 'çok', 'çünkü', 'da', 'daha', 'de', 'defa', 'diye', 'eğer', 'en', 'gibi', 'hem', 'hep', 'hepsi', 'her', 'hiç', 'için', 'ile', 'ise', 'kez', 'ki', 'kim', 'mı', 'mu', 'mü', 'nasıl', 'ne', 'neden', 'nerde', 'nerede', 'nereye', 'niçin', 'niye', 'o', 'sanki', 'şey', 'siz', 'şu', 'tüm', 've', 'veya', 'ya', 'yani']


In [17]:
from nltk.tokenize import word_tokenize
import re
# Tokenizing, removing punctuations and stop words, applying case folding
for idx, row in dataset.iterrows():
    dataset.at[idx, "review"] = [token.lower() for token in word_tokenize(re.sub(r"\.", " . ", row["review"]))  if token.isalpha() and (token.lower() not in stop_words)]

###Train-Test Splits

In [18]:
# Train Dataset
train_dataset = dataset.sample(frac=0.8, random_state=1)

In [19]:
train_dataset

Unnamed: 0,label,review
1777,1,"[iyi, film, noluyor, lan, demekten, filme, kon..."
6818,0,"[das, experimenti, yeniden, çekmisler, gerek, ..."
1305,1,"[yaw, herhangi, bir, kanal, filmi, yayinlasa, ..."
6106,0,"[filmi, dün, gece, izledim, bekledigim, çikmad..."
1185,1,"[tom, hanks, senaryo, iyi, film, zaten, asmisss]"
...,...,...
7062,0,"[acayip, zorlama, bir, film, olmus, puani, pop..."
4690,1,"[dvd, sini, nerdeyse, hafta, önce, izledim, sa..."
6497,0,"[serinin, iyi, filmiydi, serinin, ilk, filmi, ..."
10558,0,"[seyrettim, foktan, filimdi, gidip, korsanini,..."


In [20]:
len(train_dataset)

8529

In [21]:
train_dataset["label"].value_counts()

1    4275
0    4254
Name: label, dtype: int64

In [22]:
# Test Dataset
test_dataset = dataset.drop(train_dataset.index)

In [23]:
test_dataset

Unnamed: 0,label,review
0,1,"[gerçekten, harika, bir, yapim, birçok, izledi..."
2,1,"[gerçekten, tarihi, savas, filmleri, arasinda,..."
15,1,"[önce, bir, çikarip, izledim, senedir, arada, ..."
18,1,"[iskoçyanin, evlatlari, baslayip, özgürlügü, a..."
20,1,"[kesinlikle, kaçirilmamasi, gerekne, filmlerin..."
...,...,...
10639,0,"[hafta, sonu, izledim, basrol, oyuncusunun, ka..."
10644,0,"[yakinda, apoya, film, çekilirse, sasirmam]"
10647,0,"[kendileri, inanacakki, karsiyida, inandirarak..."
10650,0,"[oldu, olacak, oscara, aday, gösterelim, filmm..."


In [24]:
len(test_dataset)

2132

In [25]:
test_dataset["label"].value_counts()

0    1076
1    1056
Name: label, dtype: int64

### Build Vocabulary

In [26]:
total_tokens = 0
token_set = set()
for _, tokens in train_dataset["review"].items():
  total_tokens += len(tokens)
  for token in tokens:
    token_set.add(token)

# Add <UNKNOWN> to token set for occurence of words that are not in corpus.
token_set.add("<UNKNOWN>")

In [27]:
print(f"Corpus size: {total_tokens}, Vocabulary size: {len(token_set)}\n")
print("First 10 tokens:")
i = 0
for token in token_set:
  if i >= 10:
    break
  print(token)
  i += 1

Corpus size: 145979, Vocabulary size: 25227

First 10 tokens:
aldigim
vice
korel
yabanciydi
uçlu
boguk
pismanin
amy
sayilamaz
senaryosunda


### Vectorization of Movie Reviews:

In [28]:
# term-to-index mapping
term2idx = {}
for idx, term in enumerate(token_set):
  term2idx.update({term:idx})

In [29]:
print(f"Size: {len(term2idx)}\n\nTerms and Indexes:")
i = 0
for key in term2idx:
  if i >= 10:
    break
  print(f"{key}: {term2idx[key]}")
  i += 1

Size: 25227

Terms and Indexes:
aldigim: 0
vice: 1
korel: 2
yabanciydi: 3
uçlu: 4
boguk: 5
pismanin: 6
amy: 7
sayilamaz: 8
senaryosunda: 9


In [30]:
# Document vectors of Train Set
train_vector_df = None
vocab_size = len(token_set)
doc_vectors = []
for _, row in train_dataset.iterrows():
  vector = [0] * vocab_size # initial vector
  for token in row["review"]:
    vector[term2idx[token]] += 1
  doc_vectors.append([row["label"], vector])

train_vector_df = pd.DataFrame(doc_vectors, columns= ["label", "vector"])

###Model Training


In [31]:
import numpy as np

# Vectors for positive & negative sentiments
V = len(token_set)
pos_vector = np.array([0] * V) # Initial positive sentiment vector
neg_vector = np.array([0] * V) # Initial negative sentiment vector

for _, row in train_vector_df.iterrows():
  vector = np.array(row["vector"])
  if row["label"] == 0:
      neg_vector += vector
  else:
      pos_vector += vector


In [32]:
print(f"positive sentiment vector of size: {len(pos_vector)}\n", pos_vector[:100])
print(f"negative sentiment vector of size: {len(neg_vector)}\n", neg_vector[:100])

positive sentiment vector of size: 25227
 [ 6  0  0  0  1  0  0  3  0  2  0 81  1  1  0  0  1  1  1  3  1  6  0  0
  0 10  1  0  0  1  0  1  2  1  1  0  3  1  2  1  2  0  0  0  1  1  1  1
  1  1  1  2  2  0  0  1  1  0  0  4  0  0  2  0  3  0 30  0 17  1  1  1
  1  1  1  1  1  0  1  1  0  1  3  0  1  0  0  0  1  1  2  1  1  1  1  1
  1  0  1  0]
negative sentiment vector of size: 25227
 [ 1  1  1  1  0  1  1  0  1  1  1 36  0  0  2  1 48  0  0  2  0  7  1  1
  1  3  0  2  1  1  2  0  0  0  0  1  0  1  0  0  1  2  1  1  1  0  0  0
  0  0  0  5  0  2  1  0  0  1  1  0  1  1  0  2  4  1 15  1  2  0  2  1
  0  2  0  4  0  1  0  0  1  0  0  1  0  1  1  1  0  0  0  0  0  0  1  0
  1  2  0  1]


### Similarity Measure: Cosine

In [33]:
def cosine_sim(vector1, vector2):
  dot_product = np.dot(vector1, vector2)  # dot product of the vectors
  magnitude1 = np.linalg.norm(vector1)    # norm => length of vector1
  magnitude2 = np.linalg.norm(vector2)    # norm => length of vector2

  cosine_similarity = dot_product / (magnitude1 * magnitude2)
  return cosine_similarity

###Testing and Result

In [34]:
# Document vectors of Test Set
test_vector_df = None
vocab_size = len(token_set)
test_doc_vectors = []
for _, row in test_dataset.iterrows():
  vector = [0] * vocab_size # initial vector
  for token in row["review"]:
    idx = term2idx.get(token, term2idx.get("<UNKNOWN>"))
    vector[idx] += 1
  test_doc_vectors.append([row["label"], vector])

test_vector_df = pd.DataFrame(test_doc_vectors, columns= ["label", "vector"])

In [38]:
test_labels, test_vectors  = test_vector_df.iloc[:,0], test_vector_df.iloc[:,1]
test_labels.value_counts()

0    1076
1    1056
Name: label, dtype: int64

In [36]:
# make all predictions
predicted_labels = []
for vector in test_vectors:
  pos_sim = cosine_sim(np.array(vector), pos_vector)
  neg_sim = cosine_sim(np.array(vector), neg_vector)
  predicted_labels.append((1 if pos_sim > neg_sim else 0))

  cosine_similarity = dot_product / (magnitude1 * magnitude2)


In [39]:
# Check predictions
len(predicted_labels)

2132

In [40]:
# check if prediction made correctly or not for each prediction
total_predictions = len(predicted_labels)
correct_predictions = 0
pos_pos = 0 # predicted pos, it is pos   (correct prediction)
neg_neg = 0 # predicted neg, it is neg   (correct prediction)
pos_neg = 0 # predicted pos, it is neg   (uncorrect prediction)
neg_pos = 0 # predicted neg, it is pos   (uncorrect prediction)
for idx ,prediction in enumerate(predicted_labels):
  label = test_labels[idx]

  if prediction and label:
    pos_pos += 1
  elif prediction and not label:
    pos_neg += 1
  elif not prediction and not label:
    neg_neg += 1
  else:
    neg_pos += 1

  correct_predictions += 1 if prediction == test_labels[idx] else 0

print(f"Success rate: {correct_predictions / total_predictions:.4g}")
print(f"\nPrediction\tLabel")
print(f"positive\tpositive\t{pos_pos / total_predictions:.4g}")
print(f"negative\tnegative\t{neg_neg / total_predictions:.4g}")
print(f"positive\tnegative\t{pos_neg / total_predictions:.4g}")
print(f"negative\tpositive\t{neg_pos / total_predictions:.4g}")

Success rate: 0.8007

Prediction	Label
positive	positive	0.4029
negative	negative	0.3977
positive	negative	0.1069
negative	positive	0.0924
