### Dataset

>Dataset: [Turkish_Movie_Sentiment](https://www.win.tue.nl/~mpechen/projects/smm/)

>It consists of 5331 positive and 5331 negative movie reviews in Turkish.

### Load Dataset

> Load the dataset, unzip it, construct the dataframe out of it with respect to columns **[label, review]**.

> **You can use the curl command the download dataset from its source *(1)*. You can upload the dataset yourself *(2)*.**

In [1]:
# 1
!curl https://www.win.tue.nl/~mpechen/projects/smm/Turkish_Movie_Sentiment.zip -o ./Turkish_Movie_Sentiment.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  520k  100  520k    0     0   506k      0  0:00:01  0:00:01 --:--:--  507k


In [2]:
# 2
#from google.colab import files
#uploaded = files.upload()

In [3]:
!unzip /content/Turkish_Movie_Sentiment.zip -d /content > /dev/null

> **Each line is a review.**

In [4]:
# Positive Reviews
with open("./tr_polarity.pos", "rb") as f:
  review_pos_list = f.read().decode("iso-8859-9").replace("\r", "").split("\n")

# Negative Reviews
with open("./tr_polarity.neg", "rb") as f:
  review_neg_list = f.read().decode("iso-8859-9").replace("\r", "").split("\n")

In [5]:
print(f"First 5 of total {len(review_pos_list)} positive reviews:\n ", *review_pos_list[:5], sep="\n")
print(f"\n\nFirst 5 of total {len(review_neg_list)} negative reviews:\n ", *review_neg_list[:5], sep="\n")

First 5 of total 5332 positive reviews:
 
gerçekten harika bir yapim birçok kez izledim gene izlerim özgürlük askini ve ingilizlerin ne kadar vahset olduklarini gözler önüne seren bir film ve tabi ki ask.... 
her izledigimde hayranlik duydugum gerçek klasik diyebilecegimiz filmlerden . içinde teknik hatalar barindirsa bile sinema olgusunun en üst noktalarindan.. 
gerçekten tarihi savas filmleri arasinda tartismasiz en iyisi , 12 yil boyunca acaba ikincisi çekirimi diye bekledigim bir film ,belki william wallace babasinin ölümünden sonra amcasi yanina almisti onu yetistirmisti belki bunu anlatan mükkemmel bir filim olablilr=). 
aldigi ödülleri sonuna dek hak eden muhtesem bir basyapit . 
özgürlük denilince aklima gelen ilk film.bir basyapit.. 


First 5 of total 5332 negative reviews:
 
giseye oynayan bir film.mel gibson'in oyunculugu yine çok kötü.film bastan sona duygu sömürüsü ama anlayan nerde!. 
bircok yonden sahip olduklari zayifliklari populerligi iyi kullanmasiyla gidermis zayif

In [6]:
import pandas as pd

In [7]:
# Convert reviews to pandas dataframe
dataset = pd.DataFrame([[1, prev] for prev in review_pos_list if prev] + [[0, nrev] for nrev in review_neg_list if nrev] , columns=["label", "review"])
dataset

Unnamed: 0,label,review
0,1,gerçekten harika bir yapim birçok kez izledim ...
1,1,her izledigimde hayranlik duydugum gerçek klas...
2,1,gerçekten tarihi savas filmleri arasinda tarti...
3,1,aldigi ödülleri sonuna dek hak eden muhtesem b...
4,1,özgürlük denilince aklima gelen ilk film.bir b...
...,...,...
10656,0,"yarisina bile gelmeden sikilip biraktim,murat ..."
10657,0,rezalet bir senaryo rezalet oyunculuklar(tuba ...
10658,0,nerden bulmuslar böyle yönetmeni oyuncuyu bast...
10659,0,konu:bilindik senaryo:basit kurgu:çakma geriye...


### Tokenization, Punctuations, Stop Words, Case Folding

> Tokeinize, remove punctuations and stopwords, apply case folding.

In [8]:
# nltk library for tokenizing, punctuations and stopwords
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [9]:
from nltk.corpus import stopwords;
nltk.download("stopwords");
stop_words = stopwords.words("turkish");

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
print(stop_words)

['acaba', 'ama', 'aslında', 'az', 'bazı', 'belki', 'biri', 'birkaç', 'birşey', 'biz', 'bu', 'çok', 'çünkü', 'da', 'daha', 'de', 'defa', 'diye', 'eğer', 'en', 'gibi', 'hem', 'hep', 'hepsi', 'her', 'hiç', 'için', 'ile', 'ise', 'kez', 'ki', 'kim', 'mı', 'mu', 'mü', 'nasıl', 'ne', 'neden', 'nerde', 'nerede', 'nereye', 'niçin', 'niye', 'o', 'sanki', 'şey', 'siz', 'şu', 'tüm', 've', 'veya', 'ya', 'yani']


In [11]:
from nltk.tokenize import word_tokenize
import re
# Tokenizing, removing punctuations and stop words, applying case folding
for idx, row in dataset.iterrows():
    dataset.at[idx, "review"] = [token.lower() for token in word_tokenize(re.sub(r"\.", " . ", row["review"]))  if token.isalpha() and (token.lower() not in stop_words)]

### Train-Test Splits

> Split dataset into 2 parts **train dataset** (80%) and **test dataset** (20%).

In [12]:
# Train Dataset
train_dataset = dataset.sample(frac=0.8, random_state=1)

In [13]:
train_dataset

Unnamed: 0,label,review
1777,1,"[iyi, film, noluyor, lan, demekten, filme, kon..."
6818,0,"[das, experimenti, yeniden, çekmisler, gerek, ..."
1305,1,"[yaw, herhangi, bir, kanal, filmi, yayinlasa, ..."
6106,0,"[filmi, dün, gece, izledim, bekledigim, çikmad..."
1185,1,"[tom, hanks, senaryo, iyi, film, zaten, asmisss]"
...,...,...
7062,0,"[acayip, zorlama, bir, film, olmus, puani, pop..."
4690,1,"[dvd, sini, nerdeyse, hafta, önce, izledim, sa..."
6497,0,"[serinin, iyi, filmiydi, serinin, ilk, filmi, ..."
10558,0,"[seyrettim, foktan, filimdi, gidip, korsanini,..."


In [14]:
len(train_dataset)

8529

In [15]:
train_dataset["label"].value_counts()

label
1    4275
0    4254
Name: count, dtype: int64

In [16]:
# Test Dataset
test_dataset = dataset.drop(train_dataset.index)

In [17]:
test_dataset

Unnamed: 0,label,review
0,1,"[gerçekten, harika, bir, yapim, birçok, izledi..."
2,1,"[gerçekten, tarihi, savas, filmleri, arasinda,..."
15,1,"[önce, bir, çikarip, izledim, senedir, arada, ..."
18,1,"[iskoçyanin, evlatlari, baslayip, özgürlügü, a..."
20,1,"[kesinlikle, kaçirilmamasi, gerekne, filmlerin..."
...,...,...
10639,0,"[hafta, sonu, izledim, basrol, oyuncusunun, ka..."
10644,0,"[yakinda, apoya, film, çekilirse, sasirmam]"
10647,0,"[kendileri, inanacakki, karsiyida, inandirarak..."
10650,0,"[oldu, olacak, oscara, aday, gösterelim, filmm..."


In [18]:
len(test_dataset)

2132

### Build Vocabulary

> Create a token set which holds all the tokens in train dataset and a token named as ***\<UNKNOWN>***, for tokens may appear in test dataset but does not appear in train dataset.

In [19]:
total_tokens = 0
token_set = set()
for _, tokens in train_dataset["review"].items():
  total_tokens += len(tokens)
  for token in tokens:
    token_set.add(token)

# Add <UNKNOWN> to token set for occurence of words that are not in corpus.
token_set.add("<UNKNOWN>")

In [20]:
print(f"Corpus size: {total_tokens}, Vocabulary size: {len(token_set)}\n")
print("First 10 tokens:")
i = 0
for token in token_set:
  if i >= 10:
    break
  print(token)
  i += 1

Corpus size: 145979, Vocabulary size: 25227

First 10 tokens:
anlarmadan
izlememesi
süphem
ersoy
notebook
vermeyecegim
sigara
mizansenler
anlarsin
çevirdigi


### Vectorization


> token2idx: {token: token_idx}

> Create train_vector_df with respect to structure [label, vector].
vector is calculated using each tokens term frequency. vector's size is equal to vocabulary size (vocab_size).

In [21]:
# token to index mapping
token2idx = {}
for idx, term in enumerate(token_set):
  token2idx.update({term:idx})

In [22]:
print(f"Size: {len(token2idx)}\n\nTerms and Indexes:")
i = 0
for key in token2idx:
  if i >= 10:
    break
  print(f"{key}: {token2idx[key]}")
  i += 1

Size: 25227

Terms and Indexes:
anlarmadan: 0
izlememesi: 1
süphem: 2
ersoy: 3
notebook: 4
vermeyecegim: 5
sigara: 6
mizansenler: 7
anlarsin: 8
çevirdigi: 9


In [23]:
# Document vectors of Train Set
train_vector_df = None
vocab_size = len(token_set)
doc_vectors = []
for _, row in train_dataset.iterrows():
  vector = [0.0] * vocab_size # initial vector
  for token in row["review"]:
    vector[token2idx[token]] += 1

  # Normalize the vectors by dividing TFs to number of tokens in the review
  for token in set(row["review"]):
    idx = token2idx.get(token, token2idx.get("<UNKNOWN>"))
    vector[idx] /= len(row["review"])

  doc_vectors.append([row["label"], vector])

train_vector_df = pd.DataFrame(doc_vectors, columns= ["label", "vector"])

### Model Training

> Calculating a vector for positive reviews and a vector for negative reviews using element-wise addition.

In [25]:
import numpy as np

# Vectors for positive & negative sentiments
V = len(token_set)
pos_vector = np.array([0.0] * V) # Initial positive sentiment vector
neg_vector = np.array([0.0] * V) # Initial negative sentiment vector

for _, row in train_vector_df.iterrows():
  vector = np.array(row["vector"])
  if row["label"] == 0:
      neg_vector += vector
  else:
      pos_vector += vector


In [26]:
print(f"positive sentiment vector of size: {len(pos_vector)}\n", pos_vector[:100])
print(f"negative sentiment vector of size: {len(neg_vector)}\n", neg_vector[:100])

positive sentiment vector of size: 25227
 [0.         0.         0.03571429 0.06666667 0.         0.
 0.03846154 0.         0.10846561 0.10416667 0.         0.10998133
 0.         0.         0.03448276 0.         1.98722964 0.07107843
 0.         0.04166667 0.09444444 0.07142857 0.14285714 0.
 0.         0.04       0.         0.         0.         0.07142857
 0.09978214 0.         0.38453397 0.05882353 0.45900794 0.04
 0.03030303 0.02777778 0.         0.125      0.         0.
 0.24798535 0.         0.05       0.         0.         0.075
 0.         0.         0.         0.         0.37618629 0.04545455
 0.84761777 0.         1.68522441 0.         0.         0.02702703
 0.04545455 0.14818713 0.14285714 0.08703704 0.03333333 0.
 0.         0.         0.20238095 0.02941176 0.04761905 0.
 0.         0.05       0.         0.04545455 0.40465749 0.
 0.08608059 0.03333333 2.66223697 0.         0.03333333 0.02941176
 0.03846154 0.         0.14136905 0.         0.         0.
 0.0707196  0.      

### Similarity Measure: Cosine

> Defining a function which calculates cosine similarity between vectors of the same dimension.

In [27]:
def cosine_sim(vector1, vector2):
  dot_product = np.dot(vector1, vector2)  # dot product of the vectors
  magnitude1 = np.linalg.norm(vector1)    # length of vector1
  magnitude2 = np.linalg.norm(vector2)    # length of vector2

  cosine_similarity = dot_product / (magnitude1 * magnitude2)
  return cosine_similarity

### Testing and Result

> Calculate the vector for each review in test_dataset with term frequencies. Predict labels for all reviews using cosine similarity. Finally compare predictions with real labels and calculate the success rate.

In [28]:
# Document vectors of Test Set
test_vector_df = None
vocab_size = len(token_set)
test_doc_vectors = []
for _, row in test_dataset.iterrows():
  vector = [0.0] * vocab_size # initial vector
  for token in row["review"]:
    idx = token2idx.get(token, token2idx.get("<UNKNOWN>"))
    vector[idx] += 1

  # Normalize the vectors by dividing TFs to number of tokens in the review
  for token in set(row["review"]):
    idx = token2idx.get(token, token2idx.get("<UNKNOWN>"))
    vector[idx] /= len(row["review"])

  test_doc_vectors.append([row["label"], vector])

test_vector_df = pd.DataFrame(test_doc_vectors, columns= ["label", "vector"])

In [29]:
test_labels, test_vectors  = test_vector_df.iloc[:,0], test_vector_df.iloc[:,1]
test_labels.value_counts()

label
0    1076
1    1056
Name: count, dtype: int64

In [30]:
# make all predictions
predicted_labels = []
for vector in test_vectors:
  pos_sim = cosine_sim(np.array(vector), pos_vector)
  neg_sim = cosine_sim(np.array(vector), neg_vector)
  predicted_labels.append((1 if pos_sim > neg_sim else 0))

  cosine_similarity = dot_product / (magnitude1 * magnitude2)


In [31]:
# Check predictions
len(predicted_labels)

2132

In [32]:
# check if prediction made correctly or not for each prediction
total_predictions = len(predicted_labels)
correct_predictions = 0
pos_pos = 0 # predicted pos, it is pos   (correct prediction)
neg_neg = 0 # predicted neg, it is neg   (correct prediction)
pos_neg = 0 # predicted pos, it is neg   (uncorrect prediction)
neg_pos = 0 # predicted neg, it is pos   (uncorrect prediction)
for idx ,prediction in enumerate(predicted_labels):
  label = test_labels[idx]

  if prediction and label:
    pos_pos += 1
  elif prediction and not label:
    pos_neg += 1
  elif not prediction and not label:
    neg_neg += 1
  else:
    neg_pos += 1

  correct_predictions += 1 if prediction == test_labels[idx] else 0

print(f"Success rate: {correct_predictions / total_predictions:.4g}")
print(f"\nPrediction\tLabel")
print(f"positive\tpositive\t{pos_pos / total_predictions:.4g}")
print(f"negative\tnegative\t{neg_neg / total_predictions:.4g}")
print(f"positive\tnegative\t{pos_neg / total_predictions:.4g}")
print(f"negative\tpositive\t{neg_pos / total_predictions:.4g}")

Success rate: 0.811

Prediction	Label
positive	positive	0.4081
negative	negative	0.4029
positive	negative	0.1018
negative	positive	0.08724
