                         # NLP ile IMDB Duygu Analizi - Yapay Zeka ile IMDB Metin Tanıma Yazılımı 
                         # (# IMDB Sentiment Analysis with NLP - IMDB Text Recognition Software with Artificial Intelligence)


# Bu proje ile NLP konseptini kullanarak duygu analiz yazılımı geliştireceğiz. Bu çalışmada Google'a ait bir platform olan Kaggle platformundan elde edilen veri seti kullanacağız.

# Bu veri seti ile gelen İngilizce IMDB film yorumlarını bu projede geliştireceğimiz yapay zeka yazılımımız sayesinde pozitif veya negatif yorumları otomatik bir şekilde çıkartabileceğiz.

# Bu proje ile NLP konseptini teoriye boğulmadan çok kısa sürede öğrenmiş olacağız.



# With this project, we will develop sentiment analysis software using the NLP concept. In this study, we will use the dataset obtained from the Kaggle platform, a platform owned by Google.

# With this dataset, we will be able to automatically extract positive or negative comments from the English IMDB movie reviews that come with this dataset, thanks to our artificial intelligence software that we will develop in this project.

# With this project, we will learn the NLP concept in a very short time without getting bogged down in theory.


In [91]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score 
from bs4 import BeautifulSoup
import re
import nltk
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

In [92]:
# CountVectorizer NLP için kullanılan bir modüldür.
# CountVectorizer is a module used for NLP.

# Veri setlerimizi yüklüyoruz...
# Loading our datasets...
df = pd.read_csv("NLPlabeledData.tsv", delimiter="\t", quoting=3)


# `delimiter="\t"` ve `quoting=3` parametreleri, `pd.read_csv()` fonksiyonu kullanılarak bir dosya okunduğunda dosyanın nasıl işleneceğini belirler. İşte her birinin anlamı:

# 1. **delimiter="\t"**:
 #  - Bu parametre, verilerin hangi karakterle ayrıldığını belirtir. `"\t"` karakteri, sekme (tab) karakterini ifade eder. Yani, bu kod, sekmelerle ayrılmış bir dosyayı okuduğunuzu belirtir. Bu durumda, dosya içindeki her bir satırdaki veriler sekmelerle ayrılmıştır.

# 2. **quoting=3**:
  # - Bu parametre, alıntıların nasıl işleneceğini kontrol eder. Pandas'ta `quoting` parametresi için birkaç farklı değer vardır. `3`, #`csv.QUOTE_NONE` değerine karşılık gelir. Bu, dosyada alıntı karakterleri kullanılmadığını ve bu nedenle alıntıların göz ardı edilmesi gerektiğini belirtir. Yani, eğer verilerinizde çift tırnak (`"`) ya da tek tırnak (`'`) ile çevrelenmiş metin yoksa, bu ayar uygundur.

# Bu iki parametre, dosyadan veri okuma işleminin doğru bir şekilde gerçekleştirilmesi için önemlidir. Dosyanızın yapısına uygun ayarları kullanarak verilerinizi doğru bir şekilde yükleyebilirsiniz. 





# The `delimiter="\t"` and `quoting=3` parameters determine how a file is handled when it is read using the `pd.read_csv()` function. Here's what each of them means:

# 1. **delimiter="\t"**:
# - This parameter specifies what character is used to separate the data. The `"\t"` character represents the tab character. So, this code indicates that you are reading a tab-delimited file. In this case, the data on each line in the file is separated by tabs.

# 2. **quoting=3**:
# - This parameter controls how quotes are handled. There are several different values ​​for the `quoting` parameter in Pandas. `3` corresponds to the #`csv.QUOTE_NONE` value. This indicates that the file does not use quote characters and therefore quotes should be ignored. This means that if your data does not contain text surrounded by double quotes (`"`) or single quotes (`'`), this setting is appropriate.

# These two parameters are important for the correct reading of data from the file. You can load your data correctly by using the settings appropriate to the structure of your file. 

In [94]:
# Verimize bakalım.
# Let's look at our data.

df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [95]:
len(df)

25000

# len() fonksiyonu, Python'da bir nesnenin (örneğin bir liste, string, demet veya başka bir koleksiyon) uzunluğunu veya eleman sayısını döndüren yerleşik bir fonksiyondur.

# The len() function is a built-in function in Python that returns the length or number of elements of an object (for example, a list, string, tuple, or other collection).

In [97]:
len(df["review"])          # review satırıda 25.000.
                           # review line is 25.000.

25000

In [98]:
# stopwords'ü temizlemek için nltk kütüphanesinden stopwords kelime setini bilgisayarımıza indirmemiz gerekiyor.
# Bu işlemi nltk ile yapıyoruz.

# To clean stopwords, we need to download the stopwords word set from the nltk library to our computer. 
# We do this with nltk.

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ubuntu26/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# nltk.download("stopwords"), NLTK (Natural Language Toolkit) kütüphanesinde yer alan "stopwords" verisini indirmek için kullanılan bir komuttur. "Stopwords", doğal dil işleme (NLP) alanında genellikle önemli sayılmayan ve cümlelerde sıkça bulunan kelimelerdir (örneğin, "ve", "bir", "ben", "ile" gibi).

# Bu kelimeler, metin analizi yaparken genellikle çıkarılır çünkü anlamlı içerik sağlamazlar. nltk.download("stopwords") komutunu çalıştırdığınızda, NLTK kütüphanesi, stopwords listelerini içeren dosyaları indirir.


# nltk.download("stopwords") is a command used to download "stopwords" data from the NLTK (Natural Language Toolkit) library. "Stopwords" are words that are not generally considered important in the field of natural language processing (NLP) and are frequently found in sentences (for example, "and", "one", "me", "with").

# These words are usually omitted when doing text analysis because they do not provide meaningful content. When you run the nltk.download("stopwords") command, the NLTK library downloads files containing stopword lists.

### VERİ TEMİZLEME İŞLEMLERİ  (DATA CLEANING PROCEDURES)

# Öncelikle BeautifulSoup modülünü kullanarak HTML etiketlerini review cümlelerinden sileceğiz.
# First, we will remove HTML tags from review sentences using the BeautifulSoup module.

# Bu işlemlerin nasıl yapıldığını açıklamak için önce örnek tek bir review seçip size nasıl yapıldığına bakalım. 
# To explain how these operations are done, let's first choose a single example review and see how it is done.

In [101]:
sample_review = df.review[0]    # Bu temizlenecek. (This will be cleared.)
sample_review

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [102]:
# HTML etiketleri temzilendikten sonra... (# After cleaning HTML tags...)
sample_review= BeautifulSoup(sample_review).get_text()
sample_review

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

In [103]:
# Noktalama işaretleri ve sayılardan temizliyoruz.  (# We clean it from punctuation and numbers.) (Sayılar duygu analizinde kullanılmıyor.)
sample_review = re.sub("[^a-zA-Z]",' ',sample_review)                                             # (Yanıltıcı olabilir.)
sample_review

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    m

# re.sub(), Python'un re (regular expression) modülünde bulunan bir fonksiyondur. Bu fonksiyon, bir düzenli ifade (regular expression) kullanarak bir metindeki belirli bir alt dizeyi başka bir alt dizeyle değiştirmek için kullanılır.


# re.sub() is a function in Python's re (regular expression) module. This function is used to replace a given substring in a text with another substring using a regular expression.

# Küçük harfe dönüştürüyoruz, makine öğrenim algoritmalarımızın büyük harfle başlayan kelimeleri farklı kelime olarak algılamaması için yapıyoruz.

# We convert it to lowercase so that our machine learning algorithms do not perceive words starting with capital letters as different words.

In [106]:
sample_review = sample_review.lower()
sample_review

' with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    m

# Stopwords (yani the is ,are gibi kelimeler yapay zeka tarafından kullanılmamasını istiyoruz. Bunlar gramer kelimeleri...)
# Önce split ile kelimeleri bölüyoruz ve listeye dönüştürüyoruz. Amacımız stopwords kelimeleri çıkarmak..

# Stopwords (i.e. words like the is, are, etc. we don't want AI to use them. These are grammar words...)
# First, we divide the words with split and turn them into a list. Our goal is to remove stopwords..

In [138]:
swords = set(stopwords.words("english"))
sample_review = [w for w in sample_review if w not in swords]        # are, is, at gibi kelimeler atıldı.
sample_review

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'really',
 'cool',
 'eighties',
 'maybe',
 'make',
 'mind',
 'whether',
 'guilty',
 'innocent',
 'moonwalker',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'remember',
 'going',
 'see',
 'cinema',
 'originally',
 'released',
 'subtle',
 'messages',
 'mj',
 'feeling',
 'towards',
 'press',
 'also',
 'obvious',
 'message',
 'drugs',
 'bad',
 'kay',
 'visually',
 'impressive',
 'course',
 'michael',
 'jackson',
 'unless',
 'remotely',
 'like',
 'mj',
 'anyway',
 'going',
 'hate',
 'find',
 'boring',
 'may',
 'call',
 'mj',
 'egotist',
 'consenting',
 'making',
 'movie',
 'mj',
 'fans',
 'would',
 'say',
 'made',
 'fans',
 'true',
 'really',
 'nice',
 'actual',
 'feature',
 'film',
 'bit',
 'finally',
 'starts',
 'minutes',
 'excluding',
 'smooth',
 'crim

# Temizleme işlemini açıkladıktan sonra şimdi tüm dataframe'imiz içinde bulunan reviewlerin döngü içinde topluca temizleyeceğiz. Bu amaçla fonksiyon oluşturalım.

# After explaining the cleaning process, we will now clean all the reviews in our dataframe in a loop. Let's create a function for this purpose.



In [143]:
def process(review):
    #review without HTML etiketleri
    review = BeautifulSoup(review).get_text()
    #review without punctuation and numbers  (#noktalama işareti ve rakam olmadan inceleme)
    review = re.sub("[^a-zA-Z]",' ', review)
    # converting into lowercase and splitting to eliminate stowords (# küçük harfe dönüştürme ve kelime hatalarını ortadan kaldırmak için bölme)
    review = review.lower()
    review = review.split()
    # review without stopwords
    swords = set(stopwords.words("english"))                    # conversion into set for fast searching (# hızlı arama için kümeye dönüştürme)
    review = [w for w in review if w not in swords]
    # splitted paragraph'ları space ile birleştiriyoruz.
    return(" ".join(review))

# Training datamızı yukarıdaki fonksiyon yardımıyla temizliyoruz.
# Her 1000 review sonrası bir satır yazdırarak review işleminin durumunu görüyoruz.


# We clean our training data with the function above.
# We see the status of the review process by printing a line after every 1000 reviews.

In [146]:
train_x_tum = []
for r in range(len(df["review"])):
    if (r+1)%1000 == 0:
        print("No of reviews processed =", r+1)
        train_x_tum.append(process(df["review"][r]))

No of reviews processed = 1000
No of reviews processed = 2000
No of reviews processed = 3000
No of reviews processed = 4000
No of reviews processed = 5000
No of reviews processed = 6000
No of reviews processed = 7000
No of reviews processed = 8000
No of reviews processed = 9000
No of reviews processed = 10000
No of reviews processed = 11000
No of reviews processed = 12000
No of reviews processed = 13000
No of reviews processed = 14000
No of reviews processed = 15000
No of reviews processed = 16000
No of reviews processed = 17000
No of reviews processed = 18000
No of reviews processed = 19000
No of reviews processed = 20000
No of reviews processed = 21000
No of reviews processed = 22000
No of reviews processed = 23000
No of reviews processed = 24000
No of reviews processed = 25000


# Verdiğin kod parçası, bir veri çerçevesinin (DataFrame) "review" sütunundaki her incelemeyi işleyen bir döngü içeriyor. Ancak döngü, yalnızca her 1000 incelemede bir güncelleme yapıyor. Yani, işlemin tamamlanması sırasında ilerlemeyi takip etmek için ara çıktılar veriyor.

# The code snippet you provided contains a loop that processes each review in the "review" column of a DataFrame. However, the loop only updates every 1000 reviews, meaning it returns intermediate outputs to track progress as the process completes.

In [157]:
x = train_x_tum
y = np.array(df["sentiment"])

# train test split
# train_x, test_x, y_train, y_test = train_test_split(x,y, test_size = 0.1, random_state =42)