# Logistic Regression (Movie Reviews Sentiment Prediction)

## Explanation

create a machine learning model that can predict whether the Movie Review is Positive or Negative, using a Logistic Regression algorithm.

### Column Descriptions

- ***review*** = User review for the movie
- ***sentiment (label)*** = Review classification (Negative / Positive)

## A. Data Preparation

### A.1 Import Libraries

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download resource yang dibutuhkan
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt') # Tambahan penting buat misahin kata (tokenizing)
nltk.download('punkt_tab') # Jaga-jaga kalau error punkt versi baru

list_stopwords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nahls\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nahls\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nahls\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nahls\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\nahls\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


### A.2 Load Data

In [3]:
try:
    df = pd.read_csv("imdb_reviews.csv")
    print("Berhasil Membaca Data")
except:
    print("Gagal Membaca Data")

Berhasil Membaca Data


### A.3 Viewing Data Dimensions

In [4]:
df.shape

(50000, 2)

### A.4 Viewing Data Informations

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


### A.5 Viewing Data Statistics

#### No Statistics for NLP Data

### A.6 Viewing Top 5 Data and Bottom 5 Data

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


### A.7 Viewing Duplicated Data

In [11]:
df.duplicated().sum()

np.int64(0)

In [10]:
df.drop_duplicates(inplace=True)

### A.8 Viewing Missing Data

In [9]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

### A.9 Viewing Outlier Data

#### No Outliers For NLP Data

## B. Data Preprocessing

### B.1 Mapping Label

In [14]:
df_clean = df.copy()

In [None]:
sentiment_mapping = {'negative' : 0, 'positive' : 1}

df_clean['sentiment'] = df['sentiment'].map(sentiment_mapping)

## C. Text Preprocessing

### Text Cleaning & Lemmatizing

In [None]:
def clean_text(text):
    # 1. Lowercase (Biar 'Good' dan 'good' dianggap sama)
    text = text.lower()
    
    # 2. Hapus HTML Tags (IMDB banyak <br /> nya)
    text = re.sub(r'<.*?>', '', text)
    
    # 3. Hapus Karakter Selain Huruf (Angka & Simbol buang)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # 4. Tokenization (Pecah kalimat jadi kata-kata)
    words = word_tokenize(text)
    
    # 5. Stopwords Removal & Lemmatization (Gabung biar cepet)
    # "movies" -> "movie", "running" -> "run"
    # "is", "the", "and" -> Dibuang
    cleaned_words = [lemmatizer.lemmatize(word) for word in words if word not in list_stopwords]
    
    # Gabung lagi jadi kalimat utuh
    return " ".join(cleaned_words)

# Terapkan ke DataFrame (Agak lama karena datanya 50k, sabar ya!)
print("Sedang membersihkan data... (Bisa 1-2 menit)")
df_clean['review_clean'] = df_clean['review'].apply(clean_text)

Sedang membersihkan data... (Bisa 1-2 menit)
                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                        review_clean  sentiment  
0  one reviewer mentioned watching oz episode you...          1  
1  wonderful little production filming technique ...          1  
2  thought wonderful way spend time hot summer we...          1  
3  basically there family little boy jake think t...          0  
4  petter matteis love time money visually stunni...          1  


In [18]:
df_clean[['review', 'review_clean', 'sentiment']].head()

Unnamed: 0,review,review_clean,sentiment
0,One of the other reviewers has mentioned that ...,one reviewer mentioned watching oz episode you...,1
1,A wonderful little production. <br /><br />The...,wonderful little production filming technique ...,1
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...,1
3,Basically there's a family where a little boy ...,basically there family little boy jake think t...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love time money visually stunni...,1


## C. Exploratory Data Analysis (EDA)