# Natural Language Processing ðŸ¤–

## Case 2: Text Mining

a. Aplikasi yang dibuat harus berdasarkan tema tentang rekomendasi terhadap suatu hal 
(contohnya dapat dilihat di bawah ini) dan memanfaatkan sejumlah teknik 
pemrosesan teks, termasuk model bahasa atau language modelling (n-gram), word 
embedding dengan menggunakan metode TF-IDF, Word2Vec atau GloVe, dan 
mengimplementasikan Named Entity Recognition (NER).
<br><br>
b. Model yang dihasilkan dari arsitektur NaÃ¯ve Bayes harus mencapai tingkat akurasi 
minimal sebesar 80%.
<br><br>
c. Pada bagian Named Entity Recognition (NER), Anda diminta untuk menampilkan 
Named Entities dari data yang telah diolah sebelumnya. Ilustrasi dapat dilihat di 
bawah ini.

## 0. Import Libraries

In [61]:
### Import Libraries

import pandas as pd
import numpy as np
import re
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
from sklearn.metrics.pairwise import cosine_similarity

## 1. Install Dependencies

##### Actually there are many seperate libraries and dependencies below depend on the topics

In [62]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Read Dataset

In [63]:
import pandas as pd

result_clean = pd.read_csv('IMBD_Train.csv', encoding="utf-8", nrows=45000)

result_clean.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


## 3. N-Gram Model

#### N-Gram Testing

In [64]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from scipy.sparse import csr_matrix

class NGramLanguageModel:
    def __init__(self, n):
        self.vectorizer = CountVectorizer(ngram_range=(n, n))
    
    def fit_transform(self, corpus):
        return self.vectorizer.fit_transform(corpus)

    def transform(self, query):
        return self.vectorizer.transform(query)


corpus = ["text data for training", "another example of text data"]


n = 1
ngram_model = NGramLanguageModel(n)

matrix = ngram_model.fit_transform(corpus)
query_v = ngram_model.transform(["some query text"])

print(f"{n}-Gram Model:")

print(pd.DataFrame(matrix.toarray(), columns=ngram_model.vectorizer.get_feature_names_out()))
print(query_v)

1-Gram Model:
   another  data  example  for  of  text  training
0        0     1        0    1   0     1         1
1        1     1        1    0   1     1         0
  (0, 5)	1


#### N-Gram using Real Dataset

In [65]:
class NGramLanguageModel:
  def __init__(self, n):
    self.n = n
    self.vectorizer = CountVectorizer(analyzer='word', ngram_range=(n,n), max_features=10000)
  def fit_transform(self, corpus):
    return self.vectorizer.fit_transform(corpus)
  def transform(self, corpus):
    return self.vectorizer.transform(corpus)

def calculate_cosine_similarity(matrix, query_v):
  similarities = cosine_similarity(query_v, matrix)
  return similarities

In [66]:
# Prepare data for processing

corpus = result_clean['text'].tolist()
query = "Movie"

In [67]:
n = 1
ngram_model = NGramLanguageModel(n)

matrix = ngram_model.fit_transform(corpus)
query_v = ngram_model.transform(["some query text"])

print(f"{n}-Gram Model:")

print(pd.DataFrame(matrix.toarray(), columns=ngram_model.vectorizer.get_feature_names_out()))
print(query_v)

1-Gram Model:
       00  000  007  10  100  1000  101  11  12  13  ...  zizek  zoey  zombi  \
0       0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
1       0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
2       0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
3       0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
4       0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
...    ..  ...  ...  ..  ...   ...  ...  ..  ..  ..  ...    ...   ...    ...   
39995   0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
39996   0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
39997   0    0    0   0    0     0    0   0   0   0  ...      0     0      0   
39998   0    0    0   1    0     0    0   0   0   0  ...      0     0      0   
39999   0    2    0   0    0     0    0   0   0   0  ...      0     0      0   

       zombie  zombies  z

In [71]:
similarities = calculate_cosine_similarity(matrix, query_v)

data = {'IMBD Review': corpus, 'Similarity': similarities[0]}
df = pd.DataFrame(data)

print(query)
df

Movie


Unnamed: 0,IMBD Review,Similarity
0,I grew up (b. 1965) watching and loving the Th...,0.000000
1,"When I put this movie in my DVD player, and sa...",0.079368
2,Why do people who do not know what a particula...,0.000000
3,Even though I have great interest in Biblical ...,0.000000
4,Im a die hard Dads Army fan and nothing will e...,0.000000
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",0.053243
39996,This movie is an incredible piece of work. It ...,0.085126
39997,My wife and I watched this movie because we pl...,0.000000
39998,"When I first watched Flatliners, I was amazed....",0.000000


## 4. TF-IDF Model Word Embedding

In [70]:
# TF-IDF Model

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
query_tfidf_v = tfidf_vectorizer.transform([query])
tfidf_similarities = calculate_cosine_similarity(tfidf_matrix, query_tfidf_v)

df_tfidf = pd.DataFrame({
    'IMBD Review': corpus,
    'TF-IDF Similarity': tfidf_similarities[0]
})

print("TF-IDF Similarities:")
print(df_tfidf)

TF-IDF Similarities:
                                             IMBD Review  TF-IDF Similarity
0      I grew up (b. 1965) watching and loving the Th...           0.023821
1      When I put this movie in my DVD player, and sa...           0.163534
2      Why do people who do not know what a particula...           0.023389
3      Even though I have great interest in Biblical ...           0.110682
4      Im a die hard Dads Army fan and nothing will e...           0.000000
...                                                  ...                ...
39995  "Western Union" is something of a forgotten cl...           0.020033
39996  This movie is an incredible piece of work. It ...           0.053480
39997  My wife and I watched this movie because we pl...           0.083131
39998  When I first watched Flatliners, I was amazed....           0.024990
39999  Why would this film be so good, but only gross...           0.000000

[40000 rows x 2 columns]


## 5. Dependency Parsing

In [9]:
import spacy

In [72]:
# Initialize spaCy model

nlp = spacy.load("en_core_web_sm")

def extract_named_entities_and_parse_tree(sentence):
    doc = nlp(sentence)

    # Extract parts of speech
    pos_tags = {
        "punctuations": [token.text for token in doc if token.pos_ == "PUNCT"],
        "verbs": [token.text for token in doc if token.pos_ == "VERB"],
        "nouns": [token.text for token in doc if token.pos_ in ["NOUN", "PROPN"]],
    }

    # Display formatted dependency parse tree
    print("Formatted Dependency Parse Tree:")
    for token in doc:
        print(f"{token.text} --{token.dep_}--> {token.head.text} ({token.pos_})")

    return pos_tags

def main():

    # Limit to first 50 rows (biar ga meledak)

    for index, row in result_clean.head(50).iterrows():
        extract_text = row['text']
        print(f"\nAnalyzing Review {index+1}:")
        print("Review Text:", extract_text)
        pos_tags = extract_named_entities_and_parse_tree(extract_text)

        print("\nExtracted Parts of Speech:")
        print("Punctuations:", pos_tags["punctuations"])
        print("Verbs:", pos_tags["verbs"])
        print("Nouns:", pos_tags["nouns"])

if __name__ == "__main__":
    main()


Analyzing Review 1:
Review Text: I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.
Formatted Dependency Parse Tree:
I --nsubj--> grew (PRON)
grew --ROOT--> grew (VERB)
up --p

## 6. Grammar Parsing using NLTK

In [11]:
import nltk
from nltk import CFG
from nltk.parse import ChartParser

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [73]:
def demonstrate_nlp_parsing(sentence, grammar):

    # Tokenize sentence
    words = nltk.word_tokenize(sentence)
    parser = ChartParser(grammar)

    try:

        parses = list(parser.parse(words))
        # print(parses)
        if parses:
            print(f"Parsed Trees for: \"{sentence}\"")
            for tree in parses:
                # print(tree)
                tree.pretty_print()
        else:
            print(f"No parses found for: \"{sentence}\"")

    except Exception as e:
        print(f"Error during parsing: {e}")


##### Grammar Parsing: Sentences from the IMBD Review

In [74]:
nlp_grammar = CFG.fromstring("""
  S -> NP VP | PP VP | S Conj S
  NP -> Det N | Det N PP | 'i' | 'I' | 'people' | 'this' | 'the' | 'a' | 'The'
  VP -> V NP | V NP PP | 'have' | 'had' | 'has' | 'been' | 'am' | 'is' | 'may' | 'think' | 'needs' | 'just'
  Det -> 'the' | 'a' | 'every' | 'some' | 'my'
  N -> 'upgrade' | 'life' | 'device' | 'battery' | 'reviews' | 'fan' | 'S' | 'Samsung' | 'phone' | 'series' | 'favorite' | 'diehard' | 'far' | 'defective' | 'amazing'
  V -> 'saw' | 'ate' | 'chased' | 'think' | 'have' | 'had' | 'is' | 'am' | 'started' | 'needs' | 'may' | 'replaced' | 'be' | 'amazed' | 'been'
  PP -> P NP
  P -> 'in' | 'on' | 'with' | 'at' | 'before' | 'since' | 'As' | 'so' | 'of' | 'to' | 'that'
  Conj -> ',' | ' '
""")

# Get the sentences

nlp_sentences = result_clean[:5]["text"]
print(nlp_sentences)


0    I grew up (b. 1965) watching and loving the Th...
1    When I put this movie in my DVD player, and sa...
2    Why do people who do not know what a particula...
3    Even though I have great interest in Biblical ...
4    Im a die hard Dads Army fan and nothing will e...
Name: text, dtype: object


In [75]:
for i, sentence in enumerate(nlp_sentences):
  print(f"Sentence {i+1}: {nlp_sentences[i]}")
  demonstrate_nlp_parsing(sentence, nlp_grammar)
  print("\n")

Sentence 1: I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.
Error during parsing: Grammar does not cover some of the input words: '\'grew\', \'up\', \'(\', \'b\', \'.\', \'1

##### Grammar Parsing: Sentences Manual

In [80]:
nlp_grammar = CFG.fromstring("""
  S -> NP VP | VP | S Conj S | S PP | NP VP PP
  NP -> Det N | Det Adj N | N | Pron | N PP
  VP -> V NP | V NP PP | V Adj | V | V PP
  PP -> P NP
  Adj -> 'amazing' | 'great' | 'defective' | 'diehard' | 'favorite' | 'far' | 'long'
  Det -> 'the' | 'a' | 'every' | 'some' | 'my' | 'this' | 'The' | 'This'
  N -> 'upgrade' | 'life' | 'device' | 'battery' | 'reviews' | 'fan' | 'Samsung' | 'phone' | 'series' | 'world'
  V -> 'have' | 'had' | 'has' | 'been' | 'am' | 'is' | 'may' | 'think' | 'needs' | 'just' | 'started' | 'replaced' | 'be' | 'amazed' | 'long' | 'lasts'
  P -> 'in' | 'on' | 'with' | 'at' | 'before' | 'since' | 'as' | 'so' | 'of' | 'to'
  Conj -> 'and' | 'or' | ',' | 'but' |
  Pron -> 'I' | 'i' | 'you' | 'we' | 'they' | 'he' | 'she' | 'it'
""")

nlp_sentences = [
    "I think this phone is amazing",
    "The battery lasts long",
    "Samsung has replaced my defective phone",
    "This is my favorite series",
    "The phone with great reviews is amazing"
]

In [81]:
for i, sentence in enumerate(nlp_sentences):
  print(f"Sentence {i+1}: {nlp_sentences[i]}")
  demonstrate_nlp_parsing(sentence, nlp_grammar)
  print("\n")

Sentence 1: I think this phone is amazing
Parsed Trees for: "I think this phone is amazing"
                     S                                
       ______________|______________                   
      S         |                   S                 
  ____|____     |          _________|_______           
 NP        VP   |         NP                VP        
 |         |    |     ____|____          ___|_____     
Pron       V   Conj Det        N        V        Adj  
 |         |    |    |         |        |         |    
 I       think ...  this     phone      is     amazing

                          S                       
             _____________|_____________           
            S                  |        |         
  __________|___               |        |          
 |              VP             |        S         
 |      ________|____          |        |          
 NP    |             NP        |        VP        
 |     |         ____|____     |     ___|_____  

### Explanation

- Grammar Parsing: Senteces from the IMBD Review -> Complex -> Grammar Error
- Grammar Parsing: Senteces Manual -> Quite Complex -> Some success, some not found (!= error)

## 7. Movie Recommendation Application

##### Notes:
- You can just run this single cell without any cell above
- Actually the dataset is quite goofy for Movie Recommendation :(

In [18]:
# Import Libaries

import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import cosine_similarity

# Load the data

result_clean = pd.read_csv('IMBD_Train.csv', encoding="utf-8", nrows=45000)

# Prepare data for classification

corpus = result_clean['text'].tolist()  # Reviews
categories = result_clean['label'].tolist()  # Categories


# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(corpus, categories, test_size=0.3, random_state=42)


# Train Naive Bayes model

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB())
])

pipeline.fit(X_train, y_train)


# Predict on the test set and calculate accuracy


y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Naive Bayes Accuracy: {accuracy * 100:.2f}%")


# Initialize spaCy model

nlp = spacy.load("en_core_web_sm")


# NER


def extract_named_entities_and_parse_tree(sentence):
    doc = nlp(sentence)
    pos_tags = {
        "punctuations": [token.text for token in doc if token.pos_ == "PUNCT"],
        "verbs": [token.text for token in doc if token.pos_ == "VERB"],
        "nouns": [token.text for token in doc if token.pos_ in ["NOUN", "PROPN"]],
    }
    print("Formatted Dependency Parse Tree:")
    for token in doc:
        print(f"{token.text} --{token.dep_}--> {token.head.text} ({token.pos_})")
    return pos_tags


# Movie Recommend Function


def recommend_movie(review):

    # Predict the category of the review
    category = pipeline.predict([review])[0]
    
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

    query_tfidf_v = tfidf_vectorizer.transform([review])

    tfidf_similarities = cosine_similarity(tfidf_matrix, query_tfidf_v)

    df_tfidf = pd.DataFrame({
        'Document': corpus,
        'Category': categories,
        'TF-IDF Similarity': tfidf_similarities.flatten()
    })

    recommendations = df_tfidf[df_tfidf['Category'] == category]

    top_recommendations = recommendations.sort_values(by='TF-IDF Similarity', ascending=False).head(5)

    top_recommendations['Category'] = top_recommendations['Category'].astype(str)

    result_clean['label'] = result_clean['label'].astype(str)

    top_recommendations = top_recommendations.merge(result_clean[['text', 'label']], 
                                                   left_on='Category', right_on='label', how='left')

    # Return the top 5 movie recommendations
    return top_recommendations[['label', 'text', 'TF-IDF Similarity']].head(5)


# Main Menu


def menu():

    review = ""
    category = "UNKNOWN"

    while True:
        
        print("\nMOVIE RECOMMENDATION APPLICATION BASED ON IMBD REVIEWS")
        print(f"YOUR REVIEW: {review if review else 'NO REVIEW'}")
        print(f"YOUR REVIEW CATEGORY: {category}")
        print("1. WRITE YOUR REVIEW")
        print("2. VIEW MOVIE RECOMMENDATION")
        print("3. VIEW NAMED ENTITIES RECOGNITION")
        print("4. EXIT")
        choice = input("Choose an option: ")

        if choice == "1":

            review = input("Write your review: ")
            category = pipeline.predict([review])[0]
            print(f"Review categorized as: {category}")

        elif choice == "2":

            if review:
                recommendations = recommend_movie(review)
                print("\nTOP MOVIE RECOMMENDATIONS FOR YOU:")
                if recommendations.empty:
                    print("No recommendations found.")
                else:
                    for idx, row in recommendations.iterrows():
                        print(f"{idx + 1}: Category {row['label']}")
                        print(f"   Review: {row['text']}")
                        print(f"   Similarity: {row['TF-IDF Similarity']:.4f}")
            else:
                print("Please write a review first.")

        elif choice == "3":

            if review:
                pos_tags = extract_named_entities_and_parse_tree(review)
                print("\nExtracted Parts of Speech:")
                print("Punctuations:", pos_tags["punctuations"])
                print("Verbs:", pos_tags["verbs"])
                print("Nouns:", pos_tags["nouns"])
            else:
                print("Please write a review first.")

        elif choice == "4":
            print("\nExiting...")
            break

        else:
            print("Invalid choice. Please try again.")

if __name__ == "__main__":
    menu()


Naive Bayes Accuracy: 84.84%

MOVIE RECOMMENDATION APPLICATION BASED ON IMBD REVIEWS
YOUR REVIEW: NO REVIEW
YOUR REVIEW CATEGORY: UNKNOWN
1. WRITE YOUR REVIEW
2. VIEW MOVIE RECOMMENDATION
3. VIEW NAMED ENTITIES RECOGNITION
4. EXIT
Review categorized as: 0

MOVIE RECOMMENDATION APPLICATION BASED ON IMBD REVIEWS
YOUR REVIEW: zombieeeeeeeeee zombie zombie
YOUR REVIEW CATEGORY: 0
1. WRITE YOUR REVIEW
2. VIEW MOVIE RECOMMENDATION
3. VIEW NAMED ENTITIES RECOGNITION
4. EXIT

TOP MOVIE RECOMMENDATIONS FOR YOU:
1: Category 0
   Review: I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with 

### Case 2 Criteria

- 2#Dependency Parsing: Done
- 2#Grammar Parsing: Done
- 2#Language Model: Done
- 2#NER: Done
- 2#Text Preprocessing: Done

<code>Made by: NW25-1</code>