
# POS Tagging Analysis of Reviews Dataset

This notebook performs Part-Of-Speech (POS) tagging on reviews from a dataset using the NLTK library. The dataset contains customer reviews and ratings, and we will use the stemmed version of the reviews to extract POS tags.

## Steps Included:
1. Load the dataset.
2. Tokenize and POS tag the text.
3. Display the POS tagging results.

### Dataset Information
The dataset contains columns such as:
- `content`: Original review text.
- `score`: Numerical rating.
- `sentiment_rating`: Sentiment label (Positive/Negative).
- `review_stemmed`: Stemmed version of the review.


In [1]:
import pandas as pd
from google.colab import files

# Upload the CSV file
uploaded = files.upload()

# Load the CSV file
file_path = list(uploaded.keys())[0]
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()


Saving df_yesdok_stemmed.csv to df_yesdok_stemmed (2).csv


Unnamed: 0,content,score,sentiment_rating,wordCount,review_with_stopwords,wordCount_after_stopwords,review_stemmed
0,"gak bisa bayar yg mau di bpilih , bingung aneh...",2,Negative,15,gak bayar bpilih bingung aneh gak bantuan sma,8,gak bayar bpilih bingung aneh gak bantuan sma
1,sgt berguna utk tahu kesehatan dan shared soal...,5,Positive,9,sgt berguna utk kesehatan shared kesehatan,6,sgt berguna utk kesehatan share kesehatan
2,sudah war jadwal dokter berbulan2 pas hari h d...,1,Negative,66,war jadwal dokter berbulan pas h dokternya ter...,38,war jadwal dokter berbulan pa h dokternya terk...
3,dokternya sibuk terus,1,Negative,3,dokternya sibuk,2,dokternya sibuk
4,aplikasi yang sangat kurang relevan buat dipak...,1,Negative,77,aplikasi relevan dipake pelayanan yesdok berma...,28,aplikasi relevan dipak pelayanan yesdok berman...


In [2]:
from nltk import pos_tag, word_tokenize
import nltk
import pandas as pd

# Download necessary NLTK data (only needs to be done once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Assuming 'df' is your DataFrame and 'review_stemmed' is the column with reviews
# Create an empty DataFrame to store results
all_word_pos_data = []

# Iterate over each review in the DataFrame
for review in df['review_stemmed']:
    if isinstance(review, str):  # Check if the entry is a string
        # Tokenize the review
        tokens = word_tokenize(review)
        # Get POS tags for the tokens
        pos_tags = pos_tag(tokens)
        # Create a list of dictionaries for each word and its POS tag
        word_pos_list = [{"Word": word, "POS Tag": tag} for word, tag in pos_tags]
        # Append the result for this review to the main list
        all_word_pos_data.extend(word_pos_list)

# Convert the list of dictionaries to a pandas DataFrame
df_word_pos = pd.DataFrame(all_word_pos_data)

# Display the DataFrame
print(df_word_pos)

# Save the DataFrame to a CSV file
df_word_pos.to_csv("word_pos_tags.csv", index=False)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


           Word POS Tag
0           gak      NN
1         bayar      NN
2        bpilih      NN
3       bingung      NN
4          aneh     VBP
...         ...     ...
68398       via      IN
68399     vcall      NN
68400      bagu      NN
68401      help      NN
68402  membantu      NN

[68403 rows x 2 columns]


In [3]:
from nltk import pos_tag, word_tokenize
import nltk
import pandas as pd
from collections import Counter

# Download necessary NLTK data (only needs to be done once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Assuming 'df' is your DataFrame and 'review_stemmed' is the column with reviews
# Create an empty DataFrame to store results
all_word_pos_data = []

# Iterate over each review in the DataFrame
for review in df['review_stemmed']:
    if isinstance(review, str):  # Check if the entry is a string
        # Tokenize the review
        tokens = word_tokenize(review)
        # Get POS tags for the tokens
        pos_tags = pos_tag(tokens)
        # Create a list of dictionaries for each word and its POS tag
        word_pos_list = [{"Word": word, "POS Tag": tag} for word, tag in pos_tags]
        # Append the result for this review to the main list
        all_word_pos_data.extend(word_pos_list)

# Convert the list of dictionaries to a pandas DataFrame
df_word_pos = pd.DataFrame(all_word_pos_data)

# Count the frequency of each word
word_counts = Counter(df_word_pos['Word'])

# Get the 10 most common words
top_10_words = word_counts.most_common(10)

# Create a DataFrame for the top 10 words
df_top_10_words = pd.DataFrame(top_10_words, columns=['Word', 'Frequency'])

# Save the DataFrame to a CSV file if needed
df_top_10_words.to_csv("top_10_words.csv", index=False)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [4]:
df = pd.read_csv("top_10_words.csv")

# Display the first few rows of the dataset
df

Unnamed: 0,Word,Frequency
0,aplikasi,5474
1,membantu,3579
2,yesdok,2615
3,konsultasi,2541
4,dokter,2210
5,bagu,2099
6,kesehatan,1934
7,bermanfaat,1225
8,aplikasinya,849
9,mudah,844
