# 1. Multiclass sentiment analysis of the users
    
<strong>Thread app dataset: 37000 entities</strong>
    



<p style="text-align:right">V&iacute;ctor Viloria  (<em>ComputingVictor</em>)</p>



<hr style="border:1px solid gray">

# Structure

[Introduction](#introduccion) 

[1. Python libraries](#librerias) 

[2. Data Loading](#lectura) 

[3. Exploratory Data Analysis ](#EDA) 

   - 3.1 Shape and types
   - 3.2 Nulls
   - 3.3 Numerical Analysis
   - 3.4 Temporal Series Analysis
   
[4. Text transformation](#text) 


   - 4.1 Tokenizer
   - 4.2 Vectorization (TF-IDF)
   
[5. ML Models](#ml) 

   - 5.1 Linear Regression
   - 5.2 SVC
   - 5.3 LightGBM

[6. Conclusions](#conclusions) 

<hr style="border:1px solid gray">

# Introduction 

In this notebook, We will use the *Thread app dataset: 37000 entities*, we will proceed with data preprocessing for natural language processing (NLP) on reviews posted on the Threads app. Then we will train different models, MLand Neuronal networks to predict the sentiment of each review.

# 1. Python libraries

In [None]:
# Import pandas.

import pandas as pd
import numpy as np

# Import nltk.

import nltk
nltk.download('punkt')

# Import nltk stopwords.

nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import Counter

# Import wordcloud and matplotlib.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

# Vectorizer.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

# ML models.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder

# Scores.

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import fbeta_score


# 2. Data Loading 


In [None]:
# Load the csv file.

threads_df = pd.read_csv("/kaggle/input/37000-reviews-of-thread-app-dataset/37000_reviews_of_thread_app.csv",index_col=[0])

# Display the first rows.

threads_df.head(5)

#  3. Exploratory data analysis 

### 3.1 Shape and types

In [None]:
# Print the shape of the dataframe.

print("The dataframe has {} rows and {} columns.".format(threads_df.shape[0], threads_df.shape[1]))

print("---------------------------------------------------------------------------------------")

# Print the column names and datatypes.

threads_df.info()

### 3.2 Nulls

In [None]:
# Print the number of null values in each column.

threads_df.isnull().sum()

We notice that there are some rows with high number of null data that are irrelevant for our analysis, we proceed to drop them. For `thumbs_up` we will convert the nulls to 0.

In [None]:
# Drop the developer columns, appVersion and review_title variable.

threads_df = threads_df.drop(['review_title', 'developer_response', 'developer_response_date', 'appVersion'], axis=1)

# Convert nulls of thumbs_up to 0.

threads_df['thumbs_up'] = threads_df['thumbs_up'].fillna(0)

# Display the first rows of "threads_df".

threads_df.head(5)

In [None]:
# Print the number of null values in each column.

threads_df.isnull().sum()

### 3.3 Numerical Analysis

#### 3.3.1. Unique Values

First of all, we will check the unique values for the most important variables

In [None]:
# Check unique values
unique_sources = threads_df['source'].unique()
unique_ratings = threads_df['rating'].unique()
unique_languages = threads_df['laguage_code'].unique()
unique_countries = threads_df['country_code'].unique()
unique_thumbs_up = threads_df['thumbs_up'].unique()

print("Unique values for source:", unique_sources)
print("Unique values for rating:", unique_ratings)
print("Unique values for language_code:", unique_languages)
print("Unique values for country_code:", unique_countries)

We found multiple variables in `ratings` and `source`. Let's check the number of reviews by rating.

#### 3.3.2. Number of Reviews by rating

In [None]:
# Group by 'rating' and count reviews.
ratings_count = threads_df['rating'].value_counts().sort_index()

# Plot.
ratings_count.plot(kind='bar', color='skyblue')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews by Rating')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

As we can see, the ratings are polarized. With the 1 and 5 ratings with +10k values, while the other rating values have less data.

#### 3.3.3. Top 5 of total thumbs in the dataset

In [None]:
# Group by 'source' and count reviews.
thumbs_count = threads_df['thumbs_up'].value_counts().sort_values(ascending=False)

thumbs_count.head(5).plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews by Store')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

#### 3.3.4. Number of Reviews by Store

In [None]:
# Group by 'source' and count reviews.
ratings_count = threads_df['source'].value_counts().sort_index()

# Plot.
ratings_count.plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews by Store')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Most of the reviews come from Google Play.

#### 3.3.5. Users with most reviews.

In [None]:
# Group by 'source' and count reviews.
user_count = threads_df['user_name'].value_counts().sort_values(ascending=False)

user_count.head(5).plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.title('Number of Reviews by User')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

We can see there are several people with same names or repeated accounts that reviewed more than once.

### 3.4 Temporal Series Analysis

In [None]:
# Convert 'review_date' to datetime and extract only the date part.

threads_df['review_date'] = pd.to_datetime(threads_df['review_date']).dt.date

# Group by 'review_date' (only the date) and count reviews.

daily_reviews = threads_df.groupby('review_date').size()


# Plot.

daily_reviews.plot(kind='line', marker='o')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.title('Total Reviews per Date')
plt.tight_layout()
plt.grid(True)
plt.xticks(rotation=45) 
plt.subplots_adjust(bottom=0.25)  
plt.show()

In the temporal Series plot, we can check that most of the reviews came during the first days of the app release. 7 days after, the numbers of reviews dropped considerably, with August at a low

# 4. Text transfomation 

### 4.1. Tokenizer

We will make some transformations in the text with the objective to standardize it. We will convert all to lowercase, then we will remove the sign punctuations, remove the line breaks in case there would be and emoticons.

In [None]:
# Convert the text column to lowercase.

threads_df['review_description'] = threads_df['review_description'].str.lower()

# Delete emoticons with text.

threads_df['review_description'] = threads_df['review_description'].str.replace('[\:\;\=][\-\^]?[\(\)\[\]\{\}\@D\|Pp\$\*\+\#]','')

# Delete punctuation signs.

threads_df['review_description'] = threads_df['review_description'].str.replace('[^\w\s]','')

# Delete /n from text.

threads_df['review_description'] = threads_df['review_description'].str.replace('\n',' ')

Once the transformation is done, we proceed to tokenize the text and delete the stopwords. Just to have the text ready to apply it into differents models. 

In [None]:
# Convert the reviewText column to string.

threads_df['review_description'] = threads_df['review_description'].astype(str)

# Tokenize the text.

threads_df['review_description'] = threads_df['review_description'].apply(nltk.word_tokenize)

# Delete the stopwords from text.

stop_words = set(stopwords.words('english'))

threads_df['review_description'] = threads_df['review_description'].apply(lambda x: [item for item in x if item not in stop_words])

# Convert reviewText column to string with space between words.

threads_df['review_description'] = threads_df['review_description'].apply(lambda x: ' '.join(x))

# Display first 5 rows

threads_df.head()

Let's check the wordclouds.

In [None]:
# Create a wordcloud of the first 100 reviews.

wordcloud = WordCloud().generate(str(threads_df['review_description']))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

plt.show()


The most common words are related with his competitor "Twitter" and different opinions appeared in order to the most common ratins we checked before: 1 and 5. Lets see the most used words:

In [None]:
# Histogram of the top 10 most common words.

top_10_words = Counter(" ".join(threads_df['review_description']).split()).most_common(10)

top_10_df = pd.DataFrame(top_10_words)

top_10_df.columns=["Word", "Frequency"]

sns.barplot(x="Word", y="Frequency", data=top_10_df)

plt.title("Top 10 most common words")

plt.xlabel("Word")

plt.ylabel("Frequency")

plt.xticks(rotation=45)

plt.show()

Finally we will drop variables that won't be relevant for the training of the models.

In [None]:
# Drop irrelevant variables

threads_df = threads_df.drop(['user_name','laguage_code', 'country_code', 'review_id', 'source','review_date','thumbs_up'], axis=1)

# Display the first rows.

threads_df.head(5)

### 4.2. Vectorization (TF-IDF)

In [None]:
# Load the vectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.99,min_df=0.01)

# Transform the tokens.

tfidf_matrix = tfidf_vectorizer.fit_transform(threads_df['review_description'])

# 5. ML Models

Once vectorized the text, we proceed with the split into train and test sets. In our case we will try multiclass ML models.

In [None]:
# Split into X and y.

X = tfidf_matrix
y = threads_df['rating']

# Split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12345, stratify=y)

### 5.1. Linear Regression

In [None]:
# Load the Linear Regression Classifier.

clf = LogisticRegression()

# Train the model.

clf.fit(X_train, y_train)

In [None]:
# Predict the X_test.

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)


# Evaluation.

print(classification_report(y_test,y_pred))

# Print: the accuracy score.

print("Accuracy:",accuracy_score(y_test, y_pred))

# Print: F-2.

print("F2 micro:",fbeta_score(y_test, y_pred, beta=2, average='micro'))

# Print: F-2.

print("F2 macro:",fbeta_score(y_test, y_pred, beta=2, average='macro'))

### 5.2. LinearSVC

In [None]:
# Load SVC and train the model.

clf = LinearSVC()
clf.fit(X_train, y_train)

# Predict the X_test.

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

In [None]:
# Evaluation.

print(classification_report(y_test,y_pred))

# Print: the accuracy score.

print("Accuracy:",accuracy_score(y_test, y_pred))

# Print: F-2.

print("F2 micro:",fbeta_score(y_test, y_pred, beta=2, average='micro'))

# Print: F-2.

print("F2 macro:",fbeta_score(y_test, y_pred, beta=2, average='macro'))

### 5.3. LightGBM

In [None]:
# Due to the format of the target we will use Label Encoder.

le = LabelEncoder()
y = le.fit_transform(threads_df['rating'])

# Luego, divide tus datos en conjuntos de entrenamiento y prueba nuevamente
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)


In [None]:
# LightGBM format.

d_train = lgb.Dataset(X_train, label=y_train)

# Parámetros para LightGBM, puedes ajustar estos según tus necesidades
params = {
    'objective': 'multiclass',
    'num_class': len(np.unique(y)),
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

# Entrenar el modelo
clf = lgb.train(params, d_train,)

# Realizar predicciones
y_pred = clf.predict(X_test, num_iteration=clf.best_iteration)

# Después de obtener y_pred de LightGBM:
y_pred_class = np.argmax(y_pred, axis=1)

# Evaluar el modelo
accuracy = accuracy_score(y_test, y_pred_class)

In [None]:
# Evaluation.

print(classification_report(y_test,y_pred_class))

# Print: the accuracy score.

print("Accuracy:",accuracy_score(y_test, y_pred_class))

# Print: F-2.

print("F2 micro:",fbeta_score(y_test, y_pred_class, beta=2, average='micro'))

# Print: F-2.

print("F2 macro:",fbeta_score(y_test, y_pred_class, beta=2, average='macro'))

# 6. Conclusions 

After several trials with differente models, the accuracy is okay (60-61%) , by the way the macro f2 and f1 scores are poor, due to the imbalance of the dataset in ratings like 2,3 and 4 compared with 1 and 5.