# <h1 align="center"> Analyzing Visitor Comments on Hespress</h1>

# **1. Data Description and Representation**

## Datasets Source
- **Dataset Sources:**
  - [Tweet Sentiment Multilingual Dataset on Hugging Face](https://huggingface.co/datasets/cardiffnlp/tweet_sentiment_multilingual)
  - [DZ Sentiment YT Comments Dataset on Hugging Face](https://huggingface.co/datasets/Abdou/dz-sentiment-yt-comments)

## Datasets Overview
The project utilizes two datasets for multilingual sentiment analysis, sourced from Hugging Face:
1. **`tweet_sentiment_multilingual`**
2. **`dz-sentiment-yt-comments`**

This dataset consists of 50,016 samples of comments extracted from Algerian YouTube channels. It is manually annotated with 3 classes (the label column) and is not balanced. Here are the number of rows of each class:
- **`0`**: Negative 17,033 (34.06%)
- **`1`**: Neutral 11,136 (22.26%)
- **`2`**: Positive 21,847 (43.68%)

---

## File Formats

### **1. JSONL Files (from `tweet_sentiment_multilingual`)**
Each file in this dataset (`train.jsonl`, `test.jsonl`, `validate.jsonl`) is in JSON Lines format. Each line is a JSON object representing one data sample. 

#### Structure of JSONL File
Each line in the JSONL file has the following key-value pairs:
- **`text`**: A string containing the text of the tweet.
- **`label`**: An integer (0, 1, or 2) representing the sentiment of the text.

#### Example Entry in JSONL
 the text within the JSONL files is encoded in Unicode, specifically using escape sequences like \uXXXX for non-ASCII characters. This is common when dealing with different languages, special characters, or symbols that cannot be represented directly in ASCII.
```json
{"text": "RT @user: \u0625\u062d\u0635\u0627\u0626\u064a\u0629.. \u0627\u0633\u062a\u0634\u0647\u0627\u062f 96 \ufec3\ufed4\ufefc\u064b ...", "label": "0"}
{"text": "\u0644\u0627 \u0627\u0644\u0647 \u0627\u0644\u0627 \u0627\u0644\u0644\u0647\ud83d\udc9c#\u0623\u064a\u0641\u0648\u0646_\u0627\u0644\u0628\u0631\u0648\u0641\u064a\u0633\u0648\u0631 ...", "label": "1"}

Here, \u0625\u062d\u0635\u0627\u0626\u064a\u0629 is the Unicode sequence for إحصائية (Arabic word for "statistics").


# **2. Importing Required Libraries**

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import nltk
from nltk.corpus import stopwords
import string
import re
import matplotlib.pyplot as plt
import pandas as pd
import csv
import json

1. **TensorFlow (`tensorflow`)**: 
   - TensorFlow is an open-source library developed by Google for numerical computation and machine learning. It is widely used for building and training deep learning models.

2. **NumPy (`numpy`)**: 
   - NumPy is a powerful library for numerical operations in Python. It is particularly useful for handling multi-dimensional arrays and performing mathematical computations efficiently.

3. **NLTK (`nltk`)**:
   - NLTK (Natural Language Toolkit) is a powerful Python library used for working with human language data (text). It provides tools and resources for a variety of tasks related to natural language processing (NLP)

3. **Matplotlib (`matplotlib.pyplot`)**: 
   - Matplotlib is a plotting library for Python. It allows the creation of static, interactive, and animated visualizations, such as graphs and charts.

4. **Pandas (`pandas`)**: 
   - Pandas is a data manipulation and analysis library. It provides data structures like `DataFrame` and `Series` to handle and process structured data efficiently.

5. **CSV (`csv`)**: 
   - The CSV module is part of Python’s standard library. It provides functionality for reading from and writing to CSV (Comma-Separated Values) files.

6. **JSON (`json`)**: 
   - The JSON module is used for parsing and working with JSON (JavaScript Object Notation) data. JSON is a popular format for data interchange between systems.

This combination of libraries sets up the environment for machine learning, data manipulation, visualization, and working with structured data formats (CSV and JSON).


# **3. Exploring the Data**

In [None]:
# Load the data
data = pd.read_csv('data\dz-sentiment-yt-comments\ADArabic-3labels-50016.csv', sep=',', header=None)
# Print the first 5 rows of the dataframe.
print(data.head())
print('-'*50)
# Print the statistics of the data
print(data.describe())
print('-'*50)
# Check for missing values
print(data.isna().sum())


                                                   0      1
0                                               text  label
1       يا سي كريم الرئيس الذى تشتكى له هو أصله معين      1
2  حتى السعودية قاتلكم ماكمش عرب،واش بقا يا بعاصي...      0
3                              Thbliiii bravo souade      2
4          تحيالي ناس بن يزقن و لقصور في الغيبة 🌹🇩🇿🤣      2
--------------------------------------------------
            0      1
count   50017  50017
unique  50017      4
top      text      2
freq        1  21847
--------------------------------------------------
0    0
1    0
dtype: int64


1. **Load the Data**:
   - The dataset is loaded using `pandas.read_csv()`:
     - File: `'data\dz-sentiment-yt-comments\ADArabic-3labels-50016.csv'`
     - Parameters:
       - `sep=','`: Specifies that the data is comma-separated.
       - `header=None`: Indicates that the dataset does not have a header row, so no column names are assigned automatically.

2. **View the First 5 Rows**:
   - `data.head()`: Displays the first five rows of the dataset. This provides a quick preview of the structure and content of the data.

3. **Print Summary Statistics**:
   - `data.describe()`: Generates a summary of basic statistical details:
     - Includes count, mean, standard deviation, minimum, maximum, and quartile values for numeric columns.

4. **Check for Missing Values**:
   - `data.isna().sum()`: Identifies missing values in each column by summing up `NaN` (Not a Number) occurrences.

This step is crucial for understanding the dataset's structure, its key properties, and any data quality issues, such as missing values or irregularities.

---


In [13]:
jsonl_data = []
with open('data\\tweet_sentiment_multilingual\\train.jsonl', 'r', encoding='utf-8') as jsonl_file:
    for line in jsonl_file:
        json_line = json.loads(line)
        jsonl_data.append({'text': json_line['text'], 'label': json_line['label']})

print(jsonl_data[0])
print(jsonl_data[1])

{'text': 'RT @user: @user @user   وصلنا لاقتصاد اسوء من سوريا والعراق ومن غير حربانجاز ده ولا مش انجاز يا متعلمين يا بتوع المدا…', 'label': '0'}
{'text': 'كاني ويست، دريك، نيكي، بيونسيه، قاقا http', 'label': '1'}


#### Code Breakdown:
1. **Initialize Data Storage**:
   - `jsonl_data = []`: Creates an empty list to store the processed data.

2. **Open the JSONL File**:
   - `with open('data\\tweet_sentiment_multilingual\\train.jsonl', 'r', encoding='utf-8') as jsonl_file`:
     - Opens the `train.jsonl` file in read mode with UTF-8 encoding.
     - JSONL files contain JSON objects, one per line.

3. **Parse Each Line**:
   - The file is read line by line:
     - `json.loads(line)`: Converts each line from JSON format to a Python dictionary.
     - Extracted Data:
       - `'text'`: The text of the tweet/comment.
       - `'label'`: The sentiment label.
     - Appends the extracted information as a dictionary to the `jsonl_data` list.

4. **Preview the Data**:
   - `print(jsonl_data[0])` and `print(jsonl_data[1])`:
     - Prints the first two entries to verify the structure and content.

#### Purpose:
This process extracts the necessary data (`text` and `label`) from the **`tweet_sentiment_multilingual`** dataset, making it ready for further analysis or combination with the second dataset.

# **4. Data Loading and Preparation**

In [14]:
# method to load JSONL data
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as jsonl_file:
        for line in jsonl_file:
            json_line = json.loads(line)
            data.append({'text': json_line['text'], 'label': json_line['label']})
    return data
# Load JSONL data from three files
jsonl_data1 = load_jsonl('data\\tweet_sentiment_multilingual\\test.jsonl')
jsonl_data2 = load_jsonl('data\\tweet_sentiment_multilingual\\train.jsonl')
jsonl_data3 = load_jsonl('data\\tweet_sentiment_multilingual\\validation.jsonl')
# Load CSV data
csv_data = []
with open('data\\dz-sentiment-yt-comments\\ADArabic-3labels-50016.csv', 'r', encoding='utf-8') as csv_file:
    reader = csv.reader(csv_file)
    next(reader, None)  # Skip the header if there is one
    for row in reader:
        text = row[0]  # Assuming 'text' is in the first column
        label = row[1]  # Assuming 'label' is in the second column
        csv_data.append({'text': text, 'label': label})

# Combine the datasets
combined_data = jsonl_data1 + jsonl_data2 + jsonl_data2 + csv_data


# Define the labels and sentences
sentences = [data['text'] for data in combined_data]
labels = [data['label'] for data in combined_data]
# Convert labels to integers
labels = [int(label) for label in labels]
# Convert labels to one-hot encodings
labels = tf.keras.utils.to_categorical(labels, num_classes=3) # converting it to numpy array instead is also an option
# Print the first five sentences and labels
print(f"First five sentences:\n\n{sentences[:5]}")
print(f"First five labels:\n\n{labels[:5]}")


First five sentences:

['نوال الزغبي (الشاب خالد ليس عالمي) هههههههه أتفرجي على ها الفيديو يا مبتدئة http vía @user', 'تقول نوال الزغبي : http', 'نوال الزغبي لطيفه الفنانه الوحيده اللي كل الفيديو كليبات تبعها ماتسبب تلوث بصري ولا سمعي لو صوتها اقل من عادي', 'لما قالت نوال الزغبي لابقلها هاللقب فرحوا فانزها 😂😂😂كان لازم ياخدوها اهانة مش ثناء http', 'الفنانة نوال الزغبي سنة 90 http']
First five labels:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


#### Method to Load JSONL Data:
- **Function: `load_jsonl(file_path)`**
  - This function reads a JSONL (JSON Lines) file and extracts:
    - `'text'`: The content of the tweet/comment.
    - `'label'`: The sentiment label.
  - Each entry is stored as a dictionary and appended to a list.
  - Returns a list of dictionaries containing the loaded data.

#### Load Data from Multiple Sources:
1. **Load JSONL Data**:
   - The `load_jsonl()` function is used to read three JSONL files:
     - `test.jsonl`
     - `train.jsonl`
     - `validation.jsonl`
   - The data from these files is loaded into `jsonl_data1`, `jsonl_data2`, and `jsonl_data3`.

2. **Load CSV Data**:
   - The CSV file `ADArabic-3labels-50016.csv` is read using the `csv` module:
     - `text`: Assumed to be in the first column.
     - `label`: Assumed to be in the second column.
   - The data is stored as a list of dictionaries with `text` and `label` keys.

#### Combine Datasets:
- The data from JSONL files and the CSV file is combined into a single list called `combined_data`.

#### Data Transformation:
1. **Extract Sentences and Labels**:
   - `sentences`: A list containing all the text entries from `combined_data`.
   - `labels`: A list containing all the sentiment labels from `combined_data`.

2. **Convert Labels to Integers**:
   - The labels are converted from strings to integers using `int()` to ensure compatibility with machine learning models.

3. **One-Hot Encoding of Labels**:
   - `tf.keras.utils.to_categorical()` is used to convert the integer labels into one-hot encoded format with three classes (positive, neutral, negative).

#### Preview the Data:
- The first five sentences and their corresponding labels are printed to verify the data structure and transformation:
  - `sentences[:5]`: Displays the first five text entries.
  - `labels[:5]`: Displays the first five one-hot encoded labels.

#### Purpose:
This process integrates multiple datasets from different formats (JSONL and CSV) into a unified structure and prepares the data for training machine learning models. The use of one-hot encoding ensures compatibility with models expecting categorical labels.

---

### Removing Punctuation
- Use a regular expression to remove any non-word characters (except spaces) from the text.
- This will strip punctuation marks like periods, commas, exclamation marks, etc.

Example: `"Hello, world!"` becomes `"Hello world"`


In [None]:
# Function to remove punctuation
def remove_punctuation(text):
    # Remove punctuation using regex
    return re.sub(r'[^\w\s]', '', text)

# Example text
sample_text = "Hello, world! This is an example text with punctuation."

# Clean the text
cleaned_text = remove_punctuation(sample_text)
print(cleaned_text)  # Output: "Hello world This is an example text with punctuation"

### Removing Stop Words
- Stop words are common words in a language that are usually removed to improve model performance.
- You can use the `nltk` library to filter out stop words in the text.

Example: `"This is an example sentence"` becomes `"example sentence"`


In [None]:
# Download the stop words list
nltk.download('stopwords')

# Function to remove stop words
def remove_stopwords(text):
    stop_words = set(stopwords.words('arabic'))
    words = text.split()
    # Remove stop words
    return " ".join([word for word in words if word.lower() not in stop_words])

# Example text
sample_text = "This is an example sentence that contains stop words."

# Clean the text
cleaned_text = remove_stopwords(sample_text)
print(cleaned_text)  # Output: "example sentence contains stop words"

### Removing Duplicates
- Duplicates in the dataset can be removed by using `drop_duplicates` in pandas, ensuring each entry is unique.
- This step is especially useful when you have repeated text samples in the dataset.

Example: 
```python
df.drop_duplicates(subset='text')


In [None]:
# Example DataFrame
data = {'text': ["Hello world", "Hello world", "Python is great", "Python is great", "I love coding"]}
df = pd.DataFrame(data)

# Remove duplicate rows
df_unique = df.drop_duplicates(subset='text')

### Splitting the Data into Training and Validation Sets

In [15]:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets this also shuffles the data
training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(sentences, labels, test_size=0.2, random_state=42)

#### Importing `train_test_split`:
- **`train_test_split`** is a utility from the `sklearn.model_selection` module used to split datasets into training and testing (or validation) sets.
- It also shuffles the data to ensure a random distribution of samples in each split.

#### Splitting the Data:
1. **Inputs**:
   - `sentences`: List of all text entries (features).
   - `labels`: Corresponding one-hot encoded labels (targets).

2. **Output Variables**:
   - `training_sentences`: Sentences used for training the model.
   - `validation_sentences`: Sentences set aside for validation/testing.
   - `training_labels`: Labels corresponding to the training sentences.
   - `validation_labels`: Labels corresponding to the validation sentences.

3. **Parameters**:
   - `test_size=0.2`: Specifies that 20% of the data will be used for validation, while 80% is used for training.
   - `random_state=42`: Ensures reproducibility by controlling the randomness of the split. The same seed (`42`) will always produce the same split.

#### Purpose:
- Splitting the dataset allows the model to be trained on one portion of the data (`training_sentences` and `training_labels`) while being evaluated on an unseen portion (`validation_sentences` and `validation_labels`).
- This ensures the model generalizes well to new, unseen data.

---


### Tokenizing and Padding Sentences

In [20]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# Tokenize the sentences
def tokenize(sentences, vocab_size, oov_token, trunc_type, padding_type, max_length):
    tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
    tokenizer.fit_on_texts(sentences)
    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(sentences)
    padded = pad_sequences(sequences, padding=padding_type, truncating=trunc_type, maxlen=max_length)
    return sequences, padded, word_index, tokenizer

vocab_size = 10000
oov_token = "<OOV>"
trunc_type = 'post'
padding_type = 'post'
embedding_dim = 16
max_length = 100 # tweets don't usaually exceed 50 words

train_sequences, train_padded, word_index, tokenizer = tokenize(training_sentences, vocab_size, oov_token, trunc_type, padding_type, max_length)

test_padded = tokenizer.texts_to_sequences(validation_sentences)
test_padded = pad_sequences(test_padded, padding=padding_type, truncating=trunc_type, maxlen=max_length)

print(f"First sentence:\n\n {training_sentences[0]}")
print(f"First sentence tokenized:\n\n {train_sequences[0]}")
print(f"First sentence padded:\n\n {train_padded[0]}")
print(f"padded shape:\n\n {train_padded.shape}")
print("-"*50)
print(f"First validation sentence:\n\n {validation_sentences[0]}")
print(f"First validation sentence tokenized:\n\n {test_padded[0]}")
print(f"padded shape:\n\n {test_padded.shape}")



First sentence:

 هل تعلم ان النقطه (. )هيه نفسها 👈 (*)بس مسويه شعرها مثل ميريام فارس 🌚😂✋
First sentence tokenized:

 [95, 1423, 18, 1, 1, 1861, 1, 407, 1, 4665, 170, 821, 488, 1]
First sentence padded:

 [  95 1423   18    1    1 1861    1  407    1 4665  170  821  488    1
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]
padded shape:

 (43651, 100)
--------------------------------------------------
First validation sentence:

 شيئ تقشعر له الأبدان
First validation sentence tokenized:

 [1140    1  104    1    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0 

#### Text Tokenization and Padding:
1. **Purpose**:
   - Neural networks cannot directly process raw text, so text needs to be converted into numerical representations. This is done using tokenization and padding.

2. **Function: `tokenize()`**:
   - **Parameters**:
     - `sentences`: List of text data to tokenize.
     - `vocab_size`: Maximum size of the vocabulary. The most frequent `vocab_size` words are kept.
     - `oov_token`: Token used to replace words not found in the vocabulary (Out of Vocabulary).
     - `trunc_type`: Specifies how sequences longer than `max_length` are truncated (`'post'` truncates at the end).
     - `padding_type`: Specifies how sequences shorter than `max_length` are padded (`'post'` adds padding at the end).
     - `max_length`: Maximum allowed length of sequences.
   - **Process**:
     - `Tokenizer()`: Initializes a tokenizer with the specified vocabulary size and OOV token.
     - `fit_on_texts(sentences)`: Maps words in the `sentences` to unique integer indices.
     - `texts_to_sequences(sentences)`: Converts the sentences into sequences of integers.
     - `pad_sequences(sequences)`: Pads or truncates sequences to the specified `max_length`.

3. **Outputs**:
   - `sequences`: Tokenized sequences of integers.
   - `padded`: Padded/truncated sequences.
   - `word_index`: Dictionary mapping words to their token indices.
   - `tokenizer`: Tokenizer object for later use.

#### Hyperparameters:
- `vocab_size = 10000`: Limits the vocabulary to the top 10,000 words.
- `oov_token = "<OOV>"`: Assigns a special token for out-of-vocabulary words.
- `trunc_type = 'post'` and `padding_type = 'post'`: Truncates and pads at the end of sequences.
- `max_length = 50`: Limits the sequence length to 50 tokens (suitable for short text like tweets).

#### Tokenize Training Sentences:
- `train_sequences`: Tokenized integer sequences for the training data.
- `train_padded`: Padded/truncated sequences for training, ensuring uniform length.

#### Tokenize Validation Sentences:
- Validation data is tokenized and padded using the same tokenizer (`tokenizer`) to ensure consistency.

#### Results Preview:
1. **First Training Sentence**:
   - Raw text: The original sentence.
   - Tokenized: Converted to integers using `word_index`.
   - Padded: Adjusted to the `max_length` with padding added as necessary.

2. **First Validation Sentence**:
   - Similar steps as training data to ensure consistency.

3. **Shape of Padded Data**:
   - The shape of the padded arrays confirms uniform sequence length across the dataset.

#### Purpose:
- Tokenization and padding prepare the text data for input into machine learning models, ensuring compatibility with deep learning frameworks and maintaining sequence length uniformity.


# **5. Training The Model**

In [None]:
import tensorflow as tf

# Define the model architecture
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),  # Embedding layer
    tf.keras.layers.Dropout(0.2),  # Dropout to prevent overfitting
    tf.keras.layers.LSTM(64, return_sequences=False),  # LSTM layer with 64 units
    tf.keras.layers.Dense(64, activation='relu'),  # Dense layer with 64 units
    tf.keras.layers.Dense(3, activation='softmax')  # Output layer for 3 classes
])

# Compile the model
model.compile(loss='categorical_crossentropy',  # Using categorical crossentropy for multi-class classification
              optimizer='adam', 
              metrics=['accuracy'])

# LSTM Model for Text Classification

In this section, we build and train an LSTM-based neural network for text classification. LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) designed to capture long-term dependencies in sequential data. It is particularly effective for tasks like sentiment analysis, language modeling, and time series prediction.

We will use an LSTM layer to process the sequential data, followed by dense layers to output the predictions.

## Model Architecture

1. **Embedding Layer**: Converts words into dense vectors of fixed size.
2. **LSTM Layer**: Captures sequential dependencies in the text.
3. **Dense Layers**: Perform the final classification.
4. **Output Layer**: Softmax activation for multi-class classification.


## Explanation of the Model Layers

1. **Embedding Layer**: 
   - The embedding layer converts the input tokens (words) into dense vectors of fixed size. This is important because it maps the sparse one-hot encoded vectors into a continuous vector space where semantically similar words are closer together.

2. **Dropout Layer**: 
   - Dropout is used to reduce overfitting during training. It randomly sets a fraction of the input units to zero during training.

3. **LSTM Layer**: 
   - The LSTM layer is the core component of the model. It processes the sequence data and remembers information over long sequences. The `return_sequences=False` means that we are only interested in the final output of the LSTM and not the sequence of outputs at each time step.

4. **Dense Layer (64 units)**: 
   - This layer further processes the data by using a fully connected layer with 64 units and ReLU activation to learn the relationships between features.

5. **Output Layer (3 units)**: 
   - The output layer has 3 units, corresponding to the 3 classes of our classification task. The `softmax` activation ensures that the model outputs probabilities for each class.

---

In [31]:
history = model.fit(train_padded, training_labels, epochs=20, validation_data=(test_padded, validation_labels), verbose=2)

Epoch 1/20
1365/1365 - 6s - loss: 0.5121 - accuracy: 0.5923 - val_loss: 0.4336 - val_accuracy: 0.6781 - 6s/epoch - 4ms/step
Epoch 2/20
1365/1365 - 5s - loss: 0.3900 - accuracy: 0.7165 - val_loss: 0.4003 - val_accuracy: 0.7169 - 5s/epoch - 3ms/step
Epoch 3/20
1365/1365 - 8s - loss: 0.3422 - accuracy: 0.7695 - val_loss: 0.3861 - val_accuracy: 0.7346 - 8s/epoch - 6ms/step
Epoch 4/20
1365/1365 - 7s - loss: 0.3117 - accuracy: 0.7924 - val_loss: 0.3882 - val_accuracy: 0.7333 - 7s/epoch - 5ms/step
Epoch 5/20
1365/1365 - 8s - loss: 0.2887 - accuracy: 0.8095 - val_loss: 0.3947 - val_accuracy: 0.7370 - 8s/epoch - 6ms/step
Epoch 6/20
1365/1365 - 6s - loss: 0.2698 - accuracy: 0.8231 - val_loss: 0.4135 - val_accuracy: 0.7337 - 6s/epoch - 4ms/step
Epoch 7/20
1365/1365 - 7s - loss: 0.2538 - accuracy: 0.8335 - val_loss: 0.4191 - val_accuracy: 0.7322 - 7s/epoch - 5ms/step
Epoch 8/20
1365/1365 - 9s - loss: 0.2382 - accuracy: 0.8440 - val_loss: 0.4430 - val_accuracy: 0.7286 - 9s/epoch - 7ms/step
Epoch 9/

# **6. Model Training Results**

After training the model for 20 epochs, we can evaluate its performance on the validation set. The `fit` function will return the training and validation accuracy and loss, which can be plotted to analyze the model's performance over time.



In [None]:
# Get the class predictions from the probabilities
predictions = model.predict(test_padded)
predictions = np.argmax(predictions, axis=-1)  # Convert probabilities to class labels (integer)

if len(validation_labels.shape) > 1:  # Check if they are one-hot encoded
    validation_labels = np.argmax(validation_labels, axis=-1)

# Evaluate metrics
accuracy = accuracy_score(validation_labels, predictions)
precision = precision_score(validation_labels, predictions, average='weighted')
recall = recall_score(validation_labels, predictions, average='weighted')
f1 = f1_score(validation_labels, predictions, average='weighted')

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

Accuracy: 0.7214
Precision: 0.7140
Recall: 0.7214
F1-Score: 0.7161


# **7.Saving the model and tokenizer** 

In [33]:
# save the model')
history.model.save('model.keras')
tokenizer_json = tokenizer.to_json()
with open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(tokenizer_json)

After training the model, it's important to save both the model and tokenizer for future use (e.g., for inference or resuming training).

1. **Saving the Model**:
   - The model can be saved using the `model.save()` method in Keras. It will save the entire model architecture, weights, and training configuration.

2. **Saving the Tokenizer**:
   - The tokenizer can be saved as a JSON file using `tokenizer.to_json()`.

# **7. Simple Test**

In [41]:
# Load the saved model
model = tf.keras.models.load_model('model.keras')

# Load the saved tokenizer
with open('tokenizer.json', 'r', encoding='utf-8') as f:
    tokenizer_json = f.read()
    tokenizer = tokenizer_from_json(tokenizer_json)

# Example Arabic sentence
arabic_sentence = "احسن لاعب في أفريقيا"  

# Preprocess the sentence: Convert to sequence and pad
sequence = tokenizer.texts_to_sequences([arabic_sentence])
max_length = 100  # Ensure this matches the max_length used during training
padded_sequence = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=max_length, padding='post')

# Make predictions on the sentence
predictions = model.predict(padded_sequence)

# Get the predicted class (index of the highest probability)
predicted_class = np.argmax(predictions, axis=1)

# Print the predicted class
print(f"Predicted class: {predicted_class[0]}")  # Print the index of the predicted class

# Optionally, if you have class labels, you can map them to actual labels
class_labels = ["neutral", "negative", "positive"]  # Replace with your actual class labels
print(f"Predicted class label: {class_labels[predicted_class[0]]}")

Predicted class: 2
Predicted class label: positive
