# Text Classification Exam

Welcome to the Text Classification Practical Exam. In this exam, you will be tasked with building, training, and evaluating an NLP model to classify text data. You are provided with a labeled dataset containing both the text and its corresponding class labels.

Your objective is to develop a model that accurately predicts the class of the given text. Make sure to follow best practices in data preprocessing, model selection, and evaluation to achieve optimal results.

Good luck!
___

# Install and Import Needed Libraries

You can use `pyarabic` or any other library to pre-process and clean the Arabic text.

In [25]:
!pip install pyarabic



In [82]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyarabic.araby as araby
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM, Dense
from sklearn.model_selection import train_test_split

# Download the Dataset

Please note that you are allowed to take a subset of this dataset, the reason for that is it might take a long time to train the model on.

In [27]:
!kaggle datasets download -d khaledzsa/sanad
!unzip sanad.zip

Dataset URL: https://www.kaggle.com/datasets/khaledzsa/sanad
License(s): unknown
sanad.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  sanad.zip
replace sanad.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [83]:
df = pd.read_csv('/content/sanad.csv')
df.head()

Unnamed: 0,text,label
0,https://example.com/resource/الشاٌرقة -ْ محمِد...,Culture
1,https://example.com/resource/اَنِطٌلقّتَ ٍفٍيّ...,Culture
2,https://example.com/resource/أُقيًمٌتِ مِساءُ ...,Culture
3,https://example.com/resource/بٍاسُمةَ يًوٌنٍس ...,Culture
4,https://example.com/resource/قُرر اَتحِاد اًلْ...,Culture


In [84]:
df

Unnamed: 0,text,label
0,https://example.com/resource/الشاٌرقة -ْ محمِد...,Culture
1,https://example.com/resource/اَنِطٌلقّتَ ٍفٍيّ...,Culture
2,https://example.com/resource/أُقيًمٌتِ مِساءُ ...,Culture
3,https://example.com/resource/بٍاسُمةَ يًوٌنٍس ...,Culture
4,https://example.com/resource/قُرر اَتحِاد اًلْ...,Culture
...,...,...
39880,https://example.com/resource/أعلّنت شّركٌةً بّ...,Tech
39881,https://example.com/resource/بُتٍاَرٌيًخَ 28ْ ...,Tech
39882,https://example.com/resource/دبَيُ:َ «ُاَلخليج...,Tech
39883,https://example.com/resource/LٌG GًS2ْ9ً0 Coْo...,Tech


# Data Exploration

Before diving into preprocessing and model building, it’s important to first explore the dataset to understand its structure, distribution, and key characteristics. This step will help you gain insights into the data and guide your decisions in subsequent steps. Here’s what to consider:

1. **Inspect the Data**:
   Start by looking at the first few rows of the dataset to get a sense of its structure. Check the columns, data types, and a few sample entries. This helps to ensure that the data is loaded correctly and gives you an initial overview of the content.

2. **Check for Missing Values**:
   Identify if there are any missing values in the dataset.

3. **Distribution of Labels**:
   Examine the distribution of the target labels (classes).

4. **Text Data Characteristics (Bonus)**:
   Analyze the length of the text data. It is useful to calculate the number of words or characters in each text sample to understand how long the texts are. This will help you set a suitable `max_length` for tokenization and padding later. You can plot a histogram of text lengths to visualize the distribution.

5. **Common Words and Vocabulary (Bonus)**:
   Explore the most frequent words in the text data.

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39885 entries, 0 to 39884
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    39885 non-null  object
 1   label   39885 non-null  object
dtypes: object(2)
memory usage: 623.3+ KB


In [86]:
df.describe()

Unnamed: 0,text,label
count,39885,39885
unique,39885,7
top,https://example.com/resource/الشاٌرقة -ْ محمِد...,Politics
freq,1,6334


In [87]:
df.isnull().sum()

Unnamed: 0,0
text,0
label,0


In [88]:
df.duplicated().sum()

0

In [89]:
df.drop_duplicates(inplace=True)

In [90]:
df.duplicated().sum()

0

In [121]:
max_length = df['text']
for i in range(len(df)):
  if len(df['text'][i])>len(max_length):
    max_length = df['text'][i]
print(max_length)

0        https://example.com/resource/الشاٌرقة -ْ محمِد...
1        https://example.com/resource/اَنِطٌلقّتَ ٍفٍيّ...
2        https://example.com/resource/أُقيًمٌتِ مِساءُ ...
3        https://example.com/resource/بٍاسُمةَ يًوٌنٍس ...
4        https://example.com/resource/قُرر اَتحِاد اًلْ...
                               ...                        
39880    https://example.com/resource/أعلّنت شّركٌةً بّ...
39881    https://example.com/resource/بُتٍاَرٌيًخَ 28ْ ...
39882    https://example.com/resource/دبَيُ:َ «ُاَلخليج...
39883    https://example.com/resource/LٌG GًS2ْ9ً0 Coْo...
39884    https://example.com/resource/تنطًلقٍ اٌليومّ َ...
Name: text, Length: 39885, dtype: object


In [92]:
max_length = df['text'].apply(lambda x: len(x.split())).max()
max_length

700

# Data Cleaning and Preprocessing

In this section, we will focus on cleaning and filtering the dataset, preparing it for the text classification task. We will implement the following steps:

1. **Remove missing values**:
   First, we eliminate any rows with missing values to ensure the dataset is complete and consistent.

2. **Filter by text length (Bonus)**:
   To maintain a uniform dataset, we will filter the text samples by a specified word count range. This ensures that the texts are neither too short to lack context nor too long to introduce unnecessary complexity.

3. **Arabic stopwords loading**:
   We load a list of Arabic stopwords to filter out commonly used but contextually insignificant words. This is an important step for improving the performance of the model, as stopwords do not contribute valuable information.

4. **Text cleaning**:
   We apply a series of text cleaning steps to standardize and simplify the text data. This involves:
   - **Removing links (URLs)**: Any URLs present in the text are removed as they are not meaningful for classification purposes.
   - **Removing special characters and punctuation**: This step removes any non-alphabetical characters, ensuring the text only contains meaningful words.
   - **Removing Arabic diacritics (Tashkeel) and elongated letters (Tatweel)**: Diacritical marks and elongated letters are stripped out to standardize the text.
   - **Removing Arabic stopwords**: Words that are part of the stopwords list are removed, as they do not add value to the classification task.
   - **Stemming or Lemmmatization**: Either stemming or lemmatization is applied to reduce words to their root or base form.
   - **Normalizing Hamza**: Any variation of the Hamza character is normalized for consistency.

   **Note:** Most of these you can do using the library [PyArabic](https://pyarabic.readthedocs.io/ar/latest/README.html#features)

5. **Final cleanup**:
   Apply the cleanup function to the feature column.

By following these steps, the text will be cleaned, filtered, and ready for tokenization!

In [152]:
from pyarabic.araby import strip_tashkeel, strip_tatweel, normalize_hamza
import re

In [160]:
def remove_urls(text):
    url_pattern = re.compile(r'http\S+|www.\S+')
    return url_pattern.sub(r'', text)

df['text'] = df['text'].apply(remove_urls)

print(df['text'])

0         -ْ محمِد وِلدٌ َمحمْدُ ٌسّاٍلمَعرٍضت مًسُاءٍ ...
1         ٍفٍيّ مثلَ َهًذَه ّالأيِامٌ َمنْ الُعًاْمَ ًا...
2         مِساءُ َأمٍسٍ اٌلأَوٌل فيُ ِإكسّبٍو اٍلَشارًق...
3         يًوٌنٍس ْحّينًماً قًاْلُ صاحّبَ ُاٌلٌسَمَوِّ ...
4         اَتحِاد اًلْأَدًباًءْ ًوالْكِتًّاٌب الموريْتْ...
                               ...                        
39880     شّركٌةً بّاَنِاسٌوَنُيّك عن اٍطَلاّقٌ ّسلُسًل...
39881     28ْ ُمّارسٍ/ٍآذاٍر الماًضِي وّبَيّن ْالٌسّاعة...
39882     «ُاَلخليجٍ» أَبرٍمِت بلدّية دبِيٌ مع بلدٍيَة ...
39883     GًS2ْ9ً0 Coْoْkِie Frْesًhُهٍاْتًف متحرّك ٌجد...
39884     اٌليومّ َفّيٍ مركَزٌ ٍمعارضُ ًمُطار دبًي اُلد...
Name: text, Length: 39885, dtype: object


In [167]:
print(df['text'])

0         -ْ محمِد وِلدٌ َمحمْدُ ٌسّاٍلمَعرٍضت مًسُاءٍ ...
1         ٍفٍيّ مثلَ َهًذَه ّالأيِامٌ َمنْ الُعًاْمَ ًا...
2         مِساءُ َأمٍسٍ اٌلأَوٌل فيُ ِإكسّبٍو اٍلَشارًق...
3         يًوٌنٍس ْحّينًماً قًاْلُ صاحّبَ ُاٌلٌسَمَوِّ ...
4         اَتحِاد اًلْأَدًباًءْ ًوالْكِتًّاٌب الموريْتْ...
                               ...                        
39880     شّركٌةً بّاَنِاسٌوَنُيّك عن اٍطَلاّقٌ ّسلُسًل...
39881     28ْ ُمّارسٍ/ٍآذاٍر الماًضِي وّبَيّن ْالٌسّاعة...
39882     «ُاَلخليجٍ» أَبرٍمِت بلدّية دبِيٌ مع بلدٍيَة ...
39883     GًS2ْ9ً0 Coْoْkِie Frْesًhُهٍاْتًف متحرّك ٌجد...
39884     اٌليومّ َفّيٍ مركَزٌ ٍمعارضُ ًمُطار دبًي اُلد...
Name: text, Length: 39885, dtype: object


In [168]:
import pyarabic.number
an = pyarabic.number.ArNumbers()

In [169]:
pyarabic.araby.strip_tatweel(df['text'][0])

' -ْ محمِد وِلدٌ َمحمْدُ ٌسّاٍلمَعرٍضت مًسُاءٍ أًمسَ اَلأَول علَىِ ٍخْشبٌةّ مسرح قصَر ّالثْقًافةً ٌفي اِلشاْرقُة ٍاًلمسرِحية ٍاٍلسْعٍودْيَة ً"ْبَعيُدْاًّ عن اٌلسيطرة"ّ ًلِفرقةَ مَسّرُحَ ًاِلطِاٌئف،ّ ْمن ّتٌأٌليُف فُهًدِ ّرّدِةَ الحاّرثي، ّوٌإُخراًج َسٍاٌمَي ًصّاَلٍحٍ الزَهرانيّ،ْ ٌوُذّلكّ فٌيً َرٌابعٍةٌ ٌليالي ِالدورْةَ اَلأوٌلى ٌمن مًهرًجِانٌ اْلشْارِقُة لّلّمَسُرٌحْ الخليجي ٌ.تًبًدَأْ اْلْمسرٍحيٍة بثًلاٍثّة َأشخاٍصٍ ٍيجٍلْسِونَ ُفي قاعةِ مّكتبةِ،ٌ ًيِنهمُك كٌل َمْنهٌمٌ ّفي ُالقرِاءة بْشغَف،ِ ُثُم يبِدأِوْن ًفيَ الٍحوٌاُر لنٍكُتشُفُ ُأِنهمّ كاْنوّاٌ ْيِقُرَأّونْ ِروٍاياّت لُأستُاّذهم ًالكِاتِب اّلمٌبدَعَ ُالٍذيٍ مِاْت ُوْتِرْك ُرِوايٌاتَ فريَدَةً،ٍ ُرٌسم فْيهًا ِشْخُصًياُت ٍغاية فُي الُدُقٌة، ٌويتحدٌثونّ ْعنُ ضرورة تكِرٍيمَ ِأستْاّذُهًم،ٌ ويٍتِفِقون ُعُلْىّ طرٍيقّة ًخاصةّ َلًلتكرّيم وُهيٌ ِإُخرٍاّج ُشِخصِيٌاتَه ًمنُ ْرِواياٍتها ٍلٍتَعٍيٍش في ّاٍلوْاقًع، وِيًنتِقّونً شِخُصيًاتٌ مركزيُة، ٍأْولها ْاُلحِلاق ٍاٍلذي ٍكٍان ًطَيًبًاًُ،ٌ حْافٍظاً لأُسرّاٌر أهل الّحٍيُ، وكْاٍنً الٍجِميِعُ يًحبّ

In [170]:
pyarabic.araby.strip_tashkeel(df['text'][0])

' - محمد ولد محمد سالمعرضت مساء أمس الأول على خشبة مسرح قصر الثقافة في الشارقة المسرحية السعودية "بعيدا عن السيطرة" لفرقة مسرح الطائف، من تأليف فهد ردة الحارثي، وإخراج سامي صالح الزهراني، وذلك في رابعة ليالي الدورة الأولى من مهرجان الشارقة للمسرح الخليجي .تبدأ المسرحية بثلاثة أشخاص يجلسون في قاعة مكتبة، ينهمك كل منهم في القراءة بشغف، ثم يبدأون في الحوار لنكتشف أنهم كانوا يقرأون روايات لأستاذهم الكاتب المبدع الذي مات وترك روايات فريدة، رسم فيها شخصيات غاية في الدقة، ويتحدثون عن ضرورة تكريم أستاذهم، ويتفقون على طريقة خاصة للتكريم وهي إخراج شخصياته من رواياتها لتعيش في الواقع، وينتقون شخصيات مركزية، أولها الحلاق الذي كان طيبا، حافظا لأسرار أهل الحي، وكان الجميع يحبه، وحين لا يكون الشخص لديه ما يدفعه مقابل الحلاقة فإنه لا يطالبه بشيء، ثم يستخرجون حفار القبور الذي كان يقبر الجميع، ويردد دائما أن الدنيا فانية، وأن البقاء لله وحده، ثم يستخرجون الشاب المحب الذي ظل سنوات طويلة يحمل وردة وينتظر حبيبته التي رحلت عنه ولم تعد إليه حتى مات .يأخذ طلاب الأستاذ تلك الشخصيات إلى عالم جديد لتعيش فيه، ونك

In [171]:
pyarabic.araby.normalize_hamza(df['text'][0])

' -ْ محمِد وِلدٌ َمحمْدُ ٌسّاٍلمَعرٍضت مًسُاءٍ ءًمسَ اَلءَول علَىِ ٍخْشبٌةّ مسرح قصَر ّالثْقًافةً ٌفي اِلشاْرقُة ٍاًلمسرِحية ٍاٍلسْعٍودْيَة ً"ْبَعيُدْاًّ عن اٌلسيطرة"ّ ًلِفرقةَ مَسّرُحَ ًاِلطِاٌءف،ّ ْمن ّتٌءٌليُف فُهًدِ ّرّدِةَ الحاّرثي، ّوٌءُخراًج َسٍاٌمَي ًصّاَلٍحٍ الزَهرانيّ،ْ ٌوُذّلكّ فٌيً َرٌابعٍةٌ ٌليالي ِالدورْةَ اَلءوٌلى ٌمن مًهرًجِانٌ اْلشْارِقُة لّلّمَسُرٌحْ الخليجي ٌ.تًبًدَءْ اْلْمسرٍحيٍة بثًلاٍثّة َءشخاٍصٍ ٍيجٍلْسِونَ ُفي قاعةِ مّكتبةِ،ٌ ًيِنهمُك كٌل َمْنهٌمٌ ّفي ُالقرِاءة بْشغَف،ِ ُثُم يبِدءِوْن ًفيَ الٍحوٌاُر لنٍكُتشُفُ ُءِنهمّ كاْنوّاٌ ْيِقُرَءّونْ ِروٍاياّت لُءستُاّذهم ًالكِاتِب اّلمٌبدَعَ ُالٍذيٍ مِاْت ُوْتِرْك ُرِوايٌاتَ فريَدَةً،ٍ ُرٌسم فْيهًا ِشْخُصًياُت ٍغاية فُي الُدُقٌة، ٌويتحدٌثونّ ْعنُ ضرورة تكِرٍيمَ ِءستْاّذُهًم،ٌ ويٍتِفِقون ُعُلْىّ طرٍيقّة ًخاصةّ َلًلتكرّيم وُهيٌ ِءُخرٍاّج ُشِخصِيٌاتَه ًمنُ ْرِواياٍتها ٍلٍتَعٍيٍش في ّاٍلوْاقًع، وِيًنتِقّونً شِخُصيًاتٌ مركزيُة، ٍءْولها ْاُلحِلاق ٍاٍلذي ٍكٍان ًطَيًبًاًُ،ٌ حْافٍظاً لءُسرّاٌر ءهل الّحٍيُ، وكْاٍنً الٍجِميِعُ يًحبّ

In [172]:
pyarabic.araby.strip_diacritics(df['text'][0])

' - محمد ولد محمد سالمعرضت مساء أمس الأول على خشبة مسرح قصر الثقافة في الشارقة المسرحية السعودية "بعيدا عن السيطرة" لفرقة مسرح الطائف، من تأليف فهد ردة الحارثي، وإخراج سامي صالح الزهراني، وذلك في رابعة ليالي الدورة الأولى من مهرجان الشارقة للمسرح الخليجي .تبدأ المسرحية بثلاثة أشخاص يجلسون في قاعة مكتبة، ينهمك كل منهم في القراءة بشغف، ثم يبدأون في الحوار لنكتشف أنهم كانوا يقرأون روايات لأستاذهم الكاتب المبدع الذي مات وترك روايات فريدة، رسم فيها شخصيات غاية في الدقة، ويتحدثون عن ضرورة تكريم أستاذهم، ويتفقون على طريقة خاصة للتكريم وهي إخراج شخصياته من رواياتها لتعيش في الواقع، وينتقون شخصيات مركزية، أولها الحلاق الذي كان طيبا، حافظا لأسرار أهل الحي، وكان الجميع يحبه، وحين لا يكون الشخص لديه ما يدفعه مقابل الحلاقة فإنه لا يطالبه بشيء، ثم يستخرجون حفار القبور الذي كان يقبر الجميع، ويردد دائما أن الدنيا فانية، وأن البقاء لله وحده، ثم يستخرجون الشاب المحب الذي ظل سنوات طويلة يحمل وردة وينتظر حبيبته التي رحلت عنه ولم تعد إليه حتى مات .يأخذ طلاب الأستاذ تلك الشخصيات إلى عالم جديد لتعيش فيه، ونك

In [173]:
pyarabic.araby.strip_harakat(df['text'][0])

' - محمد ولد محمد سّالمعرضت مساء أمس الأول على خشبةّ مسرح قصر ّالثقافة في الشارقة المسرحية السعودية "بعيداّ عن السيطرة"ّ لفرقة مسّرح الطائف،ّ من ّتأليف فهد ّرّدة الحاّرثي، ّوإخراج سامي صّالح الزهرانيّ، وذّلكّ في رابعة ليالي الدورة الأولى من مهرجان الشارقة لّلّمسرح الخليجي .تبدأ المسرحية بثلاثّة أشخاص يجلسون في قاعة مّكتبة، ينهمك كل منهم ّفي القراءة بشغف، ثم يبدأون في الحوار لنكتشف أنهمّ كانوّا يقرأّون رواياّت لأستاّذهم الكاتب اّلمبدع الذي مات وترك روايات فريدة، رسم فيها شخصيات غاية في الدقة، ويتحدثونّ عن ضرورة تكريم أستاّذهم، ويتفقون علىّ طريقّة خاصةّ للتكرّيم وهي إخراّج شخصياته من رواياتها لتعيش في ّالواقع، وينتقّون شخصيات مركزية، أولها الحلاق الذي كان طيبا، حافظا لأسرّار أهل الّحي، وكان الجميع يحبّه، وحين لا يكون الشخص لديه مّا يدفّعه مقابل الحلاقة فإنهّ لا يطالبه بشيء، ثّم يستخرجّون حفار الّقبّور الذي كان يقبر الجميع، ويرددّ دائما أنّ الدنيا فانية، وأن البقاء لله وحده، ثم يستّخرجون الشاب اّلمحب الذي ظل سنوات طويلة يحمل وردة وينتظر حبّيبته التي رحلتّ عنّه وّلم تّعد إليه حتىّ مات .يأخ

In [174]:
araby.normalize_hamza(df['text'], method='tasheel')

Unnamed: 0,text
0,-ْ محمِد وِلدٌ َمحمْدُ ٌسّاٍلمَعرٍضت مًسُاءٍ ...
1,ٍفٍيّ مثلَ َهًذَه ّالأيِامٌ َمنْ الُعًاْمَ ًا...
2,مِساءُ َأمٍسٍ اٌلأَوٌل فيُ ِإكسّبٍو اٍلَشارًق...
3,يًوٌنٍس ْحّينًماً قًاْلُ صاحّبَ ُاٌلٌسَمَوِّ ...
4,اَتحِاد اًلْأَدًباًءْ ًوالْكِتًّاٌب الموريْتْ...
...,...
39880,شّركٌةً بّاَنِاسٌوَنُيّك عن اٍطَلاّقٌ ّسلُسًل...
39881,28ْ ُمّارسٍ/ٍآذاٍر الماًضِي وّبَيّن ْالٌسّاعة...
39882,«ُاَلخليجٍ» أَبرٍمِت بلدّية دبِيٌ مع بلدٍيَة ...
39883,GًS2ْ9ً0 Coْoْkِie Frْesًhُهٍاْتًف متحرّك ٌجد...


In [175]:
def filter_word(texts, min_words, max_words):
    filtered_texts = []
    for text in texts:
        word_count = len(text.split())
        if min_words <= word_count <= max_words:
            filtered_texts.append(text)
    return filtered_texts
    print(filtered_texts)
print(df['text'])

0         -ْ محمِد وِلدٌ َمحمْدُ ٌسّاٍلمَعرٍضت مًسُاءٍ ...
1         ٍفٍيّ مثلَ َهًذَه ّالأيِامٌ َمنْ الُعًاْمَ ًا...
2         مِساءُ َأمٍسٍ اٌلأَوٌل فيُ ِإكسّبٍو اٍلَشارًق...
3         يًوٌنٍس ْحّينًماً قًاْلُ صاحّبَ ُاٌلٌسَمَوِّ ...
4         اَتحِاد اًلْأَدًباًءْ ًوالْكِتًّاٌب الموريْتْ...
                               ...                        
39880     شّركٌةً بّاَنِاسٌوَنُيّك عن اٍطَلاّقٌ ّسلُسًل...
39881     28ْ ُمّارسٍ/ٍآذاٍر الماًضِي وّبَيّن ْالٌسّاعة...
39882     «ُاَلخليجٍ» أَبرٍمِت بلدّية دبِيٌ مع بلدٍيَة ...
39883     GًS2ْ9ً0 Coْoْkِie Frْesًhُهٍاْتًف متحرّك ٌجد...
39884     اٌليومّ َفّيٍ مركَزٌ ٍمعارضُ ًمُطار دبًي اُلد...
Name: text, Length: 39885, dtype: object


# Tokenization, Padding, and Data Splitting

In this step, we will prepare the text data for input into a model by converting the text into numerical sequences, padding them to a uniform length, and splitting the dataset into training and testing sets. Here's an overview of the steps involved:

1. **Tokenization**:
   We use a tokenizer to convert the cleaned text into numerical sequences. You can use `Tokenizer` tokenizer from `tensorflow.keras.preprocessing.text` package or any other tokenizer you like.

2. **Text to sequences**:
   After fitting the tokenizer on the cleaned text, we transform each text into a sequence of numbers, where each number corresponds to a token (word) in the text.

3. **Padding the sequences**:
   Since different texts may vary in length, we pad the sequences to ensure they all have the same length.

4. **Label encoding**:
   The labels (target values) also need to be converted into numerical form if they are not encoded.

5. **Train-test split**:
   The dataset is divided into training and testing sets. We allocate 80% of the data for training the model and reserve 20% for testing its performance.
   
   - The **training data** consists of the padded sequences used to train the model.
   - The **training labels** are the encoded labels corresponding to the training data.
   - The **testing data** is used to assess the model’s performance after training.
   - The **testing labels** are the encoded labels corresponding to the testing data.

6. **Data shape confirmation**:
   After splitting the data, we print the shape (dimensions) of both the training and testing sets to confirm that the data is properly divided and formatted.

By the end of this step, the text data will be transformed into padded numerical sequences, the labels will be encoded, and the data will be split into training and testing sets for model development and evaluation.

In [176]:
text_data = df['text']
label = df['label']

In [177]:
max_length = 10
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

In [178]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(label)

In [179]:
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

print("after train:", X_train.shape)
print("aftre test:", X_test.shape)

after train: (31908, 10)
aftre test: (7977, 10)


# Building the Classifier

In this step, you will design and build a NLP Classifier model to classify text data. Below is a breakdown of the key components you'll implement, but it's up to you to decide how to configure them based on your understanding and experimentation:

1. **Model Type**:
   You will use a Sequential model, which allows you to stack layers in a linear sequence.

2. **Input Layer**:
   Define the shape of the input data. Consider the dimensions of your padded sequences and set the input shape accordingly.

3. **Embedding Layer**:
   The embedding layer will convert input tokens (integers) into dense vector representations. You will need to determine the size of the input dimension (based on your vocabulary) and the output dimension (embedding size).

4. **Bidirectional Simple RNN/LSTM Layers**:
   You can add one or more recurrent layers. Consider using Bidirectional layers to capture contextual information from both directions (forward and backward). You can chose SimpleRNN/GRU/LSTM to perform this step.

5. **Dense Layers**:
   Add one or more fully connected (Dense) layers to process the output from the RNN/GRU/LSTM layers.

6. **Output Layer**:
   The output layer should match the type of classification task you're working on. Consider using appropriate activation function with appropriate number of units.

7. **Model Summary**:
   After defining your model architecture, print a summary to review the number of layers, types of layers, and total parameters.

8. **Model Compilation**:
   Finally, compile the model by selecting an optimizer, a loss function, and metrics.

In [180]:
vocab_size = 10000
max_length = 250

tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])

train_data = pad_sequences(sequences, maxlen=max_length, padding='post')
test_data = pad_sequences(sequences, maxlen=max_length, padding='post')

In [181]:
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(max_length,)),
    tf.keras.layers.Embedding(vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()



In [182]:
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
)

# Defining Batch Size, Creating Datasets, and Training the Model

In this step, you will define the batch size, create TensorFlow Datasets for both training and testing, and train the model. The key elements to consider are outlined below, and it is up to you to choose the specific configurations based on your preferences and experimentation:

1. **Batch Size**:
   Select a batch size for training and testing. The batch size determines how many samples will be processed together in one forward and backward pass during training.

2. **Creating Datasets**:
   Use TensorFlow’s `Dataset.from_tensor_slices()` to create datasets from the training and testing data.

3. **Batching the Datasets**:
   Batch the datasets by grouping the data into batches of the specified size.

4. **Training the Model**:
   Train the model by fitting it on the training dataset for a specified number of epochs. You will also need to provide the validation data to monitor the model’s performance on unseen data during training.

5. **Tracking Training History**:
   During training, the model’s performance metrics (such as loss and accuracy) will be tracked over the epochs, and the results will be stored in the `history` object.

In [183]:
batch_size = 32

train_data = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
test_data = tf.data.Dataset.from_tensor_slices((test_data, test_labels))

train_dataset = train_data.batch(batch_size)
test_dataset = test_data.batch(batch_size)

NameError: name 'train_labels' is not defined

In [184]:
history = model.fit(
    train_dataset,
    epochs=5,
    validation_data=test_dataset,
)

Epoch 1/5


InvalidArgumentError: Graph execution error:

Detected at node sequential_8_1/embedding_8_1/GatherV2 defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start

  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 685, in <lambda>

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 738, in _run_callback

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 825, in inner

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 361, in process_one

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-184-e5ba00a028a7>", line 1, in <cell line: 1>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 318, in fit

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 121, in one_step_on_iterator

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 108, in one_step_on_data

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 51, in train_step

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py", line 882, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/models/sequential.py", line 209, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/models/functional.py", line 175, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/function.py", line 171, in _run_through_graph

  File "/usr/local/lib/python3.10/dist-packages/keras/src/models/functional.py", line 556, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py", line 882, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/embedding.py", line 140, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py", line 4875, in take

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py", line 1951, in take

indices[0,0] = 6060343 is not in [0, 10000)
	 [[{{node sequential_8_1/embedding_8_1/GatherV2}}]] [Op:__inference_one_step_on_iterator_34360]

# Model Evaluation

Once the model is trained, the next step is to evaluate its performance on the testing dataset.

1. **Evaluate the Model**:
   You will use the `evaluate()` method to assess the model’s performance on the test dataset.

2. **Testing Dataset**:
   Ensure that the testing dataset is properly prepared and batched, just like the training dataset.

4. **Loss Curve**:
   A loss curve plots the loss values for both the training and validation datasets over the epochs.

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Curve Over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Model Inference

In this step, you will use the trained model to make predictions on new, unseen data (inference). Here’s an outline of the key points:

1. **Create Test Sample**:
   Create a string to test your modelm the goal here is to give the model Before making predictions, ensure that the new data is preprocessed in the same way as the training data. This includes tokenization, padding, and any other transformations you applied during the data preprocessing step. The data can be single text to see the result of the prediction.

2. **Model Prediction**:
   Use the `predict()` method to feed new samples into the trained model and obtain predictions. The model will output probabilities or predicted class labels based on the type of classification task (binary or multi-class).

3. **Interpreting Predictions**:
   The model will return probabilities for each class.

# Notebook Question:
- How did you handle text preprocessing? Why did you choose this approach?

- Why did you choose this model design?

- Why did you pick this number of layers or units for the model?

- Why did you select these evaluation methods?

- Does your model show signs of overfitting or underfitting? How do you know?

- What changes could you make to improve the model and fix overfitting or underfitting?

Answer Here: