# Text Preprocessing and Word2Vec Representation
## Agenda:
* 1- **Introduction** - Overview of text preprocessing and Word2Vec.
* 2- **Arabic Text Preprocessing** - Explanation and implementation.
* 3- **English Text Preprocessing** - Explanation and implementation.
* 4- **Word2Vec Representation** - Explanation and implementation.
* 5- **Embedding Visualization** - Applying t-SNE for 2D scatter plot.
* 6- **Summary** - Reflection and key takeaways.

# 1- **Introduction**
In this notebook,
> 1- **we will preprocess two texts**,
 
 >> 1- arabic_text = "مرحباً بكم في AI، انه ممتع جدا ورائع! 😉 هذا نص تجريبي باللغة العربية."
 
 >> 2- english_text = "Hello, welcome to AI! It's super fun & amaizing :) This is a test text."

 > 2- **generate Word2Vec embeddings**
 
 > 3- **visualize them using t-SNE**.

# **2- Arabic Text Preprocessing**
> Arabic text often contains diacritics, punctuation, and a mix of English and Arabic words. We'll normalize, tokenize, remove stopwords, and correct spelling issues to prepare the text.

## 2.1- Arabic text

In [31]:
! pip install farasapy
!pip install pyspellchecker
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from farasa.segmenter import FarasaSegmenter
from spellchecker import SpellChecker
import nltk

nltk.download('stopwords')
nltk.download('punkt_tab')

# Sample Arabic text
arabic_text = "مرحباً بكم في AI، انه ممتع جدا ورائع! 😉 هذا نص تجريبي باللغة العربية."



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


## 2.2- Normalize Text

In [33]:
def normalize_arabic(text):
    text = re.sub(r'[إأآا]', 'ا', text)
    text = re.sub(r'ء|ئ|ؤ', 'ء', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    return text

normalized_arabic_text = normalize_arabic(arabic_text)
print(arabic_text)
print(normalized_arabic_text)

مرحباً بكم في AI، انه ممتع جدا ورائع! 😉 هذا نص تجريبي باللغة العربية.
مرحبا بكم في ai انه ممتع جدا وراءع  هذا نص تجريبي باللغة العربية


## 2.3- Tokenize and remove stopwords

In [35]:
from nltk.tokenize import word_tokenize
arabic_tokens = word_tokenize(normalized_arabic_text)

arabic_stopwords = set(stopwords.words('arabic'))
cleaned_arabic_tokens = [word for word in arabic_tokens if word not in arabic_stopwords]

print(arabic_text)
print(cleaned_arabic_tokens)

مرحباً بكم في AI، انه ممتع جدا ورائع! 😉 هذا نص تجريبي باللغة العربية.
['مرحبا', 'ai', 'انه', 'ممتع', 'جدا', 'وراءع', 'نص', 'تجريبي', 'باللغة', 'العربية']


## 2.4- Spelling correction

In [40]:
spell = SpellChecker(language='ar')
corrected_arabic_tokens = [spell.correction(word) for word in cleaned_arabic_tokens]

print("Cleaned Arabic Tokens:", corrected_arabic_tokens)

Cleaned Arabic Tokens: ['مرحبا', 'من', 'انه', 'ممتع', 'جدا', 'وراء', 'نص', 'تجريبي', 'باللغة', 'العربية']


# 3- **English Text Preprocessing**
> English text issues include contractions, slang, and punctuation problems. We'll normalize, tokenize, remove stopwords, and correct spelling.

In [43]:
# Sample English text
english_text = "Hello, welcome to AI! It's super fun & amaizing :) This is a test text."

## 3.2- Normalize text

In [46]:
def normalize_english(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    return text

normalized_english_text = normalize_english(english_text)
print(english_text)
print(normalized_english_text)

Hello, welcome to AI! It's super fun & amaizing :) This is a test text.
hello welcome to ai its super fun  amaizing  this is a test text


## 3.3- Tokenize and remove stopwords

In [49]:
from nltk.tokenize import word_tokenize

english_tokens = word_tokenize(normalized_english_text)

english_stopwords = set(stopwords.words('english'))
cleaned_english_tokens = [word for word in english_tokens if word not in english_stopwords]

print(normalized_english_text)
print(cleaned_english_tokens)

hello welcome to ai its super fun  amaizing  this is a test text
['hello', 'welcome', 'ai', 'super', 'fun', 'amaizing', 'test', 'text']


## 3.4- Spelling correction

In [52]:
spell = SpellChecker()
corrected_english_tokens = [spell.correction(word) for word in cleaned_english_tokens]

print("Cleaned English Tokens:", corrected_english_tokens)

Cleaned English Tokens: ['hello', 'welcome', 'ai', 'super', 'fun', 'amazing', 'test', 'text']


# **4- Word2Vec Representation
> Word2Vec helps capture word semantics by mapping words into a vector space.

<img src="C:\Users\Asus\Desktop\tf idf .png" alt="Image Alt Text" width="1000">

In [63]:
! pip install scipy
! pip install gensim

from gensim.models import Word2Vec

# Train Word2Vec models
arabic_model = Word2Vec([corrected_arabic_tokens], vector_size=100, window=5, min_count=1, workers=4)
english_model = Word2Vec([corrected_english_tokens], vector_size=100, window=5, min_count=1, workers=4)

# Extract embeddings
arabic_words = list(arabic_model.wv.index_to_key)[:5]
english_words = list(english_model.wv.index_to_key)[:5]

print("\nArabic Words and Embeddings:\n", arabic_words)
# print([arabic_model.wv[word] for word in arabic_words])  # ---> vector

print("\nEnglish Words and Embeddings:\n", english_words)
# print([english_model.wv[word] for word in english_words])  # ---> vector



Arabic Words and Embeddings:
 ['العربية', 'باللغة', 'تجريبي', 'نص', 'وراء']

English Words and Embeddings:
 ['text', 'test', 'amazing', 'fun', 'super']


# **5- TF IDF**

In [65]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Cleaned tokens for English and Arabic
# cleaned_english_tokens = ['hello', 'welcome', 'ai', 'super', 'fun', 'amazing', 'test', 'text']
# cleaned_arabic_tokens = ['مرحبا', 'من', 'انه', 'ممتع', 'جدا', 'وراء', 'نص', 'تجريبي', 'باللغة', 'العربية']

english_doc = ' '.join(cleaned_english_tokens)  
arabic_doc = ' '.join(cleaned_arabic_tokens)

# Create TfidfVectorizer instances
tfidf_vectorizer_english = TfidfVectorizer()
tfidf_vectorizer_arabic = TfidfVectorizer()

# Fit and transform the "documents"
english_tfidf = tfidf_vectorizer_english.fit_transform([english_doc])
arabic_tfidf = tfidf_vectorizer_arabic.fit_transform([arabic_doc])

# Output results
print("English TF-IDF Feature Names:\n", tfidf_vectorizer_english.get_feature_names_out())
print("English TF-IDF Values:\n", english_tfidf.toarray())

print("\nArabic TF-IDF Feature Names:\n", tfidf_vectorizer_arabic.get_feature_names_out())
print("Arabic TF-IDF Values:\n", arabic_tfidf.toarray())


English TF-IDF Feature Names:
 ['ai' 'amaizing' 'fun' 'hello' 'super' 'test' 'text' 'welcome']
English TF-IDF Values:
 [[0.35355339 0.35355339 0.35355339 0.35355339 0.35355339 0.35355339
  0.35355339 0.35355339]]

Arabic TF-IDF Feature Names:
 ['ai' 'العربية' 'انه' 'باللغة' 'تجريبي' 'جدا' 'مرحبا' 'ممتع' 'نص' 'وراءع']
Arabic TF-IDF Values:
 [[0.31622777 0.31622777 0.31622777 0.31622777 0.31622777 0.31622777
  0.31622777 0.31622777 0.31622777 0.31622777]]


In [85]:
# Convert to DataFrame
import pandas as pd
english_df = pd.DataFrame(english_tfidf.toarray(), columns=tfidf_vectorizer_english.get_feature_names_out())
arabic_df = pd.DataFrame(arabic_tfidf.toarray(), columns=tfidf_vectorizer_arabic.get_feature_names_out())

In [81]:
english_df

Unnamed: 0,ai,amaizing,fun,hello,super,test,text,welcome
0,0.353553,0.353553,0.353553,0.353553,0.353553,0.353553,0.353553,0.353553


In [83]:
arabic_df

Unnamed: 0,ai,العربية,انه,باللغة,تجريبي,جدا,مرحبا,ممتع,نص,وراءع
0,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228,0.316228


# **Summary**
> In this notebook, we:

>>Preprocessed Arabic and English texts by removing noise and normalizing the content.
Trained Word2Vec models to generate embeddings.