#### Group members

Mostafa Allahmoradi - 9087818
Jarius Bedward - 8841640

## Imports


In [7]:
import string
import nltk
from nltk.corpus.reader import documents
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import os
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec


from gensim.models import Word2Vec
import tensorflow
from tensorflow.python.types.doc_typealias import document

## Setup

In [8]:
# Warning: This download will copy files to your home directory.
# For example, on Linux, it will copy files to ~/.nltk_data.
# In Windows, it will copy files to C:\Users\YourAccount\AppData\Roaming
# nltk.download('punkt')

# A better way to handle the download is to:
# Ensure 'punkt' is available and nltk_data path is set
nltk_data_path = os.path.join(os.getcwd(), "nltk_data")
print("Downloading tokenizer resources...")

nltk.download('brown')
nltk.download('stopwords')
nltk.download("punkt", download_dir=nltk_data_path, force=True)
nltk.download("punkt_tab", download_dir=nltk_data_path, force=True)

# makes sure path is used by nltk
if nltk_data_path not in nltk.data.path:
    nltk.data.path.append(nltk_data_path)

print("Active nltk paths:", nltk.data.path)
print("Contents of nltk_data:", os.listdir(nltk_data_path))

Downloading tokenizer resources...


[nltk_data] Downloading package brown to C:\Users\jjbed/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jjbed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\jjbed\Downloads\ML
[nltk_data]     prog lab9\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\jjbed\Downloads\ML prog lab9\nltk_data...


Active nltk paths: ['C:\\Users\\jjbed/nltk_data', 'C:\\Users\\jjbed\\AppData\\Local\\Programs\\Python\\Python313\\nltk_data', 'C:\\Users\\jjbed\\AppData\\Local\\Programs\\Python\\Python313\\share\\nltk_data', 'C:\\Users\\jjbed\\AppData\\Local\\Programs\\Python\\Python313\\lib\\nltk_data', 'C:\\Users\\jjbed\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data', 'C:\\Users\\jjbed\\Downloads\\ML prog lab9\\nltk_data']
Contents of nltk_data: ['tokenizers']


[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


## Document Collection


In [9]:

documents = [" ".join(sent) for sent in brown.sents()[:500]] #using first 500 for demo, borwn.sents gives brown corpus as list of sentencess


print ("Number of documents collected:", len(documents)) #length of the list
print("Sample document:\n", documents[0])

Number of documents collected: 500
Sample document:
 The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .


- Use of brown corpus to get a list of sentences using brown.sents
- len(documents) gives us the length of the list which is how many documents we have

##  Tokenizer, Normalization Pipeline

In [10]:
#Normalization

def normalize(text):
    # in lowercase text
    text = text.lower()
    #removes punctionation
    text = text.translate(str.maketrans('', '', string.punctuation))
    #removes numbers
    text = re.sub(r"\d+", "", text)
    #Removes urls
    text = re.sub(r"http\S+|www\S+", "", text)
    #removes extra white spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

normalize_docs = [normalize(doc) for doc in documents] #take each document from document list  and apply normalization

print("Normalized Sample:\n", normalize_docs[0])

#Tokenization Pipeline

stopwords=set(stopwords.words("english"))

def tokenize(text):
    tokens = word_tokenize(text)
    #remove stopwords
    tokens = [t for t in tokens if t not in stopwords]
    return tokens

tokenize_docs = [tokenize(doc) for doc in normalize_docs] # use normalize docs to tokenize the normalized words
 #take each document from document list  and apply tokenization

#FInal output print
print("Tokenized sample:\n", tokenize_docs[0])


Normalized Sample:
 the fulton county grand jury said friday an investigation of atlantas recent primary election produced no evidence that any irregularities took place
Tokenized sample:
 ['fulton', 'county', 'grand', 'jury', 'said', 'friday', 'investigation', 'atlantas', 'recent', 'primary', 'election', 'produced', 'evidence', 'irregularities', 'took', 'place']


###### Normalization
- We create a normalization class and normalize the text by transforming to lowercase, removes punctuation, removing numbers, removes urls, and removes extra white spaces using regex
- Then we take the list of documents and apply normalization to the whole list
###### Tokenization
- We then apply tokenization by removing stop words in a loop
- Then we use the normalized list to tokenize the already normalized words

## Implement a Word2Vec predictive model using the knowledge corpus.

In [11]:


model_word2vec = Word2Vec(
    sentences=tokenize_docs, # the tokenized corpus must be a list of lists
    vector_size=100,    #size of embedding
   window=5,        #context window
   min_count=1,     #keep all words (for demo purpose
    workers=4,         #choose how much cpu coreses use
  sg = 1            # number of skip-grams = 1 since this is small data 1
)

print("Words2Vec model trained")

#xample check
#checks if money exists in corpus
word = "money"
vector_word2Vec = model_word2vec.wv[word] #word vector store inside w2v model
print(f"Word vector for '{word}' using Word2Vec: {vector_word2Vec}")





Words2Vec model trained
Word vector for 'money' using Word2Vec: [ 0.00213936 -0.00513772  0.00957934 -0.00377395 -0.00815433 -0.00564388
  0.00576452  0.00625592  0.00330616  0.00837108 -0.00678436  0.00397323
  0.00097306  0.00790533  0.00656204  0.00115702  0.00996519 -0.01281296
  0.00022501  0.00131961  0.0115654   0.00826382  0.00155255 -0.00216188
  0.00765863  0.00422351  0.00334267 -0.00833297 -0.00199157 -0.00869753
  0.00224814 -0.00806683 -0.00173035  0.00755662 -0.00856033  0.01140152
 -0.00545522  0.00373608  0.00374476 -0.011393   -0.00721612 -0.00719035
 -0.00944927 -0.00501105  0.00360177 -0.00920607 -0.01294341  0.00538896
  0.00534256 -0.00224518  0.00428087 -0.00349704  0.00885019  0.00220814
  0.00173936 -0.00806178 -0.00615809 -0.00350766  0.00408438 -0.00687911
  0.00386494  0.00066182 -0.00565066  0.0051219  -0.00864518  0.01070338
 -0.00858912  0.0035071  -0.00486206 -0.00380146  0.00038248 -0.00548745
  0.0086353  -0.00861872 -0.00439417 -0.00572507  0.00331686

- The words2vec model is trained on the tokenized corpus to learn dense vector relationships for each word capturing contextual and semantic relationships.
- The parameters like vector_size=100 and window=5 control the embedding dimension and context window while sg=1 is the skip gram and uses the skipgram approach to predict surrounding words from a target word
- Once trained, each word in the vocab can be represented as a numerical vector which can be used for other tasks like sentiment analysis, text classification, and analogy reasoning

#### Implement a GloVe count-based model using the knowledge corpus.

In [12]:


from glove import Glove, Corpus
import numpy as np

sentences = [['this', 'is', 'an', 'example'], ['glove', 'is', 'awesome']]

corpus = Corpus()
corpus.fit(sentences, window=5)

glove_model = Glove(no_components=100, learning_rate=0.05)
glove_model.fit(corpus.matrix, epochs=20, no_threads=4, verbose=True)
glove_model.add_dictionary(corpus.dictionary)

print(glove_model.word_vectors[glove_model.dictionary['glove']])
print(glove_model.most_similar('glove'))

#Download Glove Pretrained Embeddings From: http://nlp.stanford.edu/data/glove.6B.zip  

def embedding_for_vocab(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1
      
    # Adding again 1 because of reserved 0 index
    embedding_matrix_vocab = np.zeros((vocab_size, embedding_dim))
  
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index.index(word)
                embedding_matrix_vocab[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]
                
    return embedding_matrix_vocab
  
# matrix for vocab: tokenized_words
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab('../glove.6B.50d/glove.6B.50d.txt', tokenized_words, embedding_dim)
  
print("Dense vector for first word is => ", embedding_matrix_vocab[1])

ModuleNotFoundError: No module named 'glove'

## ðŸ§  Learning Objectives
- Teams of 2 (individual evaluation in class).
- Implement **Word2Vec**  and **GloVe** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into markdown comments.


## ðŸ§© Workshop Structure (In Class)
1. **Set up teams of 2 people** â€“ Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Jupyter Notebook Development** *(In class)* â€“ NLP Pipeline (if needed) and Probabilistic Model method implementations + Markdown documentation (work as teams)
3. **Push to GitHub** â€“ Teams commit and push the notebook. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around in class, take notes, and provide coaching as needed, during the **Peer Review Round**


## ðŸ’» Submission Checklist
- âœ… `EmbeddingClusteringVectorizationWorkshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline on a relevant corpus.
  - Demo code: Implement a Word2Vec predictive model using the knowledge corpus.
  - Demo code: Implement a GloVe count-based model using the knowledge corpus.
  - Markdown explanations for each major step
  - In a table that compare **Word2Vec** against **GloVe** in the context of the use case that makes use of the knowledge corpus.
- âœ… `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- âœ… GitHub Repo:
  - Public repo named `EmbeddingClusteringVectorizationWorkshop`
  - **Markdowns and meaningful talking points**