# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *K*

**Names:**

* *Mathieu Sauser*
* *Luca Mouchel*
* *Jérémy Chaverot*
* *Heikel Jebali*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [45]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl

import re
import pickle
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import CountVectorizer

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

[nltk_data] Downloading package punkt to /home/jebali/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Exercise 4.1: Pre-processing

In [46]:
freqs = {}

for i, course in enumerate(courses):
    description = course['description']
    description = [char.lower() for char in description]
    description = ''.join(description)

    # Step 4: Remove punctuation marks
    description = re.sub(r'[^\w\s]', '', description)

    # Step 5: Tokenize the text into words
    tokens = nltk.word_tokenize(description)
    
    # Step 6: Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]                        
        
    # Step 7: Stem or lemmatize words
    stemmer = PorterStemmer()
    stemmedTokens = [stemmer.stem(token) for token in tokens]
    
    freqDist = FreqDist(stemmedTokens)
    for token in stemmedTokens:
        if token not in freqs.keys():
            freqs[token] = freqDist[token]
        else:
            freqs[token] += freqDist[token]

    # Step 9: Add n-grams to the vocabulary
    nGramRange = (2, 3)  # Specify the range of n-grams to consider
    
    ngrams = []
    for i in range(nGramRange[0], nGramRange[1] + 1):
        ngrams.extend(list(nltk.ngrams(stemmedTokens, i)))
            
    vocabulary = stemmedTokens + [' '.join(ngram) for ngram in ngrams]
    
    course['description'] = vocabulary

In [47]:
mostFreq = float('-inf')
mostFreqToken = ''
for token, freq in freqs.items():
    if freq > mostFreq: 
        mostFreq = freq
        mostFreqToken = token

print(f'{mostFreqToken} is the most used (stemmed) word, with {mostFreq} apparitions')

student is the most used (stemmed) word, with 9887 apparitions


In [48]:
freqWords = [word for word in freqs.keys() if freqs[word] > mostFreq * 0.6]
infreqWords = [word for word in freqs.keys() if freqs[word] < 4]

freqWords

['learn', 'student', 'model', 'method', 'system']

In [55]:
for course in courses:
    description = course['description']
    course['description'] = [word for word in description if word not in freqWords and word not in infreqWords]
    
    if course['courseId'] == 'COM-308':
        print(sorted(course['description']))

['20', '20 midterm', '20 midterm 30', '30', '30 final', '30 final exam', '50', 'acquir', 'acquir lectur', 'acquir lectur handson', 'activ', 'activ lectur', 'activ lectur homework', 'ad', 'ad', 'ad auction', 'ad auction', 'ad auction learn', 'ad auction provid', 'advertis class', 'advertis class explor', 'algebra', 'algebra', 'algebra algorithm', 'algebra algorithm data', 'algebra markov', 'algebra markov chain', 'algorithm', 'algorithm', 'algorithm data', 'algorithm data structur', 'algorithm statist', 'algorithm statist graph', 'analysi', 'analysi user', 'analysi user data', 'analyt', 'analyt', 'analyt applic', 'analyt applic social', 'analyt collect', 'analyt collect model', 'apach spark', 'apach spark keyword', 'applic', 'applic', 'applic inspir', 'applic inspir current', 'applic social', 'applic social network', 'assess', 'assess method', 'assess method project', 'auction', 'auction', 'auction learn', 'auction learn prerequisit', 'auction provid', 'auction provid good', 'balanc', '

## Exercise 4.2: Term-document matrix

In [57]:
# Step 1: Build vocabulary and document-term frequency matrix
vocabulary = set()
doc_term_freq = []
ix_class_index = 0  # Replace with the actual index of the IX class

for i, course in enumerate(courses):
    description = course['description']
    term_freq = FreqDist(description)
    doc_term_freq.append(term_freq)
    vocabulary.update(term_freq.keys())
    
    if course['courseId'] == 'COM-308': ix_class_index = i

vocabulary = sorted(vocabulary)
num_terms = len(vocabulary)
num_docs = len(courses)

# Step 2: Calculate inverse document frequency (IDF)
idf = np.zeros(num_terms)
for i, term in enumerate(vocabulary):
    num_docs_with_term = sum(1 for freq in doc_term_freq if term in freq)
    idf[i] = np.log(num_docs / (1 + num_docs_with_term))

# Step 3: Compute TF-IDF scores
tf_idf = np.zeros((num_terms, num_docs))
for j, term_freq in enumerate(doc_term_freq):
    for i, term in enumerate(vocabulary):
        tf = term_freq[term]
        tf_idf[i, j] = tf * idf[i]

# Step 4: Construct sparse term-document matrix
X = csr_matrix(tf_idf)

# Step 5: Print 15 terms with the highest TF-IDF scores in the IX class description
ix_class_tfidf_scores = tf_idf[:, ix_class_index]
top_terms_indices = np.argsort(ix_class_tfidf_scores)[-15:]
top_terms = [vocabulary[i] for i in top_terms_indices]
print("Top 15 terms in the IX class description:")
for term in top_terms:
    print(term)

Top 15 terms in the IX class description:
network ecommerc
ad auction
recommend system cluster
cluster commun detect
system cluster commun
system cluster
social network ecommerc
cluster commun
mine
social network
data mine
explor
social
realworld
onlin


## Exercise 4.3: Document similarity search