# Session 2 - Word representation

In this lesson, we will learn how to represent a word so that you can treat it as a piece of knowledge—learning the basics of feature representation.

## [15-20 min] Review one group work on first task


In [None]:
import random
from collections import Counter

numbers = [random.randint(1, 3) for _ in range(20)]
poll = Counter(numbers)

selected = max(poll.items(), key=lambda x: x[1])

print(f"Group to present will be group number {selected[0]} with {selected[1]} votes")

Group to present will be group number 2 with 12 votes


## [30 min] How to represent our text to the model

In [None]:
import pandas as pd
import sys
import os
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [None]:
#We will import and read our dataset using pandas
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
dataset = pd.DataFrame({"text": data["data"], "label": data["target"]})

In [None]:
with open("../assets/stopwords.txt", "r") as f:  # type:ignore[name-defined]
    STOPWORDS = [i.strip().lower() for i in f.readlines()]

In [None]:
sentences = dataset["text"].values

### Preprocessing

Let's generalize and create a unique function that can be easily costomizable


In [None]:
def get_preprocessing_function(
    use_lower: bool = True,
    use_alpha: bool = True,
    use_stemming: bool = False
):
    
    def alpha(text: str):
        return re.sub("[^a-z]+", " ", text) if use_alpha else text

    def lower(text: str):
        return text.lower() if use_lower else text
        
    def stemming(text: str):
        #TODO Write this function
        return text
    
    def preprocess(text: str):
        #Create list of steps
        steps = [lower, alpha, stemming]
        for step in steps:
            text = step(text)
        return text
    
    return preprocess

In [None]:
preprocess = get_preprocessing_function(
    use_lower = True,
    use_alpha = True,
    use_stemming = True
)

In [None]:
sample = dataset.sample(100)
sample["text"] = sample["text"].fillna(".")
sample["text"] = sample["text"].astype(str)
sample["text"] = sample["text"].apply(preprocess)

In [None]:
sample["text"].values[1]

'from ffritze hpwad wad hp com fromut fritze subject re anyone know stacker s email address organization hewlett packard waldbronn germany lines does anybody know if stacker has a e mail address and if so what it is i know they have a bbs and something on compuserve but i m hoping someone know s their e mail address john white from stac electronics can be reached at compuserv as for me compuserve com would as email address work from internet internet ffritze hpwbe wad hp com phone germany address fromut fritze waldbronn analytic division r d hewlett packard str d waldbronn germany '

In [None]:
sentence = random.choice(list(sentences))

In [None]:
processed_sentence = preprocess(sentence)

print(f"""
Non processed corpus:
{sentence}
------------------------
Processed corpus:
{processed_sentence}
""")



Non processed corpus:
Subject: [ANNOUNCE] Ivan Sutherland to speak at Harvard
From: eekim@husc11.harvard.edu (Eugene Kim)
Distribution: harvard
Organization: Harvard University Science Center
Nntp-Posting-Host: husc11.harvard.edu
Lines: 21

The Harvard Computer Society is pleased to announce its third lecture of
the spring.  Ivan Sutherland, the father of computer graphics and an
innovator in microprocessing, will be speaking at Harvard University on
Tuesday, April 20, 1993, at 4:00 pm in Aiken Computations building, room
101.  The title of his talk is "Logical Effort and the Conflict over the
Control of Information."

Cookies and tea will be served at 3:30 pm in the Aiken Lobby.  Admissions
is free, and all are welcome.

Aiken is located north of the Science Center near the Law School.

For more information, send e-mail to eekim@husc.harvard.edu.

The lecture will be videotaped, and a tape will be made available.

Thanks.

-- 
Eugene Kim '96                     |   "Give me a place t

### Vectorization

Now that we have a good function to clean our text, we want to create a vectorial representation of each sentence in order to be processed by several models.

#### CountVectorizer



In [None]:
count_vec = CountVectorizer(
    preprocessor=preprocess,
    tokenizer=lambda s: s.split(),
    stop_words=STOPWORDS,
    min_df=4,
    max_df=0.8,
    max_features=10    
)

In [None]:
# count_vec = count_vec.fit([sentence])
vector = count_vec.transform([sentence])

In [None]:
vector.todense()

matrix([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

In [None]:
sorted(count_vec.vocabulary_.items(), key=lambda x: x[1])

[('auto', 0),
 ('new', 1),
 ('organization', 2),
 ('paper', 3),
 ('post', 4),
 ('r', 5),
 ('virginia', 6),
 ('warren', 7),
 ('writer', 8),
 ('yorker', 9)]

### Explanation of the CountVectorizer parameters

- max_df: Define the frequency of max observation in the training set (can be a number or a fraction) 
    
- min_df: Same thing than max_df but for min observation

- stop_words: We saw that in the last session. One thing to consider here is the fact that you want to make sure that the stopword will match your tokenizer and preprocessor pattern.

- max_features: Number of maximum word in the vocabulary (this is optional, a good configuration of max_df and min_df should be enough)

- ngram_range: The n-gram range that we want to accept.

In [None]:
#Parameters that we can tune
NGRAM = (1, 1) #Add more features when context is needed
MIN_DF = 0.1 #The more, the more specific
MAX_DF = 0.3 #The less, the more specific
MAX_FEATURES = 100 #Define the lenght of the vocabulary

In [None]:
count_vec = CountVectorizer(
    preprocessor=preprocess,
    ngram_range=NGRAM,
    tokenizer=lambda s: s.split(),
    stop_words=STOPWORDS,
    min_df=MIN_DF,
    max_df=MAX_DF,
    max_features=MAX_FEATURES    
)

In [None]:
count_vec = count_vec.fit(sentences)

In [None]:
voc1 = list(count_vec.vocabulary_.keys())
print(voc1)

['edu', 'nntp', 'posting', 'host', 'university', 'anyone', 'could', 'really', 'know', 'years', 'please', 'e', 'mail', 'thanks', 'u', 'article', 'number', 'two', 'computer', 'distribution', 'usa', 'well', 'k', 'way', 'back', 'new', 'c', 'make', 'since', 'like', 'much', 'better', 'good', 'people', 'use', 'question', 'might', 'news', 'time', 'f', 'w', 'p', 'world', 'com', 'x', 'j', 'writes', 'information', 'n', 'h', 'cs', 'system', 'things', 'right', 'see', 'r', 'apr', 'v', 'many', 'need', 'government', 'would', 'say', 'believe', 'even', 'must', 'using', 'year', 'first', 'point', 'reply', 'file', 'last', 'state', 'may', 'still', 'problem', 'said', 'think', 'go', 'going', 'one', 'help', 'b', 'work', 'something', 'want', 'ca', 'god', 'never', 'g', 'space', 'used', 'take', 'l', 'windows', 'q', 'max', 'z', 'ax']


In [None]:
voc2 = list(count_vec.vocabulary_.keys())
print(voc2)

['thing', 'anyone', 'could', 'really', 'know', 'years', 'please', 'e', 'mail', 'thanks', 'u', 'two', 'computer', 'distribution', 'usa', 'well', 'way', 'back', 'new', 'c', 'make', 'since', 'got', 'much', 'better', 'good', 'people', 'use', 'question', 'might', 'news', 'time', 'world', 'x', 'cs', 'system', 'things', 'right', 'see', 'r', 'apr', 'many', 'need', 'say', 'believe', 'even', 'must', 'using', 'read', 'first', 'point', 'another', 'reply', 'sure', 'last', 'state', 'long', 'may', 'still', 'problem', 'said', 'think', 'go', 'going', 'help', 'b', 'work', 'something', 'without', 'want', 'ca', 'case', 'never', 'let', 'used', 'take', 'someone', 'etc']


## TFidf

How can CountVectorizer be improved?

In [None]:
idf_vec = TfidfVectorizer(
    preprocessor=preprocess,
    ngram_range=NGRAM,
    tokenizer=lambda s: s.split(),
    stop_words=STOPWORDS,
    min_df=MIN_DF,
    max_df=MAX_DF,
    max_features=MAX_FEATURES,
    use_idf=True,
    smooth_idf=True
)

In [None]:
idf_vec = idf_vec.fit(sentences)

In [None]:
voc_idf = list(idf_vec.vocabulary_.keys())

In [None]:
print(voc_idf)

['thing', 'anyone', 'could', 'really', 'know', 'years', 'please', 'e', 'mail', 'thanks', 'u', 'two', 'computer', 'distribution', 'usa', 'well', 'way', 'back', 'new', 'c', 'make', 'since', 'got', 'much', 'better', 'good', 'people', 'use', 'question', 'might', 'news', 'time', 'world', 'x', 'cs', 'system', 'things', 'right', 'see', 'r', 'apr', 'many', 'need', 'say', 'believe', 'even', 'must', 'using', 'read', 'first', 'point', 'another', 'reply', 'sure', 'last', 'state', 'long', 'may', 'still', 'problem', 'said', 'think', 'go', 'going', 'help', 'b', 'work', 'something', 'without', 'want', 'ca', 'case', 'never', 'let', 'used', 'take', 'someone', 'etc']


In [None]:
idf_vec = TfidfVectorizer(
    preprocessor=preprocess,
    ngram_range=NGRAM,
    tokenizer=lambda s: s.split(),
    stop_words=STOPWORDS,
    min_df=0,
    max_df=10,
    max_features=10,
    use_idf=True,
    smooth_idf=True
)

In [None]:
vector = idf_vec.fit_transform([sentence])

In [None]:
vector.todense()

matrix([[0.24413654, 0.32551538, 0.24413654, 0.16275769, 0.73240961,
         0.32551538, 0.16275769, 0.16275769, 0.16275769, 0.16275769]])

## [10 min] Questions


## [5 min] Next assigment

In [None]:
groups = [1, 2, 3]
random.shuffle(groups)
print(groups)

[2, 3, 1]


Group 2:
- Non-negative matrix factorization (NMF or NNMF)

Group 3:
- Latent Dirichlet Allocation (LDA)

Group 1:
- K-means

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4b514847-e145-4e51-9c26-e306429d4631' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>