# Session 3 - Model design

In this lesson, we will learn how to bring all the pieces together—assembling what we learned so far into our problem. Then, we will design the pipeline that will create our final classification model. Finally, you will try to find all the system's limitations and new ideas from the final result. We will go through the assignment results and some of your questions after designing your model during this lesson.


# [30-40 min] Presentation of each group's model

- How the model works?
- How did it perform using your dataset?
- What are the limitation of that model?

### [10-15min] Presentation of K-means

### [10-15min] Presentation of NMF

### [10-15min] Presentation of LDA


### [20 min] Presentation of an alternative method

In [None]:
!python -m spacy download en_core_web_md

2021-10-27 21:43:00.313073: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-27 21:43:00.313126: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Collecting en-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
[K     |████████████████████████████████| 45.4 MB 123 kB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.1.0
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [None]:
import pandas as pd
import sys
import numpy as np
import os
import re
import random
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import spacy

In [None]:
#We will import and read our dataset using pandas
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
dataset = pd.DataFrame({"text": data["data"], "label": data["target"]})

In [None]:
sentences = dataset["text"].values

In [None]:
with open("../assets/stopwords.txt", "r") as f:  # type:ignore[name-defined]
    STOPWORDS = [i.strip().lower() for i in f.readlines()]

In [None]:
def get_preprocessing_function(
    use_lower: bool = True,
    use_alpha: bool = True,
    use_stemming: bool = False
):
    
    def alpha(text: str):
        return re.sub("[^a-z]+", " ", text) if use_alpha else text

    def lower(text: str):
        return text.lower() if use_lower else text
        
    def stemming(text: str):
        return text
    
    def preprocess(text: str):
        #Create list of steps
        steps = [lower, alpha, stemming]
        for step in steps:
            text = step(text)
        return text
    
    return preprocess

In [None]:
preprocess = get_preprocessing_function(
    use_lower = True,
    use_alpha = True,
    use_stemming = True
)

In [None]:
dataset["text"] = dataset["text"].fillna(".")
dataset["text"] = dataset["text"].astype(str)
dataset["text"] = dataset["text"].apply(preprocess)
sentences = dataset["text"].values

### What are word embeddings?

A word embedding is a learned representation for text where words that have the same meaning have a similar representation.

It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.

[ref number 1](https://machinelearningmastery.com/what-are-word-embeddings/)

![image_1](https://developers.google.com/machine-learning/guides/text-classification/images/WordEmbeddings.png)

![image_3](https://www.researchgate.net/profile/Scott-Cohen-5/publication/344886229/figure/fig3/AS:950864126685191@1603715082766/Word-co-occurrence-network-of-the-original-scientific-article-n-130-paragraphs.png)


In [None]:
model = spacy.load('en_core_web_md', disable=['parser', 'ner'])

In [None]:
sentence_1 = random.choice(sentences)
sentence_2 = random.choice(sentences)

In [None]:
doc_1 = model(sentence_1)
doc_2 = model(sentence_2)
doc_1_vector = doc_1.vector
doc_2_vector = doc_2.vector

### How do we have a vector of a sentence?

![image_2](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmiro.medium.com%2Fmax%2F1920%2F1*ytRLNPOlDQ7kV6XhwH4baA.png&f=1&nofb=1)

In [None]:
print(doc_1_vector.shape)

(300,)


In [None]:
print(doc_1_vector)

[-5.62302135e-02  2.22206473e-01 -1.75952137e-01  8.17859359e-03
 -1.52418232e-02 -7.95962736e-02  3.75552778e-03 -1.96965709e-01
  7.44718164e-02  2.26508260e+00 -2.30198309e-01  6.71647117e-02
  4.37277071e-02 -5.09226993e-02 -5.99096976e-02 -1.74977034e-02
 -4.30246070e-02  1.14461374e+00 -1.78094506e-01 -1.57537255e-02
 -1.95053162e-03 -5.09126708e-02 -1.14611484e-01 -3.73650752e-02
  3.47329937e-02  3.07766423e-02 -3.19175329e-03 -2.02922542e-02
  5.69847077e-02 -5.63087277e-02 -1.03423499e-01  7.66070336e-02
 -2.24631727e-02  1.01324013e-02  3.41261141e-02 -2.13441462e-03
 -1.47219524e-02 -1.88007820e-02 -7.40992203e-02 -6.73977807e-02
  4.75566722e-02  9.04916301e-02  1.66198555e-02 -5.97303137e-02
  4.83734533e-02  7.20232353e-02 -1.64188579e-01 -6.84722373e-03
  2.49541551e-02  4.20619398e-02 -1.05918601e-01  5.67717627e-02
  6.47750497e-02 -8.72547254e-02  8.88657197e-02  1.15228062e-02
  1.68119092e-02 -7.63470381e-02  3.27668488e-02 -9.88457799e-02
 -5.42579629e-02 -1.04056

In [None]:
doc_1.similarity(doc_2)

0.9663938703589886

### Still an issue?

The more words in the text --> The more smoothing when average --> real close document for each of them

In [None]:
MIN_DF = 2
MAX_DF = 0.4
vec = TfidfVectorizer(
    preprocessor=lambda s: s,
    tokenizer=lambda s: s.split(),
    stop_words=STOPWORDS,
    min_df=MIN_DF,
    max_df=MAX_DF,
    use_idf=True,
    smooth_idf=True
)

In [None]:
vec = vec.fit(sentences)
vectors = vec.transform(sentences)

  "The parameter 'token_pattern' will not be used"


In [None]:
len(vec.get_feature_names())

46678

In [None]:
len(vec.get_feature_names())
print(vec.get_feature_names())



### Still an issue?

We still have a lot of features in this vectorizer. There is a high chance that we won't solve the smoothing issue because of the amount of words we will be filtering.

In [None]:
top_n = 25

words = np.array(vec.get_feature_names())
res = []
for i in range(vectors.shape[0]):
    # Will get the words that are in the TFIDF which have the higher score
    # We use -vectors because the order is ascending
    s = np.argsort(np.asarray(-vectors[i, :].todense()).flatten())
    res.append(" ".join(words[s[:top_n]]))



In [None]:
res[0]

'car lerxst wam umd tellme bricklin funky rac enlighten bumper neighborhood doors maryland specs production sports anyone door park separate il engine addition brought late'

In [None]:
sentence_1 = random.choice(res)
sentence_2 = random.choice(res)

In [None]:
doc_1 = model(sentence_1)
doc_2 = model(sentence_2)
doc_1_vector = doc_1.vector
doc_2_vector = doc_2.vector

In [None]:
doc_1.similarity(doc_2)

0.5709277623677915

In [None]:
doc_1_vector

array([ 4.08852994e-02,  2.88990647e-01, -2.00382527e-02, -1.27978846e-01,
        8.50051939e-02, -2.46315617e-02, -6.13432452e-02, -9.62310731e-02,
        3.78384367e-02,  7.85024643e-01, -2.03909919e-01, -4.99030016e-02,
        2.61485577e-02, -1.36638075e-01, -2.11598143e-01, -7.46200308e-02,
       -1.02811448e-01,  1.14155114e+00, -8.23028013e-02, -9.92530882e-02,
        1.11317351e-01, -8.51118639e-02,  8.32710937e-02, -1.14716671e-01,
       -7.12008178e-02,  3.56472358e-02, -2.48487011e-01, -7.88615737e-03,
        9.39784721e-02,  2.11245120e-01,  1.12766931e-02, -5.24664074e-02,
        7.83825852e-03,  4.02679779e-02, -3.83816063e-02, -1.11462079e-01,
        2.61193532e-02,  4.74296585e-02, -5.59849590e-02, -1.06795607e-02,
        6.04381859e-02, -6.29040375e-02,  1.67846054e-01,  2.02239633e-01,
        4.01060805e-02,  2.83332448e-02,  8.85285586e-02, -1.76660493e-01,
        6.19931705e-03,  1.14666350e-01,  8.17067176e-02,  7.74377137e-02,
        1.17067851e-01, -

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4b514847-e145-4e51-9c26-e306429d4631' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>