<a href="https://colab.research.google.com/github/willewiik/Text-Mining/blob/main/lab1/TM-Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="alert alert-info">
    
➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).

➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).

➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**.  You normally shouldn't need to modify any of the other cells.

</div>

# L1: Information Retrieval

In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.

In [50]:
#from google.colab import drive
#drive.mount('/gdrive')
#%cd /gdrive/MyDrive/TExtmining

In [51]:
!git clone https://github.com/willewiik/Text-Mining.git
%cd "Text-Mining/lab1"

Cloning into 'Text-Mining'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 20 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (20/20), 950.38 KiB | 3.91 MiB/s, done.
Resolving deltas: 100% (3/3), done.
/content/Text-Mining/lab1/Text-Mining/lab1


In [52]:
# Define some helper functions that are used in this notebook
from IPython.display import display, HTML

def success():
    display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))

## Dataset

The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).

In [53]:
import bz2
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 500)

with bz2.open('app-descriptions.json.bz2', mode='rt', encoding='utf-8') as source:
    df = pd.read_json(source, encoding='utf-8')

In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200–204:

In [54]:
df.loc[200:205]

Unnamed: 0,name,description
200,Brick Breaker Star: Space King,"Introducing the best Brick Breaker game that everyone can enjoy.\nEnjoy various missions and addictively simple play control.\n\n[Features]\n- Hundreds of stages and various missions\n- No limit to play such as Heart, play as much as you can!\n- 5 kinds of various items and items reinforcement system\n- No network required\n- game file is as low as 20M, light-weight download!\n- supports tablet screen\n- supports Google Play Leaderboards, Achievement, Multiplay\n- supports 14 languages\n\nHo..."
201,Brick Classic - Brick Game,"Classic Brick Game!\n\nBrick Classic is a popular and addictive puzzle game!\n\nHow to play?\n- Simply drag the bricks to move them.\n- Create full lines on the grid vertically or horizontally to break bricks.\n\nTips:\n- Classic brick game without time limits.\n- Place the bricks in a reasonable position.\n- The more brick break, the more scores you have.\n- Bricks can't be rotated.\n\nWho's the best brick breaker? Challenge it now!!!"
202,Bricks Breaker - Glow Balls,"Bricks Breaker - Glow Balls is a addictive and challenging brick game.\nJust play it to relax your brain. Be focus on breaking bricks and you will find it more funny and exciting.\n\nHow to play\n- Hold the screen with your finger and move to aim.\n- Find best positions and angles to hit all bricks.\n- When the durability of brick reaches 0, destroyed.\n- Never let bricks reach the bottom or game is over.\n\nFeatures\n- Colorful glow skins.\n- Free to play.\n- Easy game controls with one fin..."
203,Bricks Breaker Quest,"How to play\n- The ball flies to wherever you touched.\n- Clear the stages by removing bricks on the board.\n- Break the bricks and never let them hit the bottom.\n- Find best positions and angles to hit every brick.\n\nFeature\n- Free to play\n- Tons of stages\n- Various types of balls\n- Easy to play, Simplest game system, Designed for one handheld gameplay.\n- Off-line (without internet connection) gameplay supported \n- Multi-play supported\n- Tablet device supported\n- Achievement & lea..."
204,Brothers in Arms® 3,"Fight brave soldiers from around the globe on the frenzied multiplayer battlegrounds of World War 2 or become Sergeant Wright and experience a dramatic, life-changing single-player journey, in the aftermath of the D-Day invasion.\n\nCLIMB THE ARMY RANKS IN MULTIPLAYER \n> 4 maps to master and enjoy. \n> 2 gameplay modes to begin with: Free For All and Team Deathmatch.\n> Unlock game-changing perks by playing with each weapon class!\n> A soldier’s only as deadly as his weapon. Be sure to upgr..."
205,Brown Dust - Tactical RPG,"The Empire has fallen, and the Age of Great Mercenaries Now Begins!\nCreate Your Ultimate Team And Strike Down Your Enemies!\n\nCAPTIVATING AND STUNNING ARTWORK\n- Experience the high-quality anime illustrations you have never seen before.\n- Meet Brown Dust's charming Mercenaries now.\n\nASSEMBLE LEGENDARY MERCENARIES\n- Over 300 Mercenaries and a Variety of Skills.\n- Discover the Unique Mercenaries, 6 Devils and Dominus Octo.\n- All Mercenaries can reach max level and the highest rank.\n\..."


As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The next cell shows how to access only the description field from row 200:

In [55]:
df.loc[200, 'description']

'Introducing the best Brick Breaker game that everyone can enjoy.\nEnjoy various missions and addictively simple play control.\n\n[Features]\n- Hundreds of stages and various missions\n- No limit to play such as Heart, play as much as you can!\n- 5 kinds of various items and items reinforcement system\n- No network required\n- game file is as low as 20M, light-weight download!\n- supports tablet screen\n- supports Google Play Leaderboards, Achievement, Multiplay\n- supports 14 languages\n\nHomepage:\nhttps://play.google.com/store/apps/dev?id=4931745640662708567\n\nFacebook: \nhttps://www.facebook.com/spcomesgames/'

## Problem 1: What's in a vector?

We start by vectorising the data — more specifically, we map each app description to a tf–idf vector. This is very simple with a library like [scikit-learn](https://scikit-learn.org/stable/), which provides a [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class for exactly this purpose.  If we instantiate this class, and call `fit_transform()` on all of our app descriptions, scikit-learn will preprocess and tokenize each app description, compute tf–idf values for each of them, and return a vectorised representation:

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['description'])
X

<1614x27877 sparse matrix of type '<class 'numpy.float64'>'
	with 267110 stored elements in Compressed Sparse Row format>

Let’s pick the app "Pancake Tower", which has a rather short description text, to see how it has been vectorised:

In [57]:
# We can use 'toarray' to convert the sparse matrix object into a "normal" array
vec = X[1032].toarray()[0]

# The app description & its corresponding vector
df.loc[1032, 'description'], vec

("Let's see how many pancakes you can pile up!!",
 array([0., 0., 0., ..., 0., 0., 0.]))

That's not very informative yet.  We know that the vector contains tf–idf values, and that each dimension of the vector corresponds to a token in the vectorizer’s vocabulary; let's extract these for this specific example.

Your **first task** is to find out how to access the `vectorizer`’s vocabulary, for example by [checking the documentation of `TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and print all the tokens that are represented in the vector with a tf–idf value greater than zero (i.e., only the tokens that are actually part of this app’s description) _in descending order of the tf–idf values_.  In other words, the token with the highest tf–idf value should be at the top of your output, and the token with the lowest tf–idf value at the bottom.   Before you implement this, think about what you would expect the output look like, for example which words you would expect to have the highest/lowest tf–idf values in this example.

Your final output should look something like this:

```
<token 1>: <tf-idf value 1>
<token 2>: <tf-idf value 2>
...
```

In [58]:
"""Print the tokens and their tf–idf values, in descending order."""

# YOUR CODE HERE
idx = np.where(vec>0)

tokens = vectorizer.get_feature_names_out()[idx] # our words in the description
values = vec[idx] # our values

df1 = pd.DataFrame({'y':tokens,'x':values}) # adding it to a dataframe then sort
df1 = df1.sort_values('x')[::-1] # reversing the order


for i in range(len(idx[0])): # looping over all words in the description
  print(f' {df1.iloc[i]["y"]}:  {df1.iloc[i]["x"]}') # print




 pancakes:  0.6539332651185913
 pile:  0.5304701435508047
 let:  0.2615287714771797
 see:  0.2557630827415271
 many:  0.23491959669849022
 how:  0.21153246225085887
 up:  0.17216837691451817
 can:  0.13047602895910532
 you:  0.10276923239718011


## Problem 2: Finding the nearest vectors

To build a small search engine, we need to be able to turn _queries_ (for example the string "pile up pancakes") into _query vectors_, and then find out which of our app description vectors are closest to the query vector.

For the first part (turning queries into query vectors), we can simply re-use the `vectorizer` that we used for the app descriptions. For the second part, an easy way to find the closest vectors is to use scikit-learn’s [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. This class needs to be _fit_ on a set of vectors (the "training set"; in our case the app descriptions) and can then be used with any vector to find its _nearest neighbors_ in the vector space.

**First,** instantiate and fit a class that returns the _ten (10)_ nearest neighbors:

In [59]:
"""Instantiate and fit a class that returns the 10 nearest neighboring vectors."""

# YOUR CODE HERE
from sklearn.neighbors import NearestNeighbors as nn
# cosine as metric to base similarity on angles between vectors
neigh = nn(n_neighbors = 10 ,metric='cosine') # 10 nearest neighbors

neigh.fit(X) # fitting it on our vectors



**Second,** implement a function that uses the vectorizer and the fitted class to find the nearest neighbours for a given query string:

In [60]:
def search(query):
    """Find the nearest neighbors in `df` for a query string.

    Arguments:
      query (str): A query string.

    Returns:
      The 10 apps (with name and description) most similar (in terms of
      cosine similarity) to the given query as a Pandas DataFrame.
    """
    query_vec = vectorizer.transform([query]) # transform query to vector
    dist, index = neigh.kneighbors(query_vec) # chech nearest neighbours
    df_new = pd.DataFrame({'name': df.iloc[index[0]]["name"], 'Distance':dist[0]}) # dataframe with
                                            # the names of the closes documents
    return df_new

### 🤞 Test your code

Test your implementation by running the following cell, which will sanity-check your return value and show the 10 best search results for the query _"pile up pancakes"_:

In [61]:
"""Check that searching for "pile up pancakes" returns a DataFrame with ten results,
   and that the top result is "Pancake Tower"."""

result = search('pile up pancakes')
display(result)
assert isinstance(result, pd.DataFrame), "search() function should return a Pandas DataFrame"
assert len(result) == 10, "search() function should return 10 search results"
assert result.iloc[0]["name"] == "Pancake Tower", "Top search result should be 'Pancake Tower'"
success()

Unnamed: 0,name,Distance
1032,Pancake Tower,0.140541
326,Cooking School: Games for Girls,0.831773
656,"Hell’s Cooking — crazy chef burger, kitchen fever",0.937301
1235,Solitaire,0.94035
1164,Rummy - Free,0.943607
1181,Sago Mini Trucks and Diggers,0.952065
436,Dr. Panda's Ice Cream Truck,0.954849
1442,Turbo Dismount™,0.955981
1446,UNO!™,0.95601
1326,TO-FU Oh!SUSHI,0.959162


Before continuing with the next problem, play around a bit with this simple search functionality by trying out different search queries, and see if the results look like what you would expect:

In [62]:
# Example — try out your own queries!
search("dodge trains")

Unnamed: 0,name,Distance
1428,Train Conductor World,0.680156
1301,Subway Surfers,0.695724
1300,Subway Princess Runner,0.773085
998,No Humanity - The Hardest Game,0.861715
228,Bus Rush 2,0.907905
1465,Virus War - Space Shooting Game,0.912711
360,Dancing Road: Color Ball Run!,0.913174
184,Bob - jigsaw puzzles free games for kids & parents,0.923505
179,Blocky Highway: Traffic Racing,0.927082
271,Cat Runner: Decorate Home,0.930841


In [63]:
search('survival shooters')

Unnamed: 0,name,Distance
372,"Day R Survival – Apocalypse, Lone Survivor and RPG",0.792994
354,DEAD WARFARE: Zombie Shooting - Gun Games Free,0.803232
1313,Survival on Raft: Ocean Nomad - Simulator,0.805102
1558,Zombie Hunter Sniper: Apocalypse Shooting Games,0.871971
408,Don't Starve: Pocket Edition,0.883075
847,MAD ZOMBIES : Offline Zombie Games,0.895886
141,Beetle.io,0.901796
799,Last Shelter: Survival,0.901825
1377,This War of Mine,0.902319
1220,Slingshot Championship,0.903289


In [64]:
search('football stats')

Unnamed: 0,name,Distance
327,Cool Goal!,0.722585
1016,Online Soccer Manager (OSM) - 2019/2020,0.737138
338,Crazy Kick!,0.790452
1233,"SofaScore - Live Scores, Fixtures & Standings",0.801238
550,Football Strike - Multiplayer Soccer,0.819802
256,Captain Tsubasa ZERO -Miracle Shot-,0.864666
1230,Soccer Scores - FotMob,0.880596
1538,World Soccer League,0.887208
1445,UDisc Disc Golf App,0.921225
807,Leghe Fantacalcio ®,0.92254


## Problem 3: Custom preprocessing & tokenization

In Problem 1, you should have seen that `TfidfVectorizer` already performs some preprocessing by default and also does its own tokenization of the input data. This is great for getting started, but often we want to have more control over these steps. We can customize some aspects of the preprocessing through arguments when instantiating `TfidfVectorizer`, but for this exercise, we want to do _all_ of our preprocessing & tokenizing outside of scikit-learn.

Concretely, we want to use [spaCy](https://spacy.io), a library that we will make use of in later labs as well.  Here is a brief example of how to load and use a spaCy model:

In [65]:
import spacy
# Load the small English model, disabling some components that we don't need right now
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

# Take an example sentence and print every token from it separately
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)


Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


**Your task** is to write a preprocessing function that uses spaCy to perform the following steps:
- tokenization
- lemmatization
- stop word removal
- removing tokens containing non-alphabetical characters

We recommend that you go through the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy&nbsp;101, which demonstrates how you can get the relevant kind of information via the spaCy library.

Implement your preprocessor by completing the following function:

### 🤞 Test your code

Test your implementation by running the following cell:

In [66]:
def preprocess(text):
    """Preprocess the given text by tokenising it, removing any stop words,
    replacing each remaining token with its lemma (base form), and discarding
    all lemmas that contain non-alphabetical characters.

    Arguments:
      text (str): The text to preprocess.

    Returns:
      The list of remaining lemmas after preprocessing (represented as strings).
    """
    # YOUR CODE HERE
    doc = nlp(text)

    # Tokenization & Lemmatization
    lemmas = []
    for token in doc:
        lemmas.append(token.lemma_)

    # Stop word removal
    filtered_lemmas = []
    for lemma in lemmas:
        if nlp.vocab[lemma].is_stop == False:  # Only keep non-stop words
            filtered_lemmas.append(lemma)

    # Removing non-alphabetic characters
    final_lemmas = []
    for lemma in filtered_lemmas:
        if lemma.isalpha():
            final_lemmas.append(lemma)

    return final_lemmas



In [67]:
"""Check that the preprocessing returns the correct output for a number of test cases."""

assert (
    preprocess('Apple is looking at buying U.K. startup for $1 billion') ==
    ['Apple', 'look', 'buy', 'startup', 'billion']
)
assert (
    preprocess('"Love Story" is a country pop song written and sung by Taylor Swift.') ==
    ['Love', 'Story', 'country', 'pop', 'song', 'write', 'sing', 'Taylor', 'Swift']
)
success()

## Problem 4: The effect of preprocessing

To make use of the new `preprocess` function from Problem 3, we need to make sure that we incorporate it into `TfidfVectorizer` and disable all preprocessing & tokenization that `TfidfVectorizer` performs by default. Afterwards, we also need to re-fit the vectorizer and the nearest-neighbors class. To make this a bit easier to handle, let’s take everything we have done so far and put it in a single class `AppSearcher`.

### Task 4.1

**Your first task** is to complete the stub of the `AppSearcher` class given below. Keep in mind:
- The `fit()` function should fit both the vectorizer (from Problem 1) and the nearest-neighbors class (from Problem 2).  Make sure to modify the call to `TfidfVectorizer` to _disable all preprocessing & tokenization_ that it would do by default, and replace it with a call to the `preprocess()` function _defined in `AppSearcher`_.
- For the `preprocess()` function, you can start by copying your solution from Problem 3.
- For the `search()` function, you can copy your solution from Problem 2.
- Make sure to adapt your code to store the everything (data, vectorizer, nearest-neighbors class) within the `AppSearcher` class, so that your solution is independent of the code you wrote above!

In [74]:
class AppSearcher:
    def fit(self, df):
        """Instantiate and fit all the classes required for the search engine (cf. Problems 1 and 2)."""
        self.df = df
        # YOUR CODE HERE
        vectorizer = TfidfVectorizer(preprocessor = self.preprocess)  # Disable default preprocessing
        X = vectorizer.fit_transform(df['description'])  # Fit and transform the descriptions

        # Fit the NearestNeighbors model
        self.neigh = nn(n_neighbors = 10, metric = 'cosine')
        self.neigh.fit(X)  # Fit on the transformed vectorized data
        self.vectorizer = vectorizer  # Store the vectorizer


    def preprocess(self, text):
        """Preprocess the given text (cf. Problem 3)."""
        # YOUR CODE HERE
        doc = nlp(text)

        # Tokenization & Lemmatization
        lemmas = []
        for token in doc:
            lemmas.append(token.lemma_)

        # Stop word removal
        filtered_lemmas = []
        for lemma in lemmas:
            if nlp.vocab[lemma].is_stop == False:  # Only keep non-stop words
                filtered_lemmas.append(lemma)

        # Removing non-alphabetic characters
        final_lemmas = []
        for lemma in filtered_lemmas:
            if lemma.isalpha():
                final_lemmas.append(lemma)


        return " ".join(final_lemmas)



    def search(self, query):
        """Find the nearest neighbors in `df` for a query string (cf. Problem 2)."""
        # YOUR CODE HERE
        query_vec = self.vectorizer.transform([query])
        dist, index = self.neigh.kneighbors(query_vec)

        # Create a DataFrame with the names of the closest documents
        df_new = pd.DataFrame({'name': self.df.iloc[index[0]]["name"], 'Distance':dist[0]})
        return df_new



#### 🤞 Test your code

The following cell demonstrates how your class should be used. Note that it can take a bit longer to train it on the data as before, since we’re now calling spaCy for the preprocessing.

In [75]:
apps = AppSearcher()
apps.fit(df)
apps.search("pile up pancakes")

Unnamed: 0,name,Distance
1032,Pancake Tower,0.042465
326,Cooking School: Games for Girls,0.802253
1235,Solitaire,0.888335
1181,Sago Mini Trucks and Diggers,0.89807
1263,Spider Solitaire,0.905475
1164,Rummy - Free,0.940506
436,Dr. Panda's Ice Cream Truck,0.949923
1245,Solitaire Free,0.951328
1442,Turbo Dismount™,0.952369
427,Dr. Panda Ice Cream Truck Free,0.959701


### Task 4.2

**Your second task** is to experiment with the effect of using (or not using) different preprocessing steps.  We always need to _tokenize_ the text, but other preprocessing steps are optional and require a conscious decision whether to use them or not, such as:
- lemmatization
- lowercasing all characters
- removing stop words
- removing tokens containing non-alphabetical characters

**Modify the definition of the `preprocess()` function** of `AppSearcher` to include/exclude individual preprocessing steps, run some searches, and observe if and how the results change.  Which search queries you try out is up to you — you could compare searching for "pile up pancakes" with "pancake piling", for example; or you could try entirely different search queries aimed at different kinds of apps.  (You can modify the class directly by changing the cell above under Task 4.1, or copy the definitions to the cells below, whichever you prefer; there is no separate code to show for this task, but you will use your observations here for the individual reflection.)

In [87]:
class AppSearcher:
    def fit(self, df):
        """Instantiate and fit all the classes required for the search engine (cf. Problems 1 and 2)."""
        self.df = df
        # YOUR CODE HERE
        vectorizer = TfidfVectorizer(preprocessor = self.preprocess)  # Disable default preprocessing
        X = vectorizer.fit_transform(df['description'])  # Fit and transform the descriptions

        # Fit the NearestNeighbors model
        self.neigh = nn(n_neighbors = 10, metric = 'cosine')
        self.neigh.fit(X)  # Fit on the transformed vectorized data
        self.vectorizer = vectorizer  # Store the vectorizer


    def preprocess(self, text):
        """Preprocess the given text (cf. Problem 3)."""
        # YOUR CODE HERE
        doc = nlp(text)

        # Tokenization & Lemmatization
        lemmas = []
        for token in doc:
            lemmas.append(token.text)

        # Stop word removal
        filtered_lemmas = []
        for lemma in lemmas:
            if nlp.vocab[lemma].is_stop == False:  # Only keep non-stop words
                filtered_lemmas.append(lemma)

        # Removing non-alphabetic characters
        final_lemmas = []
        for lemma in filtered_lemmas:
            if lemma.isalpha():
                final_lemmas.append(lemma)


        return " ".join(final_lemmas)



    def search(self, query):
        """Find the nearest neighbors in `df` for a query string (cf. Problem 2)."""
        # YOUR CODE HERE
        query_vec = self.vectorizer.transform([query])
        dist, index = self.neigh.kneighbors(query_vec)

        # Create a DataFrame with the names of the closest documents
        df_new = pd.DataFrame({'name': self.df.iloc[index[0]]["name"], 'Distance':dist[0]})
        return df_new



In [88]:
apps = AppSearcher()
apps.fit(df)
apps.search("pile up pancakes")

Unnamed: 0,name,Distance
1032,Pancake Tower,0.066075
326,Cooking School: Games for Girls,0.803875
656,"Hell’s Cooking — crazy chef burger, kitchen fever",0.930242
1235,Solitaire,0.931532
1164,Rummy - Free,0.938789
436,Dr. Panda's Ice Cream Truck,0.946209
1442,Turbo Dismount™,0.95092
1245,Solitaire Free,0.954851
427,Dr. Panda Ice Cream Truck Free,0.956827
1326,TO-FU Oh!SUSHI,0.964055


In [81]:
apps.search("football stats")

Unnamed: 0,name,Distance
327,Cool Goal!,0.719588
1016,Online Soccer Manager (OSM) - 2019/2020,0.736827
550,Football Strike - Multiplayer Soccer,0.776025
338,Crazy Kick!,0.790515
1233,"SofaScore - Live Scores, Fixtures & Standings",0.799786
1538,World Soccer League,0.860266
256,Captain Tsubasa ZERO -Miracle Shot-,0.863801
1230,Soccer Scores - FotMob,0.867418
96,BLEACH Brave Souls,0.918519
807,Leghe Fantacalcio ®,0.918578


## Individual reflection

<div class="alert alert-info">
    <strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below.  Remember:
    <ul>
        <li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
        <li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
    </ul>
</div>

1. In Problem 1, which token had the highest tf–idf score, which the lowest?  Based on your knowledge of how tf–idf works, how would you explain this result?
2. Based on your observations in Problem 4, which preprocessing steps do you think are the most appropriate for this "search engine" example?  Why?

**Congratulations on finishing this lab! 👍**

<div class="alert alert-info">
    
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors.  For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).

</div>