# <center>HW #1</center>


<div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>


**Instructions**:

- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description


**Problem Description**

In this assignment, you'll write functions to analyze an article to find out the word distributions and key concepts.

The packages you'll need for this assignment include numpy and pandas.


## Q1. Define a function to analyze word counts in an input sentence

Define a function named `tokenize(text)` which does the following:

- accepts a sentence (i.e., `text` parameter) as an input
- splits the sentence into a list of tokens by **space** (including tab, and new line).
  - e.g., `it's a hello world!!!` will be split into tokens `["it's", "a","hello","world!!!"]`
- removes the **leading/trailing punctuations or spaces** of each token, if any
  - e.g., `world!!! -> world`, while `it's` does not change
  - hint, you can import module _string_, use `string.punctuation` to get a list of punctuations (say `puncts`), and then use function `strip(puncts)` to remove leading or trailing punctuations in each token
- only keeps tokens with 2 or more characters, i.e. `len(token)>1`
- converts all tokens into lower case
- find the count of each unique token and save the counts as dictionary, i.e., `{world: 1, a: 1, ...}`
- returns the dictionary


In [18]:
import string
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [19]:
def tokenize(text):

    # initialize a list
    list = []
    vocab = {}

    # split by space (including \tab and \n)
    list = text.split()

    # clean up tokens
    puncts = string.punctuation
    for i in range(len(list)):
        list[i] = list[i].strip(puncts).lower()
        if(len(list[i]) > 1):
            vocab[list[i]] = vocab.get(list[i], 0)+1

    return vocab


In [20]:
# test your code
text = """it's a hello world!!!
           it is hello world again."""
tokenize(text)


{"it's": 1, 'hello': 2, 'world': 2, 'it': 1, 'is': 1, 'again': 1}

## Q2. Generate a document term matrix (DTM) as a numpy array

Define a function `get_dtm(sents)` as follows:

- accepts a list of sentences, i.e., `sents`, as an input
- uses `tokenize` function you defined in Q1 to get the count dictionary for each sentence
- pools the words from all the strings togehter to get a list of unique words, denoted as `unique_words`
- creates a numpy array, say `dtm` with a shape (# of docs x # of unique words), and set the initial values to 0.
- fills cell `dtm[i,j]` with the count of the `j`th word in the `i`th sentence
- returns `dtm` and `unique_words`


In [21]:
def get_dtm(sents):

    unique_words = []
    # process each sentence
    for s in sents:
        all_words = tokenize(s).keys()
        for w in all_words:
            if w not in unique_words:
                unique_words.append(w)

    # get all words
    dtm = np.zeros((len(sents), len(unique_words)))
    for i in range(len(sents)):
        for j in range(len(unique_words)):
            tokens = tokenize(sents[i])
            if unique_words[j] in tokens.keys():
                dtm[i, j] = tokens[unique_words[j]]

    return dtm, unique_words


In [22]:
# A test document. This document can be found at https://hbr.org/2022/04/the-power-of-natural-language-processing

sents = pd.read_csv("sents.csv")
sents.head()


Unnamed: 0,text
0,The Power of Natural Language Processing.
1,"Until recently, the conventional wisdom was th..."
2,But in the past two years language-based AI ha...
3,It has been used to write an article for The G...
4,AI even excels at cognitive tasks like program...


In [23]:
dtm, all_words = get_dtm(sents.text)

# Check if the array is correct

# randomly check one sentence
idx = 3

# get the dictionary using the function in Q1
vocab = tokenize(sents["text"].loc[idx])
print(sorted(vocab.items(), key=lambda item: item[0]))

# get all non-zero entries in dtm[idx] and create a dictionary
# these two dictionaries should be the same
sents.loc[idx]
vocab1 = {all_words[j]: dtm[idx][j] for j in np.where(dtm[idx] > 0)[0]}
print(sorted(vocab1.items(), key=lambda item: item[0]))


[('ago', 1), ('ai-authored', 1), ('an', 1), ('and', 1), ('article', 1), ('been', 1), ('blog', 1), ('feats', 1), ('few', 1), ('for', 1), ('gone', 1), ('guardian', 1), ('has', 1), ('have', 1), ('it', 1), ('possible', 1), ('posts', 1), ('that', 1), ('the', 1), ('to', 1), ('used', 1), ('viral', 1), ('weren’t', 1), ('write', 1), ('years', 1)]


text    It has been used to write an article for The G...
Name: 3, dtype: object

[('ago', 1.0), ('ai-authored', 1.0), ('an', 1.0), ('and', 1.0), ('article', 1.0), ('been', 1.0), ('blog', 1.0), ('feats', 1.0), ('few', 1.0), ('for', 1.0), ('gone', 1.0), ('guardian', 1.0), ('has', 1.0), ('have', 1.0), ('it', 1.0), ('possible', 1.0), ('posts', 1.0), ('that', 1.0), ('the', 1.0), ('to', 1.0), ('used', 1.0), ('viral', 1.0), ('weren’t', 1.0), ('write', 1.0), ('years', 1.0)]


## Q3 Analyze DTM Array

**Don't use any loop in this task**. You should use array operations to take the advantage of high performance computing.


Define a function named `analyze_dtm(dtm, words)` which:

- takes an array $dtm$ and $words$ as an input, where $dtm$ is the array you get in Q2 with a shape $(m \times n)$, and $words$ contains an array of words corresponding to the columns of $dtm$.
- calculates the sentence frequency for each word, say $j$, e.g. how many sentences contain word $j$. Save the result to array $df$ ($df$ has shape of $(n,)$ or $(1, n)$).
- normalizes the word count per sentence: divides word count, i.e., $dtm_{i,j}$, by the total number of words in sentence $i$. Save the result as an array named $tf$ ($tf$ has shape of $(m,n)$).
- for each $dtm_{i,j}$, calculates $tf\_idf_{i,j} = \frac{tf_{i, j}}{df_j}$, i.e., divide each normalized word count by the sentence frequency of the word. The reason is, if a word appears in most sentences, it does not have the discriminative power and often is called a `stop` word. The inverse of $df$ can downgrade the weight of such words. $tf\_idf$ has shape of $(m,n)$
- prints out the following:
  - the total number of words in the document represented by $dtm$
  - the most frequent top 10 words in this document
  - words with the top 10 largest $df$ values (show words and their $df$ values)
  - the longest sentence (i.e., the one with the most words)
  - top-10 words with the largest $tf\_idf$ values in the longest sentence (show words and values)
- returns the $tf\_idf$ array.

Note, for all the steps, **do not use any loop**. Just use array functions and broadcasting for high performance computation.


In [24]:
def analyze_dtm(dtm, words, sents):

    # calculates the sentence frequency for each word
    df = np.count_nonzero(dtm, axis=0)

    # normalizes the word count per sentence
    tf = dtm/dtm.sum(axis=1, keepdims=True)

    tf_idf = tf/df

    print(f"The total number of words:\n{dtm.sum()}\n")

    k = 10
    frequent = np.sum(dtm, axis=0)
    output = list(
        map(lambda i: (words[i], frequent[i]), np.argsort(-frequent)[:k]))
    print(f"The top {k} frequent words:\n{output}\n")

    k = 10
    output = list(map(lambda i: (words[i], df[i]), np.argsort(-df)[:k]))
    print(f"The top {k} words with highest df values:\n{output}\n")

    k = 10
    idx = np.argmax(np.sum(dtm, axis=1))
    print(f"The longest sentence :\n{sents[idx]}\n")

    k = 10
    output = list(
        map(lambda i: (words[i], tf_idf[idx][i]), np.argsort(-tf_idf[idx])[:k]))
    print(
        f"The top {k} words with highest tf-idf values in the longest sentece:\n{output}\n")

    return tf_idf


In [25]:
words = np.array(all_words)

analyze_dtm(dtm, words, sents.text)


The total number of words:
1853.0

The top 10 frequent words:
[('the', 68.0), ('to', 65.0), ('and', 52.0), ('of', 50.0), ('for', 37.0), ('ai', 25.0), ('in', 24.0), ('is', 23.0), ('are', 22.0), ('like', 20.0)]

The top 10 words with highest df values:
[('the', 46), ('to', 42), ('and', 41), ('of', 36), ('for', 32), ('in', 21), ('ai', 21), ('like', 20), ('is', 20), ('tasks', 19)]

The longest sentence :
Language models are already reshaping traditional text analytics, but GPT-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems.

The top 10 words with highest tf-idf values in the longest sentece:
[('problems', 0.02), ('pivotal', 0.02), ('math', 0.02), ('10x', 0.02), ('larger', 0.02), ('upon', 0.02), ('reshaping', 0.02), ('release', 0.02), ('enabled', 0.02), ('perform', 0.02)]



array([[0.00362319, 0.16666667, 0.00462963, ..., 0.        , 0.        ,
        0.        ],
       [0.00074963, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00088731, 0.        , 0.00113379, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.00094518, 0.        , 0.00120773, ..., 0.        , 0.        ,
        0.        ],
       [0.00074963, 0.        , 0.        , ..., 0.03448276, 0.        ,
        0.        ],
       [0.00086957, 0.        , 0.        , ..., 0.        , 0.04      ,
        0.04      ]])

## Q4. Find keywords of the document (Bonus)

Can you leverage $dtm$ array you generated to find a few keywords that can be used to tag this document? e.g., AI, language models, tools, etc.

Please use a narrative to describe your ideas and also implement your ideas.
