

# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from **scratch instead of using any pre-existing libraries** to do so:

(1) Count the frequency of all the N-grams (N=3) and (N=2).

(2) Calculate the probabilities for all the bigrams in the dataset by using the formula  count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative
probabilities of each review in terms of other reviews (abstracts, or tweets) by using the formula  frequency (noun phrase) / max frequency (noun phrase) on the whole dataset.

Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [None]:
import pandas as pd

CSV_PATH = "/content/Bhamore_Yash_s2_clean_steps.csv"
TEXT_COL = "abstract"

df = pd.read_csv(CSV_PATH)
print("Rows in file:", len(df))
print("Columns:", df.columns.tolist())

# drop empty texts
df = df.dropna(subset=[TEXT_COL])
df = df[df[TEXT_COL].astype(str).str.strip().ne("")]
print("Non-empty texts:", len(df))

# preview
df.head(3)


Rows in file: 4443
Columns: ['topic', 'paperId', 'title', 'abstract', 'year', 'venue', 'publicationTypes', 'authors', 'citationCount', 'paper_url', 'externalIds', 'isOpenAccess', 'clean_text_basic', 'clean_text_final', 'char_len', 'token_len', 'step0_raw', 'step4_lower', 'step1_no_noise', 'step2_no_numbers', 'step3_no_stopwords', 'step5_stemmed', 'step6_lemmatized']
Non-empty texts: 4443


Unnamed: 0,topic,paperId,title,abstract,year,venue,publicationTypes,authors,citationCount,paper_url,...,clean_text_final,char_len,token_len,step0_raw,step4_lower,step1_no_noise,step2_no_numbers,step3_no_stopwords,step5_stemmed,step6_lemmatized
0,machine learning,f9c602cc436a9ea2f9e7db48c77d924e09ce3c32,Fashion-MNIST: a Novel Image Dataset for Bench...,"We present Fashion-MNIST, a new dataset compri...",2017.0,arXiv.org,JournalArticle,Han Xiao;Kashif Rasul;Roland Vollgraf,9293,https://www.semanticscholar.org/paper/f9c602cc...,...,present fashion mnist new dataset compris x gr...,516,48,"We present Fashion-MNIST, a new dataset compri...","we present fashion-mnist, a new dataset compri...",we present fashion mnist a new dataset compris...,we present fashion mnist a new dataset compris...,present fashion mnist new dataset comprising x...,present fashion mnist new dataset compris x gr...,present fashion mnist new dataset comprising x...
1,machine learning,9c9d7247f8c51ec5a02b0d911d1d7b9e8160495d,TensorFlow: Large-Scale Machine Learning on He...,TensorFlow is an interface for expressing mach...,2016.0,arXiv.org,JournalArticle,Martín Abadi;Ashish Agarwal;P. Barham;E. Brevd...,11254,https://www.semanticscholar.org/paper/9c9d7247...,...,tensorflow interfac express machin learn algor...,1225,102,TensorFlow is an interface for expressing mach...,tensorflow is an interface for expressing mach...,tensorflow is an interface for expressing mach...,tensorflow is an interface for expressing mach...,tensorflow interface expressing machine learni...,tensorflow interfac express machin learn algor...,tensorflow interface expressing machine learni...
2,machine learning,0090023afc66cd2741568599057f4e82b566137c,A Survey on Bias and Fairness in Machine Learning,With the widespread use of artificial intellig...,2019.0,ACM Computing Surveys,JournalArticle;Review,Ninareh Mehrabi;Fred Morstatter;N. Saxena;Kris...,4699,https://www.semanticscholar.org/paper/0090023a...,...,widespread use artifici intellig ai system app...,1560,133,With the widespread use of artificial intellig...,with the widespread use of artificial intellig...,with the widespread use of artificial intellig...,with the widespread use of artificial intellig...,widespread use artificial intelligence ai syst...,widespread use artifici intellig ai system app...,widespread use artificial intelligence ai syst...


In [None]:
import re

# Take the first 100 abstracts for analysis
texts = df["abstract"].astype(str).head(100).tolist()
print(f"Using {len(texts)} texts for Q1.")

# Tokenization: lowercase, remove punctuation and extra spaces
def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)  # keep only letters and numbers
    tokens = text.split()
    return tokens

# Apply tokenization
tokenized_texts = [tokenize(t) for t in texts]

# Show first 20 tokens from the first abstract
print("Example tokens from first abstract:\n", tokenized_texts[0][:20])


Using 100 texts for Q1.
Example tokens from first abstract:
 ['we', 'present', 'fashion', 'mnist', 'a', 'new', 'dataset', 'comprising', 'of', '28x28', 'grayscale', 'images', 'of', '70', '000', 'fashion', 'products', 'from', '10', 'categories']


In [None]:
from collections import Counter

unigram = Counter()
bigram  = Counter()
trigram = Counter()

for toks in tokenized_texts:
    unigram.update(toks)
    for i in range(len(toks) - 1):
        bigram[(toks[i], toks[i+1])] += 1
    for i in range(len(toks) - 2):
        trigram[(toks[i], toks[i+1], toks[i+2])] += 1

print(f"unique unigrams: {len(unigram)}, bigrams: {len(bigram)}, trigrams: {len(trigram)}")

# bigram probabilities: P(w2|w1) = count(w1,w2) / count(w1)
bigram_prob = {bg: c / unigram[bg[0]] for bg, c in bigram.items() if unigram[bg[0]] > 0}

# show top-10 bigrams by probability (tie-break by count)
top_bi = sorted(bigram.items(),
                key=lambda kv: (bigram_prob.get(kv[0], 0), kv[1]),
                reverse=True)[:10]
print("\nTop bigrams (P(w2|w1), count):")
for (w1, w2), c in top_bi:
    print(f"{w1} {w2:<18}  P={bigram_prob[(w1,w2)]:.4f}  count={c}")

# show top-10 trigrams by count
top_tri = trigram.most_common(10)
print("\nTop trigrams (count):")
for (w1, w2, w3), c in top_tri:
    print(f"{w1} {w2} {w3:<12}  count={c}")


unique unigrams: 3130, bigrams: 12311, trigrams: 16250

Top bigrams (P(w2|w1), count):
number of                  P=1.0000  count=16
covid 19                  P=1.0000  count=10
part of                  P=1.0000  count=9
variety of                  P=1.0000  count=7
ranging from                P=1.0000  count=6
due to                  P=1.0000  count=6
active learning            P=1.0000  count=6
lead to                  P=1.0000  count=6
serve as                  P=1.0000  count=5
https url                 P=1.0000  count=5

Top trigrams (count):
of machine learning      count=51
machine learning models        count=28
machine learning algorithms    count=22
machine learning and           count=20
in this paper         count=17
as well as            count=17
in machine learning      count=17
this paper we            count=15
the machine learning      count=13
machine learning ml            count=13


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
!pip -q install nltk
import nltk, numpy as np, pandas as pd
nltk.download("punkt", quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)

# simple NP chunker: (Adj)* + (Noun)+
GRAMMAR = r"NP: {<JJ.*>*<NN.*>+}"
chunker = nltk.RegexpParser(GRAMMAR)

def noun_phrases(tokens):
    tagged = nltk.pos_tag(tokens)
    tree = chunker.parse(tagged)
    nps = []
    for subtree in tree.subtrees(lambda t: t.label() == "NP"):
        words = [w for w, _ in subtree.leaves()]
        nps.append(" ".join(words))
    return nps

# collect NP counts per doc and global
from collections import Counter
doc_np_counts = []
global_np = Counter()

for toks in tokenized_texts:
    nps = noun_phrases(toks)
    c = Counter(nps)
    doc_np_counts.append(c)
    global_np.update(c)

# to keep the matrix compact, keep top-K NPs by global frequency
TOP_K = 50   # you can raise to 100 if needed
top_nps = [np_ for np_, _ in global_np.most_common(TOP_K)]
max_per_np = {np_: global_np[np_] for np_ in top_nps}

# rows = docs, cols = NPs, value = freq_in_doc / max_freq_in_corpus
mat = np.zeros((len(doc_np_counts), len(top_nps)), dtype=float)
for i, c in enumerate(doc_np_counts):
    for j, np_ in enumerate(top_nps):
        if max_per_np[np_] > 0:
            mat[i, j] = c.get(np_, 0) / max_per_np[np_]

np_df = pd.DataFrame(mat, columns=top_nps)
np_df.insert(0, "doc_id", range(1, len(doc_np_counts)+1))

print(f"NP columns (TOP_K): {len(top_nps)}")
np_df.head(5)


NP columns (TOP_K): 50


Unnamed: 0,doc_id,machine,machine learning,data,methods,models,paper,algorithms,systems,model,...,approaches,issues,information,knowledge,area,experiments,way,development,predictions,order
0,1,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.022222,0.0,0.0,0.0,0.0,0.032258,0.037037,0.086957,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.017241,0.0,0.0,0.03125,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Question 2 (25 points)

**Undersand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any **pre-existing libraries** to do so.

In [None]:
# Write your code here

import re, pandas as pd
try:
    tokenized_texts
except NameError:
    # rebuild quickly from the same df you loaded in Q1
    texts = df["abstract"].astype(str).head(100).tolist()
    def tokenize(s): return re.sub(r"[^a-z0-9\s]", " ", s.lower()).split()
    tokenized_texts = [tokenize(t) for t in texts]

# build corpus-level counts to choose a reasonable vocab
from collections import Counter
corpus_counts = Counter()
doc_freq = Counter()
for toks in tokenized_texts:
    corpus_counts.update(toks)
    doc_freq.update(set(toks))   # each term counted once per doc for DF

N_DOCS   = len(tokenized_texts)
MIN_DF   = 3        # ignore rare words that appear in fewer than 3 docs
MAX_VOC  = 5000     # safety limit

# vocabulary = terms with DF >= MIN_DF, sorted by frequency, capped at MAX_VOC
candidates = [w for w,dfc in doc_freq.items() if dfc >= MIN_DF]
candidates.sort(key=lambda w: corpus_counts[w], reverse=True)
vocab = candidates[:MAX_VOC]
term2idx = {w:i for i,w in enumerate(vocab)}
print(f"Docs: {N_DOCS}, vocab size: {len(vocab)} (MIN_DF={MIN_DF})")







Docs: 100, vocab size: 891 (MIN_DF=3)


In [None]:
import math
import numpy as np

# build per-doc TF (raw counts) on our vocab
doc_term_tf = []        # list[dict(term_idx -> tf count)]
df_counts   = np.zeros(len(vocab), dtype=int)

for toks in tokenized_texts:
    tf = Counter([t for t in toks if t in term2idx])
    # update DF: a term counts once per doc if present
    for t in tf.keys():
        df_counts[term2idx[t]] += 1
    # convert to index form
    tf_idx = {term2idx[t]: c for t,c in tf.items()}
    doc_term_tf.append(tf_idx)

# IDF with smoothing (textbook-ish)
# idf = log( (N_DOCS + 1) / (df + 1) ) + 1
idf = np.log((N_DOCS + 1) / (df_counts + 1)) + 1.0

# build TF-IDF vectors (L2 normalized)
def l2_norm(vec_dict):
    s = 0.0
    for j, v in vec_dict.items():
        s += v*v
    return math.sqrt(s) if s > 0 else 1.0

doc_tfidf = []   # list[dict(term_idx -> tfidf)]
for tf in doc_term_tf:
    # simple TF: raw counts (you could also use log(1+tf))
    tfidf = {j: (c * idf[j]) for j, c in tf.items()}
    # L2 normalize for comparability
    norm = l2_norm(tfidf)
    tfidf = {j: v / norm for j, v in tfidf.items()} if norm > 0 else tfidf
    doc_tfidf.append(tfidf)

print("Built TF-IDF vectors for all docs.")


Built TF-IDF vectors for all docs.


In [None]:
# helper to show top-k terms by TF-IDF for a doc
def top_terms(tfidf_vec, k=10):
    pairs = sorted(tfidf_vec.items(), key=lambda kv: kv[1], reverse=True)[:k]
    return [(list(term2idx.keys())[list(term2idx.values()).index(j)], v) for j,v in pairs]

# show top terms for first 3 documents
for di in range(min(3, len(doc_tfidf))):
    tops = top_terms(doc_tfidf[di], k=10)
    print(f"\nDoc {di+1} top terms:")
    for term, score in tops:
        print(f"  {term:20s}  {score:.4f}")

# optional: make a (small) dense matrix for the first 1000 vocab terms to save/inspect
MAX_SAVE_VOC = min(len(vocab), 1000)
mat = np.zeros((len(doc_tfidf), MAX_SAVE_VOC), dtype=float)
for i, tfidf in enumerate(doc_tfidf):
    for j, v in tfidf.items():
        if j < MAX_SAVE_VOC:
            mat[i, j] = v

tfidf_df = pd.DataFrame(mat, columns=vocab[:MAX_SAVE_VOC])
tfidf_df.insert(0, "doc_id", range(1, len(doc_tfidf)+1))
tfidf_df.head(3)

# save for submission (optional)
tfidf_df.to_csv("a3_q2_tfidf_matrix_small.csv", index=False)
print("\nSaved a3_q2_tfidf_matrix_small.csv (first", MAX_SAVE_VOC, "vocab terms).")



Doc 1 top terms:
  000                   0.4652
  images                0.4107
  fashion               0.3684
  dataset               0.2886
  10                    0.2220
  set                   0.1924
  the                   0.1795
  training              0.1440
  products              0.1228
  test                  0.1228

Doc 2 top terms:
  implementation        0.3007
  and                   0.2872
  interface             0.2848
  algorithms            0.1900
  variety               0.1739
  devices               0.1739
  of                    0.1692
  systems               0.1676
  such                  0.1485
  wide                  0.1484

Doc 3 top terms:
  ai                    0.3915
  in                    0.2535
  systems               0.2413
  researchers           0.2031
  biases                0.1957
  that                  0.1953
  address               0.1878
  to                    0.1663
  different             0.1583
  and                   0.1551

Saved a3_q2_tfi

## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit or Fine tune bert model).

(2) Visualize the embeddings using PCA or t-SNE in 2D. Create a scatter plot of at least 20 words and show how similar words cluster together.

(3) Calculate the cosine similarity between a few pairs of words to see if the model captures semantic similarity accurately.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
# Write your code here
# Q3 (1) – POS Tagging
!pip -q install spacy
import spacy, pandas as pd

# load small English model
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# take first 100 abstracts from your dataset
texts = df["abstract"].astype(str).head(100).tolist()

pos_counts = {"NOUN": 0, "VERB": 0, "ADJ": 0, "ADV": 0}

for text in texts:
    doc = nlp(text)
    for token in doc:
        if token.pos_ in pos_counts:
            pos_counts[token.pos_] += 1

print("Total POS counts across 100 abstracts:")
for pos, cnt in pos_counts.items():
    print(f"{pos:<5}: {cnt}")







Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m105.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Total POS counts across 100 abstracts:
NOUN : 5766
VERB : 2293
ADJ  : 2047
ADV  : 596


In [None]:
# Q3 (2) — constituency (benepar standalone) + dependency (spaCy) — robust version
!pip -q install spacy benepar
!python -m spacy download en_core_web_sm

import spacy, benepar
from nltk.tree import Tree

nlp = spacy.load("en_core_web_sm")
benepar.download("benepar_en3")
parser = benepar.Parser("benepar_en3")   # standalone parser (no spaCy pipe)

# pick a SHORT, safe sentence from your abstracts (<= 40 tokens)
texts = df["abstract"].dropna().astype(str).head(100).tolist()
short_sent = None
for txt in texts:
    doc = nlp(txt)
    for s in doc.sents:
        if 5 <= len(s) <= 40:   # avoid super short & super long sentences
            short_sent = s.text.strip()
            break
    if short_sent:
        break

if not short_sent:
    # last resort, take first abstract and truncate
    short_sent = " ".join(nlp(texts[0]).text.split()[:30])

print("Example sentence:\n", short_sent, "\n")

# ---- Constituency parse via benepar (standalone) ----
# tokenize with spaCy but pass tokens to benepar directly
tokens = [t.text for t in nlp.make_doc(short_sent)]
const_tree: Tree = parser.parse(tokens)
print("Constituency parse tree:")
print(const_tree)   # bracketed form
# optional pretty print:
# print(const_tree.pformat(margin=120))

# ---- Dependency parse via spaCy ----
doc = nlp(short_sent)
print("\nDependency relationships (token → head / relation):")
for tok in doc:
    print(f"{tok.text:15s} → {tok.head.text:15s} {tok.dep_}")


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Example sentence:
 We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. 

Constituency parse tree:
(TOP
  (S
    (NP (PRP We))
    (VP
      (VBP present)
      (NP
        (NP (NNP Fashion) (HYPH -) (NNP MNIST))
        (, ,)
        (NP
          (NP (DT a) (JJ new) (NN dataset))
          (VP
            (VBG comprising)
            (PP
              (IN of)
              (NP
                (NP
                  (NP (CD 28x28) (JJ grayscale) (NNS images))
                  (PP
                    (IN of)
                    (NP
                      (NP (CD 70,000) (NN fashion) (NNS products))
                      (PP (IN from) (NP (CD 10) (NNS categories))))))
                (, ,)
                (PP
                  (IN with)
                  (NP
                    (NP (CD 7,000) (NNS images))
                    (PP (IN per) (NP (NN category)))))))))))
    (. .)))

Depende



In [None]:
# Q3 (3) – Named Entity Recognition (NER): extract and count entities
import spacy, pandas as pd
from collections import Counter

# reuse the first 100 abstracts
texts = df["abstract"].astype(str).head(100).tolist()

# if nlp not in memory, reload (safe)
try:
    nlp
except NameError:
    import spacy
    nlp = spacy.load("en_core_web_sm")

label_map = {
    "PERSON": "Person",
    "ORG": "Organization",
    "GPE": "Location",        # geo-political entity (countries/cities)
    "LOC": "Location",
    "PRODUCT": "Product",
    "DATE": "Date",
    "TIME": "Time",
    "EVENT": "Event",
    "WORK_OF_ART": "WorkOfArt",
    "LAW": "Law",
    "LANGUAGE": "Language",
    "NORP": "Group",          # nationalities/religions/political groups
    "FAC": "Facility",
    "PERCENT": "Percent",
    "MONEY": "Money",
    "QUANTITY": "Quantity",
    "ORDINAL": "Ordinal",
    "CARDINAL": "Cardinal"
}

all_ents = []
doc_entities = []  # per-doc list of (text,label)

for text in texts:
    doc = nlp(text)
    ents = [(ent.text, label_map.get(ent.label_, ent.label_)) for ent in doc.ents]
    doc_entities.append(ents)
    all_ents.extend([lab for _, lab in ents])

# overall label counts
label_counts = Counter(all_ents)
print("Overall entity counts across 100 abstracts:")
for lab, cnt in label_counts.most_common():
    print(f"{lab:12s}: {cnt}")

# build a DataFrame of the top entities (text + label) for quick preview
rows = []
for i, ents in enumerate(doc_entities, start=1):
    for txt, lab in ents:
        rows.append({"doc_id": i, "entity": txt, "label": lab})

ner_df = pd.DataFrame(rows)
print("\nPreview of extracted entities:")
ner_df.head(10)


Overall entity counts across 100 abstracts:
Organization: 214
Cardinal    : 132
Date        : 42
Person      : 21
Location    : 20
Percent     : 16
Ordinal     : 15
Group       : 9
Product     : 6
Quantity    : 3
Facility    : 2
Event       : 2
Language    : 2
Money       : 1
WorkOfArt   : 1
Law         : 1

Preview of extracted entities:


Unnamed: 0,doc_id,entity,label
0,1,Fashion-MNIST,Organization
1,1,28x28,Cardinal
2,1,70000,Cardinal
3,1,10,Cardinal
4,1,7000,Cardinal
5,1,60000,Cardinal
6,1,10000,Cardinal
7,2,TensorFlow,Organization
8,2,TensorFlow,Organization
9,2,hundreds,Cardinal


## Question 4 (20 Points)

**Create your own training and evaluation dataset for an NLP task.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.




1.   Which NLP Task you would like perform on your selected dataset
(NER, Summarization, Sentiment Analysis, Text classficication)
2.  Explain your labeling Schema you have used and mention those labels

3.  You can take AI assistance for labeling the data only.



In [15]:
# Q4(1) — Build annotation template from your dataset
import pandas as pd
import re
import numpy as np

# 1) take 120 reasonably short sentences from your abstracts
texts = df["abstract"].dropna().astype(str).tolist()

def split_into_sentences(text):
    # simple sentence split (keeps things dependency-free)
    # you already have spacy loaded above; but let's keep this super light
    # fall back to regex for robustness
    parts = re.split(r'(?<=[.!?])\s+', text.strip())
    parts = [p.strip() for p in parts if len(p.strip().split()) >= 6 and len(p.strip().split()) <= 40]
    return parts

sentences = []
for t in texts:
    sentences.extend(split_into_sentences(t))
    if len(sentences) >= 120:
        break

sentences = sentences[:120]
template = pd.DataFrame({
    "id": range(1, len(sentences)+1),
    "text": sentences,
    "label": ""   # <-- you will fill this with NLP / CV / ML
})

TEMPLATE_CSV = "a3_q4_annotation_template.csv"
template.to_csv(TEMPLATE_CSV, index=False)
print(f"Saved template → {TEMPLATE_CSV}  ({len(template)} rows)")

template.head(5)


Saved template → a3_q4_annotation_template.csv  (120 rows)


Unnamed: 0,id,text,label
0,1,"We present Fashion-MNIST, a new dataset compri...",
1,2,"The training set has 60,000 images and the tes...",
2,3,Fashion-MNIST is intended to serve as a direct...,
3,4,The dataset is freely available at this https URL,
4,5,TensorFlow is an interface for expressing mach...,


In [16]:
# Q4(2) — Validate and summarize the annotated dataset
import pandas as pd

ANNOTATED_IN  = "/content/a3_q4_annotation_template.csv"       # the one you filled
ANNOTATED_OUT = "a3_q4_annotated_clean.csv" # validated, clean

df_ann = pd.read_csv(ANNOTATED_IN)
print("Loaded:", df_ann.shape)
assert set(["id","text","label"]).issubset(df_ann.columns), "CSV must have columns: id, text, label"

# strip and uppercase to normalize
df_ann["label"] = df_ann["label"].astype(str).str.strip().str.upper()

# allowed labels
ALLOWED = {"NLP", "CV", "ML"}
bad = df_ann[~df_ann["label"].isin(ALLOWED)]
if len(bad):
    print("⚠️ Found invalid labels (showing first 10):")
    display(bad.head(10))
    print("\nAllowed labels are:", ALLOWED)
else:
    print("✅ All labels valid.")

# drop exact duplicates (same text+label)
before = len(df_ann)
df_ann = df_ann.drop_duplicates(subset=["text","label"]).reset_index(drop=True)
after = len(df_ann)
if after < before:
    print(f"Removed {before-after} duplicate rows.")

# basic stats
print("\nLabel distribution:")
print(df_ann["label"].value_counts())

# quick length sanity (optional)
df_ann["len"] = df_ann["text"].str.split().apply(len)
print("\nSentence length summary (tokens):")
print(df_ann["len"].describe().round(2))

# save clean file (no helper column)
df_ann = df_ann[["id","text","label"]]
df_ann.to_csv(ANNOTATED_OUT, index=False)
print(f"\nSaved clean annotations → {ANNOTATED_OUT}")


Loaded: (120, 3)
⚠️ Found invalid labels (showing first 10):


Unnamed: 0,id,text,label
0,1,"We present Fashion-MNIST, a new dataset compri...",NAN
1,2,"The training set has 60,000 images and the tes...",NAN
2,3,Fashion-MNIST is intended to serve as a direct...,NAN
3,4,The dataset is freely available at this https URL,NAN
4,5,TensorFlow is an interface for expressing mach...,NAN
5,6,This paper describes the TensorFlow interface ...,NAN
6,7,The TensorFlow API and a reference implementat...,NAN
7,8,With the widespread use of artificial intellig...,NAN
8,9,AI systems can be used in many sensitive envir...,NAN
9,10,More recently some work has been developed in ...,NAN



Allowed labels are: {'CV', 'NLP', 'ML'}

Label distribution:
label
NAN    120
Name: count, dtype: int64

Sentence length summary (tokens):
count    120.00
mean      23.23
std        7.84
min        9.00
25%       17.75
50%       23.00
75%       28.00
max       40.00
Name: len, dtype: float64

Saved clean annotations → a3_q4_annotated_clean.csv


In [None]:
# The GitHub link of your final csv file


# Link:



# Mandatory Question

Provide your thoughts on the assignment by filling this survey link. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Type your answer