# 0. Data Setup

In [1]:
!gdown 1Ale0UvXib0oiOQmtlbAJCARxPxCg3yOk
!gdown 16_nlvHX92GyDyIhchvs39p7KSnywdpd6
!gdown 1I8jfE53cHlMyxQDeTbUmpOnjV7PKs2Bg

Downloading...
From: https://drive.google.com/uc?id=1Ale0UvXib0oiOQmtlbAJCARxPxCg3yOk
To: /content/train.csv
100% 200M/200M [00:10<00:00, 19.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=16_nlvHX92GyDyIhchvs39p7KSnywdpd6
To: /content/dev.csv
100% 24.7M/24.7M [00:00<00:00, 44.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1I8jfE53cHlMyxQDeTbUmpOnjV7PKs2Bg
To: /content/test.csv
100% 24.8M/24.8M [00:00<00:00, 39.1MB/s]


In [2]:
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
train_df = pd.read_csv('train.csv')
dev_df = pd.read_csv('dev.csv')
test_df = pd.read_csv('test.csv')

In [5]:
train_df = train_df.sample(4600, random_state = 42)
dev_df = dev_df.sample(575, random_state = 42)
test_df = test_df.sample(575, random_state = 42)

# 1. Data Preprocessing

Preprocess the data utilizing nltk


*   Apply lower casing
*   Apply tokenization
*   Remove stop words

In [6]:
def preprocess_text(text):
  text = text.lower()
  text = text.translate(str.maketrans('', '', string.punctuation))
  # tokenize
  tokens = word_tokenize(text)
  # remove stop words
  stop_words = set(stopwords.words('english'))
  tokens = [token for token in tokens if token not in stop_words]
  return tokens

In [7]:
train_df['clean_reviews'] = train_df['reviewText'].apply(preprocess_text)
dev_df['clean_reviews'] = dev_df['reviewText'].apply(preprocess_text)
test_df['clean_reviews'] = test_df['reviewText'].apply(preprocess_text)

In [9]:
for i in range(4):
  print("Original Review:", train_df['reviewText'].iloc[i])
  print("Clean Review:", train_df['clean_reviews'].iloc[i])

Original Review: Works great for its simplicity. You can use it to control water flow.
Clean Review: ['works', 'great', 'simplicity', 'use', 'control', 'water', 'flow']
Original Review: This review applies to CD number 821 233-2, a two-CD package.

The tracks are said to have been mastered from the original mono tapes and 78 rpm discs*.
DISC 1:
1. Move It On Over
2. A Mansion on the Hill
3. Lovesick Blues*
4. Wedding Bells
5. Mind Your Own Business
6. You're Gonna Change
7. Lost Highway
8. My Bucket's Got a Hole In It
9. I'm So Lonesome I Could Cry
10. I Just Don't Like This Kind of Living
11. Long Gone Lonesome Blues
12. My Son Calls Another Man Daddy
13. Why Don't You Love Me
14. Why Should We Try Anymore
15. They'll Never Take Her Love From Me
16. Moanin' the Blues
17. Nobody's Lonesome for Me
18. Cold Cold Heart
19. Dear John
20. Howlin' at the Moon

DISC 2:
1. I Can't Help It
2. Hey, Good Lookin'
3. Crazy Heart
4. Lonesome Whistle
5. Baby, We're Really in Love
6. Ramblin' Man
7. H

# 2. Simple Baseline Model

Output is *n* randomly selected words from the preprocessed review:

In [10]:
import random

In [11]:
def simple_model(review, n):
  if len(review) > n:
    return ' '.join(random.sample(review, n))
  else:
     return ' '.join(review)

Sample Outputs:

In [36]:
for i in range(9,14):
  print("Original Review:", train_df['reviewText'].iloc[i])
  print("Generated Summary:", simple_model(train_df['clean_reviews'].iloc[i], 3))
  print("Original Summary:", train_df['summary'].iloc[i])

Original Review: It works.
Generated Summary: works
Original Summary: Five Stars
Original Review: Rock & Roll Over is the Kiss album sandwiched between the more famous Destroyer and the more controversial Love Gun, so it tends to be overlooked. However, it might be the best of those three. After the bombastic polished production of Destroyer Kiss went on stage with Eddie Kramer recording and belted out a batch of energetic, fun, straight-shooting rockers. In retrospect, it was a magic time when they had the success of Destroyer under their belts but didn't get to excess of the Love Gun/Alive II era. Kind of like surfing a killer wave while it's breaking but hasn't crested yet. Is "Calling Dr Love" the greatest Gene Simmons gem in an arsenal full of rock classics? I say yes. Turn that song up to 10 from the opening note and it just blows you away. Dr Love also contains my all-time favorite Ace Frehely guitar solo. Stanley's "I Want You" might be the best lead-off track ever. "Makin Love

# 3. Evaluation and Parameter Tuning

ROUGE Evaluation.

In [25]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=4c08bfb748aa74ca9fc06bba319f6b5c8c1b210bebd6e89dc10a07ae3863cb09
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [59]:
from rouge_score import rouge_scorer

In [60]:
def rouge_score(prediction, ground_truth):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(prediction, ground_truth)
    return scores

In [61]:
def evaluate(predictions, ground_truths):
  rouge1_f = 0
  rouge2_f = 0
  rougeL_f = 0

  num_reviews = len(predictions)
  for pred, actual in zip(predictions, ground_truths):
    scores = rouge_score(pred, actual)
    rouge1_f += scores['rouge1'][2]
    rouge2_f += scores['rouge2'][2]
    rougeL_f += scores['rougeL'][2]

  rouge1_f = rouge1_f / num_reviews
  rouge2_f = rouge2_f / num_reviews
  rougeL_f = rougeL_f / num_reviews

  return (rouge1_f, rouge2_f, rougeL_f)

We want to find the best number of words to sample from the review

In [64]:
possible_k = [1, 3, 5, 10, 20, 50, 100]
best_k = 0
best_rougeL_f = 0

for k in possible_k:
  predicted_summaries = train_df['clean_reviews'].apply(lambda x: simple_model(x, k))
  actual_summaries = train_df['summary']

  rouge1_f, rouge2_f, rougeL_f = evaluate(predicted_summaries, actual_summaries)
  print(f"Current K: {k}")
  print(f"Current rougeL_f: {rougeL_f}")
  if rougeL_f > best_rougeL_f:
    best_k = k
    best_rougeL_f = rougeL_f
print(f'Best K: {best_k}')
print(f'Best RougeL_f: {best_rougeL_f}')

Current K: 1
Current rougeL_f: 0.03873137394280891
Current K: 3
Current rougeL_f: 0.06931777111593089
Current K: 5
Current rougeL_f: 0.08046215994327646
Current K: 10
Current rougeL_f: 0.09747161171996421
Current K: 20
Current rougeL_f: 0.10648441168399525
Current K: 50
Current rougeL_f: 0.10826242572892045
Current K: 100
Current rougeL_f: 0.10776902613193819
Best K: 50
Best RougeL_f: 0.10826242572892045


# 4. Generate Results

In [65]:
dev_predictions = dev_df['clean_reviews'].apply(lambda x: simple_model(x, best_k))
dev_references = dev_df['summary']

In [66]:
test_predictions = test_df['clean_reviews'].apply(lambda x: simple_model(x, best_k))
test_references = test_df['summary']

In [67]:
train_predictions = train_df['clean_reviews'].apply(lambda x: simple_model(x, best_k))
train_references = train_df['summary']

In [69]:
print("Training Eval")
rouge1_f, rouge2_f, rougeL_f = evaluate(train_predictions, train_references)
print("ROUGE-1 F-Score: ", (rouge1_f))
print("ROUGE-2 F-Score: ", (rouge2_f))
print("ROUGE-L F-Score: ", (rougeL_f))
print("-------------")
print("Dev Eval")
rouge1_f, rouge2_f, rougeL_f = evaluate(dev_predictions, dev_references)
print("ROUGE-1 F-Score: ", (rouge1_f))
print("ROUGE-2 F-Score: ", (rouge2_f))
print("ROUGE-L F-Score: ", (rougeL_f))
print("-------------")
print("Test Eval")
rouge1_f, rouge2_f, rougeL_f = evaluate(test_predictions, test_references)
print("ROUGE-1 F-Score: ", (rouge1_f))
print("ROUGE-2 F-Score: ", (rouge2_f))
print("ROUGE-L F-Score: ", (rougeL_f))

Training Eval
ROUGE-1 F-Score:  0.11334907227288749
ROUGE-2 F-Score:  0.0334312193990944
ROUGE-L F-Score:  0.10815748086678201
-------------
Dev Eval
ROUGE-1 F-Score:  0.10947841711559926
ROUGE-2 F-Score:  0.028037174461862543
ROUGE-L F-Score:  0.10346037447741731
-------------
Test Eval
ROUGE-1 F-Score:  0.10039754230148261
ROUGE-2 F-Score:  0.027238618868406284
ROUGE-L F-Score:  0.09228933912071954


### Write our results to a file

In [18]:
def write_to_file(data, file_name):
    with open(file_name, 'w') as txtfile:
        for row in data:
            txtfile.write(str(row) + '\n')

In [28]:
write_to_file(dev_predictions, 'weak_dev_pred.txt')
write_to_file(dev_references, 'weak_dev_ref.txt')

In [29]:
write_to_file(test_predictions, 'weak_test_pred.txt')
write_to_file(test_references, 'weak_test_ref.txt')

In [30]:
write_to_file(train_predictions, 'weak_train_pred.txt')
write_to_file(train_predictions, 'weak_train_ref.txt')