In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("discussion.ipynb")

# Discussion 6

### Due Saturday May 14th, 11:59:59PM


# Regex and NLP

In [2]:
import os
import numpy as np
import pandas as pd
import requests
import time
import re

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from discussion import *

## Regular Expressions

### Resources

**Online Simulators**

 - https://pythex.org/

 - https://regex101.com/
 
**Cheat sheets**

 - https://dsc80.com/resources/other/berkeley-regex-reference.pdf

 - https://www.debuggex.com/cheatsheet/regex/python

 - https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf

**Question 1**: Identify duplicate words in a sentence

Given an input sentence as a string, provide a list of words that are duplicated. If there is no duplication return an empty list

**Hint 1:** Checkout capture groups and backreferencing in regex

**Hint 2:** Use the online simulators with the doctests as test cases for faster experimentation and debugging

In [6]:
# Use capture group number to identify duplicates

In [9]:
found = re.findall(r'\b(\w+)(\s+\1\b)+', 'cool cool')
simple = []
for n in found:
    simple.append(n[0])
simple

['cool']

In [8]:
grader.check("q1")

**Question 2**: Extract laptop specifications

Given a df with product description - Return df with added columns of `processor` (i3, i5), `generation` (9th Gen, 10th Gen), `storage` (512 GB SSD, 1 TB HDD), `display_in_inch` (15.6 inch, 14 inch). The below image provides details on column names and the exact patterns.

If there is no specific information present, keep a null (`NaN`) value.

**Hint:** You can write regex patterns in `.str.extract()` pandas methods. Note that this method may return multiple columns based on the number of capture groups present.

<img src='imgs/laptop_specs.PNG'>

In [6]:
# Use 'pd_column.str.extract(r'pattern')' to extract the required pattern
df = pd.read_csv('data/laptop_details.csv')

In [151]:
re.findall('([0-9]{0,2}[.]{0,2}[0-9]{0,2}\sinch)',df['laptop_description'][1])

[]

In [94]:
'aosida  512 GB SSD adasd'.extract('\(\s*(\d*\w+ Gen)\s*\)')

AttributeError: 'str' object has no attribute 'extract'

In [105]:
duplicate_words('I went to the market market with my my my family')

['cool']

In [7]:
df

df['processor'] = df['laptop_description'].str.extract('Intel Core (i\d{1}) Processor')
df
df['generation'] = df['laptop_description'].str.extract('\(\s*([12345678910]{1,2}[a-z]{1,2}\s+Gen)\s*')
df
df['storage'] = df['laptop_description'].str.extract('(\d+ \w+ [HDS]{3})')
df
df['display_inch'] = df['laptop_description'].str.extract('([0-9]{0,2}[.]{0,2}[0-9]{0,2}\sinch)')
df



Unnamed: 0,laptop_description,processor,generation,storage,display_inch
0,"Intel Core i5 Processor (9th Gen), 8 GB DDR4 R...",i5,9th Gen,512 GB SSD,15.65 inch
1,"Intel Core i3 Processor (2nd Gen), 8 GB DDR4...",i3,2nd Gen,1 TB HDD,
2,"Intel Core i5 Processor ( 10th Gen), 64 bit W...",i5,10th Gen,512 GB SSD,15.6 inch
3,"Intel Core i5 Processor (10th Gen ), 256 GB SS...",i5,10th Gen,256 GB SSD,14 inch
4,"Intel Core i3 Processor, 4 GB DDR4 RAM, 64 bit...",i3,,1 TB HDD,15.6 inch
5,"Intel Core i3 Processor (10th Gen), 4 GB DDR4 ...",i3,10th Gen,1 TB HDD,15.65 inch
6,"Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,1 TB HDD,14 inch
7,"Intel Core i5 Processor (10th Gen), 8 GB DDR4 ...",i5,10th Gen,512 GB HDD,14 inch
8,"Intel Core i3 Processor (10th Gen), 64 bit Win...",i3,10th Gen,256 GB SSD,14 inch
9,"Intel Core i5 Processor (10th Gen), 1 TB HDD, ...",i5,10th Gen,1 TB HDD,14.2 inch


In [9]:
# don't change this cell -- it is needed for the tests to work
out = laptop_details(pd.read_csv('data/laptop_details.csv'))

In [10]:
grader.check("q2")

## Natural Language Processing - Dealing with Text Data

- Unstructured data is everywhere - Everything you read, see and listen
- Quantifying text data and extracting features from it is important to generate insights and build models


- Text representation ia a huge area of study - Representing a piece of text as a vector of numbers (BoW, TF-IDF, semantic embeddings etc.)

- In this section, we will focus on Bag-of-Words representations using uni-grams and bi-grams

Let's use the musical instuments reviews dataset which contains information on reviews and ratings.

In [11]:
review_df = pd.read_csv('data/musical_instruments_reviews.csv')
review_df

Unnamed: 0,reviewerID,reviewText,overall,summary
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...",5,good
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,5,Jake
2,A195EZSQDW3E21,The primary job of this device is to block the...,5,It Does The Job Well
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,5,GOOD WINDSCREEN FOR THE MONEY
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,5,No more pops when I record my vocals.
...,...,...,...,...
195,A2DKLC2FJTY9OI,Good all around mike. If you are looking for a...,5,Best in price range
196,A1MI9FDCNB3CMR,"Seriously? The Shure SM57 sets the standard, ...",5,Industry Standard
197,A37AQI4AU3JWSR,If it's good enough to track Tom Petty's vox o...,5,Classic. the last of Shures good mics
198,A37U8NH2CD9EDX,There's a reason every mic cabinet has at leas...,5,"It's an SM57, what's there to say?"


### N-grams in text

 - uni-gram consists of a single word from a text sequence
 - Extending this, an n-gram consists of consecutive 'n' words from a text sequence
 - Eg. For `text = 'i love data science'`, uni-grams are `['i', 'love', 'data', 'science']`, bi-grams are `['i love', 'love data', 'data science']`

### Getting the uni-grams and their counts

In [52]:
# First normalize the reviews by converting to lower and removing all puntuations
reviews = review_df['reviewText'].str.lower().str.replace('[^\w\s]',' ', regex=True)
reviews = reviews.tolist()
reviews

['not much to write about here  but it does exactly what it s supposed to  filters out the pop sounds  now my recordings are much more crisp  it is one of the lowest prices pop filters on amazon so might as well buy it  they honestly work the same despite their pricing ',
 'the product does exactly as it should and is quite affordable i did not realized it was double screened until it arrived  so it was even better than i had expected as an added bonus  one of the screens carries a small hint of the smell of an old grape candy i used to buy  so for reminiscent s sake  i cannot stop putting the pop filter next to my nose and smelling it after recording   dif you needed a pop filter  this will work just as well as the expensive ones  and it may even come with a pleasing aroma like mine did buy this product    ',
 'the primary job of this device is to block the breath that would otherwise produce a popping sound  while allowing your voice to pass through with no noticeable reduction of vo

In [16]:
# Getting all the unigrams from all the reviews
unigrams = []
for review in reviews:
    words = review.split()
    unigrams.extend(words)
unigrams[:10]

['good',
 'jake',
 'it',
 'does',
 'the',
 'job',
 'well',
 'good',
 'windscreen',
 'for']

In [21]:
# Getting unigram counts
pd.Series(unigrams).value_counts()

- Does this make sense? Both the values and their counts?
- What are the positives/drawbacks of using unigram bag-of-words for text representations?

### How does 'reviewText' differ from 'summary'?

In [36]:
reviews = review_df['summary'].str.lower().str.replace('[^\w\s]','', regex=True)
reviews = reviews.tolist()

# Getting all the unigrams from all the reviews
unigrams = []
for review in reviews:
    words = review.split()
    unigrams.extend(words)

pd.Series(unigrams).value_counts()

good         45
the          38
cable        38
great        28
for          25
             ..
easily        1
breaking      1
complaint     1
which         1
by            1
Length: 328, dtype: int64

**Question 3**: Create bi-gram counts of the whole reviews text corpus.

Given a DataFrame like `review_df` and a column string (either `reviewText` or `summary`),
return a Series with bi-gram counts of that column sorted in descending order. The index of the series should be a tuple of bi-grams and the value should indicate the count of times that bi-gram appears in the whole corpus.

Perform the text normalization (lower case conversion and removing all punctuations) like we did in the uni-gram case before creating bi-gram counts.

**Hint:** Use splitting and zipping to create bi-gram combinations

In [15]:
review_df.head()

Unnamed: 0,reviewerID,reviewText,overall,summary
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...",5,good
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,5,Jake
2,A195EZSQDW3E21,The primary job of this device is to block the...,5,It Does The Job Well
3,A2C00NNG1ZQQG2,Nice windscreen protects my MXL mic and preven...,5,GOOD WINDSCREEN FOR THE MONEY
4,A94QU4C90B1AX,This pop filter is great. It looks and perform...,5,No more pops when I record my vocals.


In [17]:

reviews

['not much to write about here but it does exactly what its supposed to filters out the pop sounds now my recordings are much more crisp it is one of the lowest prices pop filters on amazon so might as well buy it they honestly work the same despite their pricing',
 'the product does exactly as it should and is quite affordablei did not realized it was double screened until it arrived so it was even better than i had expectedas an added bonus one of the screens carries a small hint of the smell of an old grape candy i used to buy so for reminiscents sake i cannot stop putting the pop filter next to my nose and smelling it after recording dif you needed a pop filter this will work just as well as the expensive ones and it may even come with a pleasing aroma like mine didbuy this product ',
 'the primary job of this device is to block the breath that would otherwise produce a popping sound while allowing your voice to pass through with no noticeable reduction of volume or high frequencie

In [48]:
reviews = review_df['reviewText'].str.lower().str.replace('[^\w\s]','', regex=True)
reviews = reviews.tolist()

unigrams = []
for review in reviews:
    words = review.split()
    zipy = zip(words[0::2], words[1::2])
    zipy2 = zip(words[1::2], words[2::2])
    unigrams.extend(zipy)
    unigrams.extend(zipy2)
   # print(unigrams)
unigrams

pd.Series(unigrams).value_counts()




(for, the)          36
(of, the)           35
(i, have)           32
(is, a)             31
(to, the)           28
                    ..
(used, some)         1
(seen, and)          1
(proper, wayive)     1
(roll, them)         1
(keep, some)         1
Length: 8470, dtype: int64

In [56]:
# don't change this cell -- it is needed for the tests to work
out_bigrams_text = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'reviewText')
out_bigrams_summary = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'summary')

In [55]:
out_bigrams_summary.shape# == (8470,)

(8470,)

In [29]:
out_bigrams_text

(for, the)           20
(i, bought)          20
(this, is)           19
(and, it)            15
(to, the)            14
                     ..
(thatyou, should)     1
(always, have)        1
(extra, cables)       1
(strings, with)       1
(these, mics)         1
Length: 4717, dtype: int64

In [57]:
grader.check("q3")

### Bag-of-Words

- The bag of words model represents texts (e.g. review, summary) as vectors of word counts.
- It is called 'bag of words' because it doesn't consider order.

### Creating the Bag-of-Words Count Matrix

Let's create a BoW count matrix of 'summary' using 'bi-grams'

In [58]:
out_bigrams_summary = bigram_counts(pd.read_csv('data/musical_instruments_reviews.csv'), 'summary')
out_bigrams_summary

(for, the)          8
(guitar, cable)     7
(the, best)         6
(it, works)         6
(good, quality)     5
                   ..
(as, advertised)    1
(low, cost)         1
(its, purpose)      1
(it, serves)        1
(old, stand)        1
Length: 559, dtype: int64

In [59]:
reviews = review_df['summary'].str.lower().str.replace('[^\w\s]','', regex=True)
# reviews = reviews.tolist()

# We can reduce sparsity in representations by filtering the bigrams as well.
k = 1000

counts_dict = {}
for bigram in out_bigrams_summary.index[:k]:
    bigram = ' '.join(bigram)
    regex_pattern = fr'\b{bigram}\b'
    counts_dict[bigram] = reviews.str.count(regex_pattern).astype(int).tolist()
    
counts_df = pd.DataFrame(counts_dict)
counts_df

Unnamed: 0,for the,guitar cable,the best,it works,good quality,the job,the price,good for,works great,is the,...,well built,quality guitar,nice high,for practice,perfect for,as advertised,low cost,its purpose,it serves,old stand
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
197,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
counts_df = pd.concat([reviews.to_frame(), counts_df], axis=1).set_index('summary')
counts_df

Unnamed: 0_level_0,for the,guitar cable,the best,it works,good quality,the job,the price,good for,works great,is the,...,well built,quality guitar,nice high,for practice,perfect for,as advertised,low cost,its purpose,it serves,old stand
summary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
good,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
jake,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
it does the job well,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
good windscreen for the money,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
no more pops when i record my vocals,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
best in price range,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
industry standard,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
classic the last of shures good mics,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
its an sm57 whats there to say,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TF-IDF (Term Frequency - Inverse Document Frequency)

- Addresses the BoW drawback of giving high weightage to common words
- TF-IDF tries to **give high weightage to words that are unique to that particular document**
- For comparison, BoW is simply TF


- TF-IDF = Term Frequency * Inverse Document Frequency
    - TF is a function of that document
    - IDF is a function of the corpus
    

- Refer to Lecture 19 for detailed explanations and calculations

### Questions to Ponder

- Pros and cons of Bag-of-Words? Can there be better representations than just counts?
- Pros and cons of TF-IDF?
- Pros and cons of uni-grams vs. bi-grams? Which are suitable in what cases?
- Pros and cons of long text vs. short text?
- What combinations of above might work well?

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file and not here by running the doctests: `python -m doctest discussion.py`.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [61]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!