# Homework 1 (Due Thursday, October 28th, 2021 at 6:29pm PST)


**Rubric**


Every day late is -10%.

You are a business analyst working for a major US toy retailer:

A **Product Marketing Manager** in the marketing department wants to find out the most frequently used words in positive reviews (five stars) and negative reviews (one star) in order to determine what occasion the toys are purchased for (Christmas, birthdays, and anniversaries.). She would like your opinion on **which age groups**(what ages) recipients of toys seem to have the highest reviews, and **which occasion has the most positive reviews.**

* There are **malformed characters in the review** text. For instance, notice the `&#34;` - these are examples of incorrectly decoded [HTML encodings](https://krypted.com/utilities/html-encoding-reference/).
```
"amazing quality first of all, these cards are amazing proxies (but don't try to use em in &#34;official duels&#34; unless a judge is okay with it, if you have the real thing to show) and look amazing in your binder!"
```
Please clean up all instances of these incorrect decodings. 

Perform the same word count analysis using the reviews received from Amazon to answer your marketing manager's question. They are stored in two files, (`poor_amazon_toy_reviews.txt`) and (`good-amazon-toy-reviews.txt`). **Provide a few sentences with your findings and business recommendations.** Make any assumptions you'd like to- this is a fictitious company after all. I just want you to get into the habit of "finishing" your analysis: to avoid delivering technical numbers to a non-technical manager.

Some considerations in your analysis:

* Use **regular expressions to parse out all references to recipients and gift occassions**, and account for the possibility that people may spell words "son" / "children" / "4 year old" as both singular and plural, upper or lower-cased.

* Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

**Submit everything as a new notebook and Slack direct message to me (Yu Chen) and the TA the HW as an attachment.**

**NOTE**: Name the notebook `lastname_firstname_HW1.ipynb`.

In [2]:
import sys
import re
import pandas as pd
import numpy as np

In [3]:
positive_text = open("good_amazon_toy_reviews.txt")
print(positive_text)
positive_text.readline()

<_io.TextIOWrapper name='good_amazon_toy_reviews.txt' mode='r' encoding='UTF-8'>


'Excellent!!!\n'

In [4]:
#how big is the file? 
positive_text.seek(0)
lines = positive_text.readlines()
print(f'This positive reviews file has {len(lines)} lines (reviews)')

This positive reviews file has 102217 lines (reviews)


In [5]:
#code taken from lecture
from collections import Counter

def count_words(lines, delimiter=" "):
    
    words = Counter() # instantiate a Counter object called words
    for line in lines:
        for word in line.split(delimiter):
            words[word] += 1 # increment count for word
    return words

count_words(lines).most_common(10)

[('the', 96606),
 ('and', 86557),
 ('a', 66611),
 ('to', 63339),
 ('I', 48358),
 ('for', 48350),
 ('', 44934),
 ('is', 42321),
 ('it', 41543),
 ('of', 36593)]

In [6]:
#we want to remove occasions of &#34 and all of the rest of these HTML invalid characters
pos_df = pd.DataFrame(open("good_amazon_toy_reviews.txt", "r"), columns = ["line"])
pos_df["line"] = pos_df["line"].str.replace("\n", "")
pos_df["line"] = pos_df["line"].str.replace(r'&#\d{1,2}', "")
count_words(pos_df["line"])
count_words(pos_df["line"]).most_common(10)

[('the', 96607),
 ('and', 86557),
 ('a', 66617),
 ('to', 63352),
 ('for', 48408),
 ('I', 48362),
 ('', 44963),
 ('it', 44347),
 ('is', 42328),
 ('of', 36598)]

# I) Occasions Parsing

In [8]:
reg_birthday = r'(birthdays?|bdays?)'
reg_halloween = r'halloweens?'
reg_presents = r'(presents?|gifts?)'
reg_christian_holidays = r'(christmas?|easters?)'
reg_life_occasion = r'(anniversary|weddings?|graduation|party)'

occasion_parse_list = [reg_birthday, reg_halloween, reg_presents, reg_christian_holidays, reg_life_occasion]

occ_dict = {}

for i in occasion_parse_list:
    print(f'Out of the {len(lines)} positive reviews, occasions that look similar to\n {i}\
    are represented {pos_df.line.str.findall(i, flags = re.IGNORECASE).apply(len).apply(bool).sum()} times')
    occ_dict[i] = pos_df.line.str.findall(i, flags = re.IGNORECASE).apply(len).apply(bool).sum()
    
# sorting the dictionary in descending order code is taken from stack overflow 
#https://stackoverflow.com/questions/20577840/python-dictionary-sorting-in-descending-order-based-on-values/41866830
final_dictionary = sorted(occ_dict.items(), key = lambda kv: kv[1], reverse = True)

print(f'\nSorting this list by frequency, we get {final_dictionary}')

Out of the 102217 positive reviews, occasions that look similar to
 (birthdays?|bdays?)    are represented 3968 times
Out of the 102217 positive reviews, occasions that look similar to
 halloweens?    are represented 372 times
Out of the 102217 positive reviews, occasions that look similar to
 (presents?|gifts?)    are represented 6163 times
Out of the 102217 positive reviews, occasions that look similar to
 (christmas?|easters?)    are represented 1217 times
Out of the 102217 positive reviews, occasions that look similar to
 (anniversary|weddings?|graduation|party)    are represented 2502 times

Sorting this list by frequency, we get [('(presents?|gifts?)', 6163), ('(birthdays?|bdays?)', 3968), ('(anniversary|weddings?|graduation|party)', 2502), ('(christmas?|easters?)', 1217), ('halloweens?', 372)]


# II) Age Parsing

In [160]:
reg_age1 = r'(\d{1,2}) (?:yo|yr)'
reg_age3 = r'(\d{1,2})+-y'
reg_age4 = r'(\d{1,2})+ (?:year) olds?'
reg_age5 = r'(\d{1,2}).. (?:bday|birthday)'
reg_age6 = r'(\d{1,2}) (?:mos)'
reg_age7 = r'(\d{1,2}) months? olds?'

age_parse_list = [reg_age1, reg_age3, reg_age4, reg_age5]
month_parse_list = [reg_age6, reg_age7]

age_list = {}

for i in age_parse_list:
    x = pos_df.line.str.findall(i, flags = re.IGNORECASE)
    x = x.explode().dropna()
    # add to counter in dictionary if the age is in there
    # if new age, start counter at 1
    for j in x:
        if j in age_list:
            age_list[j] += 1
        else:
            age_list[j] = 1

month_list = {}

for k in month_parse_list:
    y = pos_df.line.str.findall(i, flags = re.IGNORECASE)
    y = y.explode().dropna()
    #could divide by 12 for year age and round for better grouping
    for z in y:
        if z in month_list:
            month_list[z] += 1
        else:
            month_list[z] = 1

# stack overflow code for how to add a string to each key in a dictionary
#https://stackoverflow.com/questions/48681634/adding-a-string-to-all-keys-in-dictionary-python
month_list = {k +' month old': v for k, v in month_list.items()}

final_age_dict = {**age_list, **month_list}

final_age_dict = sorted(final_age_dict.items(), key = lambda kv: kv[1], reverse = True)
final_age_dict

[('2', 1378),
 ('3', 1349),
 ('5', 1053),
 ('4', 1032),
 ('6', 618),
 ('7', 480),
 ('8', 425),
 ('1', 422),
 ('9', 254),
 ('10', 253),
 ('1 month old', 184),
 ('2 month old', 184),
 ('3 month old', 164),
 ('11', 156),
 ('4 month old', 142),
 ('5 month old', 126),
 ('6 month old', 124),
 ('12', 122),
 ('7 month old', 82),
 ('13', 41),
 ('14', 41),
 ('9 month old', 38),
 ('10 month old', 36),
 ('8 month old', 34),
 ('11 month old', 24),
 ('15', 21),
 ('16', 20),
 ('30', 14),
 ('12 month old', 14),
 ('50 month old', 12),
 ('17', 11),
 ('20', 10),
 ('13 month old', 10),
 ('21 month old', 10),
 ('75', 9),
 ('40', 8),
 ('19', 7),
 ('21', 7),
 ('50', 6),
 ('30 month old', 6),
 ('25 month old', 6),
 ('40 month old', 6),
 ('22', 5),
 ('18', 5),
 ('32', 4),
 ('60', 4),
 ('24', 4),
 ('35', 4),
 ('25', 4),
 ('16 month old', 4),
 ('14 month old', 4),
 ('20 month old', 4),
 ('80 month old', 4),
 ('24 month old', 4),
 ('17 month old', 4),
 ('70', 3),
 ('81', 3),
 ('23', 3),
 ('37', 3),
 ('90', 3),
 (

## Final Recommendations for Age and Occasion

#### **Highest Frequency**
- **Occasion:** gifts and birthdays are the occasions with the most positive reviews, mentioned in roughly 10% of them
- **Ages:** 2 and 3 year olds account for the most positive reviews (about 1350 mentions each ~1-1.5%), with 5 and 4 year olds (roughly 1000 mentions each ~1%) in a close third and fourth
 
### =>  Our product marketing manager should target birthday gifts for kids aged 2-5

#### **Middle Frequency**
- **Occasion:** life occasions such as anniversaries, weddings, graduation, and parties show up about 2.5% of the time.
- **Ages:** For the 1% to 0.25% age frequency range, 6, 7, 8, 1, 9, 10 year olds (in order by frequency) have the next set of positive reviews
 
### =>  If our highest frequency marketing campaign is successful and our manager needs a new set to target, she should look for 6-10 year olds with one of the above occasions coming up

#### **Low Frequency** (note, even if low, still higher than a lot of other groups)
- **Occasion:** Holidays such as Christmas, Easter, and Halloween have low frequency but still show up over 1% of the time
- **Ages:** 9-12 year AND 1-6 month olds each have over 0.1% frequency

### =>  If our manager is successful in the highest and middle frequency categories, she should then target the lower frequency category: older children (aged 9-12) or very young toddlers (1-6 months) during Christian holiday seasons and Halloween

## **Pitfalls & Limitations:**
- Besides knowing that the reviews were either 1/5 star, we don't assess what the reviews are actually saying
- To verify our conclusions about ages, we would need to make sure the numbers we are capturing ALL refer to ages.
- To verify our conclusions about occasions, we could look at date fields around holidays and if we had more data about birthdays we could relate the review data to a birthday.
- Sometimes people write things like "80 month old" which is kindof outrageous 
- It could be helpful to do paired analysis, i.e. what occasions AND ages are commonly reviewed together as well as how they complement each other