In [1]:
poor = open("poor_amazon_toy_reviews.txt", "r", encoding='utf8')
good = open("good_amazon_toy_reviews.txt", "r", encoding='utf8')

In [2]:
goodlines = good.readlines()
poorlines = poor.readlines()

In [3]:
from collections import Counter

def count_words(lines, delimiter=" "):
    
    words = Counter() # instantiate a Counter object called words
    for line in lines:
        for word in line.split(delimiter):
            words[word] += 1 # increment count for word
    return words

In [4]:
good_count = count_words(goodlines)
poor_count = count_words(poorlines)

In [5]:
import pandas as pd 
good_count_df = pd.DataFrame(columns=["word", "frequency"]) 
poor_count_df = pd.DataFrame(columns=["word", "frequency"]) 

good_count_df["word"] = list(good_count.keys())
good_count_df["frequency"] = list(good_count.values())
poor_count_df["word"] = list(poor_count.keys())
poor_count_df["frequency"] = list(poor_count.values())

In [6]:
good_count_df.sort_values(by='frequency',ascending=False)

Unnamed: 0,word,frequency
15,the,96606
28,and,86557
50,a,66611
14,to,63339
73,I,48358
...,...,...
73731,fees,1
73730,kick-ass,1
73729,Joe.<br,1
73728,"League&#34;,",1


In [7]:
poor_count_df.sort_values(by='frequency',ascending=False)

Unnamed: 0,word,frequency
15,the,18992
14,and,10909
8,I,9629
66,to,9589
37,a,9203
...,...,...
19133,strong.\n,1
19132,10.00\n,1
19131,color.....instead,1
19129,buckets,1


In [8]:
import re
goodtext = ''.join(goodlines)
poortext = ''.join(poorlines)

Christmas_good = len(re.findall(r'\b(Christmas|christmas)\b', goodtext, flags=re.IGNORECASE)) 
Christmas_poor = len(re.findall(r'\b(Christmas|christmas)\b', poortext, flags=re.IGNORECASE)) 

In [9]:
print(f"Christmas appears {Christmas_good} times in good reviews")
print(f"Christmas appears {Christmas_poor} times in poor reviews")

Christmas appears 1198 times in good reviews
Christmas appears 74 times in poor reviews


In [10]:
birthday_good = len(re.findall(r'\b(birthday|birthdays)\b', goodtext, flags=re.IGNORECASE)) 
birthday_poor = len(re.findall(r'\b(birthday|birthdays)\b', poortext, flags=re.IGNORECASE)) 
print(f"birthday appears {birthday_good} times in good reviews")
print(f"birthday appears {birthday_poor} times in poor reviews")

birthday appears 3988 times in good reviews
birthday appears 445 times in poor reviews


In [11]:
anniversary_good = len(re.findall(r'\b(anniversary|anniversaries)\b', goodtext, flags=re.IGNORECASE)) 
anniversary_poor = len(re.findall(r'\b(anniversary|anniversaries)\b', poortext, flags=re.IGNORECASE)) 
print(f"anniversary appears {anniversary_good} times in good reviews")
print(f"anniversary appears {anniversary_poor} times in poor reviews")

anniversary appears 53 times in good reviews
anniversary appears 4 times in poor reviews


Overall, we can see that birthdays tend to have most positive reviews, but the ratio of good to poor for birthdays is lowest among all three as well. If we want to maxmizie the number of positive reviews on the website, we should focus our marketing on birthday occasions.

In [12]:
results = list(map(lambda line: re.search(r'\b(husband|son|sons|boyfriend|boyfriends|father|dad|daddy|dads)\b', line, flags=re.IGNORECASE), goodlines))
good_m_lines = list(filter(lambda result: result, results))

results = list(map(lambda line: re.search(r'\b(husband|son|sons|boyfriend|boyfriends|father|dad|daddy|dads)\b', line, flags=re.IGNORECASE), poorlines))
poor_m_lines = list(filter(lambda result: result, results))

In [13]:
len(good_m_lines)/len(goodlines)

0.07779527867184519

In [14]:
len(poor_m_lines)/len(poorlines)

0.05346456692913386

In [15]:
results = list(map(lambda line: re.search(r'\b(wife|daughter|daughters|girlfriend|girlfriends|mother|mom|mommy|mommies)\b', line, flags=re.IGNORECASE), goodlines))
good_f_lines = list(filter(lambda result: result, results))

results = list(map(lambda line: re.search(r'\b(wife|daughter|daughters|girlfriend|girlfriends|mother|mom|mommy|mommies)\b', line, flags=re.IGNORECASE), poorlines))
poor_f_lines = list(filter(lambda result: result, results))

In [16]:
len(good_f_lines)/len(goodlines)

0.07823551855366524

In [17]:
len(poor_f_lines)/len(poorlines)

0.03897637795275591

Here we assume that whenever these words appear means that they are recipients of the gifts. From the calculation above, we can see that the percentage of poor reviews for male recipients (0.053) is higher than that of female recipients (0.039), so toys purchased for male recipients seem to be more likely to be poor reviewed. Here, the denominator is total number of poor reviews. However, we don't know how many toys in total are for male recipients and female recipients respectively. The fact that the current ratio for male recipients is higher could be due to the fact that there are far more toys purchases for male recipients than for female recipients in population. For now, we can't divide the population by gender to calculate the poor review percentage for each gender, therefore, we cannot say that we should focus on female recipients because the calculated ratio is lower.

In [18]:
gift_good = re.findall(r'\b(son|sons|child|children|Christmas|Christmases)\b', goodtext, flags=re.IGNORECASE)
gift_poor = re.findall(r'\b(son|sons|child|children|Christmas|Christmases)\b', poortext, flags=re.IGNORECASE)

In [19]:
gift_good

['Children',
 'son',
 'children',
 'children',
 'son',
 'son',
 'son',
 'son',
 'child',
 'son',
 'son',
 'son',
 'child',
 'child',
 'son',
 'son',
 'son',
 'son',
 'son',
 'children',
 'child',
 'child',
 'Children',
 'Christmas',
 'child',
 'son',
 'son',
 'son',
 'son',
 'child',
 'children',
 'children',
 'child',
 'child',
 'son',
 'Children',
 'son',
 'children',
 'son',
 'children',
 'children',
 'children',
 'children',
 'son',
 'son',
 'Christmas',
 'child',
 'children',
 'son',
 'son',
 'child',
 'SON',
 'child',
 'children',
 'child',
 'son',
 'son',
 'children',
 'Christmas',
 'son',
 'child',
 'son',
 'son',
 'son',
 'son',
 'son',
 'son',
 'children',
 'son',
 'Christmas',
 'child',
 'child',
 'son',
 'children',
 'son',
 'children',
 'child',
 'son',
 'Son',
 'son',
 'CHILD',
 'son',
 'son',
 'child',
 'Christmas',
 'Child',
 'children',
 'Christmas',
 'son',
 'son',
 'son',
 'children',
 'son',
 'child',
 'children',
 'Christmas',
 'child',
 'son',
 'child',
 'children

In [20]:
gift_poor

['child',
 'son',
 'son',
 'CHILDREN',
 'son',
 'sons',
 'son',
 'son',
 'child',
 'child',
 'son',
 'son',
 'son',
 'son',
 'son',
 'child',
 'children',
 'son',
 'son',
 'son',
 'son',
 'Christmas',
 'children',
 'son',
 'child',
 'son',
 'son',
 'son',
 'child',
 'son',
 'Christmas',
 'son',
 'child',
 'son',
 'children',
 'children',
 'son',
 'son',
 'son',
 'Christmas',
 'Christmas',
 'child',
 'child',
 'son',
 'son',
 'son',
 'son',
 'son',
 'children',
 'children',
 'Child',
 'child',
 'children',
 'children',
 'son',
 'child',
 'son',
 'son',
 'sons',
 'son',
 'son',
 'son',
 'child',
 'son',
 'children',
 'son',
 'child',
 'Christmas',
 'sons',
 'child',
 'children',
 'son',
 'son',
 'Christmas',
 'son',
 'child',
 'son',
 'child',
 'son',
 'sons',
 'son',
 'son',
 'son',
 'son',
 'child',
 'son',
 'son',
 'child',
 'son',
 'Children',
 'child',
 'son',
 'children',
 'son',
 'son',
 'child',
 'son',
 'child',
 'child',
 'child',
 'son',
 'children',
 'son',
 'child',
 'child'

In [25]:
poor_df = pd.DataFrame(open("poor_amazon_toy_reviews.txt", "r"), columns=['line'])
good_df = pd.DataFrame(open("good_amazon_toy_reviews.txt", "r", encoding='utf8'), columns=['line'])

poor_df["line"] = poor_df["line"].str.replace("\n", "")
poor_df["results"] =  poor_df["line"].str.findall(r'\b(son|sons|child|children|Christmas|Christmases)\b')

good_df["line"] = good_df["line"].str.replace("\n", "")
good_df["results"] =  good_df["line"].str.findall(r'\b(son|sons|child|children|Christmas|Christmases)\b')

In [36]:
poor_slctd = poor_df[poor_df.results.map(len)>0]
good_slctd = good_df[good_df.results.map(len)>0]

In [37]:
poor_slctd.head()

Unnamed: 0,line,results
5,This was much too small for an average size ch...,[child]
19,failed to function right out of the box. piece...,[son]
21,"""If I could give this zero stars I would. Part...",[son]
32,"""These are fake. They are flimsy, broken or br...",[son]
45,"""This lasted about two hours before two of the...","[sons, son]"


In [38]:
good_slctd.head()

Unnamed: 0,line,results
10,I got this item for me and my son to play arou...,[son]
17,Purchased these Lego's to help aid me with tea...,"[children, children]"
23,I ordered these for my 3 year old son's birthd...,[son]
28,My 5 year old son loves this.,[son]
42,My son LOVES it! He wouldn't put it down for t...,[son]


For the word count approach, there are several limitations when it comes to inference. To start with, word count does not take the context or sentiment of the reviews into consideration. We have no idea why customers mention the specific word. Without context or sentiment analysis, we can't associat some word with positive meaning just because it appears in good reviews, and vice versa. In addition, word account and proportion of appearance in reviews do not establish causal realtionship between the word and good/bad reviews. It would be problematic to decide on marketing strategy simply based on word count.

Therefore, we need to do further analysis to understand the context, sentiment and meaning of each review in order to uncover the relationships between certain holiday occassions, recipients gender and positivity of reviews on the website.