## Data Science Interview Challenge

For this exercise, you will analyze a dataset from Amazon.

<b>A. (Suggested duration: 90 mins)</b>  
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.  
  
<b>1. Trustworthiness of ratings</b>
Ratings are susceptible to manipulation, bias etc.. 

Data format:

**Id:** Product id (number 0, ..., 548551)

**ASIN:** Amazon Standard Identification Number

**title:** Name/title of the product

**group:** Product group (Book, DVD, Video or Music)

**salesrank:** Amazon Salesrank

**similar:** ASINs of co-purchased products (people who buy X also buy Y)

**categories:** Location in product category hierarchy to which the product belongs (separated by |, category id in [])

**reviews:** Product review information: time, user id, rating, total number of votes on the review, total number of helpfulness votes (how many people found the review to be helpful)

---

### What can you say (quantitatively speaking) about the ratings in this dataset?

#### Open packages

In [1]:
import pandas as pd
import datetime
from collections import Counter

#### Open File

In [2]:
# Read in the data
import gzip
with gzip.open('amazon-meta.txt.gz','rb') as f:
    file_content = [x.decode('utf8').strip() for x in f.readlines()]
f.close()

#### Clean Data

In [3]:
# Forms a list of objects containig each book's given data

book_list = []

add = []

for string in file_content:
    if string != '':
        add.append(string)
    else:
        # nest books descriptions as one list object
        book_list.append(add)
        add = []
        
book_list[0:3] 

[['# Full information about Amazon Share the Love products',
  'Total items: 548552'],
 ['Id:   0', 'ASIN: 0771044445', 'discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  'title: Patterns of Preaching: A Sermon Sampler',
  'group: Book',
  'salesrank: 396585',
  'similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  'categories: 2',
  '|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  'reviews: total: 2  downloaded: 2  avg rating: 5',
  '2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']]

In [4]:
# Extract rating section from each book and its subcategory description(Total, downloaded, avg rating).
# Builds it into a dataframe

rating_dict = {}

for book in book_list:
    ident, total, downloaded, avg_rating = '', '', '', ''
    for item in book:
        if item.startswith('Id:'):
            ident = item.split()[-1]
        elif item.startswith('reviews:'):
            total = item.split()[2]
            downloaded = item.split()[4]
            avg_rating = item.split()[7]
        elif item.startswith('discontinued product'):
            continue 
        rating_dict[ident] = [total, downloaded, avg_rating]

In [5]:
# delets empty ids
del rating_dict['']

# Brings book's dictionary into a dataframe
rating_df = pd.DataFrame.from_dict(rating_dict)
# reshapes dataframe vertically
rating_df = rating_df.T.reset_index(drop=False)
# rename columns
rating_df.columns = ['id', 'total_reviews', 'downloaded', 'avg_rating']
# reshape index/format
rating_df = rating_df.iloc[1:]

# object/string values change to numerical
rating_df = rating_df.apply(pd.to_numeric)

In [6]:
# Checks for Not a Number values and over data on hand
print(rating_df.isnull().sum())
print('\n')
print(rating_df.info())

id                  0
total_reviews    5867
downloaded       5867
avg_rating       5867
dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548551 entries, 1 to 548551
Data columns (total 4 columns):
id               548551 non-null int64
total_reviews    542684 non-null float64
downloaded       542684 non-null float64
avg_rating       542684 non-null float64
dtypes: float64(3), int64(1)
memory usage: 16.7 MB
None


In [7]:
# Remove rows with atleast one NaN values
rating_df = rating_df.dropna(axis=0, how='any')

In [8]:
# Recheck for nan values
print(rating_df.isnull().sum())
print('\n')
print(rating_df.info())
print('\n')
print(rating_df.dtypes)

id               0
total_reviews    0
downloaded       0
avg_rating       0
dtype: int64


<class 'pandas.core.frame.DataFrame'>
Int64Index: 542684 entries, 1 to 548551
Data columns (total 4 columns):
id               542684 non-null int64
total_reviews    542684 non-null float64
downloaded       542684 non-null float64
avg_rating       542684 non-null float64
dtypes: float64(3), int64(1)
memory usage: 20.7 MB
None


id                 int64
total_reviews    float64
downloaded       float64
avg_rating       float64
dtype: object


In [9]:
# resets column values to a more suited format
rating_df['total_reviews']  = rating_df['total_reviews'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)

# sorts reviews by id
rating_df = rating_df.sort_values('id')
rating_df.head()

Unnamed: 0,id,total_reviews,downloaded,avg_rating
1,1,2,2,5.0
111112,2,12,12,4.5
222223,3,1,1,5.0
333334,4,1,1,4.0
444445,5,0,0,0.0


### Data Analysis

#### How many books have more ratings, than download

In [10]:
print(len(rating_df[rating_df['total_reviews'] > rating_df['downloaded']]), 'books have more ratings than actual downlaods')

8615 books have more ratings than actual downlaods


In [11]:
# Take a look at the range of values
rating_df['discrepancy'] = rating_df['total_reviews'] - rating_df['downloaded']
rating_df = rating_df.sort_values('discrepancy', ascending=False)
rating_df.head(5)

Unnamed: 0,id,total_reviews,downloaded,avg_rating,discrepancy
53542,148185,5034,5,5.0,5029
547983,99487,5033,5,5.0,5028
31862,128673,4922,5,5.0,4917
310737,379661,2925,5,4.5,2920
168341,251503,2925,5,4.5,2920


In [12]:
print('There are', len(rating_df[rating_df['discrepancy'] >= 100]),
      'books with a discrepancy difference of 100 reviews or more')

There are 302 books with a discrepancy difference of 100 reviews or more


By looking at our data table and discrepancy column, we see how some books have an extreme difference in the number of downloaded books in compare to reviews. We could assume, some books and items have been purchased differently and being review online, but the number of these books should be minimal. We do see a discrepancy as high as more than 5000 thousands, which seems to point out to fictitious reviews

### 2. Category bloat

Consider the product group named 'Books'. Each product in this group is associated with categories. Naturally, with categorization, there are tradeoffs between how broad or specific the categories must be.

For this dataset, quantify the following:

**a.** Is there redundancy in the categorization? How can it be identified/removed?

**b.** Is it possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?

In [13]:
book_list[:3]

[['# Full information about Amazon Share the Love products',
  'Total items: 548552'],
 ['Id:   0', 'ASIN: 0771044445', 'discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  'title: Patterns of Preaching: A Sermon Sampler',
  'group: Book',
  'salesrank: 396585',
  'similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  'categories: 2',
  '|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  'reviews: total: 2  downloaded: 2  avg rating: 5',
  '2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']]

Redundancy can be seen at first sight in the categories section, this shows different categories registered for one book

In [14]:
# pull category info from each book into a list
category_list = []
book_count = 0

for group in book_list:
    prod_type = ''
    for item in group:
        if item.startswith('group:'):
            prod_type = item.split()[-1]
        if prod_type == 'Book':
            if item.startswith('|'):
                category_list.append(item.strip())
    if prod_type == 'Book':
        book_count += 1
        
# sub-categories
sub_category_list = []
for branch in category_list:
    for sub_cat in branch.split('|')[1:]:
        sub_category_list.append(sub_cat)
            
print('There are {} total books.'.format(book_count))
print('There are {} total category branches across all books.'.format(len(category_list)))
print('There are {} unique category branches across all books.'.format(len(set(category_list))))
print('There are {} total sub-categories across all books.'.format(len(sub_category_list)))
print('There are {} unique sub-categories across all books.'.format(len(set(sub_category_list))))
print(sub_category_list[:10])

There are 393561 total books.
There are 1440329 total category branches across all books.
There are 12853 unique category branches across all books.
There are 7891047 total sub-categories across all books.
There are 14923 unique sub-categories across all books.
['Books[283155]', 'Subjects[1000]', 'Religion & Spirituality[22]', 'Christianity[12290]', 'Clergy[12360]', 'Preaching[12368]', 'Books[283155]', 'Subjects[1000]', 'Religion & Spirituality[22]', 'Christianity[12290]']


In [15]:
# dataframe with sub-category counts

# build diccionary list into a datafame
category_count = pd.DataFrame.from_dict(Counter(sub_category_list), orient='index').reset_index()
# order values in greatest to least
category_count = category_count.sort_values(0, ascending=False)
category_count.head()

Unnamed: 0,index,0
8015,Books[283155],1286848
5030,Subjects[1000],1222638
5375,Children's Books[4],134263
3198,Amazon.com Stores[285080],123925
9488,[265523],123925


we could look into the most common sub-categories within all books, and changed them into the more specific sub-categories.

In [16]:
# Determine what percent of category appearances is the top 10% of sub-category titles
print('{:.2f}% represent the first largest sub-category titles, which makes up {:.2f}% of the sub-category overall total.'\
          .format((len(category_count.iloc[0:1400])/len(category_count))*100, 
                  category_count.iloc[0:1400][0].sum()/category_count[0].sum()*100))

9.38% represent the first largest sub-category titles, which makes up 90.04% of the sub-category overall total.


Yes, we can reduce the number of sub-category titles by 90% by removing the top 10% of most occuring sub-category titles.

B. (Suggested duration: 30 mins)
Give the number crunching a rest! Just think about these problems.

1. Algorithm thinking

**How would you build the product categorization from scratch, using similar/co-purchased information?**

A: I would use a type of clustering method, creating product categories base on similarity in purchase information

2. Product thinking
Now, put on your 'product thinking' hat:

**a. Is it a good idea to show users the categorization hierarchy for items?**

A: Yes, it gives a more structure way for customers to find books under categories they prefer.

**b. Is it a good idea to show users similar/co-purchased items?**

A: Absolutly, I think this is one of the most effective ways now days to bring more sales, it serves as a virtual
sales system trying to predict what the client likes are in order to stimulate concurrent sales

**c. Is it a good idea to show users reviews and ratings for items?**

A: Yes, it gives customers added values as of what others have experience with the product being look at, giving a sense of trusting from actual customers and not biased groups.

**d. For each of the above, why? How will you establish the same?**

Everything needs to be very customer centric, customers are the major key in the running life of every business.