## Data Science Interview Challenge

For this exercise, you will analyze a dataset from Amazon.

<b>A. (Suggested duration: 90 mins)</b>  
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.  
  
<b>1. Trustworthiness of ratings</b>
Ratings are susceptible to manipulation, bias etc.. 

Data format:

**Id:** Product id (number 0, ..., 548551)

**ASIN:** Amazon Standard Identification Number

**title:** Name/title of the product

**group:** Product group (Book, DVD, Video or Music)

**salesrank:** Amazon Salesrank

**similar:** ASINs of co-purchased products (people who buy X also buy Y)

**categories:** Location in product category hierarchy to which the product belongs (separated by |, category id in [])

**reviews:** Product review information: time, user id, rating, total number of votes on the review, total number of helpfulness votes (how many people found the review to be helpful)

---

### What can you say (quantitatively speaking) about the ratings in this dataset?

#### Open packages

In [1]:
import pandas as pd
import datetime

#### Open File

In [2]:
# Read in the data
import gzip
with gzip.open('amazon-meta.txt.gz','rb') as f:
    file_content = [x.decode('utf8').strip() for x in f.readlines()]
f.close()

#### Clean Data

In [33]:
# Forms a list of objects containig each book's given data

grouped = []
add = []

for string in file_content:
    if string != '':
        add.append(string)
    else:
        grouped.append(add)
        add = []
        
grouped[0:3] 

[['# Full information about Amazon Share the Love products',
  'Total items: 548552'],
 ['Id:   0', 'ASIN: 0771044445', 'discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  'title: Patterns of Preaching: A Sermon Sampler',
  'group: Book',
  'salesrank: 396585',
  'similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  'categories: 2',
  '|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  'reviews: total: 2  downloaded: 2  avg rating: 5',
  '2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']]

In [34]:
# Extract rating section and its subcategory description(Total, downloaded, avg rating) for 
# each book info from each product group into a dataframe

rating_dict = {}

for group in grouped:
    ident, total, downloaded, avg_rating = '', '', '', ''
    for item in group:
        if item.startswith('Id:'):
            ident = item.split()[-1]
        elif item.startswith('reviews:'):
            total = item.split()[2]
            downloaded = item.split()[4]
            avg_rating = item.split()[7]
        elif item.startswith('discontinued product'):
            continue 
        rating_dict[ident] = [total, downloaded, avg_rating]

In [35]:
# delets empty ids
del rating_dict['']

# Brings book's dictionary into a dataframe
rating_df = pd.DataFrame.from_dict(rating_dict)
# reshapes dataframe vertically
rating_df = rating_df.T.reset_index(drop=False)
# rename columns
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
# reshape index/format
rating_df = rating_df.iloc[1:]

# object/string values change to numerical
rating_df = rating_df.apply(pd.to_numeric)

In [36]:
# Checks for Not a Number values and over data on hand
print(rating_df.isnull().sum())
print('\n')
print(rating_df.info())

id               0
total         5867
downloaded    5867
avg_rating    5867
dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548551 entries, 1 to 548551
Data columns (total 4 columns):
id            548551 non-null int64
total         542684 non-null float64
downloaded    542684 non-null float64
avg_rating    542684 non-null float64
dtypes: float64(3), int64(1)
memory usage: 16.7 MB
None


In [37]:
# Remove rows with atleast one NaN values
rating_df = rating_df.dropna(axis=0, how='any')

In [38]:
# Recheck for nan values
print(rating_df.isnull().sum())
print('\n')
print(rating_df.info())
print('\n')
print(rating_df.dtypes)

id            0
total         0
downloaded    0
avg_rating    0
dtype: int64


<class 'pandas.core.frame.DataFrame'>
Int64Index: 542684 entries, 1 to 548551
Data columns (total 4 columns):
id            542684 non-null int64
total         542684 non-null float64
downloaded    542684 non-null float64
avg_rating    542684 non-null float64
dtypes: float64(3), int64(1)
memory usage: 20.7 MB
None


id              int64
total         float64
downloaded    float64
avg_rating    float64
dtype: object


In [39]:
# resets column values to a more suited format
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)

# sorts reviews by id
rating_df = rating_df.sort_values('id')
rating_df.head()

Unnamed: 0,id,total,downloaded,avg_rating
1,1,2,2,5.0
111112,2,12,12,4.5
222223,3,1,1,5.0
333334,4,1,1,4.0
444445,5,0,0,0.0


### Data Analysis

In [71]:
rating_df = pd.DataFrame.from_dict(rating_dict)

In [72]:
rating_df = rating_df.T.reset_index(drop=False)

In [73]:
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']

In [74]:
rating_df = rating_df.iloc[1:]

In [75]:
rating_df['id']  = rating_df['id'].astype(int)

In [76]:
rating_df.head(2)

Unnamed: 0,id,total,downloaded,avg_rating
1,1,2,2,5
2,10,6,6,4


In [88]:
rating_df['downloaded']  = rating_df['downloaded'].astype(int)

ValueError: invalid literal for int() with base 10: ''

In [87]:
rating_df.dtypes

id             int32
total         object
downloaded    object
avg_rating    object
dtype: object

In [86]:
rating_df['total']  = rating_df['total'].astype(int)

ValueError: invalid literal for int() with base 10: ''

In [48]:
rating_df.head(5)

Unnamed: 0,id,total,downloaded,avg_rating
1,0,,,
2,1,,,
3,10,,,
4,100,,,
5,1000,,,


In [92]:
rating_df = rating_df.apply(pd.to_numeric)

In [93]:
rating_df.dtypes

id              int32
total         float64
downloaded    float64
avg_rating    float64
dtype: object

In [35]:
rating_df = pd.DataFrame.from_dict(rating_dict)
rating_df = rating_df.T.reset_index(drop=False)
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
rating_df = rating_df.iloc[1:]
rating_df['id']  = rating_df['id'].astype(int)
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)
rating_df['avg_rating']  = rating_df['avg_rating'].astype(float)
rating_df = rating_df.sort_values('id')
rating_df.head()

ValueError: invalid literal for int() with base 10: ''

In [28]:
# Extract rating info from each product group into a dataframe
rating_dict = {}

for group in grouped:
    ident, total, downloaded, avg_rating = '', '', '', ''
    for item in group:
        if item.startswith('Id:'):
            ident = item.split()[-1]
        elif item.startswith('  reviews:'):
            total = item.split()[2]
            downloaded = item.split()[4]
            avg_rating = item.split()[7]
        elif item.startswith('  discontinued product'):
            skip == True
    if skip == False:
        rating_dict[ident] = [total, downloaded, avg_rating]
    else:
        skip = False
    
rating_df = pd.DataFrame.from_dict(rating_dict)
rating_df = rating_df.T.reset_index(drop=False)
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
rating_df = rating_df.iloc[1:]
rating_df['id']  = rating_df['id'].astype(int)
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)
rating_df['avg_rating']  = rating_df['avg_rating'].astype(float)
rating_df = rating_df.sort_values('id')
rating_df.head()

NameError: name 'skip' is not defined