## Analysis of an E-Commerce Dataset
### Introduction
### Part 1. EDA
1. Remove records where:
 * gender/rating/helpfulness is missing
 * review is 'none'

In [5]:
# import pandas and csv file
import pandas as pd
ec = pd.read_csv('E-commerce Dataset (p1).csv')

# display dataframe - before 
ec.head()

Unnamed: 0,userId,timestamp,review,item,rating,helpfulness,gender,category,item_id,item_price,user_city
0,4051,12807,Great job for what it is!,eBay,5.0,2.0,F,Online Stores & Services,88,149.0,39
1,4052,122899,Free Access Worth your Time,NetZero,5.0,0.0,F,Online Stores & Services,46,53.0,39
2,33,12700,AOL..I love you!!!!!!!!!!!!,AOL (America Online),5.0,4.0,F,Online Stores & Services,0,145.84,31
3,33,21000,EBAY!!! I LOVE YOU!!!! :-)*,eBay,5.0,4.0,F,Online Stores & Services,88,149.0,31
4,33,22300,Blair Witch...Oh Come On.......,Blair Witch Project,1.0,4.0,F,Movies,12,44.0,31


In [6]:
# count Null values - before
print(ec.isnull().sum())

# print length before removing records
print(len(ec))

userId          0
timestamp       0
review          0
item            0
rating         17
helpfulness    22
gender         27
category        0
item_id         0
item_price      0
user_city       0
dtype: int64
20000


In [7]:
# remove missing data - is this the best way to handle null values?
ec_clean = ec.dropna()
ec_clean = ec_clean[ec_clean.review != 'none']

# print length after
print(len(ec_clean))

19916


Descriptive Statistics

* Q2.1 total number of unique users, unique reviews, unique items, and unique categories
* Q2.2 descriptive statistics, e.g., the total number, mean, std, min and max regarding all rating records
* Q2.3 descriptive statistics, e.g., mean, std, max, and min of the number of items rated by different genders 
* Q2.4 descriptive statistics, e.g., mean, std, max, min of the number of ratings that received by each items 

In [8]:
## Q2.1
print("Number of unique users:", ec_clean.userId.nunique())
print("Number of unique reviews:", ec_clean.review.nunique())
print("Number of unique items:", ec_clean.item.nunique())
print("Number of unique categories:", ec_clean.category.nunique())

Number of unique users: 8562
Number of unique reviews: 19459
Number of unique items: 89
Number of unique categories: 9


In [9]:
## Q2.2 - for all rating records
ec_clean.rating.describe()

count    19916.000000
mean         3.701798
std          1.404451
min          1.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          5.000000
Name: rating, dtype: float64

In [10]:
## Q2.3 - for the number of items rated by different genders
ec_clean.groupby("gender")["item"].count().describe()

count        2.000000
mean      9958.000000
std        233.345238
min       9793.000000
25%       9875.500000
50%       9958.000000
75%      10040.500000
max      10123.000000
Name: item, dtype: float64

In [11]:
## Q2.4 - for the number of ratings received by each items
ec_clean.groupby("item")["rating"].count().describe()

count     89.000000
mean     223.775281
std      116.418988
min      139.000000
25%      162.000000
50%      187.000000
75%      245.000000
max      939.000000
Name: rating, dtype: float64

#### Q3. Plotting and Analysis

Please try to explore the correlation between gender/helpfulness/category and ratings; for instance, do female/male users tend to provide higher ratings than male/female users? Hint: you may use the boxplot function to plot figures for comparison (___Challenge___)
    
You may need to select the most suitable graphic forms for ease of presentation. Most importantly, for each figure or subfigure, please summarise ___what each plot shows___ (i.e. observations and explanations). Finally, you may need to provide an overall summary of the data.

In [None]:
# your code and solutions 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# boxplot for gender vs rating 
ec_clean.boxplot('rating', by='gender').get_figure().suptitle('')
plt.title('Boxplot of Ratings by Gender')
plt.xlabel('Gender')
plt.ylabel('Rating')

### Analysis
We can see from the boxplot that the distribution and spread of both genders are the same. This means that there is no correlation between gender and rating, in other words, rating doesn't depend on gender.

In [None]:
# boxplot for helpfulness vs rating
ec_clean.boxplot('rating', by='helpfulness').get_figure().suptitle('')
plt.title('Boxplot of Ratings by Helpfulness')
plt.xlabel('Helpfulness')
plt.ylabel('Rating')

### Analysis
The boxplot shows that there is also little correlation between helpfulness rating and product rating, based on the similar distributions and spreads. Most people rate the helpfulness of ratings between 3.0 and 5.0. For helpfulness rating of 2.0, the distribution is a bit different, ranging from 2.0 to 5.0 instead of 3.0 to 5.0 like other helpfulness rating.

In [None]:
# category vs rating
ec_clean.boxplot('rating', by='category').get_figure().suptitle('')
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.title('Boxplot of Ratings by Categories')
plt.xlabel('Category')
plt.ylabel('Rating')

### Analysis
The plot shows the correlation of category and rating. The ratings are different between the categories, so there is correlation between the two.
- For Books: people rate really high for this category, with most are 5.0 rating. 
- For Games: people also rate quite high in this category, with the distribution ranging from 4.0 to 5.0, with some outliers of 1.0 and 2.0.
- For Media: this is the category that receives the lowest rating. The distribution of the rating is from 1.0 to 4.0, with 50% of the ratings are between 1.0 to 2.0. 
- For Movies: again, the ratings are quite high, with the distribution ranging from 3.0 to 5.0.
- The remaining categories have similar spreads and distributions.

### Summary of the data
- This dataset contains 20000 records, but after removing null values, there are only 19916 records left, which is a relatively small dataset.
- There are a total of 8562 customers, 19459 reviews for 89 items, and 9 categories. 
- The average rating that people give is 3.7. The average helpfulness rating is lower, at 2.6.
- Helpfulness rating and gender don't affect the ratings of items. However, there are some differences in ratings between different categories. 

#### Q4. Detect and remove outliers

We may define outlier users, reviews and items with three rules (if a record meets one of the rules, it is regarded as an outlier):

* reviews of which the helpfulness is no more than 2
* users who rate less than 7 items
* items that receives less than 11 ratings 
 
Please remove the corresponding records in the csv file that involves outlier users, reviews and items. After that, __print the length of the data__.

In [None]:
# your code and solutions
# rule 1
ec1 = ec_clean[~(ec_clean.helpfulness <= 2)]
ec1.head()

In [None]:
# rule 2
# create a df with users and the number of times they rate
r2 = (ec1.groupby('userId')['item'].count()).reset_index()

# choose rows with users rate < 7 times
r2 = r2[(r2.item < 7)]

# create a list of userId who rate < 7 times
r2Id = r2.userId.tolist()

# drop outliers
ec2 = ec1[~(ec1.userId.isin(r2Id))]
ec2.head()

In [None]:
# rule 3 
# create a df with items and the number of ratings they receive
r3 = (ec2.groupby('item')['rating'].count()).reset_index()

# choose rows with # of ratings < 11
r3 = r3[(r3.rating < 11)]

# create a list of items that have < 11 ratings
r3Id = r3.item.tolist()

# drop outliers
ec3 = ec2[~(ec2.item.isin(r3Id))]
ec3.head()

# print length of the final df
print(len(ec3))