## Data Analysis with Python Course Project
***

In this project, you'll be analysing listings data from an e-commerce Platform. 

The dataset is stored in the `data/project_data.xlsx` file. It contains listing information posted on the platform.

One single listing corresponds to one row in the dataset.

The dataset has 12 columns, and 464433 rows. 

Here are the brief descriptions of each column:
- `itemid`: a unique ID of the product
- `shopid`: a unique ID of the shop
- `item_name`: product title  
- `item_description`: detailed  product description
- `item_variation`: stores variations of a product (e.g. different colours or sizes, in the format like {variation 1 name: variation 1 price, variation 2 name: variation 2 price})
- `price`: how much does the item sold
- `stock`: how many stocks left 
- `category`: which category does the product belongs to 
- `cb_option`: 1 indicates the product is sold by a cross border shop
- `is_preferred`: 1 indicates the product is sold by a preferred shop
- `sold_count`: how many products have been sold 
- `item_creation_date`: when are the product uploaded by the seller


In [5]:
import pandas as pd
import numpy as np

df = pd.read_excel('../data/project_data.xlsx', 
                   sheet_name = 'listing_data', 
                   dtype = {'itemid': str, 'shopid':str,
                            'cb_option':str, 'is_preferred':str})

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 464433 entries, 0 to 464432
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   itemid              464433 non-null  object        
 1   shopid              464433 non-null  object        
 2   item_name           464409 non-null  object        
 3   item_description    463336 non-null  object        
 4   item_variation      464433 non-null  object        
 5   price               464433 non-null  float64       
 6   stock               464433 non-null  int64         
 7   category            464422 non-null  object        
 8   cb_option           464433 non-null  object        
 9   is_preferred        464433 non-null  object        
 10  sold_count          464433 non-null  int64         
 11  item_creation_date  464433 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2), object(8)
memory usage: 42.5+ MB


### Questions

1. How many unique shops are in the dataset?

In [6]:
len(set(df['shopid']))

7856

In [7]:
len(df['shopid'].unique())

7856

2. How many unique preferred and cross border shops are in the dataset?

In [11]:
# boolean indexing
len(df[(df.cb_option=='1') & (df.is_preferred=='1')]['shopid'].unique())

158

In [15]:
# loc + np.where 
print(np.where((df['cb_option']=='1') & (df['is_preferred']=='1')))

len(df.loc[np.where((df['cb_option']=='1') & (df['is_preferred']=='1'))]['shopid'].unique())

(array([   947,    948,    949, ..., 463426, 463427, 463428]),)


158

3. How many products have zero sold count?

4. How many products were created in the year 2018?

In [None]:
# TO DO



5. Show Top 3 Preferred shops’ shopid that have the largest number of unique products

In [None]:
# TO DO



6. Show Top 3 Categories that have the largest number of unique cross-border products

In [None]:
# TO DO



7. Find Top 3 shopid with the highest revenue (Assumption: the product price has not been changed.)

In [None]:
# TO DO



8. Find number of products that have more than 3 variations (do not include products with 3 or fewer variations)

In [None]:
# TO DO



9. Identify duplicated listings within each shop (If listing A and B in shop S have the exactly same product title, product detailed description, and price, both listing A and B are considered as duplicated listings)

In [None]:
# TO DO



10. Mark those duplicated listings with True otherwise False and store the marking result in a new column named “is_duplicated”

In [None]:
# TO DO



11. Find duplicate listings that has less than 2 sold count and store the result in a new excel file named “duplicated_listings.xlsx”

In [None]:
# TO DO



12. Find the preferred shop shopid that have the most number of duplicated listings

In [None]:
# TO DO

