# Exercise 1: 
# Evaluating and Cleaning up Sales Data for UK E-commerce Companies

## Analyse Objectives

The purpose of this data analysis is to uncover the best-selling products based on market sales data **in order to develop more effective marketing strategies to increase revenue.**

The purpose of this project is to practice evaluating the cleanliness and neatness of the data, and based on the results of the evaluation, to clean the data to get data that can be analysed in the next step.

## Introduction

The original dataset records all the transactions of a UK-based online retail company between 1 December 2010 and 9 December 2011, covering data from the company's operations in different countries and regions around the world. The company sells gifts covering a variety of scenarios, including but not limited to birthday gifts, wedding souvenirs, Christmas gifts and more. The company's customer base mainly consists of wholesalers and individual consumers, with wholesalers accounting for a significant proportion.

Data description:
- `InvoiceNo`: The invoice number. 6 digits, which serves as a unique identifier for the transaction. If this code starts with the letter "c", the transaction is cancelled.
- `StockCode`: The product code. 5 digits, used as a unique identifier for the product.
- `Description`: Product name.
- `Quantity`: The quantity of the product in the transaction.
- `InvoiceDate`: Invoice date and time. The date and time the transaction occurred.
- `UnitPrice`: Unit price. The price is in Pounds Sterling (£).
- `CustomerID`: Customer ID. 5 digit number that serves as a unique identifier for the customer.
- `Country`: Country name. The name of the country in which the customer resides.

## Loading Data

In [28]:
import pandas as pd

In [30]:
df = pd.read_csv('e_commerce.csv')
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,12/1/2010 8:28,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,12/1/2010 8:34,1.69,13047.0,United Kingdom


##  Evaluating Data

The assessment was carried out in two main areas: structure and content (neatness and cleanliness). Problems with the structure of the data mean that the three criteria of "one variable per column, one observation per row, and one value per cell" are not met, while problems with the content of the data include the presence of missing data, duplicated data, inconsistent data, and invalid data.

### Assessment of Data Neatness (Structure)

In [32]:
df.sample(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
336788,566402,21232,STRAWBERRY CERAMIC TRINKET BOX,24,9/12/2011 13:34,1.25,14997.0,United Kingdom
139740,548349,22087,PAPER BUNTING WHITE LACE,40,3/30/2011 14:39,2.55,15812.0,United Kingdom
484165,577523,22778,GLASS CLOCHE SMALL,4,11/20/2011 13:33,3.95,12597.0,Spain
333266,566191,21931,JUMBO STORAGE BAG SUKI,10,9/9/2011 13:26,2.08,14388.0,United Kingdom
171745,551462,84077,WORLD WAR 2 GLIDERS ASSTD DESIGNS,48,4/28/2011 16:11,0.29,15210.0,United Kingdom
537381,581256,22896,PEG BAG APPLES DESIGN,2,12/8/2011 11:21,5.79,,United Kingdom
472018,576665,23348,CHILDRENS TOY COOKING UTENSIL SET,6,11/16/2011 11:46,2.08,17841.0,United Kingdom
443119,574690,POST,POSTAGE,4,11/6/2011 13:11,40.0,12638.0,Sweden
454773,575602,23344,JUMBO BAG 50'S CHRISTMAS,4,11/10/2011 12:27,2.08,17059.0,United Kingdom
257171,559515,22883,NUMBER TILE VINTAGE FONT 4,1,7/8/2011 15:58,4.13,,United Kingdom


**Summary:**
This dataset doesn't exist any structure problem.

### Assessment of Data Cleanliness (Content)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


**Summary:**
* 'Description', 'CustomerID' exist null value.
* 'InvoiceDate'---> Dtype: Date, 'CustomerID': ---> Dtype: str

#### 1. Missing Data

##### 'Description' is Null

In [34]:
df[df['Description'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.0,,United Kingdom
1970,536545,21134,,1,12/1/2010 14:32,0.0,,United Kingdom
1971,536546,22145,,1,12/1/2010 14:33,0.0,,United Kingdom
1972,536547,37509,,1,12/1/2010 14:33,0.0,,United Kingdom
1987,536549,85226A,,1,12/1/2010 14:34,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
535322,581199,84581,,-2,12/7/2011 18:26,0.0,,United Kingdom
535326,581203,23406,,15,12/7/2011 18:31,0.0,,United Kingdom
535332,581209,21620,,6,12/7/2011 18:35,0.0,,United Kingdom
536981,581234,72817,,27,12/8/2011 10:33,0.0,,United Kingdom


*Guess 1：'Description' and 'UnitPrice' are all null.*

In [46]:
df[(df['Description'].isnull()) & (df['UnitPrice'] !=0)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


**Summary:** Guess 1 is correct, the null value of 'Description' and 'UnitPrice'could be delected.


##### 'CustomerID' is Null

In [47]:
df[df['CustomerID'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.00,,United Kingdom
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom
1445,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom
1446,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom
...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,12/9/2011 10:26,4.13,,United Kingdom
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,12/9/2011 10:26,4.13,,United Kingdom
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,12/9/2011 10:26,4.96,,United Kingdom
541539,581498,85174,S/4 CACTI CANDLES,1,12/9/2011 10:26,10.79,,United Kingdom


*Guess 2：'CustomerID', 'UnitPrice' and 'Description'are all null.*

In [50]:
df[(df['CustomerID'].isnull()) & (df['UnitPrice'] != 0) & (df['Description'].isnull())]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


**Summary:** Guess 2 is correct.

#### 2. Duplicated Data

Based on the meaning of the data variables, although InvoiceNo, StockCode, and CustomerID are all unique identifiers, a single transaction can contain multiple items, so InvoiceNo can have duplicates, different transactions can contain the same item, so StockCode can have duplicates, and a customer can make multiple transactions or order multiple The customer can make multiple transactions or order multiple items, so CustomerID can also be duplicated. So for this dataset, we don't need to evaluate duplicates.

#### 3. Inconsistent data

In [52]:
df['Country'].value_counts()

United Kingdom          495266
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
China                      288
Singapore                  229
USA                        218
UK                         211
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United States               73
United Arab Emirates        68
European

**Summary:**
* "USA" ---> "United States"
* "UK"、"U.K." ---> "United Kingdom"

#### 4. Invalid Data

In [53]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


**Summary:** 'Quantity' and 'UnitPrice' exist outliers.

In [55]:
df[df['Quantity'] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,12/1/2010 9:41,27.50,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,12/1/2010 9:49,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,12/1/2010 10:24,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
...,...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,12/9/2011 9:57,0.83,14397.0,United Kingdom
541541,C581499,M,Manual,-1,12/9/2011 10:28,224.69,15498.0,United Kingdom
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,12/9/2011 11:57,10.95,15311.0,United Kingdom
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,12/9/2011 11:58,1.25,17315.0,United Kingdom


*Guess 3：When'Quantity' < 0，all 'InvoiceNo' are open with 'C'.*

In [57]:
df[(df["Quantity"] < 0) & (df["InvoiceNo"].str[0] != "C")]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
2406,536589,21777,,-10,12/1/2010 16:50,0.0,,United Kingdom
4347,536764,84952C,,-38,12/2/2010 14:42,0.0,,United Kingdom
7188,536996,22712,,-20,12/3/2010 15:30,0.0,,United Kingdom
7189,536997,22028,,-20,12/3/2010 15:30,0.0,,United Kingdom
7190,536998,85067,,-6,12/3/2010 15:30,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
535333,581210,23395,check,-26,12/7/2011 18:36,0.0,,United Kingdom
535335,581212,22578,lost,-1050,12/7/2011 18:38,0.0,,United Kingdom
535336,581213,22576,check,-30,12/7/2011 18:38,0.0,,United Kingdom
536908,581226,23090,missing,-338,12/8/2011 9:56,0.0,,United Kingdom


**Summary:** Guess 3 is incorrect, trying to test whether all 'UnitPrice' are eaual to 0.

In [64]:
df[(df["Quantity"] < 0) & (df["InvoiceNo"].str[0] != "C") & (df['UnitPrice'] != 0)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


In [66]:
df[df["UnitPrice"] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
299983,A563186,B,Adjust bad debt,1,8/12/2011 14:51,-11062.06,,United Kingdom
299984,A563187,B,Adjust bad debt,1,8/12/2011 14:52,-11062.06,,United Kingdom


**Summary:** Deleted all outliers.

## Cleaning Data

* 'InvoiceDate'---> Dtype: Date, 'CustomerID': ---> Dtype: str
* Deleted all null value of 'Description' and 'UnitPrice'.
* "USA" ---> "United States", "UK"、"U.K." ---> "United Kingdom"
* Deleted the data when 'Quantity' < 0, 'UnitPrice' < 0.

##### 1. 'InvoiceDate'---> Dtype: Date, 'CustomerID': ---> Dtype: str

In [68]:
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["InvoiceDate"]

0        2010-12-01 08:26:00
1        2010-12-01 08:26:00
2        2010-12-01 08:26:00
3        2010-12-01 08:26:00
4        2010-12-01 08:26:00
                 ...        
541904   2011-12-09 12:50:00
541905   2011-12-09 12:50:00
541906   2011-12-09 12:50:00
541907   2011-12-09 12:50:00
541908   2011-12-09 12:50:00
Name: InvoiceDate, Length: 541909, dtype: datetime64[ns]

In [73]:
df['CustomerID'] = df['CustomerID'].astype('str')
df['CustomerID']

0         17850.0
1         17850.0
2         17850.0
3         17850.0
4         17850.0
           ...   
541904    12680.0
541905    12680.0
541906    12680.0
541907    12680.0
541908    12680.0
Name: CustomerID, Length: 541909, dtype: object

In [74]:
df['CustomerID'] = df['CustomerID'].str.slice(0, -2)
df['CustomerID']

0         17850
1         17850
2         17850
3         17850
4         17850
          ...  
541904    12680
541905    12680
541906    12680
541907    12680
541908    12680
Name: CustomerID, Length: 541909, dtype: object

##### 2. Deleted all null value of 'Description' and 'UnitPrice'.

In [76]:
df.dropna(subset = 'Description', inplace = True)
df['Description'].isnull().sum()

0

##### 3. "USA" ---> "United States", "UK"、"U.K." ---> "United Kingdom"

In [86]:
df['Country'] = df['Country'].replace({"USA": "United States", "UK": "United Kingdom", "U.K.":"United Kingdom"})
df[(df['Country'] == 'USA') | (df['Country'] == 'UK') | (df['Country'] == 'U.K.')] 

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


##### 4. Deleted the data when 'Quantity' < 0, 'UnitPrice' < 0

In [90]:
df = df[df['Quantity'] >=0]
df = df[df['UnitPrice'] >=0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680,France


In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530691 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    530691 non-null  object        
 1   StockCode    530691 non-null  object        
 2   Description  530691 non-null  object        
 3   Quantity     530691 non-null  int64         
 4   InvoiceDate  530691 non-null  datetime64[ns]
 5   UnitPrice    530691 non-null  float64       
 6   CustomerID   530691 non-null  object        
 7   Country      530691 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 36.4+ MB


In [92]:
df.describe()

Unnamed: 0,Quantity,UnitPrice
count,530691.0,530691.0
mean,10.605855,3.903303
std,156.638147,35.896047
min,1.0,0.0
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,13541.33
