# Data Collection and Preprocessing

This notebook involves:
- Data Cleaning and Transformation
    - Handling missing values
    - Removing duplicates
    - Correcting data types (if any are incorrect)
    - Standardizing data formats (e.g., dates, currency)

## Data Cleaning and Transformation

In [1]:
# import libraries and read csv
import pandas as pd
df = pd.read_csv("data.csv",
                 encoding='unicode_escape')

# remove null values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

In [2]:
# Check data types of headers
print(df_cleaned.info())

<class 'pandas.core.frame.DataFrame'>
Index: 401604 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    401604 non-null  object 
 1   StockCode    401604 non-null  object 
 2   Description  401604 non-null  object 
 3   Quantity     401604 non-null  int64  
 4   InvoiceDate  401604 non-null  object 
 5   UnitPrice    401604 non-null  float64
 6   CustomerID   401604 non-null  float64
 7   Country      401604 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 27.6+ MB
None


the Dtype for InvoiceDate and CustomerID should be datetime and int respectively

InvoiceNo and StockCode may or may not have wrong dtypes

In [3]:
# Check if InvoiceNo have wrong dtypes

# Filter out InvoiceNo with letters and display
InvoiceNo_mask = df_cleaned['InvoiceNo'].str.contains('[a-zA-Z]')
Anomalous_InvoiceNo = df_cleaned[InvoiceNo_mask]
print(Anomalous_InvoiceNo)

       InvoiceNo StockCode                       Description  Quantity  \
141      C536379         D                          Discount        -1   
154      C536383    35004C   SET OF 3 COLOURED  FLYING DUCKS        -1   
235      C536391     22556    PLASTERS IN TIN CIRCUS PARADE        -12   
236      C536391     21984  PACK OF 12 PINK PAISLEY TISSUES        -24   
237      C536391     21983  PACK OF 12 BLUE PAISLEY TISSUES        -24   
...          ...       ...                               ...       ...   
540449   C581490     23144   ZINC T-LIGHT HOLDER STARS SMALL       -11   
541541   C581499         M                            Manual        -1   
541715   C581568     21258        VICTORIAN SEWING BOX LARGE        -5   
541716   C581569     84978  HANGING HEART JAR T-LIGHT HOLDER        -1   
541717   C581569     20979     36 PENCILS TUBE RED RETROSPOT        -5   

            InvoiceDate  UnitPrice  CustomerID         Country  
141      12/1/2010 9:41      27.50     14527.0

items in InvoiceNo are either 5 digit numbers or the letter 'C' followed by a 6 digit number. InvoiceNo has correct dtypes.

In [4]:
# Check if StockCode have wrong dtypes

# Filter out StockCode with letters and display
StockCode_mask = df_cleaned['StockCode'].str.contains('[a-zA-Z]')
Anomalous_StockCode = df_cleaned[StockCode_mask]
print(Anomalous_StockCode)

       InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
45        536370      POST                              POSTAGE         3   
...          ...       ...                                  ...       ...   
541778    581578    84997C      CHILDRENS CUTLERY POLKADOT BLUE         8   
541809    581579    85099C       JUMBO  BAG BAROQUE BLACK WHITE        10   
541838    581580    84993A            75 GREEN PETIT FOUR CASES         2   
541844    581580    85049A        TRADITIONAL CHRISTMAS RIBBONS         1   
541853    581580    85049E            SCANDINAVIAN REDS RIBBONS         2   

            InvoiceDate  UnitPrice  CustomerID         Country  
0        1

items in StockCode are 5 digit numbers that may or may not be followed by 1 or more letters. Some stock codes don't match these criteria and should be removed. StockCode has correct dtypes.

In [5]:
# Create filter that matches above criteria and display
df_StockCode_match = df_cleaned[~df_cleaned['StockCode'].str.match('^\d{5}[a-zA-Z]*$')]
print(df_StockCode_match)

# Drop rows matching the criteria
df_cleaned = df_cleaned.drop(df_StockCode_match.index)

       InvoiceNo StockCode Description  Quantity      InvoiceDate  UnitPrice  \
45        536370      POST     POSTAGE         3   12/1/2010 8:45      18.00   
141      C536379         D    Discount        -1   12/1/2010 9:41      27.50   
386       536403      POST     POSTAGE         1  12/1/2010 11:27      15.00   
1123      536527      POST     POSTAGE         1  12/1/2010 13:04      18.00   
1423      536540        C2    CARRIAGE         1  12/1/2010 14:05      50.00   
...          ...       ...         ...       ...              ...        ...   
541216    581494      POST     POSTAGE         2  12/9/2011 10:13      18.00   
541541   C581499         M      Manual        -1  12/9/2011 10:28     224.69   
541730    581570      POST     POSTAGE         1  12/9/2011 11:59      18.00   
541767    581574      POST     POSTAGE         2  12/9/2011 12:09      18.00   
541768    581578      POST     POSTAGE         3  12/9/2011 12:16      18.00   

        CustomerID         Country  
45

In [6]:
# Convert datetime dtype
df_cleaned['InvoiceDate'] = pd.to_datetime(df_cleaned['InvoiceDate'], format='%m/%d/%Y %H:%M')
# Convert CustomerID dtype
df_cleaned['CustomerID'] = df_cleaned['CustomerID'].astype(int)

In [7]:
# Double check dtypes
print(df_cleaned.info())

<class 'pandas.core.frame.DataFrame'>
Index: 399689 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    399689 non-null  object        
 1   StockCode    399689 non-null  object        
 2   Description  399689 non-null  object        
 3   Quantity     399689 non-null  int64         
 4   InvoiceDate  399689 non-null  datetime64[ns]
 5   UnitPrice    399689 non-null  float64       
 6   CustomerID   399689 non-null  int64         
 7   Country      399689 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 27.4+ MB
None


In [8]:
#Save the cleaned dataframe back to CSV
df_cleaned.to_csv('clean_data.csv', index=False)