# Orderlines

## Import

In [1]:
import pandas as pd
orderlines_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/orderlines.csv"
orderlines_df = pd.read_csv(orderlines_url)
orderlines = orderlines_df.copy()

## Drop Duplicates

In [2]:
orderlines.duplicated().sum()

0

## Missing Values

In [3]:
orderlines.isna().sum()

id                  0
id_order            0
product_id          0
product_quantity    0
sku                 0
unit_price          0
date                0
dtype: int64

## Datatypes & Format

In [4]:
orderlines.dtypes

id                   int64
id_order             int64
product_id           int64
product_quantity     int64
sku                 object
unit_price          object
date                object
dtype: object

In [5]:
orderlines['date']  = pd.to_datetime(orderlines['date'])
orderlines.drop('id', axis=1, inplace=True)
orderlines['id_order']= orderlines['id_order'].astype(str)
orderlines.drop('product_id', axis=1, inplace=True)

In [6]:
orderlines['unit_price']

0          18.99
1         399.00
2         474.05
3          68.39
4          23.74
           ...  
293978     42.99
293979    141.58
293980      9.99
293981     19.99
293982     13.99
Name: unit_price, Length: 293983, dtype: object

In [7]:
print(orderlines.unit_price.str.contains("\d+\.\d+\.\d+"))

0         False
1         False
2         False
3         False
4         False
          ...  
293978    False
293979    False
293980    False
293981    False
293982    False
Name: unit_price, Length: 293983, dtype: bool


Check for prices with two dots.

In [8]:
orderlines[orderlines['unit_price'].str.contains("\d+\.\d+\.\d+", na=False)]

Unnamed: 0,id_order,product_quantity,sku,unit_price,date
6,299544,1,APP1582,1.137.99,2017-01-01 01:17:21
11,299549,1,PAC0929,2.565.99,2017-01-01 02:07:42
15,299553,1,APP1854,3.278.99,2017-01-01 02:14:47
43,299582,1,PAC0961,2.616.99,2017-01-01 08:54:00
59,299596,1,PAC1599,2.873.99,2017-01-01 09:53:11
...,...,...,...,...,...
293862,452946,1,APP2075,2.999.00,2018-03-14 13:03:33
293887,527321,1,PAC2148,3.497.00,2018-03-14 13:10:15
293889,527324,1,PAC2117,3.075.00,2018-03-14 13:10:56
293911,527342,1,APP2492,1.329.00,2018-03-14 13:24:51


After checking the products table and seeing the prices, it is clear, that the format used here n.nnn.nn should be a price of nnnn,nn € or in us nnnn.nn $

If I use comma: df['price'] = df['price'].str.replace('\.', '', regex=False).str[:-2] + ',' + df['price'].str[-2:]

In [9]:
orderlines['unit_price'] = orderlines['unit_price'].str.replace('.', '', regex=False).str[:-2]+'.'+orderlines['unit_price'].str[-2:]

In [10]:
orderlines["unit_price"] = pd.to_numeric(orderlines["unit_price"])

In [11]:
orderlines.dtypes

id_order                    object
product_quantity             int64
sku                         object
unit_price                 float64
date                datetime64[ns]
dtype: object

In [12]:
# orderlines = orderlines[orderlines['unit_price'] >= 0]

In [13]:
# orderlines.loc[orderlines['unit_price']<0]

Dropped negative price values. Remeber there are many values, which are zero as well. Why is that, should they be dropped?

## Export

In [14]:
orderlines.to_csv('/Users/merlesteffen/Documents/GitHub/Eniac/Data-Eniac/Data_Cleaned/orderlines_clean.csv', index=False)