### Importing Liberies

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Importing Data

In [4]:
path = f"C:\\My projects\\pandas\\data\\orderlines.csv"
orderlines = pd.read_csv(path)
orderlines.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [6]:
path = f"C:\\My projects\\pandas\\data\\orders.csv"
orders = pd.read_csv(path)
orders.head()

Unnamed: 0,order_id,created_date,total_paid,state
0,241319,2017-01-02 13:35:40,44.99,Cancelled
1,241423,2017-11-06 13:10:02,136.15,Completed
2,242832,2017-12-31 17:40:03,15.76,Completed
3,243330,2017-02-16 10:59:38,84.98,Completed
4,243784,2017-11-24 13:35:19,157.86,Cancelled


In [8]:
path = f"C:\\My projects\\pandas\\data\\products.csv"
products = pd.read_csv(path)
products.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


In [10]:
path = "C:\\My projects\\pandas\\data\\brands.csv"
brands = pd.read_csv(path)
brands.head()

Unnamed: 0,short,long
0,8MO,8Mobility
1,ACM,Acme
2,ADN,Adonit
3,AII,Aiino
4,AKI,Akitio


### Data Explatory

#### 1: orderlines

In [12]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


As you can see the type of `date` is **object**. I want to change it to **datetime**

Also, the type of `unit_price` is **object**. I want to change it to **float**

first `date`:

In [14]:
orderlines['date'] = pd.to_datetime(orderlines['date'])

In [16]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                293983 non-null  int64         
 1   id_order          293983 non-null  int64         
 2   product_id        293983 non-null  int64         
 3   product_quantity  293983 non-null  int64         
 4   sku               293983 non-null  object        
 5   unit_price        293983 non-null  object        
 6   date              293983 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 15.7+ MB


Now `unit_price` :

In [18]:
# first solution

#orderlines['unit_price'] = pd.to_numeric(orderlines['unit_price'])

In [20]:
# Second solution

orderlines['unit_price'] = orderlines['unit_price'].astype(float)

ValueError: could not convert string to float: '1.137.99'

As you can see we have issue with **string values** (Notice to ValueError)

` ValueError: could not convert string to float: '1.137.99' `

So, as you can see there are two dott (.) in one value ('1.137.99') 

Most important question is how can we manage it?
How many coruppted valuse like this are there?

Nice challange!! let's check it :-)

For this perpose I want to use **match pattern**

In [18]:
pattern = r'\d+\.\d+\.\d+'
corrupted = orderlines['unit_price'].str.match(pattern)

In [20]:
corrupted.value_counts()

unit_price
False    257814
True      36169
Name: count, dtype: int64

In [22]:
print(f" {round((36169 * 100 /257814), 2)}% of values in unit_price coulumn are corrupted")

 14.03% of values in unit_price coulumn are corrupted


*Let's look at corrupted data carefully*

In [24]:
orderlines.loc[corrupted , :].head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
6,1119115,299544,0,1,APP1582,1.137.99,2017-01-01 01:17:21
11,1119126,299549,0,1,PAC0929,2.565.99,2017-01-01 02:07:42
15,1119131,299553,0,1,APP1854,3.278.99,2017-01-01 02:14:47
43,1119195,299582,0,1,PAC0961,2.616.99,2017-01-01 08:54:00
59,1119214,299596,0,1,PAC1599,2.873.99,2017-01-01 09:53:11


*I want to merge the orderlines table with the products table. My goal is to identify which products have these types of prices. If I can find a logical pattern in this data, I will consider keeping the corrupted entries with some adjustments.*

In [26]:
corr_check = orderlines.merge(products,
                              on = 'sku',
                              how = 'inner')

In [28]:
corr = corr_check['unit_price'].str.match(pattern)

In [30]:
corr_check.loc[corr , :].sample(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,name,desc,price,promo_price,in_stock,type
111346,1247045,355429,0,1,APP1814,2.029.99,2017-05-13 10:06:08,"Apple MacBook Pro 13 ""with Touch Bar GHz Core ...",New MacBook Pro 13-inch Core i5 Touch Bar to 2...,2199.0,20.365.849,0,2158
118135,1254719,359037,0,1,PAC1593,2.673.99,2017-05-25 12:27:59,"Apple iMac 27 ""Core i5 3.2GHz Retina 5K | 16GB...",Desktop computer iMac 27-inch 3.2GHz Core i5 5...,3409.0,26.739.899,0,"5,74E+15"
237477,1455806,447227,0,1,APP1008,2.348.59,2017-11-27 09:36:07,"Apple MacBook Pro Retina 15 ""i7 28 Ghz | RAM 1...",New MacBook Pro 15-inch Retina screen Core i7 ...,2569.0,23.890.046,0,1282
184251,1367943,407586,0,1,APP2082,1.885.59,2017-10-02 20:08:28,"Apple MacBook Pro 13 ""Core i5 with Touch Bar 3...",New MacBook Pro 13 inch Touch Bar 31 GHz Core ...,2005.59,18.850.046,0,"1,02E+12"
63892,1198629,331980,0,1,APP1879,2.937.32,2017-03-06 00:08:00,"Apple MacBook Pro 15 ""Core i7 Touch Bar 26GHz ...",New MacBook Pro 15-inch Touch Bar to 26GHz Cor...,3179.0,2.875.594,0,2158
91853,1226280,345223,0,1,APP1199,2.189.99,2017-04-11 10:36:35,"Apple iMac 27 ""Core i5 3.2GHz Retina 5K | 8GB ...",IMac desktop computer 27 inch 5K Retina 8GB RA...,2329.0,2.189.005,0,"5,74E+15"
123325,1260407,361695,0,1,PAC0960,2.308.99,2017-06-02 14:21:15,"Apple iMac 27 ""Core i5 3.2GHz Retina 5K | 16GB...",IMac desktop computer 27 inch 5K Retina i5 3.2...,2569.0,24.150.045,0,"5,74E+15"
4955,1124471,301934,0,1,PAC1622,3.433.99,2017-01-03 20:23:03,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 8GB ...",Desktop computer iMac 27-inch 3.3GHz Core i5 5...,4369.0,34.999.903,0,"5,74E+15"
5873,1124493,301934,0,1,PAC1612,3.390.99,2017-01-03 20:24:11,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 32GB...",Desktop computer iMac 27-inch 3.3GHz Core i5 5...,4789.0,34.399.901,0,"5,74E+15"
68832,1204234,334807,0,1,PAC1055,3.435.99,2017-03-13 21:00:41,"Apple iMac 27 ""Core i7 Retina 5K 4GHz | 32GB |...",IMac desktop computer 27 inch 5K Retina 4GHz i...,3949.0,33.669.896,0,"5,74E+15"


*As you can see, all the samples are related to expensive Apple products. If I simply remove the first dot from the prices, they would make sense. However, let’s not rush to modify the data just yet. Instead, I want to apply another filter that includes "Apple" and check for non-Apple products.*

In [32]:
pattern = r'.*Apple.* | .*apple.*'

apple_filter = corr_check['name'].str.contains(pattern)

Regex Pattern:

.* Apple. *  : Matches any string containing "Apple" (any characters before and after).

|          : Logical OR operator in regex.

.* apple. *  : Matches any string containing "apple" (case-sensitive in this form).

^ and $    : Ensure the pattern matches the entire string (optional here but good practice).

In [34]:
corr_check.loc[ corr & ~(apple_filter) , :].sample(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,name,desc,price,promo_price,in_stock,type
184337,1368075,407182,0,1,QNA0142,1.090.59,2017-10-02 22:39:30,QNAP TS-453U-RP,4-bay NAS Server for Mac and PC.,1148.29,11.479.899,0,12175397
278703,1509239,468710,0,1,PAC2264,1.383.37,2017-12-21 11:26:51,Synology DS918 + NAS Server | 4GB | 24TB (4x6T...,NAS server of the Plus Series for companies se...,1609.99,13.383.677,0,12175397
124087,1260943,361985,0,1,QNA0170,2.054.99,2017-06-03 17:16:44,QNAP NAS server TVS-682T i3 8GB RAM Thunderbol...,NAS 6 bays 2 2 3 HDMI ports Thunderbolt DDR4 8...,2297.79,22.969.902,0,12175397
116274,1252399,350846,0,1,PAC1417,1.027.99,2017-05-22 13:59:42,Synology DS716 + II Pack | 8GB RAM | WD 12TB N...,Synology DS716 + II with 8GB of RAM memory + 1...,1061.89,9.791.792,0,12175397
144454,1290919,376512,0,1,WAC0030,1.799.99,2017-07-12 17:05:19,Wacom Cintiq 22HD Interactive Pen Display Mac ...,LCD graphic display with 215-inch Full HD and ...,1799.0,15.120.039,0,1405
127857,1265430,364301,0,1,LGE0041-A,1.185.31,2017-06-10 20:26:55,"Open - LG 38UC99-W Monitor 38 ""UHD 4K USB 3.0 ...",Monitor curved 38 inch IPS 4K ultra high defin...,1299.0,1120.99,0,1298
44805,1176864,322482,0,1,LGE0047,1.259.99,2017-02-11 00:51:52,"LG 38UC99-W Monitor 38 ""UHD 4K USB 3.0 USB-C",Monitor curved 38 inch IPS 4K ultra high defin...,1299.0,998.999,1,1296
179038,1359482,403486,0,1,SYN0164,5.389.00,2017-09-22 23:51:56,Synology RackStation RS4017xs + Nas Server,Nas Format rack 16 bays with Intel Xeon proces...,5389.0,5389,0,12175397
210164,1412992,428151,0,1,WAC0030,1.699.99,2017-11-18 11:24:32,Wacom Cintiq 22HD Interactive Pen Display Mac ...,LCD graphic display with 215-inch Full HD and ...,1799.0,15.120.039,0,1405
80448,1215659,340171,0,1,WAC0030,1.429.99,2017-03-28 19:52:07,Wacom Cintiq 22HD Interactive Pen Display Mac ...,LCD graphic display with 215-inch Full HD and ...,1799.0,15.120.039,0,1405


The corrupted prices concentrated are in important product categories (for example high-value items like Apple products)
So:

*I decide to keep them* :)))))

for this purpose I want to use this method:

**x.replace('.', '', n)**: Removes n occurrences of the dot from the left.

In [36]:
orderlines['unit_price'] = orderlines['unit_price'].apply(lambda x: x.replace('.', '', x.count('.') - 1))

Explanation:


    x.count('.') - 1: Counts the total dots and subtracts 1, so only the last dot is kept.
    x.replace('.', '', n): Removes n occurrences of the dot from the left.

Now, let's try again

In [38]:
orderlines['unit_price'] = orderlines['unit_price'].astype(float)

In [40]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                293983 non-null  int64         
 1   id_order          293983 non-null  int64         
 2   product_id        293983 non-null  int64         
 3   product_quantity  293983 non-null  int64         
 4   sku               293983 non-null  object        
 5   unit_price        293983 non-null  float64       
 6   date              293983 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 15.7+ MB


*Congratulations to ourselves! It worked!!* :)))

Now, I want to check name of columns to make sure there is not any white space in their name

In [42]:
orderlines.columns

Index(['id', 'id_order', 'product_id', 'product_quantity', 'sku', 'unit_price',
       'date'],
      dtype='object')

As you can see there is not null values, but let's check duplicates

In [44]:
# incorrect code (with count() ):
# orderlines.duplicated().count()

# correct code:
orderlines.duplicated().sum()

0

Now, we can get sample from table to final control

In [46]:
orderlines.sample(8)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
53394,1227355,345845,0,1,IFX0055,82.64,2017-04-12 23:42:21
74011,1263410,363261,0,1,OWC0195,474.99,2017-06-07 17:42:18
147453,1407711,426884,0,1,HOC0010,24.99,2017-11-14 19:07:29
217790,1525877,476688,0,1,GTE0140,57.99,2017-12-29 19:17:49
223897,1537754,482122,0,1,APP1216,130.0,2018-01-04 16:55:10
105886,1320521,390678,0,1,SPE0191,39.99,2017-08-19 12:17:04
141380,1396764,421570,0,1,CEL0044,8.09,2017-11-03 16:21:55
197184,1488998,459890,0,1,NTE0062,105.99,2017-12-11 14:31:32


#### 2: Orders

In [48]:
orders.sample(7)

Unnamed: 0,order_id,created_date,total_paid,state
2460,301916,2017-01-03 19:04:52,59.99,Shopping Basket
17917,317397,2017-01-30 12:31:43,22.98,Completed
202643,503129,2018-01-31 18:37:46,1099.0,Place Order
3029,302496,2017-01-04 14:45:04,71.97,Cancelled
68230,367776,2017-06-21 12:13:52,274.77,Pending
40591,340079,2017-03-28 15:27:43,574.99,Cancelled
23068,322548,2017-02-11 11:36:57,25.98,Completed


In [50]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


I want to change data type of `created_date`

In [52]:
orders['created_date'] = pd.to_datetime(orders['created_date'])

In [54]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


*Now, I want to check null values*

In [56]:
orders.isnull().sum()

order_id        0
created_date    0
total_paid      5
state           0
dtype: int64

*Very simply I want to remove them*

In [58]:
orders.dropna(inplace = True)

In [60]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
Index: 226904 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226904 non-null  int64         
 1   created_date  226904 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226904 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 8.7+ MB


*And finally let's check duplicates*

In [62]:
orders.duplicated().sum()

0

#### 3: products

In [64]:
products.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


In [74]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19326 entries, 0 to 19325
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19326 non-null  object
 1   name         19326 non-null  object
 2   desc         19319 non-null  object
 3   price        19280 non-null  object
 4   promo_price  19326 non-null  object
 5   in_stock     19326 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 906.0+ KB


First of all let's fix `NaN values`

In [123]:
products.isna().sum()

sku             0
name            0
desc            7
price          46
promo_price     0
in_stock        0
dtype: int64

*I drop them:*

In [126]:
products.dropna(inplace = True)

In [128]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19273 entries, 0 to 19325
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19273 non-null  object
 1   name         19273 non-null  object
 2   desc         19273 non-null  object
 3   price        19273 non-null  object
 4   promo_price  19273 non-null  object
 5   in_stock     19273 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.0+ MB


**Next problem**:

*As you can see, we have a problem with these three columns:*

`price`, `promo_price`, and `type`.

*Their data types need to be changed.*

*But do we really need the type column?*
*Each product is assigned a number, but is it necessary for our project?*
*My answer is no! So, I simply want to remove this column and focus on fixing the data types of the other two columns.*

*Let’s do it!*

1: droping columns `type`

In [72]:
products.drop(columns = 'type' , inplace = True)

2: change data type of column `price`

In [132]:
products['price'] = products['price'].astype(float)

ValueError: could not convert string to float: '1.639.792'

*Oh! This is the same problem we had before.*

*Let’s check if promo_price has the same issue!*

In [134]:
products['promo_price'] = products['promo_price'].astype(float)

ValueError: could not convert string to float: '1.119.976'

*Exactly! Let’s fix them the same way we did before.*

In [138]:
products['price'] = products['price'].apply(lambda x: x.replace('.', '', x.count('.') -1))

In [144]:
products['price'] = products['price'].astype(float)

In [140]:
products['promo_price'] = products['promo_price'].apply(lambda x: x.replace('.', '', x.count('.') -1))

In [146]:
products['promo_price'] = products['promo_price'].astype(float)

In [148]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19273 entries, 0 to 19325
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sku          19273 non-null  object 
 1   name         19273 non-null  object 
 2   desc         19273 non-null  object 
 3   price        19273 non-null  float64
 4   promo_price  19273 non-null  float64
 5   in_stock     19273 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 1.0+ MB


*It worked. So let's check the last table*

#### 4: brands

In [157]:
brands.head()

Unnamed: 0,short,long
0,8MO,8Mobility
1,ACM,Acme
2,ADN,Adonit
3,AII,Aiino
4,AKI,Akitio


In [159]:
brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   short   187 non-null    object
 1   long    187 non-null    object
dtypes: object(2)
memory usage: 3.1+ KB


**That's Ok, now I want so save cleaned data**

In [162]:
orders_cleaned = orders.copy()
orderlines_cleaned = orderlines.copy()
products_cleaned = products.copy()
brands_cleaned = brands.copy()

In [170]:
orders_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\orders_cleaned')
orderlines_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\orderlines_cleaned')
products_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\products_cleaned')
brands_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\brands_cleaned')