### Importing Liberies

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Importing Data

In [28]:
path = f"C:\\My projects\\pandas\\data\\orderlines.csv"
orderlines = pd.read_csv(path)
orderlines.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [30]:
path = f"C:\\My projects\\pandas\\data\\orders.csv"
orders = pd.read_csv(path)
orders.head()

Unnamed: 0,order_id,created_date,total_paid,state
0,241319,2017-01-02 13:35:40,44.99,Cancelled
1,241423,2017-11-06 13:10:02,136.15,Completed
2,242832,2017-12-31 17:40:03,15.76,Completed
3,243330,2017-02-16 10:59:38,84.98,Completed
4,243784,2017-11-24 13:35:19,157.86,Cancelled


In [32]:
path = f"C:\\My projects\\pandas\\data\\products.csv"
products = pd.read_csv(path)
products.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


In [34]:
path = "C:\\My projects\\pandas\\data\\brands.csv"
brands = pd.read_csv(path)
brands.head()

Unnamed: 0,short,long
0,8MO,8Mobility
1,ACM,Acme
2,ADN,Adonit
3,AII,Aiino
4,AKI,Akitio


### Data Explatory

#### 1: orderlines

In [36]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


As you can see the type of `date` is **object**. I want to change it to **datetime**

Also, the type of `unit_price` is **object**. I want to change it to **float**

first `date`:

In [38]:
orderlines['date'] = pd.to_datetime(orderlines['date'])

In [40]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                293983 non-null  int64         
 1   id_order          293983 non-null  int64         
 2   product_id        293983 non-null  int64         
 3   product_quantity  293983 non-null  int64         
 4   sku               293983 non-null  object        
 5   unit_price        293983 non-null  object        
 6   date              293983 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 15.7+ MB


Now `unit_price` :

In [42]:
# first solution

#orderlines['unit_price'] = pd.to_numeric(orderlines['unit_price'])

In [44]:
# Second solution

orderlines['unit_price'] = orderlines['unit_price'].astype(float)

ValueError: could not convert string to float: '1.137.99'

As you can see we have issue with **string values** (Notice to ValueError)

` ValueError: could not convert string to float: '1.137.99' `

So, as you can see there are two dott (.) in one value ('1.137.99') 

Most important question is how can we manage it?
How many coruppted valuse like this are there?

Nice challange!! let's check it :-)

For this perpose I want to use **match pattern**

In [46]:
pattern = r'\d+\.\d+\.\d+'
corrupted = orderlines['unit_price'].str.match(pattern)

In [48]:
corrupted.value_counts()

unit_price
False    257814
True      36169
Name: count, dtype: int64

In [50]:
print(f" {round((36169 * 100 /257814), 2)}% of values in unit_price coulumn are corrupted")

 14.03% of values in unit_price coulumn are corrupted


*Let's look at corrupted data carefully*

In [52]:
orderlines.loc[corrupted , :].head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
6,1119115,299544,0,1,APP1582,1.137.99,2017-01-01 01:17:21
11,1119126,299549,0,1,PAC0929,2.565.99,2017-01-01 02:07:42
15,1119131,299553,0,1,APP1854,3.278.99,2017-01-01 02:14:47
43,1119195,299582,0,1,PAC0961,2.616.99,2017-01-01 08:54:00
59,1119214,299596,0,1,PAC1599,2.873.99,2017-01-01 09:53:11


*I want to merge the orderlines table with the products table. My goal is to identify which products have these types of prices. If I can find a logical pattern in this data, I will consider keeping the corrupted entries with some adjustments.*

In [54]:
corr_check = orderlines.merge(products,
                              on = 'sku',
                              how = 'inner')

In [56]:
corr = corr_check['unit_price'].str.match(pattern)

In [58]:
corr_check.loc[corr , :].sample(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,name,desc,price,promo_price,in_stock,type
372220,1643641,524575,0,1,APP2293,2.443.00,2018-03-10 08:41:24,"Apple MacBook Pro 13 ""Core i7 Touch Bar 35GHz ...",New MacBook Pro 13 inch Touch Bar 35 GHz Core ...,2599.0,24.430.045,0,"1,02E+12"
296849,1539597,482784,0,1,PAC0960,1.995.59,2018-01-05 09:51:07,"Apple iMac 27 ""Core i5 3.2GHz Retina 5K | 16GB...",IMac desktop computer 27 inch 5K Retina i5 3.2...,2569.0,24.150.045,0,"5,74E+15"
11978,1136626,305499,0,1,PAC1047,3.340.99,2017-01-09 10:10:57,"Apple iMac 27 ""Core i7 Retina 5K 4GHz | 32GB |...",IMac desktop computer 27 inch 5K Retina i7 4GH...,3889.0,33.099.901,0,"5,74E+15"
121411,1258505,360767,0,1,PAC1057,3.273.99,2017-05-30 23:36:56,"Apple iMac 27 ""Core i7 Retina 5K 4GHz | 16GB |...",IMac desktop computer 27 inch 5K Retina 4GHz i...,3589.0,32.739.902,0,"5,74E+15"
127401,1264832,363979,0,1,PAC1038,2.314.05,2017-06-09 15:50:10,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 32GB...",IMac desktop computer 27 inch 5K Retina i5 3.3...,3469.0,2800,0,"5,74E+15"
155305,1309287,385201,0,1,QNA0197,1.510.99,2017-08-01 21:32:47,QNAP TS-1635 Server Nas,NAS system with 16 bays 2.5-inch and 3.5-inch ...,1511.29,15.109.899,0,12175397
253462,1476242,455089,0,1,APP2264,2.713.59,2017-12-04 16:00:28,"Apple Macbook Pro 13 ""Core i7 Touch Bar 35GHz ...",New MacBook Pro 13 inch Touch Bar 35 GHz Core ...,2849.0,27.070.047,1,"1,02E+12"
83773,1218575,341568,0,1,APP1581,1.016.99,2017-04-01 12:09:47,"Apple MacBook Air 13 ""i5 16 Ghz | 8GB RAM | 12...",laptop MacBook Air 13 inch i5 16GHz 8GB RAM 12...,1099.0,9.855.813,0,1282
87901,1222902,343608,0,1,APP1200,2.489.99,2017-04-05 13:53:51,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 8GB ...",IMac desktop computer 27 inch 5K Retina i5 33G...,2629.0,22.055.844,0,"5,74E+15"
118633,1255266,359275,0,2,APP1812,1.829.99,2017-05-26 01:23:45,"Apple MacBook Pro 13 ""with Touch Bar GHz Core ...",New MacBook Pro 13-inch Core i5 Touch Bar to 2...,1999.0,17.855.946,0,2158


*As you can see, all the samples are related to expensive Apple products. If I simply remove the first dot from the prices, they would make sense. However, let’s not rush to modify the data just yet. Instead, I want to apply another filter that includes "Apple" and check for non-Apple products.*

In [60]:
pattern = r'.*Apple.* | .*apple.*'

apple_filter = corr_check['name'].str.contains(pattern)

Regex Pattern:

.* Apple. *  : Matches any string containing "Apple" (any characters before and after).

|          : Logical OR operator in regex.

.* apple. *  : Matches any string containing "apple" (case-sensitive in this form).

^ and $    : Ensure the pattern matches the entire string (optional here but good practice).

In [62]:
corr_check.loc[ corr & ~(apple_filter) , :].sample(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,name,desc,price,promo_price,in_stock,type
82103,1217321,340942,0,1,LGE0032,1.049.99,2017-03-30 19:59:53,"31MU97Z-B LG Monitor 31 ""4K pivotable Thunderb...",31-inch 4K monitor screen split and multiple p...,1099,7.990.005,1,1296
349883,1615060,512073,0,1,WAC0163,2.039.99,2018-02-15 14:55:32,Wacom Cintiq Interactive Pen & Touch 27QHD Mac...,Professional graphics tablet touch screen with...,2549.9,22.139.975,0,1405
14005,1139807,306770,0,1,LGE0041-A,1.191.42,2017-01-10 20:11:49,"Open - LG 38UC99-W Monitor 38 ""UHD 4K USB 3.0 ...",Monitor curved 38 inch IPS 4K ultra high defin...,1299,1120.99,0,1298
226778,1441261,440951,0,1,PAC2006,2.075.24,2017-11-24 19:16:14,Synology DS1517 + | 8GB RAM | 40TB (5x8TB) Sea...,NAS with 8GB RAM and 40TB for Mac and PC,27.187.885,20.069.629,0,12175397
131479,1270830,366245,0,1,LGE0032,1.005.00,2017-06-19 10:52:20,"31MU97Z-B LG Monitor 31 ""4K pivotable Thunderb...",31-inch 4K monitor screen split and multiple p...,1099,7.990.005,1,1296
377021,1647973,526295,0,1,QNA0170,2.296.99,2018-03-13 09:56:29,QNAP NAS server TVS-682T i3 8GB RAM Thunderbol...,NAS 6 bays 2 2 3 HDMI ports Thunderbolt DDR4 8...,2297.79,22.969.902,0,12175397
356739,1625737,516821,0,1,PAC2418,2.202.37,2018-02-24 14:40:48,DS418play Synology NAS Server | 16GB RAM | 48T...,4-bay NAS server to accommodate 4K Ultra HD files,2677.67,20.863.679,0,12175397
13552,1138913,306422,0,1,WAC0163,2.599.99,2017-01-10 12:15:14,Wacom Cintiq Interactive Pen & Touch 27QHD Mac...,Professional graphics tablet touch screen with...,2549.9,22.139.975,0,1405
46674,1179323,323657,0,1,PAC1433,1.190.34,2017-02-13 17:05:58,Synology DS916 + Pack | 8GB RAM | WD 16TB Network,Synology DS916 + with 8GB of RAM memory + 16TB...,1305.2,11.863.675,0,12175397
135724,1277483,370236,0,1,LGE0041-A,1.259.92,2017-06-29 09:59:33,"Open - LG 38UC99-W Monitor 38 ""UHD 4K USB 3.0 ...",Monitor curved 38 inch IPS 4K ultra high defin...,1299,1120.99,0,1298


The corrupted prices concentrated are in important product categories (for example high-value items like Apple products)
So:

*I decide to keep them* :)))))

for this purpose I want to use this method:

**x.replace('.', '', n)**: Removes n occurrences of the dot from the left.

In [64]:
orderlines['unit_price'] = orderlines['unit_price'].apply(lambda x: x.replace('.', '', x.count('.') - 1))

Explanation:


    x.count('.') - 1: Counts the total dots and subtracts 1, so only the last dot is kept.
    x.replace('.', '', n): Removes n occurrences of the dot from the left.

Now, let's try again

In [66]:
orderlines['unit_price'] = orderlines['unit_price'].astype(float)

In [68]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                293983 non-null  int64         
 1   id_order          293983 non-null  int64         
 2   product_id        293983 non-null  int64         
 3   product_quantity  293983 non-null  int64         
 4   sku               293983 non-null  object        
 5   unit_price        293983 non-null  float64       
 6   date              293983 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 15.7+ MB


*Congratulations to ourselves! It worked!!* :)))

Now, I want to check name of columns to make sure there is not any white space in their name

In [70]:
orderlines.columns

Index(['id', 'id_order', 'product_id', 'product_quantity', 'sku', 'unit_price',
       'date'],
      dtype='object')

As you can see there is not null values, but let's check duplicates

In [72]:
# incorrect code (with count() ):
# orderlines.duplicated().count()

# correct code:
orderlines.duplicated().sum()

0

Now, we can get sample from table to final control

In [74]:
orderlines.sample(8)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
8472,1141021,307281,0,1,APP1690,49.99,2017-01-11 13:46:08
62527,1243088,353479,0,1,APP1570,41.99,2017-05-07 16:39:50
226085,1540846,483332,0,1,PAC2078,2492.59,2018-01-06 06:48:55
108840,1325829,393252,0,1,AP20206,29.99,2017-08-28 01:54:47
44254,1210996,337990,0,1,MOS0139,31.49,2017-03-22 22:32:08
139748,1393768,420164,0,1,MUV0198,17.99,2017-10-31 02:54:20
223420,1536985,481774,0,1,SEV0029,19.99,2018-01-04 12:49:18
76081,1266976,353207,0,1,APP2148,602.48,2017-06-13 00:00:54


#### 2: Orders

In [76]:
orders.sample(7)

Unnamed: 0,order_id,created_date,total_paid,state
220457,520948,2018-03-04 18:32:45,152.99,Shopping Basket
17127,316606,2017-01-29 11:45:35,0.0,Shopping Basket
207471,507962,2018-02-08 01:04:44,92.09,Completed
105492,405077,2017-09-26 21:52:42,1267.58,Pending
1406,300808,2017-01-02 19:25:35,64.99,Completed
147900,448015,2017-11-27 12:54:04,1994.98,Shopping Basket
225308,525800,2018-03-12 10:35:52,129.99,Pending


In [78]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


I want to change data type of `created_date`

In [80]:
orders['created_date'] = pd.to_datetime(orders['created_date'])

In [82]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


*Now, I want to check null values*

In [84]:
orders.isnull().sum()

order_id        0
created_date    0
total_paid      5
state           0
dtype: int64

*Very simply I want to remove them*

In [86]:
orders.dropna(inplace = True)

In [88]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
Index: 226904 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226904 non-null  int64         
 1   created_date  226904 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226904 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 8.7+ MB


*And finally let's check duplicates*

In [90]:
orders.duplicated().sum()

0

#### 3: products

In [92]:
products.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


In [94]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19326 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19326 non-null  object
 1   name         19326 non-null  object
 2   desc         19319 non-null  object
 3   price        19280 non-null  object
 4   promo_price  19326 non-null  object
 5   in_stock     19326 non-null  int64 
 6   type         19276 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.0+ MB


First of all let's fix `NaN values`

In [96]:
products.isna().sum()

sku             0
name            0
desc            7
price          46
promo_price     0
in_stock        0
type           50
dtype: int64

*I drop them:*

In [98]:
products.dropna(inplace = True)

In [100]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19223 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19223 non-null  object
 1   name         19223 non-null  object
 2   desc         19223 non-null  object
 3   price        19223 non-null  object
 4   promo_price  19223 non-null  object
 5   in_stock     19223 non-null  int64 
 6   type         19223 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.2+ MB


**Next problem**:

*As you can see, we have a problem with these three columns:*

`price`, `promo_price`, and `type`.

*Their data types need to be changed.*

*But do we really need the type column?*
*Each product is assigned a number, but is it necessary for our project?*
*My answer is no! So, I simply want to remove this column and focus on fixing the data types of the other two columns.*

*Let’s do it!*

1: droping columns `type`

In [102]:
products.drop(columns = 'type' , inplace = True)

2: change data type of column `price`

In [104]:
products['price'] = products['price'].astype(float)

ValueError: could not convert string to float: '1.639.792'

*Oh! This is the same problem we had before.*

*Let’s check if promo_price has the same issue!*

In [106]:
products['promo_price'] = products['promo_price'].astype(float)

ValueError: could not convert string to float: '1.119.976'

*Exactly! Let’s fix them the same way we did before.*

In [108]:
products['price'] = products['price'].apply(lambda x: x.replace('.', '', x.count('.') -1))

In [110]:
products['price'] = products['price'].astype(float)

In [112]:
products['promo_price'] = products['promo_price'].apply(lambda x: x.replace('.', '', x.count('.') -1))

In [114]:
products['promo_price'] = products['promo_price'].astype(float)

In [116]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19223 entries, 0 to 19325
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sku          19223 non-null  object 
 1   name         19223 non-null  object 
 2   desc         19223 non-null  object 
 3   price        19223 non-null  float64
 4   promo_price  19223 non-null  float64
 5   in_stock     19223 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 1.0+ MB


*Let' check duplicates*

In [118]:
products.duplicated().sum()

8746

*Wow! I want to remove duplicates*

In [120]:
products.drop_duplicates(inplace=True)

In [122]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10477 entries, 0 to 19325
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sku          10477 non-null  object 
 1   name         10477 non-null  object 
 2   desc         10477 non-null  object 
 3   price        10477 non-null  float64
 4   promo_price  10477 non-null  float64
 5   in_stock     10477 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 573.0+ KB


So let's check the last table*

#### 4: brands

In [124]:
brands.head()

Unnamed: 0,short,long
0,8MO,8Mobility
1,ACM,Acme
2,ADN,Adonit
3,AII,Aiino
4,AKI,Akitio


In [126]:
brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   short   187 non-null    object
 1   long    187 non-null    object
dtypes: object(2)
memory usage: 3.1+ KB


**That's Ok, now I want so save cleaned data**

In [128]:
orders_cleaned = orders.copy()
orderlines_cleaned = orderlines.copy()
products_cleaned = products.copy()
brands_cleaned = brands.copy()

In [170]:
orders_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\orders_cleaned')
orderlines_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\orderlines_cleaned')
products_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\products_cleaned')
brands_cleaned.to_csv(f'C:\\My projects\\pandas\\cleaned_data\\brands_cleaned')