### __Jr Data Analysis Tasks 01__

#### 📁 __Data Ingestion & Preprocessing__

##### ✅ _CSV Import with Custom Header and Separator_
        
- Load CSV

- Prepare it for a DataFrame

In [1]:
import pandas as pd

df_csv = pd.read_csv('DataSets/dummy_data.csv', sep=';', header='infer', decimal=',')

df_csv.info()
print()
print('Sample: \n\n', df_csv.sample(10, random_state=333))
print()
print('Counts NaN: \n\n', df_csv.isna().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 0 to 211
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Region            209 non-null    object
 1   Product           209 non-null    object
 2   Sales             197 non-null    object
 3   Profit            193 non-null    object
 4   Currency          174 non-null    object
 5   Customer_ID       209 non-null    object
 6   Invoice_ID        209 non-null    object
 7   Description       209 non-null    object
 8   Transaction_Date  209 non-null    object
 9   ExcesiveNA        12 non-null     object
dtypes: object(10)
memory usage: 16.7+ KB

Sample: 

     Region  Product    Sales   Profit Currency Customer_ID Invoice_ID  \
32    East   Laptop  2089,77   113,74      EUR     Raymond    INV1858   
64    West   Tablet  3619,89   1159,9      USD     Raymond    INV5505   
192   West   Laptop  4651,55  1407,77        -   Elizabeth    INV4770 

In [2]:
import pandas as pd

df_xls = pd.read_excel('DataSets/dummy_data.xlsx', sheet_name='Sales_Data')

df_xls.info()
print()
print('Sample: \n\n', df_xls.sample(10, random_state=333))
print()
print('Counts NaN: \n\n', df_xls.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 0 to 211
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Region            209 non-null    object        
 1   Product           209 non-null    object        
 2   Sales             197 non-null    float64       
 3   Profit            193 non-null    float64       
 4   Currency          174 non-null    object        
 5   Customer_ID       209 non-null    object        
 6   Invoice_ID        209 non-null    object        
 7   Description       209 non-null    object        
 8   Transaction_Date  209 non-null    datetime64[ns]
 9   ExcesiveNA        10 non-null     object        
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 16.7+ KB

Sample: 

     Region  Product    Sales   Profit Currency Customer_ID Invoice_ID  \
32    East   Laptop  2089.77   113.74      EUR     Raymond    INV1858   
64    West   Tabl

##### __Note__

If you want date columns from CSV to behave like those from Excel, do this when reading:

_df = pd.read_csv("file.csv", parse_dates=["your_date_column"])_

That way, Pandas will convert the date strings into datetime64[ns] objects, and they’ll behave just like the Excel ones.

##### ✅ _Handling Missing Values_

- Read an Excel file where missing values are represented as "N/A" or "-".

- Use keep_default_na=False to interpret these manually, then use isna() to count nulls per column.

In [3]:
import pandas as pd

df_csv = pd.read_csv('DataSets/dummy_data.csv', sep=';', header='infer', decimal=',', keep_default_na=False)

df_csv.info()
print()
print('Sample: \n\n', df_csv.sample(10, random_state=333))
print()
print('Counts NaN: \n\n', df_csv.isna().sum())
print()
print("Region: ", df_csv['Region'].unique(), df_csv['Region'].nunique())
print("Product: ", df_csv['Product'].unique(), df_csv['Product'].nunique())
print("Sales: ", df_csv['Sales'].unique(), df_csv['Sales'].nunique())
print("Profit: ", df_csv['Profit'].unique(), df_csv['Profit'].nunique())
print("Currency: ", df_csv['Currency'].unique(), df_csv['Currency'].nunique())
print("Customer_ID: ", df_csv['Customer_ID'].unique(), df_csv['Customer_ID'].nunique())
print("Invoice_ID: ", df_csv['Invoice_ID'].unique(), df_csv['Invoice_ID'].nunique())
print("Description: ", df_csv['Description'].unique(), df_csv['Description'].nunique())
print("Transaction_Date : ", df_csv['Transaction_Date'].unique(), df_csv['Transaction_Date'].nunique())
print("\n\n")
df_csv.replace(["N/A", "-", ""], pd.NA, inplace=True)

df_csv['Sales'] = pd.to_numeric(df_csv['Sales'].str.replace(',', '.', regex=False), errors='coerce')
df_csv['Profit'] = pd.to_numeric(df_csv['Profit'].str.replace(',', '.', regex=False), errors='coerce')

print()
print(df_csv.dtypes)
print()
print('Counts NaN: \n\n', df_csv.isna().sum())
print()
print("Region: ", df_csv['Region'].unique(), df_csv['Region'].nunique())
print("Product: ", df_csv['Product'].unique(), df_csv['Product'].nunique())
print("Sales: ", df_csv['Sales'].unique(), df_csv['Sales'].nunique())
print("Profit: ", df_csv['Profit'].unique(), df_csv['Profit'].nunique())
print("Currency: ", df_csv['Currency'].unique(), df_csv['Currency'].nunique())
print("Customer_ID: ", df_csv['Customer_ID'].unique(), df_csv['Customer_ID'].nunique())
print("Invoice_ID: ", df_csv['Invoice_ID'].unique(), df_csv['Invoice_ID'].nunique())
print("Description: ", df_csv['Description'].unique(), df_csv['Description'].nunique())
print("Transaction_Date : ", df_csv['Transaction_Date'].unique(), df_csv['Transaction_Date'].nunique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 0 to 211
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Region            212 non-null    object
 1   Product           212 non-null    object
 2   Sales             212 non-null    object
 3   Profit            212 non-null    object
 4   Currency          212 non-null    object
 5   Customer_ID       212 non-null    object
 6   Invoice_ID        212 non-null    object
 7   Description       212 non-null    object
 8   Transaction_Date  212 non-null    object
 9   ExcesiveNA        212 non-null    object
dtypes: object(10)
memory usage: 16.7+ KB

Sample: 

     Region  Product    Sales   Profit Currency Customer_ID Invoice_ID  \
32    East   Laptop  2089,77   113,74      EUR     Raymond    INV1858   
64    West   Tablet  3619,89   1159,9      USD     Raymond    INV5505   
192   West   Laptop  4651,55  1407,77        -   Elizabeth    INV4770 

In [4]:
import pandas as pd

df_xls = pd.read_excel('DataSets/dummy_data.xlsx', sheet_name='Sales_Data', keep_default_na=False)

df_xls.info()
print()
print('Sample: \n\n', df_xls.sample(10, random_state=333))
print()
print('Counts NaN: \n\n', df_xls.isna().sum())
print()
print("Region: ", df_xls['Region'].unique(), df_xls['Region'].nunique())
print("Product: ", df_xls['Product'].unique(), df_xls['Product'].nunique())
print("Sales: ", df_xls['Sales'].unique(), df_xls['Sales'].nunique())
print("Profit: ", df_xls['Profit'].unique(), df_xls['Profit'].nunique())
print("Currency: ", df_xls['Currency'].unique(), df_xls['Currency'].nunique())
print("Customer_ID: ", df_xls['Customer_ID'].unique(), df_xls['Customer_ID'].nunique())
print("Invoice_ID: ", df_xls['Invoice_ID'].unique(), df_xls['Invoice_ID'].nunique())
print("Description: ", df_xls['Description'].unique(), df_xls['Description'].nunique())
print("Transaction_Date : ", df_xls['Transaction_Date'].unique(), df_xls['Transaction_Date'].nunique())
print("\n\n")
df_xls.replace(["N/A", "-", ""], pd.NA, inplace=True)

df_xls["Sales"] = pd.to_numeric(df_xls["Sales"], errors="coerce")
df_xls["Profit"] = pd.to_numeric(df_xls["Profit"], errors="coerce")

print()
print(df_csv.dtypes)
print()
print('Counts NaN: \n\n', df_xls.isna().sum())
print()
print("Region: ", df_xls['Region'].unique(), df_xls['Region'].nunique())
print("Product: ", df_xls['Product'].unique(), df_xls['Product'].nunique())
print("Sales: ", df_xls['Sales'].unique(), df_xls['Sales'].nunique())
print("Profit: ", df_xls['Profit'].unique(), df_xls['Profit'].nunique())
print("Currency: ", df_xls['Currency'].unique(), df_xls['Currency'].nunique())
print("Customer_ID: ", df_xls['Customer_ID'].unique(), df_xls['Customer_ID'].nunique())
print("Invoice_ID: ", df_xls['Invoice_ID'].unique(), df_xls['Invoice_ID'].nunique())
print("Description: ", df_xls['Description'].unique(), df_xls['Description'].nunique())
print("Transaction_Date : ", df_xls['Transaction_Date'].unique(), df_xls['Transaction_Date'].nunique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 0 to 211
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Region            212 non-null    object
 1   Product           212 non-null    object
 2   Sales             212 non-null    object
 3   Profit            212 non-null    object
 4   Currency          212 non-null    object
 5   Customer_ID       212 non-null    object
 6   Invoice_ID        212 non-null    object
 7   Description       212 non-null    object
 8   Transaction_Date  212 non-null    object
 9   ExcesiveNA        212 non-null    object
dtypes: object(10)
memory usage: 16.7+ KB

Sample: 

     Region  Product    Sales   Profit Currency Customer_ID Invoice_ID  \
32    East   Laptop  2089.77   113.74      EUR     Raymond    INV1858   
64    West   Tablet  3619.89   1159.9      USD     Raymond    INV5505   
192   West   Laptop  4651.55  1407.77        -   Elizabeth    INV4770 

##### __Note__

By default (keep_default_na=True), Pandas automatically interprets certain strings as equivalent to NaN, such as:
"NA", "N/A", "na", "#N/A", "" (empty string), among others.

When you use keep_default_na=False, Pandas does NOT automatically convert these strings to NaN.

✅ pd.NA

This is the "native missing" value in Pandas (since version 1.0), and it consistently represents a missing value for both numeric, text, and mixed columns.

✅ Recommended instead of using np.nan directly, as pd.NA handles mixed data types and operations like .isna() better.

Feature - np.nan - None - pd.NA

Works on strings - ❌ - ✅ - ✅

Works on integers - ❌ (loses type) - ❌ - ✅ (with Int64Dtype)

Boolean operations - May fail - May fail - ✅ 3-way logic

Unified standard - ❌ - ❌ - ✅

##### _Diagnosis: Sales is of object type, not float64_
This confirms that even though you used:

_df = pd.read_csv("DataSets/dummy_data.csv", sep=';', decimal=',', keep_default_na=False)_

The Sales column was not parsed as a float, but rather as object — meaning its values are stored as strings like "4922,76", not as numbers like 4922.76.

This usually happens because:

The commas (,) are preserved as decimal separators (text),

And Pandas couldn’t infer the column as numeric because keep_default_na=False disables its normal type guessing for things like "N/A", "", etc.

dtype is object and not float64.

##### ✅ _Remove Empty Rows_

- Drop rows __where all values are missing__.

In [5]:
print(df_csv.tail(10))
print()
df_csv = df_csv.dropna(how='all')
print(df_csv.tail(10))

    Region     Product    Sales   Profit Currency Customer_ID Invoice_ID  \
202   West  Headphones  3465.60  1860.22      EUR      Travis    INV7105   
203   West      Tablet   710.66      NaN     <NA>     Raymond    INV8147   
204   <NA>        <NA>      NaN      NaN     <NA>        <NA>       <NA>   
205   <NA>        <NA>      NaN      NaN     <NA>        <NA>       <NA>   
206   <NA>        <NA>      NaN      NaN     <NA>        <NA>       <NA>   
207  North      Tablet  1272.30   210.85  Unknown      Dalton    INV7730   
208   East     Monitor  4185.25  1836.55     <NA>      Joshua    INV3076   
209   West      Tablet  1268.87   594.40  Unknown       Heidi    INV5938   
210  North      Tablet  1071.82  1992.36      EUR      Travis    INV1693   
211  North  Smartphone  1311.69   180.62     <NA>      Rachel    INV1897   

              Description     Transaction_Date ExcesiveNA  
202     Item 7636 - smile  2024-09-04 00:00:00  Not empty  
203     Item 8138 - class  2025-01-06 00:00

In [6]:
print(df_xls.tail(10))
print()
df_xls = df_xls.dropna(how='all')
print(df_xls.tail(10))

    Region     Product    Sales   Profit Currency Customer_ID Invoice_ID  \
202   West  Headphones  3465.60  1860.22      EUR      Travis    INV7105   
203   West      Tablet   710.66      NaN     <NA>     Raymond    INV8147   
204   <NA>        <NA>      NaN      NaN     <NA>        <NA>       <NA>   
205   <NA>        <NA>      NaN      NaN     <NA>        <NA>       <NA>   
206   <NA>        <NA>      NaN      NaN     <NA>        <NA>       <NA>   
207  North      Tablet  1272.30   210.85  Unknown      Dalton    INV7730   
208   East     Monitor  4185.25  1836.55     <NA>      Joshua    INV3076   
209   West      Tablet  1268.87   594.40  Unknown       Heidi    INV5938   
210  North      Tablet  1071.82  1992.36      EUR      Travis    INV1693   
211  North  Smartphone  1311.69   180.62     <NA>      Rachel    INV1897   

              Description     Transaction_Date ExcesiveNA  
202     Item 7636 - smile  2024-09-04 00:00:00  Not empty  
203     Item 8138 - class  2025-01-06 00:00

##### ✅ _Fill Missing Sales Values_

- Replace all missing values in the Sales column with the median of existing sales.

In [9]:
df_csv["Sales"] = df_csv["Sales"].fillna(df_csv["Sales"].median())
print("Valores nulos después:", df_csv['Sales'].isna().sum())

Valores nulos después: 0


In [8]:
df_xls["Sales"] = df_xls["Sales"].fillna(df_xls["Sales"].median())
print("Valores nulos después:", df_xls['Sales'].isna().sum())

Valores nulos después: 0


##### ✅ _Drop Columns with Excessive Missing Data_

- Drop any column with more than 60% missing values.

In [10]:
print("Columns before: ", df_csv.dtypes)
df_csv = df_csv.dropna(axis='columns', thresh=(len(df_csv) * 0.4))
print()
print("Columns after: ", df_csv.dtypes)

Columns before:  Region               object
Product              object
Sales               float64
Profit              float64
Currency             object
Customer_ID          object
Invoice_ID           object
Description          object
Transaction_Date     object
ExcesiveNA           object
dtype: object

Columns after:  Region               object
Product              object
Sales               float64
Profit              float64
Currency             object
Customer_ID          object
Invoice_ID           object
Description          object
Transaction_Date     object
dtype: object


In [11]:
print("Columns before: ", df_xls.dtypes)
df_xls = df_xls.dropna(axis='columns', thresh=(len(df_xls) * 0.4))
print()
print("Columns after: ", df_xls.dtypes)

Columns before:  Region               object
Product              object
Sales               float64
Profit              float64
Currency             object
Customer_ID          object
Invoice_ID           object
Description          object
Transaction_Date     object
ExcesiveNA           object
dtype: object

Columns after:  Region               object
Product              object
Sales               float64
Profit              float64
Currency             object
Customer_ID          object
Invoice_ID           object
Description          object
Transaction_Date     object
dtype: object


##### ✅ Check and Drop Duplicates

- Identify and drop fully duplicated rows.

In [12]:
print("DataFrame before deleting explicit duplicate rows: ", df_csv.index)

df_csv = df_csv.drop_duplicates().reset_index(drop=True)

print("DataFrame after deleting explicit duplicate rows: ", df_csv.index)

DataFrame before deleting explicit duplicate rows:  Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       199, 200, 201, 202, 203, 207, 208, 209, 210, 211],
      dtype='int64', length=209)
DataFrame after deleting explicit duplicate rows:  RangeIndex(start=0, stop=204, step=1)


In [13]:
print("DataFrame before deleting explicit duplicate rows: ", df_xls.index)

df_xls = df_xls.drop_duplicates().reset_index(drop=True)

print("DataFrame after deleting explicit duplicate rows: ", df_xls.index)

DataFrame before deleting explicit duplicate rows:  Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       199, 200, 201, 202, 203, 207, 208, 209, 210, 211],
      dtype='int64', length=209)
DataFrame after deleting explicit duplicate rows:  RangeIndex(start=0, stop=204, step=1)


##### ✅ _Drop Partial Duplicates_

- Drop duplicates only considering the columns ['Customer_ID', 'Invoice_ID'].

In [14]:
df_cid_dups = df_csv.duplicated(subset=['Customer_ID', 'Invoice_ID'])
print("Customer_ID & Invoice_ID duplicates found:", df_cid_dups.sum())
print()
df_csv = df_csv.drop_duplicates(subset=["Customer_ID", "Invoice_ID"]).reset_index(drop=True)
print("Customer_ID & Invoice_ID duplicates found:", df_csv.duplicated(subset=['Customer_ID', 'Invoice_ID']).any())


Customer_ID & Invoice_ID duplicates found: 4

Customer_ID & Invoice_ID duplicates found: False


In [15]:
df_cid_dupx = df_xls.duplicated(subset=['Customer_ID', 'Invoice_ID'])
print("Customer_ID & Invoice_ID duplicates found:", df_cid_dupx.sum())
print()
df_xls = df_xls.drop_duplicates(subset=["Customer_ID", "Invoice_ID"]).reset_index(drop=True)
print("Customer_ID & Invoice_ID duplicates found:", df_xls.duplicated(subset=['Customer_ID', 'Invoice_ID']).any())

Customer_ID & Invoice_ID duplicates found: 4

Customer_ID & Invoice_ID duplicates found: False


##### __Note__

✅ By default, .drop_duplicates() keeps the first occurrence and removes all others.
Here’s the official behavior:

_df.drop_duplicates(subset=None, keep='first', inplace=False)_

subset=None → uses all columns unless you specify a subset.

keep='first' → keeps the first duplicated row it finds and drops the rest.

Other options:

- keep='last' → keeps the last occurrence instead.

- keep=False → drops all duplicates, even the first.

##### ✅ _Custom Null Markers_

- Read a file where empty cells are marked as "Unknown", "NA", or "Missing".

In [None]:
null_markers = ["Unknown", "NA", "Missing", "-", ""]

itr = 0

print("Null Markers amount: \n", df_csv[df_csv.isin(null_markers)].count())
print()
    
for marker in null_markers:
    
    print(f"***{marker}*** marker amount: \n", df_csv[df_csv.isin([null_markers[itr]])].count())
    print()
    itr += 1
    

Null Markers amount: 
 Region               0
Product              8
Sales                0
Profit               0
Currency            44
Customer_ID          0
Invoice_ID           0
Description          0
Transaction_Date     0
dtype: int64

***Unknown*** marker amount: 
 Region               0
Product              0
Sales                0
Profit               0
Currency            44
Customer_ID          0
Invoice_ID           0
Description          0
Transaction_Date     0
dtype: int64

***NA*** marker amount: 
 Region              0
Product             0
Sales               0
Profit              0
Currency            0
Customer_ID         0
Invoice_ID          0
Description         0
Transaction_Date    0
dtype: int64

***Missing*** marker amount: 
 Region              0
Product             8
Sales               0
Profit              0
Currency            0
Customer_ID         0
Invoice_ID          0
Description         0
Transaction_Date    0
dtype: int64

***-*** marker amount: 

In [18]:
nll_mrkr = ["Unknown", "NA", "Missing", "-", ""]

itr = 0

print("Null Markers amount: \n", df_xls[df_xls.isin(nll_mrkr)].count())
print()
    
for mrkr in nll_mrkr:
    
    print(f"***{mrkr}*** marker amount: \n", df_xls[df_xls.isin([nll_mrkr[itr]])].count())
    print()
    itr += 1
    

Null Markers amount: 
 Region               0
Product              8
Sales                0
Profit               0
Currency            44
Customer_ID          0
Invoice_ID           0
Description          0
Transaction_Date     0
dtype: int64

***Unknown*** marker amount: 
 Region               0
Product              0
Sales                0
Profit               0
Currency            44
Customer_ID          0
Invoice_ID           0
Description          0
Transaction_Date     0
dtype: int64

***NA*** marker amount: 
 Region              0
Product             0
Sales               0
Profit              0
Currency            0
Customer_ID         0
Invoice_ID          0
Description         0
Transaction_Date    0
dtype: int64

***Missing*** marker amount: 
 Region              0
Product             8
Sales               0
Profit              0
Currency            0
Customer_ID         0
Invoice_ID          0
Description         0
Transaction_Date    0
dtype: int64

***-*** marker amount: 

##### ✅ _Validate Column Uniqueness_

Use .nunique() and .duplicated() to validate if a supposed ID column is truly unique.

- Use .nunique() to check how many unique values exist in the Invoice_ID column.

- Use .duplicated() to identify any duplicated values.

- If there are duplicates, display them for review.

In [23]:
print(f"Unique values in Invoice_ID: {df_csv["Invoice_ID"].nunique()} out of {len(df_csv)}\n")
print(df_csv["Invoice_ID"].unique())
print()
print("duplicated values in DataFrame: ", df_csv["Invoice_ID"].duplicated().sum())
print()
if df_csv["Invoice_ID"].duplicated().any():
    df_dups = df_csv[df_csv['Invoice_ID'].duplicated(keep=False)].sort_values('Invoice_ID')
    print(df_dups)
else:
    print("All Invoice_IDs are unique ✅")

Unique values in Invoice_ID: 199 out of 200

['INV8450' 'INV2501' 'INV4475' 'INV1349' 'INV1828' 'INV6464' 'INV4990'
 'INV3063' 'INV4362' 'INV2124' 'INV4394' 'INV4538' 'INV4817' 'INV6383'
 'INV3417' 'INV1046' 'INV5542' 'INV3370' 'INV3129' 'INV9850' 'INV5106'
 'INV3858' 'INV2801' 'INV1422' 'INV3159' 'INV1243' 'INV6869' 'INV4898'
 'INV6304' 'INV1258' 'INV3854' 'INV5347' 'INV1858' 'INV3076' 'INV7897'
 'INV9619' 'INV2862' 'INV2041' 'INV8802' 'INV8344' 'INV6931' 'INV9408'
 'INV2786' 'INV8405' 'INV9254' 'INV4629' 'INV1710' 'INV9543' 'INV5941'
 'INV8504' 'INV1510' 'INV1996' 'INV8847' 'INV7580' 'INV7984' 'INV2768'
 'INV9032' 'INV8267' 'INV2204' 'INV2323' 'INV6277' 'INV3430' 'INV2076'
 'INV3067' 'INV5505' 'INV9984' 'INV6327' 'INV7240' 'INV9692' 'INV5831'
 'INV8433' 'INV9282' 'INV8048' 'INV2624' 'INV2874' 'INV4522' 'INV8046'
 'INV8398' 'INV4743' 'INV7779' 'INV6553' 'INV8430' 'INV7532' 'INV7815'
 'INV2557' 'INV6120' 'INV7992' 'INV5176' 'INV7132' 'INV3500' 'INV8770'
 'INV2099' 'INV2494' 'INV2398' '

In [24]:
print(f"Unique values in Invoice_ID: {df_xls["Invoice_ID"].nunique()} out of {len(df_xls)}\n")
print(df_xls["Invoice_ID"].unique())
print()
print("duplicated values in DataFrame: ", df_xls["Invoice_ID"].duplicated().sum())
print()
if df_xls["Invoice_ID"].duplicated().any():
    df_dupx = df_xls[df_xls['Invoice_ID'].duplicated(keep=False)].sort_values('Invoice_ID')
    print(df_dupx)
else:
    print("All Invoice_IDs are unique ✅")

Unique values in Invoice_ID: 199 out of 200

['INV8450' 'INV2501' 'INV4475' 'INV1349' 'INV1828' 'INV6464' 'INV4990'
 'INV3063' 'INV4362' 'INV2124' 'INV4394' 'INV4538' 'INV4817' 'INV6383'
 'INV3417' 'INV1046' 'INV5542' 'INV3370' 'INV3129' 'INV9850' 'INV5106'
 'INV3858' 'INV2801' 'INV1422' 'INV3159' 'INV1243' 'INV6869' 'INV4898'
 'INV6304' 'INV1258' 'INV3854' 'INV5347' 'INV1858' 'INV3076' 'INV7897'
 'INV9619' 'INV2862' 'INV2041' 'INV8802' 'INV8344' 'INV6931' 'INV9408'
 'INV2786' 'INV8405' 'INV9254' 'INV4629' 'INV1710' 'INV9543' 'INV5941'
 'INV8504' 'INV1510' 'INV1996' 'INV8847' 'INV7580' 'INV7984' 'INV2768'
 'INV9032' 'INV8267' 'INV2204' 'INV2323' 'INV6277' 'INV3430' 'INV2076'
 'INV3067' 'INV5505' 'INV9984' 'INV6327' 'INV7240' 'INV9692' 'INV5831'
 'INV8433' 'INV9282' 'INV8048' 'INV2624' 'INV2874' 'INV4522' 'INV8046'
 'INV8398' 'INV4743' 'INV7779' 'INV6553' 'INV8430' 'INV7532' 'INV7815'
 'INV2557' 'INV6120' 'INV7992' 'INV5176' 'INV7132' 'INV3500' 'INV8770'
 'INV2099' 'INV2494' 'INV2398' '

#### 📊 __Analysis and Aggregation__

##### ✅ _Frequency Analysis_

Use .value_counts() to find the most common product sold per region.

In [56]:
df_prod_count_region = df_csv.groupby("Region")["Product"].value_counts()

print("* Products sold per region: \n\n", df_prod_count_region)
print()

df_max_region = df_prod_count_region.groupby(level=0).transform('max')
df_most_sold = df_prod_count_region[df_prod_count_region == df_max_region]

print("* Product most sold per region: \n\n", df_most_sold)

* Products sold per region: 

 Region  Product   
East    Laptop         9
        Monitor        9
        Tablet         9
        Smartphone     8
        Headphones     6
        Missing        2
North   Laptop        15
        Monitor       14
        Tablet        10
        Smartphone     8
        Headphones     7
        Missing        2
South   Smartphone    17
        Laptop        14
        Tablet        10
        Headphones     8
        Monitor        6
        Missing        2
West    Tablet        13
        Headphones     8
        Laptop         7
        Monitor        7
        Smartphone     7
        Missing        2
Name: count, dtype: int64

* Product most sold per region: 

 Region  Product   
East    Laptop         9
        Monitor        9
        Tablet         9
North   Laptop        15
South   Smartphone    17
West    Tablet        13
Name: count, dtype: int64


In [57]:
df_prd_cnt_rgn = df_xls.groupby("Region")["Product"].value_counts()

print("* Products sold per region: \n\n", df_prd_cnt_rgn)
print()

df_mx_rgn = df_prd_cnt_rgn.groupby(level=0).transform('max')
df_mst_sld = df_prd_cnt_rgn[df_prd_cnt_rgn == df_mx_rgn]

print("* Product most sold per region: \n\n", df_most_sold)

* Products sold per region: 

 Region  Product   
East    Laptop         9
        Monitor        9
        Tablet         9
        Smartphone     8
        Headphones     6
        Missing        2
North   Laptop        15
        Monitor       14
        Tablet        10
        Smartphone     8
        Headphones     7
        Missing        2
South   Smartphone    17
        Laptop        14
        Tablet        10
        Headphones     8
        Monitor        6
        Missing        2
West    Tablet        13
        Headphones     8
        Laptop         7
        Monitor        7
        Smartphone     7
        Missing        2
Name: count, dtype: int64

* Product most sold per region: 

 Region  Product   
East    Laptop         9
        Monitor        9
        Tablet         9
North   Laptop        15
South   Smartphone    17
West    Tablet        13
Name: count, dtype: int64


##### __Note__

✅ value_counts() doesn't return a DataFrame, it returns a Series with a MultiIndex (Region, Product).

This is a Series where:

- The index is two levels: Region and Product → that's a MultiIndex.

- The values are the counts.

So:

You don’t get column names like in a DataFrame,

You just get Region + Product as index, and a single numeric column (count) as values.

✅ transform() preserves shape, so you can compare it element-wise.

- level=0 → means you're grouping by the first level of the index, which is Region.

- transform("max") → gives you the max value for each Region, and repeats it for every row in that group.
This is key because now you can compare each row to the max in its group:

##### ✅ _Group Aggregation by Region_

Group by "Region" and compute:

- Total Sales

- Mean Profit

- Max Sales

In [80]:
print(df_csv["Currency"].unique())
df_csv["Currency"] = df_csv["Currency"].fillna("Unknown")
print(df_csv["Currency"].unique())
print()
print(df_csv["Sales"].unique())
df_csv["Sales"] = df_csv["Sales"].fillna(0)
print(df_csv["Sales"].unique())
print()
print(df_csv["Profit"].unique())
df_csv["Profit"] = df_csv["Profit"].fillna(0)
print(df_csv["Profit"].unique())
print()

print("Sales per region per currency: ", df_csv.groupby(["Region","Currency"])["Sales"].sum())
print()
print(f"Amount of Unknown sales: {df_csv.loc[df_csv['Currency'] == 'Unknown', 'Sales'].count()}")
print(f"Total Unknown sales: {df_csv.loc[df_csv['Currency'] == 'Unknown', 'Sales'].sum():,.2f}")
print(f"Mean Unknown profit: {df_csv.loc[df_csv['Currency'] == 'Unknown', 'Profit'].mean():,.2f}")
print(f"Max Unknown sales: {df_csv.loc[df_csv['Currency'] == 'Unknown', 'Sales'].max()}")
print()
print(f"Amount of EUR sales: {df_csv.loc[df_csv['Currency'] == 'EUR', 'Sales'].count()}")
print(f"Total EUR sales: € {df_csv.loc[df_csv['Currency'] == 'EUR', 'Sales'].sum():,.2f}")
print(f"Mean EUR profit: € {df_csv.loc[df_csv['Currency'] == 'EUR', 'Profit'].mean():,.2f}")
print(f"Max EUR sales: € {df_csv.loc[df_csv['Currency'] == 'EUR', 'Sales'].max()}")
print()
print(f"Amount of DLR sales: {df_csv.loc[df_csv['Currency'] == 'USD', 'Sales'].count()}")
print(f"Total DLR sales: $ {df_csv.loc[df_csv['Currency'] == 'USD', 'Sales'].sum():,.2f}")
print(f"Mean DLR profit: $ {df_csv.loc[df_csv['Currency'] == 'USD', 'Profit'].mean():,.2f}")
print(f"Max DLR sales: $ {df_csv.loc[df_csv['Currency'] == 'USD', 'Sales'].max()}")
print()
print(f"Amount of MXN sales: {df_csv.loc[df_csv['Currency'] == 'MXN', 'Sales'].count()}")
print(f"Total MXN sales: $ {df_csv.loc[df_csv['Currency'] == 'MXN', 'Sales'].sum():,.2f}")
print(f"Mean MXN profit: $ {df_csv.loc[df_csv['Currency'] == 'MXN', 'Profit'].mean():,.2f}")
print(f"Max MXN sales: $ {df_csv.loc[df_csv['Currency'] == 'MXN', 'Sales'].max()}")

['USD' 'Unknown' 'MXN' 'EUR']
['USD' 'Unknown' 'MXN' 'EUR']

[4922.76  666.66 4507.57 1031.39  317.49 2236.68 2647.61 4051.9  3465.6
 4707.29 2816.62 3711.48 1065.47 2213.36 4749.48 4611.78 3153.46 3350.6
  710.66 4510.23 2584.9  3367.78 1698.3  3516.32 1040.02 3358.15 1957.81
 3765.86  952.73 2888.62 2089.77 4185.25 1588.99 1129.9  3950.22 3072.64
 1679.02 2264.75 3410.57 2604.75 3989.3  4803.18 3706.21 3328.37 1490.55
 3352.88 3134.38  557.6  4764.8  1250.86 1621.06 4052.16  822.03  326.45
 4921.29 3095.24 3865.6  2331.54 4442.07 2920.79 3619.89 1981.49 2058.3
  823.1  3469.35 4474.03 4316.16 4438.78 3914.26 1172.32 4040.14 3509.97
 2376.29 2831.29 4595.6   694.66  753.36 2376.75 2702.55 2838.73 1654.76
 3800.73 2268.61 4093.53 4470.9  2191.03 4543.59 2284.98  879.87 4321.33
 2305.34 3783.71 4215.33 1458.83 3910.61 2474.53 1272.3  2255.38 3596.37
 1249.02 1745.65 4475.83  494.83  839.07 1976.87  848.75 1148.37 2132.85
 1721.32 2383.03  405.1  4181.17 2008.44 3872.   4735.7   195.71 4

In [81]:
print(df_xls["Currency"].unique())
df_xls["Currency"] = df_xls["Currency"].fillna("Unknown")
print(df_xls["Currency"].unique())
print()
print(df_xls["Sales"].unique())
df_xls["Sales"] = df_xls["Sales"].fillna(0)
print(df_xls["Sales"].unique())
print()
print(df_xls["Profit"].unique())
df_xls["Profit"] = df_xls["Profit"].fillna(0)
print(df_xls["Profit"].unique())
print()

print("Sales per region per currency: ", df_xls.groupby(["Region","Currency"])["Sales"].sum())
print()
print(f"Amount of Unknown sales: {df_xls.loc[df_xls['Currency'] == 'Unknown', 'Sales'].count()}")
print(f"Total Unknown sales: {df_xls.loc[df_xls['Currency'] == 'Unknown', 'Sales'].sum():,.2f}")
print(f"Mean Unknown profit: {df_xls.loc[df_xls['Currency'] == 'Unknown', 'Profit'].mean():,.2f}")
print(f"Max Unknown sales: {df_xls.loc[df_xls['Currency'] == 'Unknown', 'Sales'].max()}")
print()
print(f"Amount of EUR sales: {df_xls.loc[df_xls['Currency'] == 'EUR', 'Sales'].count()}")
print(f"Total EUR sales: € {df_xls.loc[df_xls['Currency'] == 'EUR', 'Sales'].sum():,.2f}")
print(f"Mean EUR profit: € {df_xls.loc[df_xls['Currency'] == 'EUR', 'Profit'].mean():,.2f}")
print(f"Max EUR sales: € {df_xls.loc[df_xls['Currency'] == 'EUR', 'Sales'].max()}")
print()
print(f"Amount of DLR sales: {df_xls.loc[df_xls['Currency'] == 'USD', 'Sales'].count()}")
print(f"Total DLR sales: $ {df_xls.loc[df_xls['Currency'] == 'USD', 'Sales'].sum():,.2f}")
print(f"Mean DLR profit: $ {df_xls.loc[df_xls['Currency'] == 'USD', 'Profit'].mean():,.2f}")
print(f"Max DLR sales: $ {df_xls.loc[df_xls['Currency'] == 'USD', 'Sales'].max()}")
print()
print(f"Amount of MXN sales: {df_xls.loc[df_xls['Currency'] == 'MXN', 'Sales'].count()}")
print(f"Total MXN sales: $ {df_xls.loc[df_xls['Currency'] == 'MXN', 'Sales'].sum():,.2f}")
print(f"Mean MXN profit: $ {df_xls.loc[df_xls['Currency'] == 'MXN', 'Profit'].mean():,.2f}")
print(f"Max MXN sales: $ {df_xls.loc[df_xls['Currency'] == 'MXN', 'Sales'].max()}")

['USD' 'Unknown' 'MXN' <NA> 'EUR']
['USD' 'Unknown' 'MXN' 'EUR']

[4922.76  666.66 4507.57 1031.39  317.49 2236.68 2647.61 4051.9  3465.6
 4707.29 2816.62 3711.48 1065.47 2213.36 4749.48 4611.78 3153.46 3350.6
  710.66 4510.23 2584.9  3367.78 1698.3  3516.32 1040.02 3358.15 1957.81
 3765.86  952.73 2888.62 2089.77 4185.25 1588.99 1129.9  3950.22 3072.64
 1679.02 2264.75 3410.57 2604.75 3989.3  4803.18 3706.21 3328.37 1490.55
 3352.88 3134.38  557.6  4764.8  1250.86 1621.06 4052.16  822.03  326.45
 4921.29 3095.24 3865.6  2331.54 4442.07 2920.79 3619.89 1981.49 2058.3
  823.1  3469.35 4474.03 4316.16 4438.78 3914.26 1172.32 4040.14 3509.97
 2376.29 2831.29 4595.6   694.66  753.36 2376.75 2702.55 2838.73 1654.76
 3800.73 2268.61 4093.53 4470.9  2191.03 4543.59 2284.98  879.87 4321.33
 2305.34 3783.71 4215.33 1458.83 3910.61 2474.53 1272.3  2255.38 3596.37
 1249.02 1745.65 4475.83  494.83  839.07 1976.87  848.75 1148.37 2132.85
 1721.32 2383.03  405.1  4181.17 2008.44 3872.   4735.7   195

##### ✅ _Multi-Level Aggregation_

- Group by both "Region" and "Product" to compute sum of "Sales" and count of "Invoice_ID".

##### ✅ _Sort Aggregated Data_

- After groupby-agg, sort the results by "Total Sales" descending.

##### ✅ _String Cleaning with .str accessor_

- Use .str.strip(), .str.lower(), .str.replace() to clean customer names.

##### ✅ _Filtering with Sets_

- From a set of known loyal customers (set()), filter rows where the Customer_ID is part of that set.

##### ✅ _Categorize with Dictionary Mapping_

- Map the "Currency" column using a dictionary like {'USD': 'Dollar', 'EUR': 'Euro'}.

##### ✅ _Top N Products_

- Use .value_counts() and slicing to find the top 5 most frequent products.

##### ✅ _Percentage of Nulls_

- Create a dictionary with column names as keys and percentage of missing values as values.

##### ✅ _Filter and Aggregate in One Line_

- Use a one-liner (with comprehension if needed) to filter all rows where Sales > 1000 and group by "Region" to get the average profit.

#### 🧠 __A_dvanced Manipulations__

##### ✅ _List Comprehension for Filtered Rows_

- Use a comprehension to create a list of Invoice_ID for which "Profit" is negative and "Sales" is above average.

##### ✅ _Create a Dictionary of Unique Values per Column_

- Create a dictionary where each key is a column and the value is the list of its unique values.

##### ✅ _Pivot Analysis_

- Pivot the data to show Product as rows and Region as columns with total sales.

##### ✅ _Tuple of Column Summary_

- Create a tuple that stores (column_name, min_value, max_value) for numeric columns.

##### ✅ _String Extraction_

- Use .str.extract() to pull out numbers from product descriptions like "Item 1234 - ABC".

##### ✅ _Customer Retention Insight_

- From a sorted DataFrame of dates, identify customers with more than one transaction in different months (use sets and groupby()).

##### ✅ _Sorting with Multiple Criteria_

- Sort the data by Region ascending and Sales descending.

##### ✅ _Nested Dictionary for Aggregation_

- Use groupby().agg() with a dictionary like: {'Sales': ['sum', 'mean'], 'Profit': ['min', 'max']}.

##### ✅ _Create a Clean Subset_

Generate a clean subset of your dataset excluding rows with:

- Missing values in any numeric columns

- Duplicate Customer_ID

##### ✅ _Export Clean Data_

- After applying all cleaning steps, export the clean DataFrame to Excel, one sheet for each Region.