
# Lab 4 – Data Quality Assessment & Preprocessing
Dataset: Chocolate_Sales.csv



- Task 1: Identify data quality issues
- Task 2: Apply one missing value strategy and explain why
- Task 3: Detect and handle outliers using IQR
- Task 4: Normalize numerical features using Min-Max and Z-score
- Task 5: Apply PCA and interpret explained variance


In [18]:
import pandas as pd # Import pandas for data tasks
import numpy as np # Import numpy for number tasks
from sklearn.preprocessing import MinMaxScaler, StandardScaler # Import tools for scaling data
from sklearn.decomposition import PCA # Import PCA for reducing data dimensions

In [19]:
df = pd.read_csv("Chocolate_Sales.csv") # Load the CSV file into a dataframe
df.head() # Show the first few rows of the dataframe

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04/01/2022,"$5,320.00",180
1,Van Tuxwell,India,85% Dark Bars,01/08/2022,"$7,896.00",94
2,Gigi Bohling,India,Peanut Butter Cubes,07/07/2022,"$4,501.00",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27/04/2022,"$12,726.00",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24/02/2022,"$13,685.00",184



## Task 1: Identify Data Quality Issues

We examine:
- Dataset shape
- Missing values
- Duplicate records
- Data types


In [20]:
# Show dataset shape (rows, columns)
print("Dataset Shape:", df.shape)

# Show missing values in each column
print("\nMissing Values:")
print(df.isnull().sum())

# Show number of duplicate rows
print("\nDuplicate Rows:", df.duplicated().sum())

# Show data types and non-null counts
df.info()

Dataset Shape: (3282, 6)

Missing Values:
Sales Person     0
Country          0
Product          0
Date             0
Amount           0
Boxes Shipped    0
dtype: int64

Duplicate Rows: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3282 entries, 0 to 3281
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Sales Person   3282 non-null   object
 1   Country        3282 non-null   object
 2   Product        3282 non-null   object
 3   Date           3282 non-null   object
 4   Amount         3282 non-null   object
 5   Boxes Shipped  3282 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 154.0+ KB



## Task 2: Handle Missing Values

Strategy Used:
- Numerical columns → Median
- Categorical columns → Mode

Reason:
Median is robust to outliers and works well when numerical data is skewed.
Mode is appropriate for categorical variables since it preserves the most frequent category.


In [21]:
# Fill missing values in numerical columns with median
for col in df.select_dtypes(include=np.number):
    df[col] = df[col].fillna(df[col].median())

# Fill missing values in categorical columns with mode
for col in df.select_dtypes(include='object'):
    df[col] = df[col].fillna(df[col].mode()[0])

# Check missing values again
print("Missing values after handling:")
print(df.isnull().sum())

Missing values after handling:
Sales Person     0
Country          0
Product          0
Date             0
Amount           0
Boxes Shipped    0
dtype: int64



## Task 3: Detect and Handle Outliers using IQR

IQR Method:
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 - Q1
- Outliers are values below (Q1 - 1.5*IQR) or above (Q3 + 1.5*IQR)


In [22]:
# Remove outliers using IQR method

for col in df.select_dtypes(include=np.number):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    df = df[(df[col] >= lower) & (df[col] <= upper)]

print("Shape after removing outliers:", df.shape)

Shape after removing outliers: (3204, 6)



## Task 4: Normalization

Two normalization techniques are applied:

1) Min-Max Scaling  
   Scales values between 0 and 1.

2) Z-score Scaling (Standardization)  
   Transforms data to mean = 0 and standard deviation = 1.


In [23]:
df['Amount'] = df['Amount'].replace(r'[\$,]', '', regex=True).astype(float)
num_cols = df.select_dtypes(include=np.number).columns # Get the names of all number columns

minmax = MinMaxScaler() # Create a Min-Max scaler tool
df_minmax = df.copy() # Make a copy of the data for Min-Max scaling
df_minmax[num_cols] = minmax.fit_transform(df[num_cols]) # Scale number columns to be between 0 and 1

df_minmax.head() # Show the first few rows of the Min-Max scaled data

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04/01/2022,0.203066,0.382479
1,Van Tuxwell,India,85% Dark Bars,01/08/2022,0.301522,0.198718
2,Gigi Bohling,India,Peanut Butter Cubes,07/07/2022,0.171763,0.192308
3,Jan Morforth,Australia,Peanut Butter Cubes,27/04/2022,0.486127,0.728632
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24/02/2022,0.52278,0.391026


In [24]:
scaler = StandardScaler() # Create a Standard Scaler tool
df_zscore = df.copy() # Make a copy of the data for Z-score scaling
df_zscore[num_cols] = scaler.fit_transform(df[num_cols]) # Scale number columns to have a mean of 0 and std of 1

df_zscore.head() # Show the first few rows of the Z-score scaled data

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04/01/2022,-0.165259,0.225279
1,Van Tuxwell,India,85% Dark Bars,01/08/2022,0.419152,-0.560471
2,Gigi Bohling,India,Peanut Butter Cubes,07/07/2022,-0.351063,-0.587881
3,Jan Morforth,Australia,Peanut Butter Cubes,27/04/2022,1.514922,1.705412
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24/02/2022,1.732488,0.261825



## Task 5: Principal Component Analysis (PCA)

PCA reduces dimensionality while preserving as much variance as possible.

We examine:
- Explained variance ratio
- Cumulative explained variance

If the first few principal components explain ~80% or more of the variance,
we can reduce dimensionality with minimal information loss.


In [25]:
pca = PCA() # Create a PCA tool
principal_components = pca.fit_transform(df_zscore[num_cols]) # Apply PCA to the scaled number columns

explained_variance = pca.explained_variance_ratio_ # Get how much each new component explains the data
cumulative_variance = np.cumsum(explained_variance) # Calculate the total explained variance up to each component

print("Explained Variance Ratio:\n", explained_variance) # Show how much variance each component captures
print("\nCumulative Explained Variance:\n", cumulative_variance) # Show the running total of explained variance

Explained Variance Ratio:
 [0.50089201 0.49910799]

Cumulative Explained Variance:
 [0.50089201 1.        ]



### Interpretation

- PC1 explains the largest portion of variance.
- If cumulative variance of first 2–3 components is high (e.g., >80%),
  dimensionality can be reduced safely.
- PCA simplifies modeling and reduces computational cost.
