# LESSON PLAN:
- Section 1: Understanding Data Types
- Section 2: Converting Data Types
- Section 3: Understanding Missing Values
- Section 4: Filling Missing Values

---

# SECTION 1: UNDERSTANDING DATA TYPES

## 1. What are Data Types?
Data types define the kind of data stored in each column.  
Python/Pandas has several data types:

### Common Data Types in Pandas
- **int64**: Integer numbers (whole numbers)  
  Example: 1, 100, -50  

- **float64**: Decimal numbers (floating point)  
  Example: 3.14, 100.5, -2.75  

- **object**: Text/string data (also mixed types)  
  Example: "Apple", "Product Name", "Category A"  

- **bool**: Boolean values  
  Example: True, False  

- **datetime64**: Date and time values  
  Example: 2024-01-15, 2024-12-16 10:30:00  

- **category**: Categorical data (limited unique values)  
  Example: "Small", "Medium", "Large"  

---

## 2. Why are Data Types Important?
- **Memory Efficiency**  
  - Different types use different amounts of memory  
  - int32 uses less memory than int64  
  - category type saves memory for repeated values  

- **Operations**  
  - Mathematical operations require numeric types  
  - String operations require object type  
  - Date operations require datetime type  

- **Analysis Accuracy**  
  - Wrong type can lead to incorrect results  
  - "100" (string) + "200" (string) = "100200" not 300  
  - Proper types enable correct calculations  

- **Machine Learning**  
  - Most ML algorithms need numeric data  
  - Categorical data needs proper encoding  
  - Wrong types cause errors in models  

---

## 3. Common Data Type Problems
- Numbers stored as strings (object)  
- Categories stored as strings (object)  
- Dates stored as strings (object)  
- Using float64 when int64 would work  
- Not using category type for repeated values  


In [69]:
# import the libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('BigBasket Products.csv')

In [70]:
# Check the DataFrame shape
df.shape

(27555, 10)

In [71]:
print("CURRENT DATA TYPES")
print(df.dtypes)

CURRENT DATA TYPES
index             int64
product          object
category         object
sub_category     object
brand            object
sale_price      float64
market_price    float64
type             object
rating          float64
description      object
dtype: object


In [72]:
# Display first few rows
df.head()

Unnamed: 0,index,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.0,220.0,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...
1,2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.0,180.0,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ..."
2,3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.0,250.0,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m..."
3,4,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149.0,176.0,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...
4,5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162.0,162.0,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...


In [73]:
# Check memory usage
df.memory_usage(deep=True)

Unnamed: 0,0
Index,132
index,220440
product,2291020
category,1876118
sub_category,1809064
brand,1583467
sale_price,220440
market_price,220440
type,1814320
rating,220440


In [74]:
print(f"\nTotal memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")


Total memory usage: 31.79 MB


In [75]:
# Analyze each column
for col in df.columns:
    print(f"\nColumn: {col}")
    print(f"  Data Type: {df[col].dtype}")
    print(f"  Non-Null Count: {df[col].notna().sum()}")
    print(f"  Null Count: {df[col].isnull().sum()}")
    print(f"  Unique Values: {df[col].nunique()}")
    if df[col].dtype == 'object':
        print(f"  Sample Values: {df[col].dropna().head(3).tolist()}")



Column: index
  Data Type: int64
  Non-Null Count: 27555
  Null Count: 0
  Unique Values: 27555

Column: product
  Data Type: object
  Non-Null Count: 27554
  Null Count: 1
  Unique Values: 23540
  Sample Values: ['Garlic Oil - Vegetarian Capsule 500 mg', 'Water Bottle - Orange', 'Brass Angle Deep - Plain, No.2']

Column: category
  Data Type: object
  Non-Null Count: 27555
  Null Count: 0
  Unique Values: 11
  Sample Values: ['Beauty & Hygiene', 'Kitchen, Garden & Pets', 'Cleaning & Household']

Column: sub_category
  Data Type: object
  Non-Null Count: 27555
  Null Count: 0
  Unique Values: 90
  Sample Values: ['Hair Care', 'Storage & Accessories', 'Pooja Needs']

Column: brand
  Data Type: object
  Non-Null Count: 27554
  Null Count: 1
  Unique Values: 2313
  Sample Values: ['Sri Sri Ayurveda ', 'Mastercook', 'Trm']

Column: sale_price
  Data Type: float64
  Non-Null Count: 27555
  Null Count: 0
  Unique Values: 3256

Column: market_price
  Data Type: float64
  Non-Null Count: 2755

# Object Columns Summary

| Column       | Non-Null Count | Null Count | Unique Values | Sample Values                                                                 |
|--------------|----------------|------------|---------------|-------------------------------------------------------------------------------|
| product      | 27554          | 1          | 23540         | ['Garlic Oil - Vegetarian Capsule 500 mg', 'Water Bottle - Orange', 'Brass Angle Deep - Plain, No.2'] |
| category     | 27555          | 0          | 11            | ['Beauty & Hygiene', 'Kitchen, Garden & Pets', 'Cleaning & Household']        |
| sub_category | 27555          | 0          | 90            | ['Hair Care', 'Storage & Accessories', 'Pooja Needs']                         |
| brand        | 27554          | 1          | 2313          | ['Sri Sri Ayurveda ', 'Mastercook', 'Trm']                                    |
| type         | 27555          | 0          | 426           | ['Hair Oil & Serum', 'Water & Fridge Bottles', 'Lamp & Lamp Oil']             |
| description  | 27440          | 115        | 21944         | ['This Product contains Garlic Oil that is known...', 'Each product is microwave safe...', 'A perfect gift for all occasions...'] |


# SECTION 2: CONVERTING DATA TYPES (15 minutes)


## 4. Why Convert Data Types?
Looking at our dataset, we need to convert data types for:

**a) CATEGORY COLUMNS**  
- Columns: `category`, `sub_category`, `brand`, `type`  
- Currently `object` but have limited unique values  
- Converting to `category` saves memory and improves performance  

**b) INDEX COLUMN**  
- Currently `int64` but could be `int32` (smaller numbers)  
- Saves memory  

**c) PRICE COLUMNS**  
- Currently `float64` but could check if `int` is sufficient  
- Depends on whether we have decimal values  

**d) RATING COLUMN**  
- Currently `float64`, which is appropriate for ratings  
- Keep as `float64`  

Let’s convert these systematically!


In [76]:
# Small sample dataset
data = {
    "index": [1, 2, 3],
    "category": ["Beauty & Hygiene", "Kitchen", "Cleaning"],
    "brand": ["Nivea", "Mastercook", "Trm"],
    "sale_price": [220.0, 180.0, 119.0],
    "rating": [4.1, 2.3, 3.4]
}

df_small = pd.DataFrame(data)
print(df_small.dtypes)
df_small


index           int64
category       object
brand          object
sale_price    float64
rating        float64
dtype: object


Unnamed: 0,index,category,brand,sale_price,rating
0,1,Beauty & Hygiene,Nivea,220.0,4.1
1,2,Kitchen,Mastercook,180.0,2.3
2,3,Cleaning,Trm,119.0,3.4


## Step 2: Convert Data Types in Sample Data
- Convert `category` and `brand` from `object` → `category`  
- Convert `index` from `int64` → `int32`  
- Keep `sale_price` and `rating` as `float64` (since they have decimals).

In [77]:
# Convert categorical columns
df_small['category'] = df_small['category'].astype('category')

In [78]:
df_small['brand'] = df_small['brand'].astype('category')

In [79]:
# Convert index to int32
df_small['index'] = df_small['index'].astype('int32')

In [80]:
print(df_small.dtypes)
df_small

index            int32
category      category
brand         category
sale_price     float64
rating         float64
dtype: object


Unnamed: 0,index,category,brand,sale_price,rating
0,1,Beauty & Hygiene,Nivea,220.0,4.1
1,2,Kitchen,Mastercook,180.0,2.3
2,3,Cleaning,Trm,119.0,3.4


## Step 3: Apply the Same Logic to Full Dataset
Now that we understand the process, let’s apply it to the full dataset (`df`).  
We’ll convert:
- `category`, `sub_category`, `brand`, `type`, `product` → `category`  
- `index` → `int32` (if values fit)  
- Keep `sale_price`, `market_price`, `rating` as `float64`.

In [81]:
# Create a copy for conversion
df_converted = df.copy()

In [82]:
# Convert categorical columns
categorical_columns = ['category', 'sub_category', 'brand', 'type', 'product']
for col in categorical_columns:
    if col in df_converted.columns:
        df_converted[col] = df_converted[col].astype('category')

In [83]:
# Convert index to int32 if possible
if df_converted['index'].max() < 2147483647:
    df_converted['index'] = df_converted['index'].astype('int32')

1. Why `.max() < 2147483647`?  
2147483647 is the maximum value of a 32-bit signed integer (`int32`).  

**Range of int32:**  
- Minimum: −2,147,483,648  
- Maximum: +2,147,483,647  

If the maximum value in your column (`df['index'].max()`) is less than this limit, then all values fit safely in `int32`.  
This allows you to downcast from `int64` → `int32` to save memory without losing information.  

---

2. Why keep `float64` instead of `float32`?  
`float64` (double precision) has higher precision than `float32`.  

- `float32` → ~7 decimal digits of precision  
- `float64` → ~15–16 decimal digits of precision  

**In your dataset:**  
- `sale_price`, `market_price` → monetary values. Precision matters if you later do calculations (discounts, averages, aggregations).  
- `rating` → decimal values (like 4.1, 3.4). Precision is important for statistical analysis.  

**Trade-off:**  
- `float32` uses 4 bytes per value (saves memory).  
- `float64` uses 8 bytes per value (more memory, but safer for calculations).  

Since prices and ratings are numeric values where precision matters, it’s safer to keep them as `float64`.  
If you were working with huge datasets and memory was a bigger concern, you could experiment with `float32` — but you risk rounding errors.


In [84]:
# Keep price and rating as float64
print(df_converted.dtypes)

index              int32
product         category
category        category
sub_category    category
brand           category
sale_price       float64
market_price     float64
type            category
rating           float64
description       object
dtype: object


# SECTION 3: UNDERSTANDING MISSING VALUES

## 5. What are Missing Values?
Missing values are data points that are not present in the dataset.

In Pandas, missing values appear as:
- NaN (Not a Number)  
- None  
- Empty cells  

---

## 6. Why Do Missing Values Occur?
- Data not collected  
- Data entry errors  
- Data corruption  
- Optional fields not filled  
- Privacy concerns  

---

## 7. Why Handle Missing Values?
- Many operations fail with missing values  
- Machine learning models cannot process NaN  
- Statistical calculations may be incorrect  
- Visualization issues  

### Step 1: Create a Small Sample Dataset
We’ll first build a small DataFrame with intentional missing values.  
This helps learners understand how Pandas detects and visualizes missing data before applying it to the full dataset.

In [85]:
# Small sample dataset with missing values
data_small = {
    "index": [1, 2, 3, 4],
    "product": ["Soap", "Shampoo", None, "Toothpaste"],
    "category": ["Beauty", "Beauty", "Household", None],
    "sale_price": [50.0, np.nan, 120.0, 80.0],
    "rating": [4.5, 3.8, np.nan, 4.0]
}

df_small = pd.DataFrame(data_small)
df_small

Unnamed: 0,index,product,category,sale_price,rating
0,1,Soap,Beauty,50.0,4.5
1,2,Shampoo,Beauty,,3.8
2,3,,Household,120.0,
3,4,Toothpaste,,80.0,4.0


### Step 2: Detect Missing Values in Sample Data
We’ll calculate:
- Count of missing values per column  
- Percentage of missing values per column  
- Data type of each column  

Then display only the columns that have missing values.


In [86]:
# Check for missing values
missing_count = df_small.isnull().sum()
missing_percentage = (missing_count / len(df_small)) * 100

In [87]:
missing_count = df_small.isnull().sum()
missing_count

Unnamed: 0,0
index,0
product,1
category,1
sale_price,1
rating,1


In [88]:
missing_percentage

Unnamed: 0,0
index,0.0
product,25.0
category,25.0
sale_price,25.0
rating,25.0


In [89]:
missing_info = pd.DataFrame({
    'Column': missing_count.index,
    'Missing_Count': missing_count.values,
    'Missing_Percentage': missing_percentage.values,
    'Data_Type': df_small.dtypes.values
})

In [90]:
missing_info

Unnamed: 0,Column,Missing_Count,Missing_Percentage,Data_Type
0,index,0,0.0,int64
1,product,1,25.0,object
2,category,1,25.0,object
3,sale_price,1,25.0,float64
4,rating,1,25.0,float64


In [91]:
# Show only columns with missing values
missing_info = missing_info[missing_info['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

In [92]:
missing_info

Unnamed: 0,Column,Missing_Count,Missing_Percentage,Data_Type
1,product,1,25.0,object
2,category,1,25.0,object
3,sale_price,1,25.0,float64
4,rating,1,25.0,float64


#### Understanding `.to_string()` in Pandas

## What is `.to_string()`?
- `.to_string()` is a Pandas method that converts a **DataFrame** or **Series** into a **string representation**.  
- It is mainly used when printing data, ensuring the **entire DataFrame** is shown without truncation.  
- By default, Pandas may shorten the output if the DataFrame is large, but `.to_string()` forces it to display all rows and columns.

---

## How Does It Work?
1. **Basic Usage**
   ```python
   print(df.to_string())


In [93]:
print("Columns with Missing Values:")
print(missing_info.to_string(index=False))

Columns with Missing Values:
    Column  Missing_Count  Missing_Percentage Data_Type
   product              1                25.0    object
  category              1                25.0    object
sale_price              1                25.0   float64
    rating              1                25.0   float64


### Step 3: Apply the Same Logic to Full Dataset
Now that we understand the process, let’s apply the same missing value detection to our full dataset (`df_converted`).


In [94]:
# Check for missing values in full dataset
missing_count = df_converted.isnull().sum()
missing_percentage = (missing_count / len(df_converted)) * 100

In [95]:
missing_count

Unnamed: 0,0
index,0
product,1
category,0
sub_category,0
brand,1
sale_price,0
market_price,0
type,0
rating,8626
description,115


In [96]:
missing_percentage

Unnamed: 0,0
index,0.0
product,0.003629
category,0.0
sub_category,0.0
brand,0.003629
sale_price,0.0
market_price,0.0
type,0.0
rating,31.304663
description,0.417347


In [97]:
missing_info = pd.DataFrame({
    'Column': missing_count.index,
    'Missing_Count': missing_count.values,
    'Missing_Percentage': missing_percentage.values,
    'Data_Type': df_converted.dtypes.values
})

In [98]:
missing_info

Unnamed: 0,Column,Missing_Count,Missing_Percentage,Data_Type
0,index,0,0.0,int32
1,product,1,0.003629,category
2,category,0,0.0,category
3,sub_category,0,0.0,category
4,brand,1,0.003629,category
5,sale_price,0,0.0,float64
6,market_price,0,0.0,float64
7,type,0,0.0,category
8,rating,8626,31.304663,float64
9,description,115,0.417347,object


In [99]:
# Show only columns with missing values
missing_info = missing_info[missing_info['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

In [100]:
missing_info

Unnamed: 0,Column,Missing_Count,Missing_Percentage,Data_Type
8,rating,8626,31.304663,float64
9,description,115,0.417347,object
4,brand,1,0.003629,category
1,product,1,0.003629,category


In [101]:
print("Columns with Missing Values:")
print(missing_info.to_string(index=False))

Columns with Missing Values:
     Column  Missing_Count  Missing_Percentage Data_Type
     rating           8626           31.304663   float64
description            115            0.417347    object
      brand              1            0.003629  category
    product              1            0.003629  category


# SECTION 4: FILLING MISSING VALUES

## 8. Strategies for Filling Missing Values

### For Numerical Data
- Mean: Average value  
- Median: Middle value (robust to outliers)  
- Mode: Most frequent value  
- Forward Fill: Use previous value  
- Backward Fill: Use next value  
- Constant: Specific value (e.g., 0)  

### For Categorical Data
- Mode: Most frequent category  
- Forward Fill: Use previous category  
- Backward Fill: Use next category  
- Unknown: New category for missing  
- Custom value: Domain-specific value  


In [102]:
# Step 1: Create a small sample dataset with missing values
data_small = {
    "index": [1, 2, 3, 4],
    "product": ["Soap", None, "Shampoo", "Toothpaste"],
    "brand": ["Nivea", "Nivea", None, "Colgate"],
    "sale_price": [50.0, 80.0, np.nan, 120.0],
    "rating": [4.5, np.nan, 3.8, 4.0],
    "description": ["Good soap", None, "Hair care product", "Dental hygiene"]
}

In [103]:
df_small = pd.DataFrame(data_small)
df_small

Unnamed: 0,index,product,brand,sale_price,rating,description
0,1,Soap,Nivea,50.0,4.5,Good soap
1,2,,Nivea,80.0,,
2,3,Shampoo,,,3.8,Hair care product
3,4,Toothpaste,Colgate,120.0,4.0,Dental hygiene


In [104]:
# Missing value summary:
df_small.isnull().sum()

Unnamed: 0,0
index,0
product,1
brand,1
sale_price,1
rating,1
description,1


## Step 2: Fill Missing Values in Sample Data
We’ll apply different strategies:
- Numerical column (`rating`) → fill with **median**  
- Numerical column (`sale_price`) → fill with **mean**  
- Categorical column (`product`) → fill with `"Unknown Product"`  
- Categorical column (`brand`) → fill with **mode**  
- Text column (`description`) → fill with `"No description available"`  

In [105]:
df_filled_small = df_small.copy()

# Fill numerical columns safely
df_filled_small['rating'] = df_filled_small['rating'].fillna(df_filled_small['rating'].median())
df_filled_small['sale_price'] = df_filled_small['sale_price'].fillna(df_filled_small['sale_price'].mean())

In [106]:
# Fill categorical/text columns safely
df_filled_small['product'] = df_filled_small['product'].fillna("Unknown Product")
df_filled_small['brand'] = df_filled_small['brand'].fillna(df_filled_small['brand'].mode()[0])
df_filled_small['description'] = df_filled_small['description'].fillna("No description available")

In [107]:
print("\nSample Data (After Filling):")
df_filled_small.head()


Sample Data (After Filling):


Unnamed: 0,index,product,brand,sale_price,rating,description
0,1,Soap,Nivea,50.0,4.5,Good soap
1,2,Unknown Product,Nivea,80.0,4.0,No description available
2,3,Shampoo,Nivea,83.333333,3.8,Hair care product
3,4,Toothpaste,Colgate,120.0,4.0,Dental hygiene


In [108]:
print("\nMissing Values Summary After Filling:")
print(df_filled_small.isnull().sum())


Missing Values Summary After Filling:
index          0
product        0
brand          0
sale_price     0
rating         0
description    0
dtype: int64


## Step 3: Apply the Same Logic to Full Dataset
Now that we understand the process, let’s apply the same filling strategies to our full dataset (`df_converted`).


In [109]:
# Create a working copy
df_filled = df_converted.copy()

In [110]:
# STEP 1: FILLING 'RATING' COLUMN (NUMERICAL DATA)
# Missing values before
df_filled['rating'].isnull().sum()

np.int64(8626)

In [111]:
# Fill with median
df_filled['rating'] = df_filled['rating'].fillna(df_filled['rating'].median())

In [112]:
# Missing values after
df_filled['rating'].isnull().sum()

np.int64(0)

In [113]:
# STEP 2: FILLING 'PRODUCT' COLUMN (CATEGORICAL DATA)
# Missing values before
df_filled['product'].isnull().sum()

np.int64(1)

In [114]:
# Add new category before filling
df_filled['product'] = df_filled['product'].cat.add_categories(["Unknown Product"])
df_filled['product'] = df_filled['product'].fillna("Unknown Product")

In [115]:
# Missing values after
df_filled['product'].isnull().sum()

np.int64(0)

In [116]:
# STEP 3: FILLING 'BRAND' COLUMN (CATEGORICAL DATA)
# Missing values before
df_filled['brand'].isnull().sum()

np.int64(1)

In [117]:
# Add mode category if needed and fill
brand_mode = df_filled['brand'].mode()[0]
if brand_mode not in df_filled['brand'].cat.categories:
    df_filled['brand'] = df_filled['brand'].cat.add_categories([brand_mode])
df_filled['brand'] = df_filled['brand'].fillna(brand_mode)

In [118]:
brand_mode = df_filled['brand'].mode()[0]
brand_mode

'Fresho'

# Understanding `mode()[0]` in Pandas

## What is `.mode()`?
- The **mode** is the most frequently occurring value in a dataset.
- In Pandas, `.mode()` returns a **Series** containing the most frequent value(s).
- If there are multiple modes (ties), `.mode()` will return all of them.

## Why use `[0]`?
- `.mode()` returns a Series, not a single value.
- `[0]` selects the **first mode value** from that Series.
- This ensures we get a plain scalar value (e.g., `"Nike"`) that can be used to fill missing values.

---

## Example



In [119]:
import pandas as pd

# Sample data with missing values
data = {"brand": ["Nike", "Adidas", "Nike", None, "Puma", "Nike", "Adidas", None]}
df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,brand
0,Nike
1,Adidas
2,Nike
3,
4,Puma
5,Nike
6,Adidas
7,


In [120]:
# Find the mode
brand_mode = df['brand'].mode()[0]
print("Mode of brand column:", brand_mode)

Mode of brand column: Nike


In [121]:
# Fill missing values with the mode
df['brand'] = df['brand'].fillna(brand_mode)
print(df)

    brand
0    Nike
1  Adidas
2    Nike
3    Nike
4    Puma
5    Nike
6  Adidas
7    Nike


In [122]:
# Missing values after
df_filled['brand'].isnull().sum()

np.int64(0)

In [123]:
# STEP 4: FILLING 'DESCRIPTION' COLUMN (TEXT DATA)
# Missing values before
df_filled['description'].isnull().sum()

np.int64(115)

In [124]:
# Add new category before filling
# Fill missing description values with a placeholder
df_filled['description'] = df_filled['description'].fillna("No description available")

In [125]:
# Missing values after
df_filled['description'].isnull().sum()

np.int64(0)

In [126]:
# FINAL VERIFICATION
# Missing Values Summary After Filling
df_filled.isnull().sum()

Unnamed: 0,0
index,0
product,0
category,0
sub_category,0
brand,0
sale_price,0
market_price,0
type,0
rating,0
description,0


In [127]:
# Sample of Cleaned Data
df_filled.head(10)

Unnamed: 0,index,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.0,220.0,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...
1,2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.0,180.0,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ..."
2,3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.0,250.0,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m..."
3,4,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149.0,176.0,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...
4,5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162.0,162.0,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...
5,6,Germ - Removal Multipurpose Wipes,Cleaning & Household,All Purpose Cleaners,Nature Protect,169.0,199.0,Disinfectant Spray & Cleaners,3.3,Stay protected from contamination with Multipu...
6,7,Multani Mati,Beauty & Hygiene,Skin Care,Satinance,58.0,58.0,Face Care,3.6,Satinance multani matti is an excellent skin t...
7,8,Hand Sanitizer - 70% Alcohol Base,Beauty & Hygiene,Bath & Hand Wash,Bionova,250.0,250.0,Hand Wash & Sanitizers,4.0,70%Alcohol based is gentle of hand leaves skin...
8,9,Biotin & Collagen Volumizing Hair Shampoo + Bi...,Beauty & Hygiene,Hair Care,StBotanica,1098.0,1098.0,Shampoo & Conditioner,3.5,"An exclusive blend with Vitamin B7 Biotin, Hyd..."
9,10,"Scrub Pad - Anti- Bacterial, Regular",Cleaning & Household,"Mops, Brushes & Scrubs",Scotch brite,20.0,20.0,"Utensil Scrub-Pad, Glove",4.3,Scotch Brite Anti- Bacterial Scrub Pad thoroug...


In [128]:
# Data Types in Cleaned Dataset
df_filled.dtypes

Unnamed: 0,0
index,int32
product,category
category,category
sub_category,category
brand,category
sale_price,float64
market_price,float64
type,category
rating,float64
description,object


# Decision on Data Types

In our dataset, most categorical fields (such as `product`, `category`, `sub_category`, `brand`, and `type`) have been converted to **category dtype** to save memory and improve efficiency.  
Numerical fields (`sale_price`, `market_price`, `rating`) are kept as **float64** for precision.  
The `index` column is stored as **int32** since its values fit safely within the 32-bit integer range.

## Why keep `description` as object?
- The `description` column contains **free‑form text data** (long sentences or product descriptions).  
- Converting it to `category` would not be efficient because:
  - It has many unique values, so memory savings would be minimal.  
  - Treating descriptions as categories would limit flexibility in text processing.  
- Keeping it as `object` ensures:
  - Full support for string operations (`.str` methods in Pandas).  
  - Flexibility for future NLP tasks (tokenization, embeddings, etc.).  
  - Avoids unnecessary conversion overhead.


In [129]:
pip install ydata_profiling



In [130]:
#from ydata_profiling import ProfileReport

#profile = ProfileReport(df_filled, title="EDA Report", explorative=True)
#profile.to_file("eda_report.html")   # Generates full report as HTML