![dataset-cover.jpg](attachment:e3f74c73-1982-4912-9624-5892e9d6a351.jpg)

# **Project Tutorial: Amazon Products**

## I. **Project Description**: 

**This tutorial project focuses on the data cleaning and exploration of the amazon products downloaded from kaggle data source** [Datasource](https://www.kaggle.com/datasets/spypsc07/amazon-products)

## II. **Dataset Description**: 

**The dataset contains informatin about the products sold on Amazon, including various attributes with various data feature attributes such as _price_, _ratings_, _availability_, and _sales volume_.**

## **1. Loading the Dependencies**

In [186]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
pd.set_option("display.max_columns", 200)

## **2. Loading Dataset**

In [187]:
amazon_prod = pd.read_csv("amazon_product.csv")

## 3. Feature Enginering Process: **_Understanding Dataset_**
 * **Head or First 5 rows of data**
 * **Size of dataset: shape**
 * **Data columns**
 * **Data types and the number of these categorical: Nominal and Ordinal data types:**
 * **Information about dataset**
 * **Statistical description of dataset**
 * **Checking for NaN or missing values and sort in ascending order for clarity**

### **i. Head or First 5 rows of data**

In [188]:
amazon_prod.head(5) # Loading first 5 rows of the amazon dataset

Unnamed: 0.1,Unnamed: 0,asin,product_title,product_price,product_original_price,currency,product_star_rating,product_num_ratings,product_url,product_photo,product_num_offers,product_minimum_offer_price,is_best_seller,is_amazon_choice,is_prime,climate_pledge_friendly,sales_volume,delivery,has_variations,product_availability,unit_price,unit_count
0,0,B0BQ118F2T,Moto G Play 2023 3-Day Battery Unlocked Made f...,$99.99,$169.99,USD,4.0,2929,https://www.amazon.com/dp/B0BQ118F2T,https://m.media-amazon.com/images/I/61K1Fz5Lxv...,10,$64.89,False,False,True,False,6K+ bought in past month,"FREE delivery Tue, Aug 6",True,,,
1,1,B0CTD47P22,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",$149.74,$158.00,USD,4.2,135,https://www.amazon.com/dp/B0CTD47P22,https://m.media-amazon.com/images/I/51QhB2CfqS...,8,$145.87,False,False,True,False,3K+ bought in past month,"FREE delivery Wed, Aug 7 Only 7 left in stock ...",False,Only 7 left in stock - order soon.,,
2,2,B0CHH6X6H2,Total by Verizon | Samsung Galaxy A03s | Locke...,$49.88,,USD,3.9,205,https://www.amazon.com/dp/B0CHH6X6H2,https://m.media-amazon.com/images/I/812woqv69C...,1,$49.88,False,False,True,False,2K+ bought in past month,"FREE delivery Tue, Aug 6",False,,,
3,3,B0BZ9XNBRB,Google Pixel 7a - Unlocked Android Cell Phone ...,$335.00,$499.00,USD,4.3,2248,https://www.amazon.com/dp/B0BZ9XNBRB,https://m.media-amazon.com/images/I/61r7cCpQPl...,30,$289.99,False,False,False,False,10K+ bought in past month,FREE delivery Aug 6 - 8,True,,,
4,4,B0CN1QSH8Q,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",$199.99,,USD,4.1,423,https://www.amazon.com/dp/B0CN1QSH8Q,https://m.media-amazon.com/images/I/61s0ZzwzSC...,2,$150.09,False,False,True,True,3K+ bought in past month,"FREE delivery Tue, Aug 6",True,,,


### **ii. Size of dataset: shape**

In [189]:
amazon_prod.shape

(64, 22)

#### **_Observation_**: Dataset has **64 rows** and **22 feature columns**

### **iii. Data columns**

In [190]:
amazon_prod.columns

Index(['Unnamed: 0', 'asin', 'product_title', 'product_price',
       'product_original_price', 'currency', 'product_star_rating',
       'product_num_ratings', 'product_url', 'product_photo',
       'product_num_offers', 'product_minimum_offer_price', 'is_best_seller',
       'is_amazon_choice', 'is_prime', 'climate_pledge_friendly',
       'sales_volume', 'delivery', 'has_variations', 'product_availability',
       'unit_price', 'unit_count'],
      dtype='object')

### **iv. Data types and the number of these categorical: Nominal and Ordinal data types:**

In [191]:
amazon_prod.dtypes

Unnamed: 0                       int64
asin                            object
product_title                   object
product_price                   object
product_original_price          object
currency                        object
product_star_rating            float64
product_num_ratings              int64
product_url                     object
product_photo                   object
product_num_offers               int64
product_minimum_offer_price     object
is_best_seller                    bool
is_amazon_choice                  bool
is_prime                          bool
climate_pledge_friendly           bool
sales_volume                    object
delivery                        object
has_variations                    bool
product_availability            object
unit_price                      object
unit_count                     float64
dtype: object

#### **Grouping into various data types:**

In [192]:
nominal_col = amazon_prod.select_dtypes(include = ["object"]).columns.tolist()

In [193]:
ordinal_col = amazon_prod.select_dtypes(include = ["number"]).columns.tolist()

In [194]:
bool_col = amazon_prod.select_dtypes(include = ["bool"]).columns.tolist()

In [195]:
nominal_col

['asin',
 'product_title',
 'product_price',
 'product_original_price',
 'currency',
 'product_url',
 'product_photo',
 'product_minimum_offer_price',
 'sales_volume',
 'delivery',
 'product_availability',
 'unit_price']

In [196]:
len(nominal_col) # We have 12 nominal data columns

12

In [197]:
ordinal_col

['Unnamed: 0',
 'product_star_rating',
 'product_num_ratings',
 'product_num_offers',
 'unit_count']

In [198]:
len(ordinal_col) # We have 5 ordinal data columns

5

In [199]:
bool_col 

['is_best_seller',
 'is_amazon_choice',
 'is_prime',
 'climate_pledge_friendly',
 'has_variations']

In [200]:
len(bool_col) # We have 5 bool data types

5

#### **Observation**: We have **_12 nominal columns_**, **_5 ordinal and 5 boolean columns_**:

### **v. Information about Dataset**

In [201]:
amazon_prod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 22 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   64 non-null     int64  
 1   asin                         64 non-null     object 
 2   product_title                64 non-null     object 
 3   product_price                64 non-null     object 
 4   product_original_price       27 non-null     object 
 5   currency                     64 non-null     object 
 6   product_star_rating          54 non-null     float64
 7   product_num_ratings          64 non-null     int64  
 8   product_url                  64 non-null     object 
 9   product_photo                64 non-null     object 
 10  product_num_offers           64 non-null     int64  
 11  product_minimum_offer_price  64 non-null     object 
 12  is_best_seller               64 non-null     bool   
 13  is_amazon_choice      

#### **_Observation_**: This dataset has missing values due to all columns not having exact 64 rows

### **vi. Statistical Description of Dataset**

In [202]:
amazon_prod.describe().T # This statistical description is based on the 5 ordinal columns

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,64.0,31.5,18.618987,0.0,15.75,31.5,47.25,63.0
product_star_rating,54.0,4.244444,0.558187,1.8,4.1,4.4,4.6,4.8
product_num_ratings,64.0,52101.640625,157685.541926,0.0,47.75,967.5,13846.25,1015448.0
product_num_offers,64.0,4.859375,8.145243,1.0,1.0,1.0,4.0,34.0
unit_count,5.0,15.2,25.390943,0.0,2.0,3.0,11.0,60.0


### **vii. Checking for NaN or missing values and sort in ascending order for clarity**

In [203]:
amazon_prod.isna().sum().sort_values()

Unnamed: 0                      0
has_variations                  0
climate_pledge_friendly         0
is_prime                        0
is_amazon_choice                0
is_best_seller                  0
product_minimum_offer_price     0
product_photo                   0
product_num_offers              0
product_num_ratings             0
currency                        0
product_price                   0
product_title                   0
asin                            0
product_url                     0
delivery                        1
sales_volume                    3
product_star_rating            10
product_original_price         37
unit_price                     59
unit_count                     59
product_availability           63
dtype: int64

### **Observation**: _Columns with missing values_:
* **product_original_price------------37**
* **unit_price--------------------------59**
* **unit_count-------------------------59**
* **product_availability---------------63**

## 4. Feature Enginering Process: **_First Data Cleaning Steps_**

* **Deleting redundant columns.**
* **Renaming the columns.**
* **Dropping duplicates.**
* **Remove the NaN values from the dataset**
* **Cleaning individual columns.**
* **Check for some more Transformations**

### i. **Deleting Redundant Columns.**

In [204]:
amazon_prod.columns

Index(['Unnamed: 0', 'asin', 'product_title', 'product_price',
       'product_original_price', 'currency', 'product_star_rating',
       'product_num_ratings', 'product_url', 'product_photo',
       'product_num_offers', 'product_minimum_offer_price', 'is_best_seller',
       'is_amazon_choice', 'is_prime', 'climate_pledge_friendly',
       'sales_volume', 'delivery', 'has_variations', 'product_availability',
       'unit_price', 'unit_count'],
      dtype='object')

In [205]:
amazon_prod.drop(columns = ['Unnamed: 0','asin','product_url', 'product_photo',
       'product_num_offers', 'product_minimum_offer_price','product_url','product_photo',
       'product_num_offers', 'product_minimum_offer_price','delivery', 'currency','sales_volume'], inplace = True, axis = 1)

In [206]:
amazon_prod.head(5)

Unnamed: 0,product_title,product_price,product_original_price,product_star_rating,product_num_ratings,is_best_seller,is_amazon_choice,is_prime,climate_pledge_friendly,has_variations,product_availability,unit_price,unit_count
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,$99.99,$169.99,4.0,2929,False,False,True,False,True,,,
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",$149.74,$158.00,4.2,135,False,False,True,False,False,Only 7 left in stock - order soon.,,
2,Total by Verizon | Samsung Galaxy A03s | Locke...,$49.88,,3.9,205,False,False,True,False,False,,,
3,Google Pixel 7a - Unlocked Android Cell Phone ...,$335.00,$499.00,4.3,2248,False,False,False,False,True,,,
4,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",$199.99,,4.1,423,False,False,True,True,True,,,


In [207]:
amazon_prod.shape

(64, 13)

#### **_Observation_**: New shape is _64 rows and 16 feature columns_

### **ii. Renaming the columns**.

In [208]:
amazon_prod.columns

Index(['product_title', 'product_price', 'product_original_price',
       'product_star_rating', 'product_num_ratings', 'is_best_seller',
       'is_amazon_choice', 'is_prime', 'climate_pledge_friendly',
       'has_variations', 'product_availability', 'unit_price', 'unit_count'],
      dtype='object')

In [209]:
New_col_named = []
for x in amazon_prod.columns:
    New_col_named.append(x.title())

In [210]:
New_col_named

['Product_Title',
 'Product_Price',
 'Product_Original_Price',
 'Product_Star_Rating',
 'Product_Num_Ratings',
 'Is_Best_Seller',
 'Is_Amazon_Choice',
 'Is_Prime',
 'Climate_Pledge_Friendly',
 'Has_Variations',
 'Product_Availability',
 'Unit_Price',
 'Unit_Count']

In [211]:
amazon_prod.columns = New_col_named

In [212]:
amazon_prod.head(5)

Unnamed: 0,Product_Title,Product_Price,Product_Original_Price,Product_Star_Rating,Product_Num_Ratings,Is_Best_Seller,Is_Amazon_Choice,Is_Prime,Climate_Pledge_Friendly,Has_Variations,Product_Availability,Unit_Price,Unit_Count
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,$99.99,$169.99,4.0,2929,False,False,True,False,True,,,
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",$149.74,$158.00,4.2,135,False,False,True,False,False,Only 7 left in stock - order soon.,,
2,Total by Verizon | Samsung Galaxy A03s | Locke...,$49.88,,3.9,205,False,False,True,False,False,,,
3,Google Pixel 7a - Unlocked Android Cell Phone ...,$335.00,$499.00,4.3,2248,False,False,False,False,True,,,
4,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",$199.99,,4.1,423,False,False,True,True,True,,,


### **iii. Dropping Duplicates.**

In [213]:
amazon_prod.duplicated(keep = 'first').sum()

0

#### **_Observation_**: We have no duplicates in the dataset

### **iii. Remove the NaN values from the dataset**

In [214]:
amazon_prod.isna().sum().sort_values()

Product_Title               0
Product_Price               0
Product_Num_Ratings         0
Is_Best_Seller              0
Is_Amazon_Choice            0
Is_Prime                    0
Climate_Pledge_Friendly     0
Has_Variations              0
Product_Star_Rating        10
Product_Original_Price     37
Unit_Price                 59
Unit_Count                 59
Product_Availability       63
dtype: int64

#### **a. Dropping off additional redundant columns due to  excessive missing values**

In [215]:
amazon_prod.drop(columns = ['Product_Original_Price', 'Unit_Price', 'Unit_Count', 'Product_Availability', 
                            'Product_Availability','Climate_Pledge_Friendly','Has_Variations','Is_Prime'], axis = 1, inplace = True)

In [216]:
amazon_prod.head(5)

Unnamed: 0,Product_Title,Product_Price,Product_Star_Rating,Product_Num_Ratings,Is_Best_Seller,Is_Amazon_Choice
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,$99.99,4.0,2929,False,False
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",$149.74,4.2,135,False,False
2,Total by Verizon | Samsung Galaxy A03s | Locke...,$49.88,3.9,205,False,False
3,Google Pixel 7a - Unlocked Android Cell Phone ...,$335.00,4.3,2248,False,False
4,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",$199.99,4.1,423,False,False


In [217]:
amazon_prod.shape # New number of columns = 12

(64, 6)

In [218]:
amazon_prod.isna().sum().sort_values()

Product_Title           0
Product_Price           0
Product_Num_Ratings     0
Is_Best_Seller          0
Is_Amazon_Choice        0
Product_Star_Rating    10
dtype: int64

In [219]:
amazon_prod.dropna(how ='any', inplace = True)

In [220]:
amazon_prod.isna().sum()

Product_Title          0
Product_Price          0
Product_Star_Rating    0
Product_Num_Ratings    0
Is_Best_Seller         0
Is_Amazon_Choice       0
dtype: int64

### **iv. Cleaning Individual Columns.**: Checking for some duplicates in individual columns

In [221]:
amazon_prod.duplicated(subset = ['Product_Title', 'Product_Price','Product_Star_Rating', 'Is_Best_Seller',]).sum()

0

In [222]:
amazon_prod.head(5)

Unnamed: 0,Product_Title,Product_Price,Product_Star_Rating,Product_Num_Ratings,Is_Best_Seller,Is_Amazon_Choice
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,$99.99,4.0,2929,False,False
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",$149.74,4.2,135,False,False
2,Total by Verizon | Samsung Galaxy A03s | Locke...,$49.88,3.9,205,False,False
3,Google Pixel 7a - Unlocked Android Cell Phone ...,$335.00,4.3,2248,False,False
4,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",$199.99,4.1,423,False,False


### **v. Check for some more Transformations**

#### a. Removing $ from the Product_Price

In [223]:
amazon_prod['Product_Price'] = amazon_prod['Product_Price'].str.replace("$","")

In [224]:
amazon_prod.head(3)

Unnamed: 0,Product_Title,Product_Price,Product_Star_Rating,Product_Num_Ratings,Is_Best_Seller,Is_Amazon_Choice
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,99.99,4.0,2929,False,False
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",149.74,4.2,135,False,False
2,Total by Verizon | Samsung Galaxy A03s | Locke...,49.88,3.9,205,False,False


In [225]:
amazon_prod.dtypes

Product_Title           object
Product_Price           object
Product_Star_Rating    float64
Product_Num_Ratings      int64
Is_Best_Seller            bool
Is_Amazon_Choice          bool
dtype: object

In [226]:
amazon_prod["Product_Price"] = amazon_prod["Product_Price"].astype(float)

In [227]:
amazon_prod.dtypes

Product_Title           object
Product_Price          float64
Product_Star_Rating    float64
Product_Num_Ratings      int64
Is_Best_Seller            bool
Is_Amazon_Choice          bool
dtype: object

#### Convert bool data type for Is_Best_Seller and Is_Amazon_Choice to 1 for Yes and 0 for No:

In [228]:
amazon_prod["Is_Best_Seller"] = amazon_prod["Is_Best_Seller"].apply(lambda x: 1 if x == True else 0)

In [229]:
amazon_prod["Is_Amazon_Choice"] = amazon_prod["Is_Amazon_Choice"].apply(lambda x: 1 if x == True else 0)

In [230]:
amazon_prod.head(5)

Unnamed: 0,Product_Title,Product_Price,Product_Star_Rating,Product_Num_Ratings,Is_Best_Seller,Is_Amazon_Choice
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,99.99,4.0,2929,0,0
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",149.74,4.2,135,0,0
2,Total by Verizon | Samsung Galaxy A03s | Locke...,49.88,3.9,205,0,0
3,Google Pixel 7a - Unlocked Android Cell Phone ...,335.0,4.3,2248,0,0
4,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",199.99,4.1,423,0,0


In [234]:
amazon_prod["Is_Best_Seller"].unique()

array([0, 1], dtype=int64)

In [235]:
amazon_prod["Is_Amazon_Choice"].unique()

array([0, 1], dtype=int64)

### **vi. Saving Final Cleaning Data**

In [237]:
amazon_prod.to_csv("Cleaned_amazon_prod.csv", index = False)

In [238]:
Cleaned_amazon_prod = pd.read_csv("Cleaned_amazon_prod.csv")

In [239]:
Cleaned_amazon_prod.head()

Unnamed: 0,Product_Title,Product_Price,Product_Star_Rating,Product_Num_Ratings,Is_Best_Seller,Is_Amazon_Choice
0,Moto G Play 2023 3-Day Battery Unlocked Made f...,99.99,4.0,2929,0,0
1,"SAMSUNG Galaxy A15 5G (SM-156M/DSN), 128GB 6GB...",149.74,4.2,135,0,0
2,Total by Verizon | Samsung Galaxy A03s | Locke...,49.88,3.9,205,0,0
3,Google Pixel 7a - Unlocked Android Cell Phone ...,335.0,4.3,2248,0,0
4,"SAMSUNG Galaxy A15 5G A Series Cell Phone, 128...",199.99,4.1,423,0,0
