<a href="https://colab.research.google.com/github/ElenaBara21/Portfolio/blob/main/Amazon_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Amazon project - Collagen

The most common types of collagen used in supplements are:
* Type I
* Type II
* Type III
* Type V
* Type X

These collagen types come in three different forms: 
* Hydrolyzed collagen (collagen peptides, collagen powder, collagen hydrolysate and hydrolyzed gelatin),
* gelatin and undenatured type II collagen (UC-II).

Collagen supplement ingredients come from a variety of sources: 
* Bovine collagen made from cows contains types I & III.
* Marine collagen made from fish contains type I & II. * Poultry collagen made from chickens contains type II and eggshell membrane collagen contains types I & V.
* We are interested in Type II, collagen is found in both marine and chicken products. People with allergies to fish or chicken should consult their doctor before taking supplements containing type II collagen.

from pandas.api.types import CategoricalDtype
## Data Description
* **Product**
* **brand**
* **Price_AED**
* **Sales**- quantity sold
* **Revenue** - quantity_sold x price_aed
* **Fees_AED**- FBA fee+ referral fee
* **Active_Seller** - number of active sellers
* **Ratings** - amazon sale rating
* **Review_Count**- total number of reviews
* **Images**- 
* **Dimensions**
* **Weight**
* **Creation_Date**
* **Category**
* **Review_velocity**-change in review count in the last 30 days
* **Buy_Box**- buy box owner,Each product page has a ‘Buy Now’ button on the right side, adjacent to the product image and description, with a hierarchal list of other buyers listed below. Sometimes, this coveted spot is based on past sales history, and sometimes it’s rotated between sellers.Amazon decides who earns the Buy Box position, but using Helium 10 tools to more effectively increase visibility and sales can help a merchant gain the buy box position. If you sell unique products under a private label, you are the most likely to have the buy box because no one else is selling the same product.
* **Size_tier**-это категория, в которую попадает товар на Amazon в зависимости от его размера и веса: Large Envelope, Standard Envelope, Standard Parcel
* **Fullfilment** - AMZ=FBA FBA stands for Fulfillment by Amazon. It is a program offered by Amazon that allows sellers to store their products in Amazon's fulfillment centers. MFN stands for Merchant Fulfilled Network. It refers to a method of fulfilling orders on Amazon where the seller handles the entire fulfillment process themselves. In this case, the seller stores the products in their own facilities and is responsible for packaging, shipping, and customer service.
* **BSR** - BSR – Best Seller Rank.Amazon tracks and publicizes its best-selling products to drive sales. Every product sold on Amazon is ranked by Amazon in at least one category, and often in several sub-categories. Ranking affects where and when products appear in a customer’s organic search. Most sellers try to rank high enough to be on the first results page of a customer search.


***Buy Box (англ. Buy Box) - это блок на странице товара на Amazon, в котором покупатель может нажать на кнопку "Купить сейчас" и сделать покупку у конкретного продавца. Он находится в правой части страницы товара, рядом с ценой и другими деталями товара. В Buy Box отображается один продавец, но Amazon может показывать разных продавцов на этом месте, основываясь на различных факторах, таких как цена, рейтинг продавца, наличие товара, и другие. Обладание Buy Box очень важно для продавцов, так как это позволяет им получать больше продаж на Amazon.



### Import libraries

In [170]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# library
import seaborn as sns
import pandas as pd
import numpy as np


In [171]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [172]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [173]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [174]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

In [175]:
df = pd.read_csv('Helium_10_Xray_2023-05-08.csv')

### Создаем копию  документа, чтоб не потерять данных

In [176]:
df1=df.copy()

### Смотрим первые 5 строк 

In [177]:
df1.head()

Unnamed: 0,Product Details,ASIN,URL,Image URL,Brand,Price AED,Sales,Revenue,BSR,Fees AED,...,Review Count,Images,Review velocity,Buy Box,Category,Size Tier,Fulfillment,Dimensions,Weight,Creation Date
0,"($) Nutrili Marine Collagen Shots (20) | Hair,...",B09YD7RHC2,https://www.amazon.ae/dp/B09YD7RHC2?psc=1,https://m.media-amazon.com/images/I/61asHjsoyP...,Nutrili,295.0,88.0,25960.0,811,52.75,...,26,11,4,nutrili,Health,Standard Parcel,FBA,4.4x3.4x4.6,1.72,4/20/2022
1,($) Swisse Beauty Collagen Glow Powder with 25...,B08NDYTG3D,https://www.amazon.ae/dp/B08NDYTG3D?psc=1,https://m.media-amazon.com/images/I/71VnTFRUH4...,Swisse,144.0,5.0,720.0,23351,29.6,...,9,3,0,Pattern MENA,Health,Standard Parcel,FBA,2.8x2.7x4.9,0.37,
2,($) Snaktive Collagen Chocolate - 6x40g bars,B0B9SWK8XQ,https://www.amazon.ae/dp/B0B9SWK8XQ?psc=1,https://m.media-amazon.com/images/I/61Id91ioPE...,SNAKTIVE,60.0,4.0,240.0,26620,16.5,...,2,4,0,Le Chocolat LLC,Health,Large Envelope,FBA,1.8x5.1x2.8,0.55,
3,"Youtheory Collagen Advanced with Vitamin C, 12...",B006VAZYLS,https://www.amazon.ae/dp/B006VAZYLS?psc=1,https://m.media-amazon.com/images/I/71m5dU+vK4...,Youtheory,60.37,89.0,5372.93,753,17.0,...,4676,9,88,NBL General Trading L.L.C,Health,Standard Parcel,FBA,3.1x3.0x3.6,0.42,
4,Neocell Super Collagen PlUS C - 250 Tablets,B00028NGEC,https://www.amazon.ae/dp/B00028NGEC?psc=1,https://m.media-amazon.com/images/I/71TDUkdGRZ...,Neocell,135.0,542.0,73170.0,200,28.25,...,18342,14,140,Amazon,Health,Standard Parcel,AMZ,3.1x2.8x5.3,0.66,03/06/2019


#### Убираем пробелы и приводим все названия к одному формату

In [178]:
# Rename columns to lowercase with underscores instead of whitespace
new_columns = {col: col.lower().replace(' ', '_') for col in df1.columns}
df1 = df1.rename(columns=new_columns)

#### Lets check the data again

In [179]:
df1.head(2)

Unnamed: 0,product_details,asin,url,image_url,brand,price_aed,sales,revenue,bsr,fees_aed,...,review_count,images,review_velocity,buy_box,category,size_tier,fulfillment,dimensions,weight,creation_date
0,"($) Nutrili Marine Collagen Shots (20) | Hair,...",B09YD7RHC2,https://www.amazon.ae/dp/B09YD7RHC2?psc=1,https://m.media-amazon.com/images/I/61asHjsoyP...,Nutrili,295.0,88.0,25960.0,811,52.75,...,26,11,4,nutrili,Health,Standard Parcel,FBA,4.4x3.4x4.6,1.72,4/20/2022
1,($) Swisse Beauty Collagen Glow Powder with 25...,B08NDYTG3D,https://www.amazon.ae/dp/B08NDYTG3D?psc=1,https://m.media-amazon.com/images/I/71VnTFRUH4...,Swisse,144.0,5.0,720.0,23351,29.6,...,9,3,0,Pattern MENA,Health,Standard Parcel,FBA,2.8x2.7x4.9,0.37,


### Understand the shape of the dataset

In [180]:
df1.shape

(55, 22)

* 22 columns and 55 rows

#### Проверим есть ли пропущенные данные 

In [181]:
print(df1.isnull().sum())

product_details     0
asin                0
url                 0
image_url           0
brand               0
price_aed           0
sales               4
revenue             4
bsr                 0
fees_aed            0
active_sellers_#    0
ratings             0
review_count        0
images              0
review_velocity     0
buy_box             0
category            0
size_tier           0
fulfillment         1
dimensions          0
weight              0
creation_date       8
dtype: int64


* Пропушенные данные в следующих колонках:
 creation_date=8,
 fullfilment=1
 sales=4
 revenue=4

 

In [182]:
missing_values = df1.isna().any()

# Printing the columns with missing values
print(missing_values)

product_details     False
asin                False
url                 False
image_url           False
brand               False
price_aed           False
sales                True
revenue              True
bsr                 False
fees_aed            False
active_sellers_#    False
ratings             False
review_count        False
images              False
review_velocity     False
buy_box             False
category            False
size_tier           False
fulfillment          True
dimensions          False
weight              False
creation_date        True
dtype: bool


* Sales, revenue, fulfillment, and creation date have 'n/a' values in their datasets. It is crucial to address this issue as it could potentially lead to misleading interpretations.

#### Lets treat missing data

replace n/a to NaN to avoid further mistakes in 'sales' column:

In [183]:
df1['sales'] = df1['sales'].replace('n/a', np.nan)  # Replace 'n/a' with NaN
df1['sales'] = pd.to_numeric(df1['sales'], errors='coerce')  # Convert to numeric data type


in revenue column:

In [184]:
df1['revenue'] = df1['revenue'].replace('n/a', np.nan)  # Replace 'n/a' with NaN
df1['revenue'] = pd.to_numeric(df1['revenue'], errors='coerce')  # Convert to numeric data type

in fulfillment column:

In [185]:
df1['fulfillment'] = df1['fulfillment'].replace('n/a', np.nan)  # Replace 'n/a' with NaN


in creation_date:

In [186]:
# df1['creation_date'] = df1['creation_date'].replace('n/a', np.nan)  # Replace 'n/a' with NaN
# df1['creation_date'] = pd.to_numeric(df1['creation_date'], errors='coerce')  # Convert to numeric data type

Lets treat missing data in sales column:

In [187]:
mean_sales = df1['sales'].mean()
df1['sales'] = df1['sales'].fillna(mean_sales)

In [188]:
mean_revenue = df1['revenue'].mean()
df1['revenue'] = df1['revenue'].fillna(mean_revenue)

we will drop 1 missing valuein fulfillment column

In [189]:
df1.dropna(subset=['fulfillment'], inplace=True)

#### creation_date

In [190]:
df1['creation_date'] = pd.to_datetime(df1['creation_date'], errors='coerce')


In [191]:
df1['creation_date'] = df1['creation_date'].replace('n/a', np.nan)


### we will treat missing data in the column creation_date


In [192]:
# Replace 'n/a' values with a marker, such as 'Missing'
df1['creation_date'] = df1['creation_date'].replace('n/a', 'Missing')

# Convert the 'creation_date' column to datetime type, including the 'Missing' marker
df1['creation_date'] = pd.to_datetime(df1['creation_date'], errors='coerce')

# Create a separate column to indicate missing dates
df1['is_missing_date'] = df1['creation_date'].isna()

# Extract the year from the 'creation_date' column
# year = df1['year'] = df1['creation_date'].dt.year

In [193]:
# df1.dropna(subset=['creation_date'], inplace=True)


In [194]:
print(df1.isnull().sum())

product_details     0
asin                0
url                 0
image_url           0
brand               0
price_aed           0
sales               0
revenue             0
bsr                 0
fees_aed            0
active_sellers_#    0
ratings             0
review_count        0
images              0
review_velocity     0
buy_box             0
category            0
size_tier           0
fulfillment         0
dimensions          0
weight              0
creation_date       8
is_missing_date     0
year                8
dtype: int64


In [195]:
#### For Better interpretation of this column data

In [196]:
df1.head()

Unnamed: 0,product_details,asin,url,image_url,brand,price_aed,sales,revenue,bsr,fees_aed,...,review_velocity,buy_box,category,size_tier,fulfillment,dimensions,weight,creation_date,is_missing_date,year
0,"($) Nutrili Marine Collagen Shots (20) | Hair,...",B09YD7RHC2,https://www.amazon.ae/dp/B09YD7RHC2?psc=1,https://m.media-amazon.com/images/I/61asHjsoyP...,Nutrili,295.0,88.0,444.464,811,52.75,...,4,nutrili,Health,Standard Parcel,FBA,4.4x3.4x4.6,1.72,2022-04-20,False,2022.0
1,($) Swisse Beauty Collagen Glow Powder with 25...,B08NDYTG3D,https://www.amazon.ae/dp/B08NDYTG3D?psc=1,https://m.media-amazon.com/images/I/71VnTFRUH4...,Swisse,144.0,5.0,720.0,23351,29.6,...,0,Pattern MENA,Health,Standard Parcel,FBA,2.8x2.7x4.9,0.37,NaT,True,
2,($) Snaktive Collagen Chocolate - 6x40g bars,B0B9SWK8XQ,https://www.amazon.ae/dp/B0B9SWK8XQ?psc=1,https://m.media-amazon.com/images/I/61Id91ioPE...,SNAKTIVE,60.0,4.0,240.0,26620,16.5,...,0,Le Chocolat LLC,Health,Large Envelope,FBA,1.8x5.1x2.8,0.55,NaT,True,
3,"Youtheory Collagen Advanced with Vitamin C, 12...",B006VAZYLS,https://www.amazon.ae/dp/B006VAZYLS?psc=1,https://m.media-amazon.com/images/I/71m5dU+vK4...,Youtheory,60.37,89.0,444.464,753,17.0,...,88,NBL General Trading L.L.C,Health,Standard Parcel,FBA,3.1x3.0x3.6,0.42,NaT,True,
4,Neocell Super Collagen PlUS C - 250 Tablets,B00028NGEC,https://www.amazon.ae/dp/B00028NGEC?psc=1,https://m.media-amazon.com/images/I/71TDUkdGRZ...,Neocell,135.0,542.0,444.464,200,28.25,...,140,Amazon,Health,Standard Parcel,AMZ,3.1x2.8x5.3,0.66,2019-03-06,False,2019.0


#### Numerical and categorical data

In [20]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_details   55 non-null     object 
 1   asin              55 non-null     object 
 2   url               55 non-null     object 
 3   image_url         55 non-null     object 
 4   brand             55 non-null     object 
 5   price_aed         55 non-null     float64
 6   sales             51 non-null     float64
 7   revenue           51 non-null     object 
 8   bsr               55 non-null     object 
 9   fees_aed          55 non-null     float64
 10  active_sellers_#  55 non-null     int64  
 11  ratings           55 non-null     float64
 12  review_count      55 non-null     object 
 13  images            55 non-null     int64  
 14  review_velocity   55 non-null     int64  
 15  buy_box           55 non-null     object 
 16  category          55 non-null     object 
 17 

* we have 8 numerical and 14 categorical, but some values are wrong like: creation_date, dimensions, review count,revenue,asin as they are numerical


#### Statistical info


In [21]:
df1.describe().T 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price_aed,55.0,155.209636,103.916063,26.99,84.91,126.0,200.785,608.99
sales,51.0,100.27451,121.228722,1.0,29.5,60.0,123.0,569.0
fees_aed,55.0,31.455273,16.001868,8.66,20.735,26.95,38.215,101.85
active_sellers_#,55.0,4.218182,2.960799,1.0,2.0,3.0,7.0,11.0
ratings,55.0,4.483636,0.284646,3.4,4.35,4.5,4.6,5.0
images,55.0,6.727273,3.418114,1.0,4.0,7.0,9.0,16.0
review_velocity,55.0,36.8,86.848612,-87.0,1.0,6.0,24.0,497.0
weight,55.0,0.979455,0.842387,0.09,0.465,0.62,1.335,4.37


#### Проверим asin, url, image_url на уникальность:

In [14]:
df1.image_url.nunique()

53

In [18]:
df1.url.nunique()

53

In [None]:
df1.asin.nunique()

53

* All of the rows in these columns are unique, which is why they hold no value for us. Therefore, we can drop them

In [22]:
df1 = df1.drop(["image_url"], axis=1)

In [23]:
df1 = df1.drop(["url"], axis=1)

In [24]:
df1 = df1.drop(["asin"], axis=1)

In [25]:
df1 = df1.drop(["category"], axis=1)

In [None]:
# Assuming the dataframe is named df1 and the column is named 'Sales'
df1['sales'] = df1['sales'].replace('n/a', np.nan)

#### We now need to change the data type of the next column to numerical since the following columns are categorical: 

*   revenue
*   review count
*   bsr


In [26]:
df1['revenue'] = df1['revenue'].str.replace(',', '').astype(float)

In [27]:
df1['bsr'] = df1['bsr'].str.replace(',', '').astype(float)

In [28]:
df1['review_count'] = df1['review_count'].str.replace(',', '').astype(float)

now lets format data_creation to dd/mm/yyyy format

In [29]:
df1['creation_date'] = pd.to_datetime(df1['creation_date'], format='%m/%d/%Y')


In [30]:
df1['creation_date'] = pd.to_datetime(df1['creation_date'], format='%m/%d/%Y', errors='coerce')



#### Lets check the format again:

In [197]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54 entries, 0 to 54
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   product_details   54 non-null     object        
 1   asin              54 non-null     object        
 2   url               54 non-null     object        
 3   image_url         54 non-null     object        
 4   brand             54 non-null     object        
 5   price_aed         54 non-null     float64       
 6   sales             54 non-null     float64       
 7   revenue           54 non-null     float64       
 8   bsr               54 non-null     object        
 9   fees_aed          54 non-null     float64       
 10  active_sellers_#  54 non-null     int64         
 11  ratings           54 non-null     float64       
 12  review_count      54 non-null     object        
 13  images            54 non-null     int64         
 14  review_velocity   54 non-nul

* Now we have 11 numeric and 6 object, dimensions we will modify later

In [None]:
df1.head()

Unnamed: 0,product_details,brand,price_aed,sales,revenue,bsr,fees_aed,active_sellers_#,ratings,review_count,images,review_velocity,buy_box,size_tier,fulfillment,dimensions,weight,creation_date
0,"($) Nutrili Marine Collagen Shots (20) | Hair,...",Nutrili,295.0,88.0,25960.0,811.0,52.75,1,4.6,26.0,11,4,nutrili,Standard Parcel,FBA,4.4x3.4x4.6,1.72,2022-04-20
1,($) Swisse Beauty Collagen Glow Powder with 25...,Swisse,144.0,5.0,720.0,23351.0,29.6,1,4.9,9.0,3,0,Pattern MENA,Standard Parcel,FBA,2.8x2.7x4.9,0.37,NaT
2,($) Snaktive Collagen Chocolate - 6x40g bars,SNAKTIVE,60.0,4.0,240.0,26620.0,16.5,1,5.0,2.0,4,0,Le Chocolat LLC,Large Envelope,FBA,1.8x5.1x2.8,0.55,NaT
3,"Youtheory Collagen Advanced with Vitamin C, 12...",Youtheory,60.37,89.0,5372.93,753.0,17.0,8,4.5,4676.0,9,88,NBL General Trading L.L.C,Standard Parcel,FBA,3.1x3.0x3.6,0.42,NaT
4,Neocell Super Collagen PlUS C - 250 Tablets,Neocell,135.0,542.0,73170.0,200.0,28.25,2,4.4,18342.0,14,140,Amazon,Standard Parcel,AMZ,3.1x2.8x5.3,0.66,2019-03-06


In [198]:
df1.describe().T 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price_aed,54.0,157.195,103.833579,26.99,85.0,130.5,204.8925,608.99
sales,54.0,100.27451,117.747744,1.0,31.25,69.5,117.25,569.0
revenue,54.0,444.464,79.518445,231.0,444.464,444.464,444.464,800.0
fees_aed,54.0,31.722963,16.027337,8.66,20.75,27.6,38.8775,101.85
active_sellers_#,54.0,4.259259,2.972739,1.0,2.0,3.0,7.0,11.0
ratings,54.0,4.474074,0.27826,3.4,4.325,4.5,4.6,5.0
images,54.0,6.722222,3.450002,1.0,4.0,6.5,9.0,16.0
review_velocity,54.0,37.407407,87.54611,-87.0,1.0,6.5,24.0,497.0
weight,54.0,0.986111,0.848836,0.09,0.4425,0.62,1.3725,4.37
year,46.0,2019.956522,1.332608,2017.0,2019.0,2019.0,2021.0,2022.0


### **Feature Engineering**
Let's create a new column time_since_creation and Seasonality.
We will consider the year of data collection, 2022, as the baseline.


### time_since_creation

In [199]:
# Convert 'creation_date' to datetime type if it's not already
df1['creation_date'] = pd.to_datetime(df1['creation_date'], errors='coerce')

# Calculate the time difference between 'creation_date' and current date
current_date = pd.Timestamp.now()
df1['time_since_creation'] = (current_date - df1['creation_date']).dt.days

### Seasonality

In [200]:
# Extract the month and create binary seasonality features
df1['month'] = df1['creation_date'].dt.month

# Create binary seasonality features
df1['is_winter'] = df1['month'].isin([12, 1, 2]).astype(int)
df1['is_spring'] = df1['month'].isin([3, 4, 5]).astype(int)
df1['is_summer'] = df1['month'].isin([6, 7, 8]).astype(int)
df1['is_autumn'] = df1['month'].isin([9, 10, 11]).astype(int)


In [201]:
df1.head()

Unnamed: 0,product_details,asin,url,image_url,brand,price_aed,sales,revenue,bsr,fees_aed,...,weight,creation_date,is_missing_date,year,time_since_creation,month,is_winter,is_spring,is_summer,is_autumn
0,"($) Nutrili Marine Collagen Shots (20) | Hair,...",B09YD7RHC2,https://www.amazon.ae/dp/B09YD7RHC2?psc=1,https://m.media-amazon.com/images/I/61asHjsoyP...,Nutrili,295.0,88.0,444.464,811,52.75,...,1.72,2022-04-20,False,2022.0,393.0,4.0,0,1,0,0
1,($) Swisse Beauty Collagen Glow Powder with 25...,B08NDYTG3D,https://www.amazon.ae/dp/B08NDYTG3D?psc=1,https://m.media-amazon.com/images/I/71VnTFRUH4...,Swisse,144.0,5.0,720.0,23351,29.6,...,0.37,NaT,True,,,,0,0,0,0
2,($) Snaktive Collagen Chocolate - 6x40g bars,B0B9SWK8XQ,https://www.amazon.ae/dp/B0B9SWK8XQ?psc=1,https://m.media-amazon.com/images/I/61Id91ioPE...,SNAKTIVE,60.0,4.0,240.0,26620,16.5,...,0.55,NaT,True,,,,0,0,0,0
3,"Youtheory Collagen Advanced with Vitamin C, 12...",B006VAZYLS,https://www.amazon.ae/dp/B006VAZYLS?psc=1,https://m.media-amazon.com/images/I/71m5dU+vK4...,Youtheory,60.37,89.0,444.464,753,17.0,...,0.42,NaT,True,,,,0,0,0,0
4,Neocell Super Collagen PlUS C - 250 Tablets,B00028NGEC,https://www.amazon.ae/dp/B00028NGEC?psc=1,https://m.media-amazon.com/images/I/71TDUkdGRZ...,Neocell,135.0,542.0,444.464,200,28.25,...,0.66,2019-03-06,False,2019.0,1534.0,3.0,0,1,0,0
