# **Project Name**    - Analyzing Amazon Sales data



##### **Project Type**    - iNeuron Internship Project\EDA
##### **Contribution**    - Individual
##### **Prepared by  - Amit Patil**


# **Project Summary -**

The project "Analyzing Amazon Sales Data" is a business intelligence project that aims to perform a comprehensive analysis of Amazon sales data to identify sales trends, key metrics, and factors that influence sales performance. The project will involve extracting, transforming, and loading relevant datasets and utilizing tools such as Python, Tableau, or Power BI to visualize and analyze the data. The project will also involve creating a high-level document design, low-level document design, architecture document design, wireframe document design, and detailed project report. The project's primary goal is to provide actionable insights that can be used to optimize pricing strategies, marketing efforts, and inventory management, ultimately leading to increased sales and profitability for Amazon.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In today's highly competitive e-commerce industry, it is essential for businesses to have an effective sales management strategy to increase profits and reduce costs. However, without proper data analysis, it can be challenging to gain insight into customer behaviour, market trends, and other factors that impact sales performance.


The goal of this project is to conduct an Amazon sales data analysis to extract, transform, and load relevant datasets to identify sales trends, key metrics, and factors that influence sales performance. By doing so, the project aims to provide actionable insights that can be used to optimize pricing strategies, marketing efforts, and inventory management, ultimately leading to increased sales and profitability
.




#### **Business Objective**

The business objective of this project is to analyze Amazon sales data to identify sales trends and key metrics, and to gain meaningful insights into the relationships between attributes. By doing so, the goal is to help business stakeholders make data-driven decisions that will reduce costs, increase profits, and gain a competitive advantage in the e-commerce market.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import date
%matplotlib inline
import seaborn as sns

In [None]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
dataset_directory_path='/content/drive/MyDrive/Capstone Project EDA/Amazon Sales Data Analysis/Amazon Sales Records.csv'
df=pd.read_csv(dataset_directory_path)

### Dataset First View

In [None]:
df

### Dataset Rows & Columns count

In [None]:
#print rows
print(df.index)

print('\n\n')
#print columns 
print(df.columns)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.drop_duplicates(inplace=True)

len(df.index)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().count())

In [None]:
# Visualizing the missing values

# Visulaizing null values through heatmap.
plt.figure(figsize=(25, 10))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False,cmap='viridis')
plt.xlabel("Name Of Columns")
plt.title("Places of missing values in column")

In [None]:
print(df.isnull().sum())

### What did you know about your dataset?

1.The dataset has 100 rows or observations, with a range index of 0 to 99.

2.There are 14 columns in the dataset, with the column names listed in the output.

3.The columns have different data types:

*  5 columns have float64 data type.
* 2 columns have int64 data type.

* 7 columns have object data type.

4.All columns have 100 non-null values, which means there are no missing values in the dataset.

5.The dataset contains information about sales, including the region, country, item type, sales channel, order priority, order date, order ID, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

### Variables Description 

1.**Region**: The region where the sales took place, represented as a string or object data type.

2.**Country**: The country where the sales took place, represented as a string or object data type.

3.**Item Typ**e: The type of item sold, represented as a string or object data type.

4.**Sales Channel**: The channel through which the sales were made, represented as a string or object data type.

5.**Order Priority**: The priority of the order, represented as a string or object data type.

6.**Order Date**: The date the order was made, represented as a string or object data type.

7.**Order ID**: A unique identifier for each order, represented as an integer data type.

8.**Ship Date**: The date the order was shipped, represented as a string or object data type.

9.**Units Sold**: The number of units sold, represented as an integer data type.

10.**Unit Price**: The price of each unit, represented as a float data type.

11.**Unit Cost**: The cost of producing each unit, represented as a float data type.

12.**Total Revenue**: The total revenue generated by the sale, represented as a float data type.

13.**Total Cost**: The total cost of producing and shipping the units, represented as a float data type.

14.**Total Profit**: The total profit generated by the sale, represented as a float data type.

In [None]:
#description of dataset 
df.describe()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import matplotlib.pyplot as plt

# Get the numeric columns
numeric_cols = ['Order ID', 'Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']

# Create a box plot for each column
for col in numeric_cols:
    plt.figure()
    plt.boxplot(df[col])
    plt.title(col)


In [None]:
import numpy as np

# Get the numeric columns
numeric_cols1 = ['Total Revenue', 'Total Cost', 'Total Profit']

# Remove outliers using IQR method
for col in numeric_cols1:
    q1 = np.quantile(df[col], 0.25)
    q3 = np.quantile(df[col], 0.75)
    iqr = q3 - q1

    upper_bound = q3 + (1.5 * iqr)
    lower_bound = q1 - (1.5 * iqr)
    print(f"IQR for {col}: {iqr}, Upper Bound: {upper_bound}, Lower Bound: {lower_bound}")

    removed_outliers = df[(df[col] > upper_bound) | (df[col] < lower_bound)]
    df = df[(df[col] <= upper_bound) & (df[col] >= lower_bound)]
    print(f"Removed {len(removed_outliers)} outliers for {col}")

print(df.head())  # to check the modified dataset


In [None]:
df


## ***3. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 -   **Bar chart for sales revenue, cost, and profit**

In [None]:

#Sales revenue, cost, and profit
total_revenue = df['Total Revenue'].sum()
total_cost = df['Total Cost'].sum()
total_profit = df['Total Profit'].sum()

# Bar chart for sales revenue, cost, and profit
plt.figure(figsize=(8,6))
sns.barplot(x=['Sales Revenue', 'Total Cost', 'Total Profit'], y=[total_revenue, total_cost, total_profit])
plt.title('Total Sales Metrics')
plt.ylabel('Amount')
plt.show()

#### Chart - 2 -**Stacked bar chart for sales by region**

In [None]:

#Sales by region
sales_by_region = df.groupby('Region')['Total Revenue'].sum()

# Stacked bar chart for sales by region
plt.figure(figsize=(8,6))
sales_by_region.plot(kind='bar', stacked=True)
plt.title('Sales by Region')
plt.ylabel('Sales Revenue')
plt.show()


#### Chart - 3-**Pie chart for sales by sales channel**

In [None]:
#Sales by sales channel
sales_by_channel = df.groupby('Sales Channel')['Total Revenue'].sum()

# Pie chart for sales by sales channel
plt.figure(figsize=(8,6))
sales_by_channel.plot(kind='pie', autopct='%1.1f%%',textprops={'weight': 'bold'},figsize =(8,8),startangle=100,explode =[0.05]*2)
plt.title('Sales by Sales Channel')
plt.show()

#### Chart - 4 - **Line chart for sales trends over time**

In [None]:

#Sales trends over time
df['Order Date'] = pd.to_datetime(df['Order Date'])
sales_by_month = df.groupby(df['Order Date'].dt.strftime('%B %Y'))['Total Revenue'].sum()

# Line chart for sales trends over time
plt.figure(figsize=(8,6))
sales_by_month.plot()
plt.title('Sales Trends over Time')
plt.ylabel('Sales Revenue')
plt.xticks(rotation=45)
plt.show()

#### Chart - 5 - **Horizontal bar chart for top-selling products and regions**

In [None]:

#Top-selling products and regions
top_products = df.groupby('Item Type')['Total Revenue'].sum().sort_values(ascending=False).head(5)
top_regions = df.groupby('Region')['Total Revenue'].sum().sort_values(ascending=False).head(3)

# Horizontal bar chart for top-selling products and regions
plt.figure(figsize=(8,6))
top_products.plot(kind='barh')
plt.title('Top Selling Products')
plt.xlabel('Sales Revenue')
plt.show()

plt.figure(figsize=(8,6))
top_regions.plot(kind='barh')
plt.title('Top Selling Regions')
plt.xlabel('Sales Revenue')
plt.show()

#### Chart - 6 - **Bubble chart for customer demographics and behavior**

In [None]:

#Customer demographics and behavior
customer_demographics = df.groupby(['Region', 'Country']).agg({'Total Revenue': 'sum', 'Units Sold': 'sum', 'Order ID': 'count'})
customer_behavior = df.groupby('Order Priority').agg({'Total Revenue': 'sum', 'Units Sold': 'sum', 'Order ID': 'count'})

# Bubble chart for customer demographics and behavior
plt.figure(figsize=(8,6))
sns.scatterplot(data=customer_demographics, x='Total Revenue', y='Units Sold', hue='Region', size='Order ID', sizes=(50, 200))
plt.title('Customer Demographics and Behavior')
plt.xlabel('Sales Revenue')
plt.ylabel('Units Sold')

plt.show()

## **5. Solution to Business Objective**

After analyzing the data, several areas of growth and improvement have been identified for Amazon. These areas include:
1.	Increasing sales in the Asia-Pacific region: The sales revenue generated from this region is relatively low compared to other regions. Amazon can focus on expanding its reach in this region by investing in marketing campaigns and improving its logistics and delivery network.
2.	Enhancing the online customer experience: The number of orders placed online is significantly higher than those placed through other channels. Therefore, it is essential to ensure that the online platform is user-friendly and provides a seamless shopping experience for customers.
3.	Promoting top-selling products: The top five selling products generate a significant amount of revenue for Amazon. Therefore, it is recommended that the company increase the visibility of these products on its website and offer special promotions and discounts to attract more customers.
4.	Optimizing inventory management: The analysis showed that there are several instances where the number of units ordered exceeded the available inventory. This can result in delayed deliveries and unsatisfied customers. Amazon should focus on optimizing its inventory management system to ensure that products are available when customers place an order.


Based on the above insights, the following recommendations are made to optimize sales, reduce costs, and improve customer satisfaction:
1.	Increase investment in marketing campaigns in the Asia-Pacific region to expand the customer base and increase sales revenue.
2.	Improve the online platform's user interface and experience to encourage more customers to make online purchases.
3.	Promote the top-selling products by offering special promotions and discounts to attract more customers.


Answer Here.

# **Conclusion**

In conclusion, this project analyzed Amazon's sales data to identify trends, patterns, and insights that could help the company make data-driven decisions. The analysis showed that the majority of sales come from the United States, with the technology item type being the most popular among customers.

The sales by region analysis revealed that the largest revenue comes from the Asia Pacific region, while sales by sales channel showed that online sales generate the highest revenue. Additionally, the analysis of sales trends over time showed an overall upward trend in revenue, with a peak in December.

Based on the insights gained from the analysis, several recommendations were made to optimize sales, reduce costs, and improve customer satisfaction. These recommendations included expanding Amazon's reach in the Asia Pacific region, increasing investment in the technology item type, improving the online shopping experience, and optimizing pricing strategies.

Overall, this project demonstrated the power of data analysis in providing valuable insights for businesses like Amazon. By leveraging data-driven insights, Amazon can make informed decisions to improve its sales performance and customer satisfaction, ultimately leading to increased profitability and growth.