# **CAPSTONE DEFENSE PROJECT**
# **POWER BI** 


**Business Understanding**

The objective of this Power BI Dashboard project is to transform the raw transactional data collected by the client in the year 2019 into actionable insights. By leveraging business intelligence tools, we aim to empower our client to make informed decisions to drive sales and enhance operational efficiency.

**Objective:**

The primary objective of this Power BI Dashboard project is to leverage the collected transactional data from the year 2019 to provide actionable insights for our client. Specifically, the objective encompasses the following key points:

1. Revenue Analysis: Determine the total revenue generated throughout the year 2019, providing a clear understanding of the financial performance over the specified period.

2. Seasonality Assessment: Identify any recurring patterns or seasonality in sales data to facilitate better resource allocation, inventory management, and marketing strategies.

3. Product Performance Evaluation: Analyze sales data to identify the best-selling and worst-selling products, enabling the optimization of product offerings and inventory management practices.

4. Sales Trend Analysis: Compare sales performance across different time periods (months or weeks) to identify trends, fluctuations, and potential areas for improvement or expansion.

5. Geographical Insights: Determine the distribution of product deliveries across various cities to enable targeted marketing efforts and optimize logistics operations.

6. Product Category Comparison: Compare revenue generated and quantities ordered across different product categories, providing insights into the performance of various product lines and guiding future product development strategies.

7. Additional Details Integration: Incorporate additional details from the data findings to provide a comprehensive understanding of business performance, including the classification of products into high-level and basic categories based on unit prices.

By achieving these objectives, the Power BI Dashboard will empower our client to make data-driven decisions, enhance sales strategies, optimize operations, and drive overall business growth and efficiency.

**The Hypothesis**

  **Null Hypothesis:** There is no difference in revenue generated between different product categories.
  
  **Alternative Hypothesis:** Certain product categories generate significantly more revenue compared to others.

**The Analytical Questions**

1. How much money did we make this year? 

2. Can we identify any seasonality in the  sales? 

3. What are our best and worst-selling products? 

4. How do sales compare to previous months or weeks? 

5. Which cities are our products delivered to most? 

6. How do product categories compare in revenue generated and quantities  ordered? 

7. You are required to show additional details from your findings in your data.

**Data Understanding:**

The dataset provided contains the following fields:

1. Order ID: Unique identifier for each order placed.
2. Product: Name or description of the product purchased.
4. Quantity Ordered: The number of units of the product ordered in each transaction.
5. Price Each: The unit price of the product.
6. Order Date: Date and time when the order was placed.
7. Purchase Address: Address where the purchase was made or where the products were delivered.

Import the necessary modules/packages

In [100]:
# %pip install pandas
# %pip install matplotlib
# %pip install seaborn
# %pip install plotly
# %pip install scipy
# %pip install scikit-learn

In [101]:
# Data manipulation packages
import pyodbc
from sqlalchemy import create_engine
from dotenv import dotenv_values
import pandas as pd
import numpy as np

#Data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Loading data Sales January, February, March, April, May, June 2019

In [102]:
# Load data
df_sales_January_2019 = pd.read_csv('Power BI Capstione Data - (Jan -May)/Sales_January_2019.csv')
df_sales_February_2019 = pd.read_csv('Power BI Capstione Data - (Jan -May)/Sales_February_2019.csv')
df_sales_March_2019 = pd.read_csv('Power BI Capstione Data - (Jan -May)/Sales_March_2019.csv')
df_sales_April_2019 = pd.read_csv('Power BI Capstione Data - (Jan -May)/Sales_April_2019.csv')
df_sales_May_2019 = pd.read_csv('Power BI Capstione Data - (Jan -May)/Sales_May_2019.csv')
df_sales_June_2019 = pd.read_csv('Power BI Capstione Data - (Jan -May)/Sales_June_2019.csv')
# Display the first few rows of the DataFrame
df_sales_January_2019.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,141234,iPhone,1,700.0,01/22/19 21:25,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,01/28/19 14:15,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,01/17/19 13:33,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,01/05/19 20:33,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,01/25/19 11:59,"387 10th St, Austin, TX 73301"


In [103]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
df_sales_January_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9723 entries, 0 to 9722
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          9697 non-null   object
 1   Product           9697 non-null   object
 2   Quantity Ordered  9697 non-null   object
 3   Price Each        9697 non-null   object
 4   Order Date        9697 non-null   object
 5   Purchase Address  9697 non-null   object
dtypes: object(6)
memory usage: 455.9+ KB


In [104]:
# View the column names
df_sales_January_2019.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

In [105]:
# Iterate over columns and print the number of unique values in each column
for column in df_sales_January_2019.columns:
    unique_values = df_sales_January_2019[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order ID': 9269
Number of unique values in 'Product': 20
Number of unique values in 'Quantity Ordered': 8
Number of unique values in 'Price Each': 19
Number of unique values in 'Order Date': 8077
Number of unique values in 'Purchase Address': 9161


In [106]:
# Iterate over columns and view unique values
for column in df_sales_January_2019.columns:
    unique_values = df_sales_January_2019[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)


Unique values in 'Order ID':
['141234' '141235' '141236' ... '150499' '150500' '150501']
Unique values in 'Product':
['iPhone' 'Lightning Charging Cable' 'Wired Headphones' '27in FHD Monitor'
 'AAA Batteries (4-pack)' '27in 4K Gaming Monitor' 'USB-C Charging Cable'
 'Bose SoundSport Headphones' 'Apple Airpods Headphones'
 'Macbook Pro Laptop' 'Flatscreen TV' 'Vareebadd Phone'
 'AA Batteries (4-pack)' 'Google Phone' '20in Monitor'
 '34in Ultrawide Monitor' 'ThinkPad Laptop' 'LG Dryer'
 'LG Washing Machine' nan 'Product']
Unique values in 'Quantity Ordered':
['1' '2' '3' '5' '4' nan '7' 'Quantity Ordered' '6']
Unique values in 'Price Each':
['700' '14.95' '11.99' '149.99' '2.99' '389.99' '11.95' '99.99' '150'
 '1700' '300' '400' '3.84' '600' '109.99' '379.99' '999.99' '600.0' nan
 'Price Each']
Unique values in 'Order Date':
['01/22/19 21:25' '01/28/19 14:15' '01/17/19 13:33' ... '01/21/19 14:31'
 '01/15/19 14:21' '01/13/19 16:43']
Unique values in 'Purchase Address':
['944 Walnut St, Bo

Observations:

1. Order ID: The 'Order ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. It includes a variety of products such as iPhones, charging cables, headphones, monitors, batteries, laptops, TVs, and washing machines. However, there are two unusual entries: 'nan' and 'Product', which might indicate missing or placeholder values.
3. Quantity Ordered: The 'Quantity Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some unusual entries such as 'nan' and 'Quantity Ordered', which may indicate missing or placeholder values.
4. Price Each: The 'Price Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some unusual entries such as 'nan' and 'Price Each', which may indicate missing or placeholder values. Additionally, there are some duplicate values in a different format ('600' and '600.0').
5. Order Date: The 'Order Date' column contains the date and time when each order was placed. Entries are in the format 'MM/DD/YY HH:mm'.
6. Purchase Address: The 'Purchase Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, and ZIP code.

Overall, while most columns appear to contain relevant information, there are some anomalies in the 'Product', 'Quantity Ordered', and 'Price Each' columns that need further investigation and potential cleaning. Additionally, missing values ('nan') and placeholder values ('Product' and 'Quantity Ordered') should be addressed to ensure the integrity of the dataset.

In [107]:
# View the dimensions of the DataFrame
df_sales_January_2019.shape

(9723, 6)

In [108]:
# Display descriptive statistics of the DataFrame
df_sales_January_2019.describe()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
count,9697,9697,9697,9697.0,9697,9697
unique,9269,20,8,19.0,8077,9161
top,Order ID,USB-C Charging Cable,1,11.95,Order Date,Purchase Address
freq,16,1171,8795,1171.0,16,16


In [109]:
# Count the number of missing values in each column
df_sales_January_2019.isnull().sum()

Order ID            26
Product             26
Quantity Ordered    26
Price Each          26
Order Date          26
Purchase Address    26
dtype: int64

In [110]:
# Count the number of duplicate rows in the DataFrame
df_sales_January_2019.duplicated().sum()

50

## Dataset of df_sales_February_2019

In [111]:
# Display the first few rows of the DataFrame
df_sales_February_2019.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,150502,iPhone,1,700.0,02/18/19 01:35,"866 Spruce St, Portland, ME 04101"
1,150503,AA Batteries (4-pack),1,3.84,02/13/19 07:24,"18 13th St, San Francisco, CA 94016"
2,150504,27in 4K Gaming Monitor,1,389.99,02/18/19 09:46,"52 6th St, New York City, NY 10001"
3,150505,Lightning Charging Cable,1,14.95,02/02/19 16:47,"129 Cherry St, Atlanta, GA 30301"
4,150506,AA Batteries (4-pack),2,3.84,02/28/19 20:32,"548 Lincoln St, Seattle, WA 98101"


In [112]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
df_sales_February_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12036 entries, 0 to 12035
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          12004 non-null  object
 1   Product           12004 non-null  object
 2   Quantity Ordered  12004 non-null  object
 3   Price Each        12004 non-null  object
 4   Order Date        12004 non-null  object
 5   Purchase Address  12004 non-null  object
dtypes: object(6)
memory usage: 564.3+ KB


In [113]:
# View the column names
df_sales_February_2019.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

In [114]:
# Iterate over columns and print the number of unique values in each column
for column in df_sales_February_2019.columns:
    unique_values = df_sales_February_2019[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order ID': 11508
Number of unique values in 'Product': 20
Number of unique values in 'Quantity Ordered': 8
Number of unique values in 'Price Each': 19
Number of unique values in 'Order Date': 9627
Number of unique values in 'Purchase Address': 11316


In [115]:
# Iterate over columns and view unique values
for column in df_sales_February_2019.columns:
    unique_values = df_sales_February_2019[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)


Unique values in 'Order ID':
['150502' '150503' '150504' ... '162006' '162007' '162008']
Unique values in 'Product':
['iPhone' 'AA Batteries (4-pack)' '27in 4K Gaming Monitor'
 'Lightning Charging Cable' 'Apple Airpods Headphones'
 'USB-C Charging Cable' 'Bose SoundSport Headphones' '27in FHD Monitor'
 'Wired Headphones' 'Macbook Pro Laptop' 'Flatscreen TV' '20in Monitor'
 'LG Dryer' 'AAA Batteries (4-pack)' 'ThinkPad Laptop'
 '34in Ultrawide Monitor' nan 'Google Phone' 'Vareebadd Phone'
 'LG Washing Machine' 'Product']
Unique values in 'Quantity Ordered':
['1' '2' '4' '3' nan '5' '7' 'Quantity Ordered' '6']
Unique values in 'Price Each':
['700' '3.84' '389.99' '14.95' '150' '11.95' '99.99' '149.99' '11.99'
 '1700' '300' '109.99' '600.0' '2.99' '999.99' '379.99' nan '600' '400'
 'Price Each']
Unique values in 'Order Date':
['02/18/19 01:35' '02/13/19 07:24' '02/18/19 09:46' ... '02/04/19 20:44'
 '02/24/19 06:31' '02/24/19 19:09']
Unique values in 'Purchase Address':
['866 Spruce St, Po

Observations:

1. Order ID: The 'Order ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. It includes a variety of products such as iPhones, batteries, monitors, charging cables, headphones, laptops, TVs, and washing machines. However, there are two unusual entries: 'nan' and 'Product', which might indicate missing or placeholder values.
3. Quantity Ordered: The 'Quantity Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some unusual entries such as 'nan' and 'Quantity Ordered', which may indicate missing or placeholder values.
4. Price Each: The 'Price Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some unusual entries such as 'nan' and 'Price Each', which may indicate missing or placeholder values. Additionally, there are some duplicate values in a different format ('600' and '600.0').
5. Order Date: The 'Order Date' column contains the date and time when each order was placed. Entries are in the format 'MM/DD/YY HH:mm'.
6. Purchase Address: The 'Purchase Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to the previous observations, anomalies exist in the 'Product', 'Quantity Ordered', and 'Price Each' columns that need further investigation and cleaning. Additionally, missing values ('nan') and placeholder values ('Product' and 'Quantity Ordered') should be handled appropriately to ensure the accuracy of the dataset.


In [116]:
# View the dimensions of the DataFrame
df_sales_February_2019.shape

(12036, 6)

In [117]:
# Display descriptive statistics of the DataFrame
df_sales_February_2019.describe()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
count,12004,12004,12004,12004.0,12004,12004
unique,11508,20,8,19.0,9627,11316
top,Order ID,USB-C Charging Cable,1,11.95,Order Date,Purchase Address
freq,18,1514,10863,1514.0,18,18


In [118]:
# Count the number of missing values in each column
df_sales_February_2019.isnull().sum()

Order ID            32
Product             32
Quantity Ordered    32
Price Each          32
Order Date          32
Purchase Address    32
dtype: int64

In [119]:
# Count the number of duplicate rows in the DataFrame
df_sales_February_2019.duplicated().sum()

66

## Dataset of df_sales_March_2019

In [120]:
# Display the first few rows of the DataFrame
df_sales_March_2019.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,162009,iPhone,1,700.0,03/28/19 20:59,"942 Church St, Austin, TX 73301"
1,162009,Lightning Charging Cable,1,14.95,03/28/19 20:59,"942 Church St, Austin, TX 73301"
2,162009,Wired Headphones,2,11.99,03/28/19 20:59,"942 Church St, Austin, TX 73301"
3,162010,Bose SoundSport Headphones,1,99.99,03/17/19 05:39,"261 10th St, San Francisco, CA 94016"
4,162011,34in Ultrawide Monitor,1,379.99,03/10/19 00:01,"764 13th St, San Francisco, CA 94016"


In [121]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
df_sales_March_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15226 entries, 0 to 15225
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          15189 non-null  object
 1   Product           15189 non-null  object
 2   Quantity Ordered  15189 non-null  object
 3   Price Each        15189 non-null  object
 4   Order Date        15189 non-null  object
 5   Purchase Address  15189 non-null  object
dtypes: object(6)
memory usage: 713.8+ KB


In [122]:
# View the column names
df_sales_March_2019.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

In [123]:
# Iterate over columns and print the number of unique values in each column
for column in df_sales_March_2019.columns:
    unique_values = df_sales_March_2019[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order ID': 14550
Number of unique values in 'Product': 20
Number of unique values in 'Quantity Ordered': 8
Number of unique values in 'Price Each': 19
Number of unique values in 'Order Date': 11784
Number of unique values in 'Purchase Address': 14247


In [124]:
# Iterate over columns and view unique values
for column in df_sales_March_2019.columns:
    unique_values = df_sales_March_2019[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)


Unique values in 'Order ID':
['162009' '162010' '162011' ... '176555' '176556' '176557']
Unique values in 'Product':
['iPhone' 'Lightning Charging Cable' 'Wired Headphones'
 'Bose SoundSport Headphones' '34in Ultrawide Monitor'
 'AA Batteries (4-pack)' 'USB-C Charging Cable' 'AAA Batteries (4-pack)'
 'LG Washing Machine' 'Apple Airpods Headphones' '27in 4K Gaming Monitor'
 'Google Phone' 'Macbook Pro Laptop' '27in FHD Monitor' 'ThinkPad Laptop'
 'Vareebadd Phone' 'Flatscreen TV' '20in Monitor' 'Product' 'LG Dryer' nan]
Unique values in 'Quantity Ordered':
['1' '2' '5' '3' '4' '6' 'Quantity Ordered' nan '7']
Unique values in 'Price Each':
['700' '14.95' '11.99' '99.99' '379.99' '3.84' '11.95' '2.99' '600.0'
 '150' '389.99' '600' '1700' '149.99' '999.99' '400' '300' '109.99'
 'Price Each' nan]
Unique values in 'Order Date':
['03/28/19 20:59' '03/17/19 05:39' '03/10/19 00:01' ... '03/22/19 20:27'
 '03/14/19 10:29' '03/30/19 12:32']
Unique values in 'Purchase Address':
['942 Church St, Aus

Observations:

1. Order ID: The 'Order ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. It includes a variety of products such as iPhones, charging cables, headphones, monitors, batteries, laptops, TVs, and washing machines. However, there are two unusual entries: 'nan' and 'Product', which might indicate missing or placeholder values.
3. Quantity Ordered: The 'Quantity Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some unusual entries such as 'nan' and 'Quantity Ordered', which may indicate missing or placeholder values.
4. Price Each: The 'Price Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some unusual entries such as 'nan' and 'Price Each', which may indicate missing or placeholder values. Additionally, there are some duplicate values in a different format ('600' and '600.0').
5. Order Date: The 'Order Date' column contains the date and time when each order was placed. Entries are in the format 'MM/DD/YY HH:mm'.
6. Purchase Address: The 'Purchase Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to the previous observations, anomalies exist in the 'Product', 'Quantity Ordered', and 'Price Each' columns that need further investigation and cleaning. Additionally, missing values ('nan') and placeholder values ('Product' and 'Quantity Ordered') should be handled appropriately to ensure the accuracy of the dataset.


In [125]:
# View the dimensions of the DataFrame
df_sales_March_2019.shape

(15226, 6)

In [126]:
# Display descriptive statistics of the DataFrame
df_sales_March_2019.describe()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
count,15189,15189,15189,15189.0,15189,15189
unique,14550,20,8,19.0,11784,14247
top,Order ID,USB-C Charging Cable,1,11.95,Order Date,Purchase Address
freq,35,1770,13779,1770.0,35,35


In [127]:
# Count the number of missing values in each column
df_sales_March_2019.isnull().sum()

Order ID            37
Product             37
Quantity Ordered    37
Price Each          37
Order Date          37
Purchase Address    37
dtype: int64

In [128]:
# Count the number of duplicate rows in the DataFrame
df_sales_March_2019.duplicated().sum()

95

## Data set of df_sales_April_2019

In [129]:
# Display the first few rows of the DataFrame
df_sales_April_2019.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,176558.0,USB-C Charging Cable,2.0,11.95,04/19/19 08:46,"917 1st St, Dallas, TX 75001"
1,,,,,,
2,176559.0,Bose SoundSport Headphones,1.0,99.99,04/07/19 22:30,"682 Chestnut St, Boston, MA 02215"
3,176560.0,Google Phone,1.0,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
4,176560.0,Wired Headphones,1.0,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


In [130]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
df_sales_April_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18383 entries, 0 to 18382
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          18324 non-null  object
 1   Product           18324 non-null  object
 2   Quantity Ordered  18324 non-null  object
 3   Price Each        18324 non-null  object
 4   Order Date        18324 non-null  object
 5   Purchase Address  18324 non-null  object
dtypes: object(6)
memory usage: 861.8+ KB


In [131]:
# View the column names
df_sales_April_2019.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

In [132]:
# Iterate over columns and print the number of unique values in each column
for column in df_sales_April_2019.columns:
    unique_values = df_sales_April_2019[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order ID': 17538
Number of unique values in 'Product': 20
Number of unique values in 'Quantity Ordered': 8
Number of unique values in 'Price Each': 19
Number of unique values in 'Order Date': 13584
Number of unique values in 'Purchase Address': 17120


In [133]:
# Iterate over columns and view unique values
for column in df_sales_April_2019:
    unique_values = df_sales_April_2019[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)


Unique values in 'Order ID':
['176558' nan '176559' ... '194092' '194093' '194094']
Unique values in 'Product':
['USB-C Charging Cable' nan 'Bose SoundSport Headphones' 'Google Phone'
 'Wired Headphones' 'Macbook Pro Laptop' 'Lightning Charging Cable'
 '27in 4K Gaming Monitor' 'AA Batteries (4-pack)'
 'Apple Airpods Headphones' 'AAA Batteries (4-pack)' 'iPhone'
 'Flatscreen TV' '27in FHD Monitor' '20in Monitor' 'LG Dryer'
 'ThinkPad Laptop' 'Vareebadd Phone' 'LG Washing Machine'
 '34in Ultrawide Monitor' 'Product']
Unique values in 'Quantity Ordered':
['2' nan '1' '3' '5' 'Quantity Ordered' '4' '7' '6']
Unique values in 'Price Each':
['11.95' nan '99.99' '600' '11.99' '1700' '14.95' '389.99' '3.84' '150'
 '2.99' '700' '300' '149.99' '109.99' '600.0' '999.99' '400' '379.99'
 'Price Each']
Unique values in 'Order Date':
['04/19/19 08:46' nan '04/07/19 22:30' ... '04/15/19 16:02'
 '04/14/19 15:09' '04/18/19 11:08']
Unique values in 'Purchase Address':
['917 1st St, Dallas, TX 75001' nan '

Observations:

1. Order ID: The 'Order ID' column contains unique numerical identifiers for each order. However, there are some 'nan' values present in this column, indicating missing values.
2. Product: The 'Product' column contains the names of various products sold. Similar to previous observations, there are 'nan' and 'Product' entries, indicating missing or placeholder values.
3. Quantity Ordered: The 'Quantity Ordered' column contains the number of units ordered for each product. Similar to previous observations, there are 'nan' and 'Quantity Ordered' entries, indicating missing or placeholder values.
4. Price Each: The 'Price Each' column contains the price of each product. Similar to previous observations, there are 'nan' and 'Price Each' entries, indicating missing or placeholder values. Additionally, there are some duplicate values in a different format ('600' and '600.0').
5. Order Date: The 'Order Date' column contains the date and time when each order was placed. Similar to previous observations, there are 'nan' entries, indicating missing values.
6. Purchase Address: The 'Purchase Address' column contains the addresses where the purchases were made. Similar to previous observations, there are 'nan' entries, indicating missing values.

As before, the anomalies in the 'Product', 'Quantity Ordered', 'Price Each', 'Order Date', and 'Purchase Address' columns need further investigation and cleaning to ensure the accuracy of the dataset. Additionally, missing values should be handled appropriately to maintain data integrity.

In [134]:
# View the dimensions of the DataFrame
df_sales_April_2019.shape

(18383, 6)

In [135]:
# Display descriptive statistics of the DataFrame
df_sales_April_2019.describe()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
count,18324,18324,18324,18324.0,18324,18324
unique,17538,20,8,19.0,13584,17120
top,Order ID,Lightning Charging Cable,1,14.95,Order Date,Purchase Address
freq,35,2201,16558,2201.0,35,35


In [136]:
# Count the number of missing values in each column
df_sales_April_2019.isnull().sum()

Order ID            59
Product             59
Quantity Ordered    59
Price Each          59
Order Date          59
Purchase Address    59
dtype: int64

In [137]:
# Count the number of duplicate rows in the DataFrame
df_sales_April_2019.duplicated().sum()

114

## Dataset of df_sales_April_2019

In [138]:
# Display the first few rows of the DataFrame
df_sales_May_2019.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,194095,Wired Headphones,1,11.99,05/16/19 17:14,"669 2nd St, New York City, NY 10001"
1,194096,AA Batteries (4-pack),1,3.84,05/19/19 14:43,"844 Walnut St, Dallas, TX 75001"
2,194097,27in FHD Monitor,1,149.99,05/24/19 11:36,"164 Madison St, New York City, NY 10001"
3,194098,Wired Headphones,1,11.99,05/02/19 20:40,"622 Meadow St, Dallas, TX 75001"
4,194099,AAA Batteries (4-pack),2,2.99,05/11/19 22:55,"17 Church St, Seattle, WA 98101"


In [139]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
df_sales_May_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16635 entries, 0 to 16634
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          16587 non-null  object
 1   Product           16587 non-null  object
 2   Quantity Ordered  16587 non-null  object
 3   Price Each        16587 non-null  object
 4   Order Date        16587 non-null  object
 5   Purchase Address  16587 non-null  object
dtypes: object(6)
memory usage: 779.9+ KB


In [140]:
# View the column names
df_sales_May_2019.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

In [141]:
# Iterate over columns and print the number of unique values in each column
for column in df_sales_May_2019.columns:
    unique_values = df_sales_May_2019[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order ID': 15827
Number of unique values in 'Product': 20
Number of unique values in 'Quantity Ordered': 8
Number of unique values in 'Price Each': 22
Number of unique values in 'Order Date': 12665
Number of unique values in 'Purchase Address': 15461


In [142]:
# Iterate over columns and view unique values
for column in df_sales_May_2019:
    unique_values = df_sales_May_2019[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)


Unique values in 'Order ID':
['194095' '194096' '194097' ... '209918' '209919' '209920']
Unique values in 'Product':
['Wired Headphones' 'AA Batteries (4-pack)' '27in FHD Monitor'
 'AAA Batteries (4-pack)' 'iPhone' 'USB-C Charging Cable'
 'Lightning Charging Cable' 'ThinkPad Laptop' '34in Ultrawide Monitor'
 'Google Phone' 'Apple Airpods Headphones' 'LG Dryer'
 'Bose SoundSport Headphones' 'Flatscreen TV' '27in 4K Gaming Monitor' nan
 'Macbook Pro Laptop' '20in Monitor' 'Vareebadd Phone'
 'LG Washing Machine' 'Product']
Unique values in 'Quantity Ordered':
['1' '2' '3' nan '4' '5' 'Quantity Ordered' '6' '7']
Unique values in 'Price Each':
['11.99' '3.84' '149.99' '2.99' '700.0' '11.95' '14.95' '999.99' '379.99'
 '600.0' '150.0' '99.99' '300.0' '389.99' nan '700' '150' '1700' '109.99'
 '600' '400' '300' 'Price Each']
Unique values in 'Order Date':
['05/16/19 17:14' '05/19/19 14:43' '05/24/19 11:36' ... '05/24/19 22:02'
 '05/04/19 12:46' '05/18/19 23:07']
Unique values in 'Purchase Addre

Observations:

1. Order ID: The 'Order ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. However, there are two unusual entries: 'nan' and 'Product', which might indicate missing or placeholder values.
3. Quantity Ordered: The 'Quantity Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some unusual entries such as 'nan' and 'Quantity Ordered', which may indicate missing or placeholder values.
4. Price Each: The 'Price Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some unusual entries such as 'nan' and 'Price Each', which may indicate missing or placeholder values. Additionally, there are some duplicate values in a different format ('600' and '600.0').
5. Order Date: The 'Order Date' column contains the date and time when each order was placed. Entries are in the format 'MM/DD/YY HH:mm'.
6. Purchase Address: The 'Purchase Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to the previous observations, anomalies exist in the 'Product', 'Quantity Ordered', and 'Price Each' columns that need further investigation and cleaning. Additionally, missing values ('nan') and placeholder values ('Product' and 'Quantity Ordered') should be handled appropriately to ensure the accuracy of the dataset.

In [143]:
# View the dimensions of the DataFrame
df_sales_May_2019.shape

(16635, 6)

In [144]:
# Display descriptive statistics of the DataFrame
df_sales_May_2019.describe()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
count,16587,16587,16587,16587.0,16587,16587
unique,15827,20,8,22.0,12665,15461
top,Order ID,Lightning Charging Cable,1,14.95,Order Date,Purchase Address
freq,33,1932,14977,1932.0,33,33


In [145]:
# Count the number of missing values in each column
df_sales_May_2019.isnull().sum()

Order ID            48
Product             48
Quantity Ordered    48
Price Each          48
Order Date          48
Purchase Address    48
dtype: int64

In [146]:
# Count the number of duplicate rows in the DataFrame
df_sales_May_2019.duplicated().sum()

93

## Dataset of df_sales_June_2019

In [147]:
# Display the first few rows of the DataFrame
df_sales_June_2019.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,209921,USB-C Charging Cable,1,11.95,06/23/19 19:34,"950 Walnut St, Portland, ME 04101"
1,209922,Macbook Pro Laptop,1,1700.0,06/30/19 10:05,"80 4th St, San Francisco, CA 94016"
2,209923,ThinkPad Laptop,1,999.99,06/24/19 20:18,"402 Jackson St, Los Angeles, CA 90001"
3,209924,27in FHD Monitor,1,149.99,06/05/19 10:21,"560 10th St, Seattle, WA 98101"
4,209925,Bose SoundSport Headphones,1,99.99,06/25/19 18:58,"545 2nd St, San Francisco, CA 94016"


In [148]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
df_sales_June_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13622 entries, 0 to 13621
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          13579 non-null  object
 1   Product           13579 non-null  object
 2   Quantity Ordered  13579 non-null  object
 3   Price Each        13579 non-null  object
 4   Order Date        13579 non-null  object
 5   Purchase Address  13579 non-null  object
dtypes: object(6)
memory usage: 638.7+ KB


In [149]:
# View the column names
df_sales_June_2019.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

In [150]:
# Iterate over columns and print the number of unique values in each column
for column in df_sales_June_2019:
    unique_values = df_sales_June_2019[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order ID': 12990
Number of unique values in 'Product': 20
Number of unique values in 'Quantity Ordered': 8
Number of unique values in 'Price Each': 23
Number of unique values in 'Order Date': 10742
Number of unique values in 'Purchase Address': 12720


In [151]:
# Iterate over columns and view unique values
for column in df_sales_June_2019:
    unique_values = df_sales_June_2019[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order ID':
['209921' '209922' '209923' ... '222907' '222908' '222909']
Unique values in 'Product':
['USB-C Charging Cable' 'Macbook Pro Laptop' 'ThinkPad Laptop'
 '27in FHD Monitor' 'Bose SoundSport Headphones'
 'Apple Airpods Headphones' 'Lightning Charging Cable' 'Wired Headphones'
 'Flatscreen TV' 'AA Batteries (4-pack)' 'AAA Batteries (4-pack)'
 '34in Ultrawide Monitor' 'iPhone' 'Google Phone' '27in 4K Gaming Monitor'
 '20in Monitor' 'Product' 'LG Dryer' 'Vareebadd Phone'
 'LG Washing Machine' nan]
Unique values in 'Quantity Ordered':
['1' '3' '2' 'Quantity Ordered' '5' nan '4' '9' '6']
Unique values in 'Price Each':
['11.95' '1700.0' '999.99' '149.99' '99.99' '150.0' '14.95' '11.99'
 '300.0' '3.84' '2.99' '379.99' '700.0' '600.0' '389.99' '109.99'
 'Price Each' '1700' '150' '400' '600' '300' '700' nan]
Unique values in 'Order Date':
['06/23/19 19:34' '06/30/19 10:05' '06/24/19 20:18' ... '06/09/19 22:07'
 '06/26/19 18:35' '06/25/19 14:33']
Unique values in 'Purch

Observations:

1. Order ID: The 'Order ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. However, there are two unusual entries: 'nan' and 'Product', which might indicate missing or placeholder values.
3. Quantity Ordered: The 'Quantity Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some unusual entries such as 'nan' and 'Quantity Ordered', which may indicate missing or placeholder values.
4. Price Each: The 'Price Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some unusual entries such as 'nan' and 'Price Each', which may indicate missing or placeholder values. Additionally, there are some duplicate values in a different format ('600' and '600.0').
5. Order Date: The 'Order Date' column contains the date and time when each order was placed. Entries are in the format 'MM/DD/YY HH:mm'.
6. Purchase Address: The 'Purchase Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to the previous observations, anomalies exist in the 'Product', 'Quantity Ordered', and 'Price Each' columns that need further investigation and cleaning. Additionally, missing values ('nan') and placeholder values ('Product' and 'Quantity Ordered') should be handled appropriately to ensure the accuracy of the dataset.

In [152]:
# View the dimensions of the DataFrame
df_sales_June_2019.shape

(13622, 6)

In [153]:
# Display descriptive statistics of the DataFrame
df_sales_June_2019.describe()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
count,13579,13579,13579,13579.0,13579,13579
unique,12990,20,8,23.0,10742,12720
top,Order ID,Lightning Charging Cable,1,14.95,Order Date,Purchase Address
freq,23,1564,12233,1564.0,23,23


In [154]:
# Count the number of missing values in each column
df_sales_June_2019.isnull().sum()

Order ID            43
Product             43
Quantity Ordered    43
Price Each          43
Order Date          43
Purchase Address    43
dtype: int64

In [155]:
# Count the number of duplicate rows in the DataFrame
df_sales_June_2019.duplicated().sum()

83

In [156]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

In [157]:
# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("servername")
database = environment_variables.get("databasename")
username = environment_variables.get("user")
password = environment_variables.get("password")

In [158]:
# Define connection string with appropriate parameters
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


In [159]:
# Establish a connection to the database using the provided connection string.
connection= pyodbc.connect(connection_string)

Loading server data

## Dataset of dbo.Sales_July_2019 on server

In [160]:
# SQL query to fetch data from 'dbo.LP1_startup_funding2020' table
query = "Select * from dbo.Sales_July_2019"

# Read data from the SQL query result
dap_july = pd.read_sql(query, connection)

# Display the first few rows of the DataFrame
dap_july.head()


  dap_july = pd.read_sql(query, connection)


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,222910.0,Apple Airpods Headphones,1.0,150.0,2026-07-19 16:51:00.0000000,"389 South St, Atlanta, GA 30301"
1,222911.0,Flatscreen TV,1.0,300.0,2005-07-19 08:55:00.0000000,"590 4th St, Seattle, WA 98101"
2,222912.0,AA Batteries (4-pack),1.0,3.84,2029-07-19 12:41:00.0000000,"861 Hill St, Atlanta, GA 30301"
3,222913.0,AA Batteries (4-pack),1.0,3.84,2028-07-19 10:15:00.0000000,"190 Ridge St, Atlanta, GA 30301"
4,222914.0,AAA Batteries (4-pack),5.0,2.99,2031-07-19 02:13:00.0000000,"824 Forest St, Seattle, WA 98101"


In [161]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
dap_july.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14371 entries, 0 to 14370
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          14291 non-null  float64
 1   Product           14326 non-null  object 
 2   Quantity_Ordered  14291 non-null  float64
 3   Price_Each        14291 non-null  float64
 4   Order_Date        14291 non-null  object 
 5   Purchase_Address  14326 non-null  object 
dtypes: float64(3), object(3)
memory usage: 673.8+ KB


In [162]:
# View the column names
dap_july.columns

Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')

In [163]:
# Iterate over columns and print the number of unique values in each column
for column in dap_july:
    unique_values = dap_july[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order_ID': 13760
Number of unique values in 'Product': 20
Number of unique values in 'Quantity_Ordered': 9
Number of unique values in 'Price_Each': 17
Number of unique values in 'Order_Date': 11348
Number of unique values in 'Purchase_Address': 13472


In [164]:
# Iterate over columns and view unique values
for column in dap_july.columns:
    unique_values = dap_july[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order_ID':
[222910. 222911. 222912. ... 236667. 236668. 236669.]
Unique values in 'Product':
['Apple Airpods Headphones' 'Flatscreen TV' 'AA Batteries (4-pack)'
 'AAA Batteries (4-pack)' 'Bose SoundSport Headphones' 'Google Phone'
 'LG Dryer' 'USB-C Charging Cable' 'Lightning Charging Cable'
 '34in Ultrawide Monitor' 'Wired Headphones' 'Vareebadd Phone'
 '27in FHD Monitor' '20in Monitor' 'ThinkPad Laptop'
 '27in 4K Gaming Monitor' 'Macbook Pro Laptop' 'iPhone' None
 'LG Washing Machine' 'Product']
Unique values in 'Quantity_Ordered':
[ 1.  5.  2.  3. nan  4.  6.  7.  8.  9.]
Unique values in 'Price_Each':
[ 150.          300.            3.83999991    2.99000001   99.98999786
  600.           11.94999981   14.94999981  379.98999023   11.98999977
  400.          149.99000549  109.98999786  999.98999023  389.98999023
 1700.          700.                   nan]
Unique values in 'Order_Date':
['2026-07-19 16:51:00.0000000' '2005-07-19 08:55:00.0000000'
 '2029-07-19 12:41:0

Observations:

1. Order_ID: The 'Order_ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. However, there are two unusual entries: 'None' and 'Product', which might indicate missing or placeholder values.
3. Quantity_Ordered: The 'Quantity_Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some missing values (NaN) present.
4. Price_Each: The 'Price_Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some missing values (NaN) present.
5. Order_Date: The 'Order_Date' column contains the date and time when each order was placed. Entries seem to be in datetime format, but there might be some inconsistencies in the date format.
6. Purchase_Address: The 'Purchase_Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

There seem to be some inconsistencies in the data, such as different date formats in the 'Order_Date' column and missing values in the 'Quantity_Ordered' and 'Price_Each' columns. Additionally, the 'Product' column contains 'None' and 'Product' entries, which need to be addressed. Further data cleaning and preprocessing may be necessary to ensure the integrity of the dataset.

In [165]:
# View the dimensions of the DataFrame
dap_july.shape

(14371, 6)

In [166]:
# Display descriptive statistics of the DataFrame
dap_july.describe()

Unnamed: 0,Order_ID,Quantity_Ordered,Price_Each
count,14291.0,14291.0,14291.0
mean,229788.516269,1.124414,184.149922
std,3970.663121,0.460838,332.954499
min,222910.0,1.0,2.99
25%,226347.5,1.0,11.95
50%,229783.0,1.0,14.95
75%,233228.5,1.0,150.0
max,236669.0,9.0,1700.0


In [167]:
# Count the number of missing values in each column
dap_july.isnull().sum()

Order_ID            80
Product             45
Quantity_Ordered    80
Price_Each          80
Order_Date          80
Purchase_Address    45
dtype: int64

In [168]:
# Count the number of duplicate rows in the DataFrame
dap_july.duplicated().sum()

96

## Dataset of dbo.Sales_August_2019 on server

In [169]:
query = "Select * from dbo.Sales_August_2019"

dap_august = pd.read_sql(query, connection)

dap_august.head()


  dap_august = pd.read_sql(query, connection)


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,236670.0,Wired Headphones,2.0,11.99,2031-08-19 22:21:00.0000000,"359 Spruce St, Seattle, WA 98101"
1,236671.0,Bose SoundSport Headphones,1.0,99.989998,2015-08-19 15:11:00.0000000,"492 Ridge St, Dallas, TX 75001"
2,236672.0,iPhone,1.0,700.0,2006-08-19 14:40:00.0000000,"149 7th St, Portland, OR 97035"
3,236673.0,AA Batteries (4-pack),2.0,3.84,2029-08-19 20:59:00.0000000,"631 2nd St, Los Angeles, CA 90001"
4,236674.0,AA Batteries (4-pack),2.0,3.84,2015-08-19 19:53:00.0000000,"736 14th St, New York City, NY 10001"


In [170]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
dap_august.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12011 entries, 0 to 12010
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          11957 non-null  float64
 1   Product           11983 non-null  object 
 2   Quantity_Ordered  11957 non-null  float64
 3   Price_Each        11957 non-null  float64
 4   Order_Date        11957 non-null  object 
 5   Purchase_Address  11983 non-null  object 
dtypes: float64(3), object(3)
memory usage: 563.1+ KB


In [171]:
# View the column names
dap_august.columns

Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')

In [172]:
# Iterate over columns and print the number of unique values in each column
for column in dap_august.columns:
    unique_values = dap_august[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order_ID': 11481
Number of unique values in 'Product': 20
Number of unique values in 'Quantity_Ordered': 8
Number of unique values in 'Price_Each': 17
Number of unique values in 'Order_Date': 9732
Number of unique values in 'Purchase_Address': 11296


In [173]:
# Iterate over columns and view unique values
for column in dap_august.columns:
    unique_values = dap_august[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order_ID':
[236670. 236671. 236672. ... 248148. 248149. 248150.]
Unique values in 'Product':
['Wired Headphones' 'Bose SoundSport Headphones' 'iPhone'
 'AA Batteries (4-pack)' '34in Ultrawide Monitor' '20in Monitor'
 'Macbook Pro Laptop' 'LG Washing Machine' '27in FHD Monitor'
 'Lightning Charging Cable' 'Apple Airpods Headphones'
 'AAA Batteries (4-pack)' 'USB-C Charging Cable' '27in 4K Gaming Monitor'
 'ThinkPad Laptop' 'Flatscreen TV' 'Google Phone' 'Vareebadd Phone'
 'Product' None 'LG Dryer']
Unique values in 'Quantity_Ordered':
[ 2.  1.  3.  4. nan  6.  7.  5.  8.]
Unique values in 'Price_Each':
[  11.98999977   99.98999786  700.            3.83999991  379.98999023
  109.98999786 1700.          600.          149.99000549   14.94999981
  150.            2.99000001   11.94999981  389.98999023  999.98999023
  300.          400.                   nan]
Unique values in 'Order_Date':
['2031-08-19 22:21:00.0000000' '2015-08-19 15:11:00.0000000'
 '2006-08-19 14:40:00.00

Observations:

1. Order_ID: The 'Order_ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. However, there are two unusual entries: 'None' and 'Product', which might indicate missing or placeholder values.
3. Quantity_Ordered: The 'Quantity_Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some missing values (NaN) present.
4. Price_Each: The 'Price_Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some missing values (NaN) present.
5. Order_Date: The 'Order_Date' column contains the date and time when each order was placed. Entries seem to be in datetime format.
6. Purchase_Address: The 'Purchase_Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

The data seems to have similar issues as before, with missing values in the 'Quantity_Ordered' and 'Price_Each' columns, and inconsistent entries in the 'Product' column. Additionally, there appear to be no missing values in the 'Order_ID', 'Order_Date', and 'Purchase_Address' columns. Further cleaning and preprocessing may be necessary to ensure the integrity of the dataset.

In [174]:
# View the dimensions of the DataFrame
dap_august.shape

(12011, 6)

In [175]:
# Display descriptive statistics of the DataFrame
dap_august.describe()

Unnamed: 0,Order_ID,Quantity_Ordered,Price_Each
count,11957.0,11957.0,11957.0
mean,242420.339299,1.124195,186.526442
std,3313.683368,0.44958,332.301934
min,236670.0,1.0,2.99
25%,239551.0,1.0,11.95
50%,242427.0,1.0,14.95
75%,245281.0,1.0,150.0
max,248150.0,8.0,1700.0


In [176]:
# Count the number of missing values in each column
dap_august.isnull().sum()

Order_ID            54
Product             28
Quantity_Ordered    54
Price_Each          54
Order_Date          54
Purchase_Address    28
dtype: int64

In [177]:
# Count the number of duplicate rows in the DataFrame
dap_august.duplicated().sum()

70

## Dataset of dbo.Sales_September_2019 on server

In [178]:
query= "Select * from dbo.Sales_September_2019"

dap_sept = pd.read_sql(query, connection)

dap_sept.head()



  dap_sept = pd.read_sql(query, connection)


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,248151.0,AA Batteries (4-pack),4.0,3.84,2017-09-19 14:44:00.0000000,"380 North St, Los Angeles, CA 90001"
1,248152.0,USB-C Charging Cable,2.0,11.95,2029-09-19 10:19:00.0000000,"511 8th St, Austin, TX 73301"
2,248153.0,USB-C Charging Cable,1.0,11.95,2016-09-19 17:48:00.0000000,"151 Johnson St, Los Angeles, CA 90001"
3,248154.0,27in FHD Monitor,1.0,149.990005,2027-09-19 07:52:00.0000000,"355 Hickory St, Seattle, WA 98101"
4,248155.0,USB-C Charging Cable,1.0,11.95,2001-09-19 19:03:00.0000000,"125 5th St, Atlanta, GA 30301"


In [179]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
dap_sept.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11686 entries, 0 to 11685
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          11629 non-null  float64
 1   Product           11646 non-null  object 
 2   Quantity_Ordered  11629 non-null  float64
 3   Price_Each        11629 non-null  float64
 4   Order_Date        11629 non-null  object 
 5   Purchase_Address  11646 non-null  object 
dtypes: float64(3), object(3)
memory usage: 547.9+ KB


In [180]:
# View the column names
dap_sept.columns

Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')

In [181]:
# Iterate over columns and print the number of unique values in each column
for column in dap_sept:
    unique_values = dap_sept[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order_ID': 11207
Number of unique values in 'Product': 20
Number of unique values in 'Quantity_Ordered': 6
Number of unique values in 'Price_Each': 17
Number of unique values in 'Order_Date': 9494
Number of unique values in 'Purchase_Address': 11032


In [182]:
# Iterate over columns and view unique values
for column in dap_sept.columns:
    unique_values = dap_sept[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order_ID':
[248151. 248152. 248153. ... 259355. 259356. 259357.]
Unique values in 'Product':
['AA Batteries (4-pack)' 'USB-C Charging Cable' '27in FHD Monitor'
 '34in Ultrawide Monitor' 'Lightning Charging Cable' 'Vareebadd Phone'
 'Wired Headphones' 'AAA Batteries (4-pack)' 'Apple Airpods Headphones'
 'Google Phone' '20in Monitor' 'Bose SoundSport Headphones' 'iPhone'
 'ThinkPad Laptop' 'Macbook Pro Laptop' 'Flatscreen TV'
 '27in 4K Gaming Monitor' None 'LG Dryer' 'LG Washing Machine' 'Product']
Unique values in 'Quantity_Ordered':
[ 4.  2.  1.  3. nan  5.  6.]
Unique values in 'Price_Each':
[   3.83999991   11.94999981  149.99000549  379.98999023   14.94999981
  400.           11.98999977    2.99000001  150.          600.
  109.98999786   99.98999786  700.          999.98999023 1700.
  300.          389.98999023           nan]
Unique values in 'Order_Date':
['2017-09-19 14:44:00.0000000' '2029-09-19 10:19:00.0000000'
 '2016-09-19 17:48:00.0000000' ... '2023-09-19 07

Observations:

1. Order_ID: The 'Order_ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. However, there are two unusual entries: 'None' and 'Product', which might indicate missing or placeholder values.
3. Quantity_Ordered: The 'Quantity_Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some missing values (NaN) present.
4. Price_Each: The 'Price_Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some missing values (NaN) present.
5. Order_Date: The 'Order_Date' column contains the date and time when each order was placed. Entries seem to be in datetime format.
6. Purchase_Address: The 'Purchase_Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to previous observations, the data requires cleaning and preprocessing to address missing values and inconsistent entries in the 'Product' column. The 'Order_ID', 'Order_Date', and 'Purchase_Address' columns seem to have no missing values. Further data cleaning and preprocessing may be necessary to ensure the integrity of the dataset.

In [183]:
# View the dimensions of the DataFrame
dap_sept.shape

(11686, 6)

In [184]:
# Display descriptive statistics of the DataFrame
dap_sept.describe()

Unnamed: 0,Order_ID,Quantity_Ordered,Price_Each
count,11629.0,11629.0,11629.0
mean,253751.814429,1.128128,179.400006
std,3235.175359,0.435077,328.595041
min,248151.0,1.0,2.99
25%,250947.0,1.0,11.95
50%,253751.0,1.0,14.95
75%,256552.0,1.0,150.0
max,259357.0,6.0,1700.0


In [185]:
# Count the number of missing values in each column
dap_sept.isnull().sum()

Order_ID            57
Product             40
Quantity_Ordered    57
Price_Each          57
Order_Date          57
Purchase_Address    40
dtype: int64

In [186]:
# Count the number of duplicate rows in the DataFrame
dap_sept.duplicated().sum()

73

## Dataset of dbo.Sales_October_2019 on server

In [187]:
query= "Select * from dbo.Sales_October_2019"

dap_oct = pd.read_sql(query, connection)

dap_oct.head()



  dap_oct = pd.read_sql(query, connection)


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,259358.0,34in Ultrawide Monitor,1.0,379.98999,2028-10-19 10:56:00.0000000,"609 Cherry St, Dallas, TX 75001"
1,259359.0,27in 4K Gaming Monitor,1.0,389.98999,2028-10-19 17:26:00.0000000,"225 5th St, Los Angeles, CA 90001"
2,259360.0,AAA Batteries (4-pack),2.0,2.99,2024-10-19 17:20:00.0000000,"967 12th St, New York City, NY 10001"
3,259361.0,27in FHD Monitor,1.0,149.990005,2014-10-19 22:26:00.0000000,"628 Jefferson St, New York City, NY 10001"
4,259362.0,Wired Headphones,1.0,11.99,2007-10-19 16:10:00.0000000,"534 14th St, Los Angeles, CA 90001"


In [188]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
dap_oct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20379 entries, 0 to 20378
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          20284 non-null  float64
 1   Product           20317 non-null  object 
 2   Quantity_Ordered  20284 non-null  float64
 3   Price_Each        20284 non-null  float64
 4   Order_Date        20284 non-null  object 
 5   Purchase_Address  20317 non-null  object 
dtypes: float64(3), object(3)
memory usage: 955.4+ KB


In [189]:
# View the column names
dap_oct.columns

Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')

In [190]:
# Iterate over columns and print the number of unique values in each column
for column in dap_oct.columns:
    unique_values = dap_oct[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order_ID': 19439
Number of unique values in 'Product': 20
Number of unique values in 'Quantity_Ordered': 8
Number of unique values in 'Price_Each': 17
Number of unique values in 'Order_Date': 14847
Number of unique values in 'Purchase_Address': 18918


In [191]:
# Iterate over columns and view unique values
for column in dap_oct.columns:
    unique_values = dap_oct[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order_ID':
[259358. 259359. 259360. ... 278794. 278795. 278796.]
Unique values in 'Product':
['34in Ultrawide Monitor' '27in 4K Gaming Monitor'
 'AAA Batteries (4-pack)' '27in FHD Monitor' 'Wired Headphones'
 'Lightning Charging Cable' 'Apple Airpods Headphones'
 'USB-C Charging Cable' '20in Monitor' 'iPhone'
 'Bose SoundSport Headphones' 'ThinkPad Laptop' 'AA Batteries (4-pack)'
 'Google Phone' 'Vareebadd Phone' 'Flatscreen TV' 'Macbook Pro Laptop'
 'LG Dryer' None 'LG Washing Machine' 'Product']
Unique values in 'Quantity_Ordered':
[ 1.  2.  3.  4.  5. nan  8.  6.  7.]
Unique values in 'Price_Each':
[ 379.98999023  389.98999023    2.99000001  149.99000549   11.98999977
   14.94999981  150.           11.94999981  109.98999786  700.
   99.98999786  999.98999023    3.83999991  600.          400.
  300.         1700.                   nan]
Unique values in 'Order_Date':
['2028-10-19 10:56:00.0000000' '2028-10-19 17:26:00.0000000'
 '2024-10-19 17:20:00.0000000' ... '2009

Observations:

1. Order_ID: The 'Order_ID' column contains unique numerical identifiers for each order. There are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. However, there are two unusual entries: 'None' and 'Product', which might indicate missing or placeholder values.
3. Quantity_Ordered: The 'Quantity_Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some missing values (NaN) present.
4. Price_Each: The 'Price_Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some missing values (NaN) present.
5. Order_Date: The 'Order_Date' column contains the date and time when each order was placed. Entries seem to be in datetime format.
6. Purchase_Address: The 'Purchase_Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to previous observations, the data requires cleaning and preprocessing to address missing values and inconsistent entries in the 'Product' column. The 'Order_ID', 'Order_Date', and 'Purchase_Address' columns seem to have no missing values. Further data cleaning and preprocessing may be necessary to ensure the integrity of the dataset.

In [192]:
# View the dimensions of the DataFrame
dap_oct.shape

(20379, 6)

In [193]:
# Display descriptive statistics of the DataFrame
dap_oct.describe()

Unnamed: 0,Order_ID,Quantity_Ordered,Price_Each
count,20284.0,20284.0,20284.0
mean,269078.523122,1.119355,183.183939
std,5612.651509,0.436922,334.005122
min,259358.0,1.0,2.99
25%,264210.75,1.0,11.95
50%,269081.5,1.0,14.95
75%,273942.25,1.0,150.0
max,278796.0,8.0,1700.0


In [194]:
# Count the number of missing values in each column
dap_oct.isnull().sum()

Order_ID            95
Product             62
Quantity_Ordered    95
Price_Each          95
Order_Date          95
Purchase_Address    62
dtype: int64

In [195]:
# Count the number of duplicate rows in the DataFrame
dap_oct.duplicated().sum()

126

## Dataset of dbo.Sales_November_2019 on server

In [196]:

query = "Select * from dbo.Sales_November_2019"

dap_nov = pd.read_sql(query, connection)

dap_nov.head()



  dap_nov = pd.read_sql(query, connection)


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,278797.0,Wired Headphones,1.0,11.99,2021-11-19 09:54:00.0000000,"46 Park St, New York City, NY 10001"
1,278798.0,USB-C Charging Cable,2.0,11.95,2017-11-19 10:03:00.0000000,"962 Hickory St, Austin, TX 73301"
2,278799.0,Apple Airpods Headphones,1.0,150.0,2019-11-19 14:56:00.0000000,"464 Cherry St, Los Angeles, CA 90001"
3,278800.0,27in FHD Monitor,1.0,149.990005,2025-11-19 22:24:00.0000000,"649 10th St, Seattle, WA 98101"
4,278801.0,Bose SoundSport Headphones,1.0,99.989998,2009-11-19 13:56:00.0000000,"522 Hill St, Boston, MA 02215"


In [197]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
dap_nov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17661 entries, 0 to 17660
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          17580 non-null  float64
 1   Product           17616 non-null  object 
 2   Quantity_Ordered  17580 non-null  float64
 3   Price_Each        17580 non-null  float64
 4   Order_Date        17580 non-null  object 
 5   Purchase_Address  17616 non-null  object 
dtypes: float64(3), object(3)
memory usage: 828.0+ KB


In [198]:
# View the column names
dap_nov.columns

Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')

In [199]:
# Iterate over columns and print the number of unique values in each column
for column in dap_nov.columns:
    unique_values = dap_nov[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order_ID': 16868
Number of unique values in 'Product': 20
Number of unique values in 'Quantity_Ordered': 8
Number of unique values in 'Price_Each': 17
Number of unique values in 'Order_Date': 13196
Number of unique values in 'Purchase_Address': 16492


In [200]:
# Iterate over columns and view unique values
for column in dap_nov.columns:
    unique_values = dap_nov[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order_ID':
[278797. 278798. 278799. ... 295662. 295663. 295664.]
Unique values in 'Product':
['Wired Headphones' 'USB-C Charging Cable' 'Apple Airpods Headphones'
 '27in FHD Monitor' 'Bose SoundSport Headphones'
 'Lightning Charging Cable' 'ThinkPad Laptop' 'AAA Batteries (4-pack)'
 'AA Batteries (4-pack)' 'Macbook Pro Laptop' 'iPhone' '20in Monitor'
 '34in Ultrawide Monitor' 'Vareebadd Phone' 'Flatscreen TV'
 '27in 4K Gaming Monitor' None 'Google Phone' 'LG Washing Machine'
 'LG Dryer' 'Product']
Unique values in 'Quantity_Ordered':
[ 1.  2.  3. nan  4.  5.  6.  7.  8.]
Unique values in 'Price_Each':
[  11.98999977   11.94999981  150.          149.99000549   99.98999786
   14.94999981  999.98999023    2.99000001    3.83999991 1700.
  700.          109.98999786  379.98999023  400.          300.
  389.98999023           nan  600.        ]
Unique values in 'Order_Date':
['2021-11-19 09:54:00.0000000' '2017-11-19 10:03:00.0000000'
 '2019-11-19 14:56:00.0000000' ... '2023

Observations:

1. Order_ID: The 'Order_ID' column contains unique numerical identifiers for each order, and there are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. There is an entry 'None', which might indicate missing or placeholder values. Additionally, there is an entry 'Product', which seems like a placeholder or a mistake.
3. Quantity_Ordered: The 'Quantity_Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some missing values (NaN) present.
4. Price_Each: The 'Price_Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some missing values (NaN) present.
5. Order_Date: The 'Order_Date' column contains the date and time when each order was placed. Entries seem to be in datetime format.
6. Purchase_Address: The 'Purchase_Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to previous observations, the data requires cleaning and preprocessing to address missing values and inconsistent entries in the 'Product' column. The 'Order_ID', 'Order_Date', and 'Purchase_Address' columns seem to have no missing values. Further data cleaning and preprocessing may be necessary to ensure the integrity of the dataset.

In [201]:
# View the dimensions of the DataFrame
dap_nov.shape

(17661, 6)

In [202]:
# Display descriptive statistics of the DataFrame
dap_nov.describe()

Unnamed: 0,Order_ID,Quantity_Ordered,Price_Each
count,17580.0,17580.0,17580.0
mean,287235.962799,1.126735,180.881967
std,4866.884258,0.452011,330.175894
min,278797.0,1.0,2.99
25%,283023.75,1.0,11.95
50%,287236.5,1.0,14.95
75%,291449.25,1.0,150.0
max,295664.0,8.0,1700.0


In [203]:
# Count the number of missing values in each column
dap_nov.isnull().sum()

Order_ID            81
Product             45
Quantity_Ordered    81
Price_Each          81
Order_Date          81
Purchase_Address    45
dtype: int64

In [204]:
# Count the number of duplicate rows in the DataFrame
dap_nov.duplicated().sum()

108

## Dataset of dbo.Sales_December_2019 on server

In [205]:
query = "Select * from dbo.Sales_December_2019"

dap_dec = pd.read_sql(query, connection)

dap_dec.head()

  dap_dec = pd.read_sql(query, connection)


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,295665.0,Macbook Pro Laptop,1.0,1700.0,2030-12-19 00:01:00.0000000,"136 Church St, New York City, NY 10001"
1,295666.0,LG Washing Machine,1.0,600.0,2029-12-19 07:03:00.0000000,"562 2nd St, New York City, NY 10001"
2,295667.0,USB-C Charging Cable,1.0,11.95,2012-12-19 18:21:00.0000000,"277 Main St, New York City, NY 10001"
3,295668.0,27in FHD Monitor,1.0,149.990005,2022-12-19 15:13:00.0000000,"410 6th St, San Francisco, CA 94016"
4,295669.0,USB-C Charging Cable,1.0,11.95,2018-12-19 12:38:00.0000000,"43 Hill St, Atlanta, GA 30301"


In [206]:
# View information about the DataFrame
# This line displays information about the DataFrame, including the data types of columns and memory usage
dap_dec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25117 entries, 0 to 25116
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          24989 non-null  float64
 1   Product           25037 non-null  object 
 2   Quantity_Ordered  24989 non-null  float64
 3   Price_Each        24989 non-null  float64
 4   Order_Date        24989 non-null  object 
 5   Purchase_Address  25037 non-null  object 
dtypes: float64(3), object(3)
memory usage: 1.1+ MB


In [207]:
# View the column names
dap_dec.columns

Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')

In [208]:
# Iterate over columns and print the number of unique values in each column
for column in dap_dec.columns:
    unique_values = dap_dec[column].nunique()
    print(f"Number of unique values in '{column}': {unique_values}")

Number of unique values in 'Order_ID': 24006
Number of unique values in 'Product': 20
Number of unique values in 'Quantity_Ordered': 7
Number of unique values in 'Price_Each': 17
Number of unique values in 'Order_Date': 17305
Number of unique values in 'Purchase_Address': 23215


In [209]:
# Iterate over columns and view unique values
for column in dap_dec.columns:
    unique_values = dap_dec[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)

Unique values in 'Order_ID':
[295665. 295666. 295667. ... 319668. 319669. 319670.]
Unique values in 'Product':
['Macbook Pro Laptop' 'LG Washing Machine' 'USB-C Charging Cable'
 '27in FHD Monitor' 'AA Batteries (4-pack)' 'Bose SoundSport Headphones'
 'AAA Batteries (4-pack)' 'ThinkPad Laptop' 'Lightning Charging Cable'
 'Google Phone' 'Wired Headphones' 'Apple Airpods Headphones'
 'Vareebadd Phone' 'iPhone' '20in Monitor' '34in Ultrawide Monitor'
 'Flatscreen TV' '27in 4K Gaming Monitor' 'Product' None 'LG Dryer']
Unique values in 'Quantity_Ordered':
[ 1.  2.  4.  3. nan  7.  5.  6.]
Unique values in 'Price_Each':
[1700.          600.           11.94999981  149.99000549    3.83999991
   99.98999786    2.99000001  999.98999023   14.94999981   11.98999977
  150.          400.          700.          109.98999786  379.98999023
  300.          389.98999023           nan]
Unique values in 'Order_Date':
['2030-12-19 00:01:00.0000000' '2029-12-19 07:03:00.0000000'
 '2012-12-19 18:21:00.0000000

Observations:

1. Order_ID: The 'Order_ID' column contains unique numerical identifiers for each order, and there are no apparent issues with this column.
2. Product: The 'Product' column contains the names of various products sold. There is an entry 'None', which might indicate missing or placeholder values. Additionally, there is an entry 'Product', which seems like a placeholder or a mistake.
3. Quantity_Ordered: The 'Quantity_Ordered' column contains the number of units ordered for each product. Most entries are numerical values representing quantities, but there are some missing values (NaN) present.
4. Price_Each: The 'Price_Each' column contains the price of each product. Most entries are numerical values representing prices, but there are some missing values (NaN) present.
5. Order_Date: The 'Order_Date' column contains the date and time when each order was placed. Entries seem to be in datetime format.
6. Purchase_Address: The 'Purchase_Address' column contains the addresses where the purchases were made. Each entry includes the street address, city, state, and ZIP code.

Similar to previous observations, the data requires cleaning and preprocessing to address missing values and inconsistent entries in the 'Product' column. The 'Order_ID', 'Order_Date', and 'Purchase_Address' columns seem to have no missing values. Further data cleaning and preprocessing may be necessary to ensure the integrity of the dataset.

In [210]:
# View the dimensions of the DataFrame
dap_dec.shape

(25117, 6)

In [211]:
# Display descriptive statistics of the DataFrame
dap_dec.describe()

Unnamed: 0,Order_ID,Quantity_Ordered,Price_Each
count,24989.0,24989.0,24989.0
mean,307655.02317,1.125335,183.845649
std,6932.795456,0.445414,333.077036
min,295665.0,1.0,2.99
25%,301653.0,1.0,11.95
50%,307656.0,1.0,14.95
75%,313654.0,1.0,150.0
max,319670.0,7.0,1700.0


In [212]:
# Count the number of missing values in each column
dap_dec.isnull().sum()

Order_ID            128
Product              80
Quantity_Ordered    128
Price_Each          128
Order_Date          128
Purchase_Address     80
dtype: int64

In [213]:
# Count the number of duplicate rows in the DataFrame
dap_dec.duplicated().sum()

166