### Business Intelligence Solution for 2019 Sales Analysis

##### CRISP-DM Outline:

1. Business Understanding:
 Objective:
Analyze the client’s transactional data from 2019 to identify opportunities to drive more sales and improve operational efficiency.

Key Business Questions:
1. How much revenue was generated in 2019?
2. Are there any seasonal trends in the sales data?
3. Which products performed the best and worst?
4. How do sales in different months and weeks compare?
5. Which cities received the most product deliveries?
6. How do different product categories perform in terms of revenue and quantity sold?
7. Identify additional insights to inform business strategies.

Constraints:
Products priced above $99.99 are categorized as "High-Level," while others are "Basic-Level."

2. Data Understanding:
Data Sources:
Excel Files: Sales data from January to June 2019 stored in CSV format.
Database: Sales data from July to December 2019 stored in a remote SQL database.
Data Collection:
Import CSV files for the first 6 months.
Connect to the remote database and extract data for the last 6 months.
Initial Data Exploration:
Check for missing data, inconsistencies, and data types.
Get an overview of sales trends, product lists, pricing, and delivery locations.

3. Data Preparation:
Data Cleaning:
Handle missing values and correct data inconsistencies.
Standardize formats across CSV and database data.
Data Integration:
Combine data from both sources into a unified dataset for the entire year.
Feature Engineering:
Create new features such as product level ("High-Level" or "Basic-Level") based on unit price.
Generate time-based features (e.g., month, week) to analyze trends.
Label and categorize products by their revenue contribution and quantity sold.

4. Modeling:
Sales Analysis:
Calculate total revenue, monthly, and weekly sales comparisons.
Identify seasonality in sales using time-series analysis.
Rank products by sales performance (best-selling vs. worst-selling).
Compare sales across cities and product categories.
Product Categorization:
Apply conditions to label products as "High-Level" or "Basic-Level."
Visualization:
Use visual tools to illustrate trends, seasonality, and comparisons (e.g., bar charts, line graphs, heat maps).

5. Evaluation:
Validation:
Ensure all business questions are addressed with the analysis.
Validate the consistency and accuracy of the data.
Business Insight:
Summarize key findings, such as peak sales periods, top-performing products, and geographic hotspots.
Provide recommendations for increasing sales and improving operational efficiency based on insights.

6. Deployment:
Report Generation:
Create a comprehensive report that answers all key business questions.
Include visualizations and key metrics to support decision-making.
Presentation:
Present the findings to the client, highlighting actionable insights.
BI Solution:
Recommend setting up a dashboard for real-time sales monitoring.
Suggest future data collection improvements for better analysis.
This outline provides a structured approach to solving the client's business problem using the CRISP-DM methodology, ensuring a thorough and systematic analysis of their 2019 sales data.









Overview
This project provides a comprehensive business intelligence solution using Power BI to analyze and visualize a client’s 2019 sales data. The goal is to deliver actionable insights into total revenue, seasonal trends, product performance, city-wise deliveries, and category comparisons, enhancing sales and operational efficiency.

Key Features
- Total Revenue Analysis: Understand the total revenue generated throughout the year.
- Seasonality Trends: Identify seasonal sales patterns and trends.
- Product Performance: Analyze the best and worst-selling products.
- Sales Comparisons: Compare sales data across different months and weeks.
- City-wise Deliveries: Determine which cities received the most deliveries.
- Category Comparisons: Evaluate revenue and quantities for different product categories.
- Product Categorization: Classify products as high-level or basic-level based on unit price.

Methodology: CRISP-DM

  The project follows the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology, structured as follows:

- Business Understanding: Define objectives and requirements from the client to drive sales and improve efficiency.
- Data Understanding: Collect and explore the dataset, which includes sales data for 2019 across various months and cities.
- Data Preparation: Clean and transform the data, integrating sources from CSV files and a remote database.
- Modeling: Apply appropriate data analysis techniques to uncover insights.
- Evaluation: Assess the results against business objectives to ensure they meet the client’s needs.
- Deployment: Develop and deploy a Power BI dashboard to visualize the insights.
- Installation
- Clone the repository:
bash
Copy code
git clone https://github.com/yourusername/comprehensive-power-bi-dashboard.git
Navigate to the project directory:
bash
Copy code
cd comprehensive-power-bi-dashboard
Open the Power BI project file (.pbix) using Power BI Desktop.
Usage
Open the Power BI file in Power BI Desktop.
Explore various dashboard pages, including Total Revenue, Seasonal Trends, Product Performance, City-wise Deliveries, and Category Comparisons.
Use interactive filters and visuals to derive insights from the data.
Contributing

In [163]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import re

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors
import folium
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.subplots import go
from matplotlib.ticker import FuncFormatter

# Statistical tests
from scipy.stats import f_oneway, shapiro, levene

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Database connectivity
import pyodbc
from sqlalchemy import create_engine

# Managing environment variables
from dotenv import dotenv_values

# Handling HTTP requests
import requests

# Handling file paths and directories
import os
from pathlib import Path

# Geocoding
import folium
from folium.plugins import MarkerCluster
from geopy.geocoders import Nominatim

# Suppressing warnings
import warnings 

warnings.filterwarnings('ignore')

##### Loading Data to Python VS code environment

In [164]:
# Accessing the second data set, downlaoded from onedrive into data folder. 

data_jan = pd.read_csv('../data/Sales_January_2019.csv')

data_jan.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,141234,iPhone,1,700.0,01/22/19 21:25,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,01/28/19 14:15,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,01/17/19 13:33,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,01/05/19 20:33,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,01/25/19 11:59,"387 10th St, Austin, TX 73301"


In [165]:
data_feb = pd.read_csv('../data/Sales_February_2019.csv')

data_feb.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,150502,iPhone,1,700.0,02/18/19 01:35,"866 Spruce St, Portland, ME 04101"
1,150503,AA Batteries (4-pack),1,3.84,02/13/19 07:24,"18 13th St, San Francisco, CA 94016"
2,150504,27in 4K Gaming Monitor,1,389.99,02/18/19 09:46,"52 6th St, New York City, NY 10001"
3,150505,Lightning Charging Cable,1,14.95,02/02/19 16:47,"129 Cherry St, Atlanta, GA 30301"
4,150506,AA Batteries (4-pack),2,3.84,02/28/19 20:32,"548 Lincoln St, Seattle, WA 98101"


In [166]:
data_mar = pd.read_csv('../data/Sales_March_2019.csv')

data_mar.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,162009,iPhone,1,700.0,03/28/19 20:59,"942 Church St, Austin, TX 73301"
1,162009,Lightning Charging Cable,1,14.95,03/28/19 20:59,"942 Church St, Austin, TX 73301"
2,162009,Wired Headphones,2,11.99,03/28/19 20:59,"942 Church St, Austin, TX 73301"
3,162010,Bose SoundSport Headphones,1,99.99,03/17/19 05:39,"261 10th St, San Francisco, CA 94016"
4,162011,34in Ultrawide Monitor,1,379.99,03/10/19 00:01,"764 13th St, San Francisco, CA 94016"


In [167]:
data_apr = pd.read_csv('../data/Sales_April_2019.csv')

data_apr.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,176558.0,USB-C Charging Cable,2.0,11.95,04/19/19 08:46,"917 1st St, Dallas, TX 75001"
1,,,,,,
2,176559.0,Bose SoundSport Headphones,1.0,99.99,04/07/19 22:30,"682 Chestnut St, Boston, MA 02215"
3,176560.0,Google Phone,1.0,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
4,176560.0,Wired Headphones,1.0,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


In [168]:
data_may = pd.read_csv('../data/Sales_May_2019.csv')

data_may.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,194095,Wired Headphones,1,11.99,05/16/19 17:14,"669 2nd St, New York City, NY 10001"
1,194096,AA Batteries (4-pack),1,3.84,05/19/19 14:43,"844 Walnut St, Dallas, TX 75001"
2,194097,27in FHD Monitor,1,149.99,05/24/19 11:36,"164 Madison St, New York City, NY 10001"
3,194098,Wired Headphones,1,11.99,05/02/19 20:40,"622 Meadow St, Dallas, TX 75001"
4,194099,AAA Batteries (4-pack),2,2.99,05/11/19 22:55,"17 Church St, Seattle, WA 98101"


In [169]:
data_jun = pd.read_csv('../data/Sales_June_2019.csv')

data_jun.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,209921,USB-C Charging Cable,1,11.95,06/23/19 19:34,"950 Walnut St, Portland, ME 04101"
1,209922,Macbook Pro Laptop,1,1700.0,06/30/19 10:05,"80 4th St, San Francisco, CA 94016"
2,209923,ThinkPad Laptop,1,999.99,06/24/19 20:18,"402 Jackson St, Los Angeles, CA 90001"
3,209924,27in FHD Monitor,1,149.99,06/05/19 10:21,"560 10th St, Seattle, WA 98101"
4,209925,Bose SoundSport Headphones,1,99.99,06/25/19 18:58,"545 2nd St, San Francisco, CA 94016"


In [170]:
# Merging January 2019 to June 2019 Data

# List of all the DataFrames to concatenate
data_frames = [data_jan, data_feb, data_mar, data_apr, data_may, data_jun]

# Concatenate all DataFrames into one
data1 = pd.concat(data_frames, ignore_index=True)
data1

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,141234,iPhone,1,700,01/22/19 21:25,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,01/28/19 14:15,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,01/17/19 13:33,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,01/05/19 20:33,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,01/25/19 11:59,"387 10th St, Austin, TX 73301"
...,...,...,...,...,...,...
85620,222905,AAA Batteries (4-pack),1,2.99,06/07/19 19:02,"795 Pine St, Boston, MA 02215"
85621,222906,27in FHD Monitor,1,149.99,06/01/19 19:29,"495 North St, New York City, NY 10001"
85622,222907,USB-C Charging Cable,1,11.95,06/22/19 18:57,"319 Ridge St, San Francisco, CA 94016"
85623,222908,USB-C Charging Cable,1,11.95,06/26/19 18:35,"916 Main St, San Francisco, CA 94016"


In [171]:
print(data1.info())  # Check data structure and types
print(data1.head())  # Preview the first few rows of the concatenated data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85625 entries, 0 to 85624
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          85380 non-null  object
 1   Product           85380 non-null  object
 2   Quantity Ordered  85380 non-null  object
 3   Price Each        85380 non-null  object
 4   Order Date        85380 non-null  object
 5   Purchase Address  85380 non-null  object
dtypes: object(6)
memory usage: 3.9+ MB
None
  Order ID                   Product Quantity Ordered Price Each  \
0   141234                    iPhone                1        700   
1   141235  Lightning Charging Cable                1      14.95   
2   141236          Wired Headphones                2      11.99   
3   141237          27in FHD Monitor                1     149.99   
4   141238          Wired Headphones                1      11.99   

       Order Date                       Purchase Address  
0  01/22/19

In [172]:
# Create DataFrame
data1 = pd.DataFrame(data1)

# Convert 'Order Date' to datetime and handle errors
data1['Order Date'] = pd.to_datetime(data1['Order Date'], format='%m/%d/%y %H:%M', errors='coerce')

# Check for any NaT values (invalid dates)
if data1['Order Date'].isna().any():
    print("There are invalid dates in the 'Order Date' column.")

# Format the 'Order Date' to the desired format
data1['Order Date'] = data1['Order Date'].dt.strftime('%Y-%m-%d %H:%M:%S.%f')

# Display the DataFrame
data1

There are invalid dates in the 'Order Date' column.


Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,141234,iPhone,1,700,2019-01-22 21:25:00.000000,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00.000000,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00.000000,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00.000000,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00.000000,"387 10th St, Austin, TX 73301"
...,...,...,...,...,...,...
85620,222905,AAA Batteries (4-pack),1,2.99,2019-06-07 19:02:00.000000,"795 Pine St, Boston, MA 02215"
85621,222906,27in FHD Monitor,1,149.99,2019-06-01 19:29:00.000000,"495 North St, New York City, NY 10001"
85622,222907,USB-C Charging Cable,1,11.95,2019-06-22 18:57:00.000000,"319 Ridge St, San Francisco, CA 94016"
85623,222908,USB-C Charging Cable,1,11.95,2019-06-26 18:35:00.000000,"916 Main St, San Francisco, CA 94016"


In [173]:
# Identify rows with NaT values
invalid_dates = data1[data1['Order Date'].isna()]

# Display rows with invalid dates
print("Rows with invalid dates:")
invalid_dates


Rows with invalid dates:


Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
664,,,,,,
678,,,,,,
797,,,,,,
876,,,,,,
1073,Order ID,Product,Quantity Ordered,Price Each,,Purchase Address
...,...,...,...,...,...,...
84643,,,,,,
84662,,,,,,
84735,,,,,,
85307,Order ID,Product,Quantity Ordered,Price Each,,Purchase Address


In [174]:
# Columns Headers data1

print(data1.columns)


Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')


In [175]:
# Change Columns spacing with hyphen-

# Display original column names
print("Original column names:")
print(data1.columns)

# Rename columns to replace spaces with hyphens
data1.columns = [col.replace(' ', '_') for col in data1.columns]

# Display updated column names
print("\nUpdated column names:")
print(data1.columns)
data1

Original column names:
Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

Updated column names:
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,141234,iPhone,1,700,2019-01-22 21:25:00.000000,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00.000000,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00.000000,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00.000000,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00.000000,"387 10th St, Austin, TX 73301"
...,...,...,...,...,...,...
85620,222905,AAA Batteries (4-pack),1,2.99,2019-06-07 19:02:00.000000,"795 Pine St, Boston, MA 02215"
85621,222906,27in FHD Monitor,1,149.99,2019-06-01 19:29:00.000000,"495 North St, New York City, NY 10001"
85622,222907,USB-C Charging Cable,1,11.95,2019-06-22 18:57:00.000000,"319 Ridge St, San Francisco, CA 94016"
85623,222908,USB-C Charging Cable,1,11.95,2019-06-26 18:35:00.000000,"916 Main St, San Francisco, CA 94016"


In [176]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

In [177]:
# Create a connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"

In [178]:
# This will connect to the server and might take a few seconds to be complete. 
connection = pyodbc.connect(connection_string)

In [179]:
query = "SELECT * FROM dbo.Sales_July_2019"

data_jul = pd.read_sql(query, connection)
data_jul.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,222910.0,Apple Airpods Headphones,1.0,150.0,2026-07-19 16:51:00.0000000,"389 South St, Atlanta, GA 30301"
1,222911.0,Flatscreen TV,1.0,300.0,2005-07-19 08:55:00.0000000,"590 4th St, Seattle, WA 98101"
2,222912.0,AA Batteries (4-pack),1.0,3.84,2029-07-19 12:41:00.0000000,"861 Hill St, Atlanta, GA 30301"
3,222913.0,AA Batteries (4-pack),1.0,3.84,2028-07-19 10:15:00.0000000,"190 Ridge St, Atlanta, GA 30301"
4,222914.0,AAA Batteries (4-pack),5.0,2.99,2031-07-19 02:13:00.0000000,"824 Forest St, Seattle, WA 98101"


In [180]:
query = "SELECT * FROM dbo.Sales_August_2019"

data_aug = pd.read_sql(query, connection)
data_aug.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,236670.0,Wired Headphones,2.0,11.99,2031-08-19 22:21:00.0000000,"359 Spruce St, Seattle, WA 98101"
1,236671.0,Bose SoundSport Headphones,1.0,99.989998,2015-08-19 15:11:00.0000000,"492 Ridge St, Dallas, TX 75001"
2,236672.0,iPhone,1.0,700.0,2006-08-19 14:40:00.0000000,"149 7th St, Portland, OR 97035"
3,236673.0,AA Batteries (4-pack),2.0,3.84,2029-08-19 20:59:00.0000000,"631 2nd St, Los Angeles, CA 90001"
4,236674.0,AA Batteries (4-pack),2.0,3.84,2015-08-19 19:53:00.0000000,"736 14th St, New York City, NY 10001"


In [181]:
query = "SELECT * FROM dbo.Sales_September_2019"

data_sep = pd.read_sql(query, connection)
data_sep.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,248151.0,AA Batteries (4-pack),4.0,3.84,2017-09-19 14:44:00.0000000,"380 North St, Los Angeles, CA 90001"
1,248152.0,USB-C Charging Cable,2.0,11.95,2029-09-19 10:19:00.0000000,"511 8th St, Austin, TX 73301"
2,248153.0,USB-C Charging Cable,1.0,11.95,2016-09-19 17:48:00.0000000,"151 Johnson St, Los Angeles, CA 90001"
3,248154.0,27in FHD Monitor,1.0,149.990005,2027-09-19 07:52:00.0000000,"355 Hickory St, Seattle, WA 98101"
4,248155.0,USB-C Charging Cable,1.0,11.95,2001-09-19 19:03:00.0000000,"125 5th St, Atlanta, GA 30301"


In [182]:
query = "SELECT * FROM dbo.Sales_October_2019"

data_oct = pd.read_sql(query, connection)
data_oct.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,259358.0,34in Ultrawide Monitor,1.0,379.98999,2028-10-19 10:56:00.0000000,"609 Cherry St, Dallas, TX 75001"
1,259359.0,27in 4K Gaming Monitor,1.0,389.98999,2028-10-19 17:26:00.0000000,"225 5th St, Los Angeles, CA 90001"
2,259360.0,AAA Batteries (4-pack),2.0,2.99,2024-10-19 17:20:00.0000000,"967 12th St, New York City, NY 10001"
3,259361.0,27in FHD Monitor,1.0,149.990005,2014-10-19 22:26:00.0000000,"628 Jefferson St, New York City, NY 10001"
4,259362.0,Wired Headphones,1.0,11.99,2007-10-19 16:10:00.0000000,"534 14th St, Los Angeles, CA 90001"


In [183]:
query = "SELECT * FROM dbo.Sales_November_2019"

data_nov = pd.read_sql(query, connection)
data_nov.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,278797.0,Wired Headphones,1.0,11.99,2021-11-19 09:54:00.0000000,"46 Park St, New York City, NY 10001"
1,278798.0,USB-C Charging Cable,2.0,11.95,2017-11-19 10:03:00.0000000,"962 Hickory St, Austin, TX 73301"
2,278799.0,Apple Airpods Headphones,1.0,150.0,2019-11-19 14:56:00.0000000,"464 Cherry St, Los Angeles, CA 90001"
3,278800.0,27in FHD Monitor,1.0,149.990005,2025-11-19 22:24:00.0000000,"649 10th St, Seattle, WA 98101"
4,278801.0,Bose SoundSport Headphones,1.0,99.989998,2009-11-19 13:56:00.0000000,"522 Hill St, Boston, MA 02215"


In [184]:
query = "SELECT * FROM dbo.Sales_December_2019"

data_dec = pd.read_sql(query, connection)
data_dec.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,295665.0,Macbook Pro Laptop,1.0,1700.0,2030-12-19 00:01:00.0000000,"136 Church St, New York City, NY 10001"
1,295666.0,LG Washing Machine,1.0,600.0,2029-12-19 07:03:00.0000000,"562 2nd St, New York City, NY 10001"
2,295667.0,USB-C Charging Cable,1.0,11.95,2012-12-19 18:21:00.0000000,"277 Main St, New York City, NY 10001"
3,295668.0,27in FHD Monitor,1.0,149.990005,2022-12-19 15:13:00.0000000,"410 6th St, San Francisco, CA 94016"
4,295669.0,USB-C Charging Cable,1.0,11.95,2018-12-19 12:38:00.0000000,"43 Hill St, Atlanta, GA 30301"


In [185]:
# List of all the DataFrames to concatenate
data_frames = [data_jul, data_aug, data_sep, data_oct, data_nov, data_dec]

# Concatenate all DataFrames into one
data2 = pd.concat(data_frames, ignore_index=True)
data2

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,222910.0,Apple Airpods Headphones,1.0,150.000000,2026-07-19 16:51:00.0000000,"389 South St, Atlanta, GA 30301"
1,222911.0,Flatscreen TV,1.0,300.000000,2005-07-19 08:55:00.0000000,"590 4th St, Seattle, WA 98101"
2,222912.0,AA Batteries (4-pack),1.0,3.840000,2029-07-19 12:41:00.0000000,"861 Hill St, Atlanta, GA 30301"
3,222913.0,AA Batteries (4-pack),1.0,3.840000,2028-07-19 10:15:00.0000000,"190 Ridge St, Atlanta, GA 30301"
4,222914.0,AAA Batteries (4-pack),5.0,2.990000,2031-07-19 02:13:00.0000000,"824 Forest St, Seattle, WA 98101"
...,...,...,...,...,...,...
101220,319666.0,Lightning Charging Cable,1.0,14.950000,2011-12-19 20:58:00.0000000,"14 Madison St, San Francisco, CA 94016"
101221,319667.0,AA Batteries (4-pack),2.0,3.840000,2001-12-19 12:01:00.0000000,"549 Willow St, Los Angeles, CA 90001"
101222,319668.0,Vareebadd Phone,1.0,400.000000,2009-12-19 06:43:00.0000000,"273 Wilson St, Seattle, WA 98101"
101223,319669.0,Wired Headphones,1.0,11.990000,2003-12-19 10:39:00.0000000,"778 River St, Dallas, TX 75001"


In [186]:
print(data2.info())  # Check data structure and types
print(data2.head())  # Preview the first few rows of the concatenated data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101225 entries, 0 to 101224
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Order_ID          100730 non-null  float64
 1   Product           100925 non-null  object 
 2   Quantity_Ordered  100730 non-null  float64
 3   Price_Each        100730 non-null  float64
 4   Order_Date        100730 non-null  object 
 5   Purchase_Address  100925 non-null  object 
dtypes: float64(3), object(3)
memory usage: 4.6+ MB
None
   Order_ID                   Product  Quantity_Ordered  Price_Each  \
0  222910.0  Apple Airpods Headphones               1.0      150.00   
1  222911.0             Flatscreen TV               1.0      300.00   
2  222912.0     AA Batteries (4-pack)               1.0        3.84   
3  222913.0     AA Batteries (4-pack)               1.0        3.84   
4  222914.0    AAA Batteries (4-pack)               5.0        2.99   

                    Or

In [187]:
# Identify rows with NaT values
invalid_date = data2[data2['Order_Date'].isna()]

# Display rows with invalid dates
print("Rows with invalid dates:")
invalid_date


Rows with invalid dates:


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
199,,,,,,
315,,,,,,
599,,,,,,
675,,,,,,
775,,Product,,,,Purchase Address
...,...,...,...,...,...,...
99856,,Product,,,,Purchase Address
100104,,,,,,
100300,,Product,,,,Purchase Address
100330,,Product,,,,Purchase Address


In [188]:
print(data1.columns)
print(data2.columns)


Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')


In [189]:
# Concatenating data1 and data2, Ensuring Both DataFrames Have the Same Columns

# Concatenate the DataFrames
df = pd.concat([data1, data2], ignore_index=True)

# Display the concatenated DataFrame
print("Concatenated DataFrame:")

df

Concatenated DataFrame:


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,141234,iPhone,1,700,2019-01-22 21:25:00.000000,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00.000000,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00.000000,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00.000000,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00.000000,"387 10th St, Austin, TX 73301"
...,...,...,...,...,...,...
186845,319666.0,Lightning Charging Cable,1.0,14.95,2011-12-19 20:58:00.0000000,"14 Madison St, San Francisco, CA 94016"
186846,319667.0,AA Batteries (4-pack),2.0,3.84,2001-12-19 12:01:00.0000000,"549 Willow St, Los Angeles, CA 90001"
186847,319668.0,Vareebadd Phone,1.0,400.0,2009-12-19 06:43:00.0000000,"273 Wilson St, Seattle, WA 98101"
186848,319669.0,Wired Headphones,1.0,11.99,2003-12-19 10:39:00.0000000,"778 River St, Dallas, TX 75001"


In [190]:
# Export Df to csv file




In [191]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186850 entries, 0 to 186849
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order_ID          186110 non-null  object
 1   Product           186305 non-null  object
 2   Quantity_Ordered  186110 non-null  object
 3   Price_Each        186110 non-null  object
 4   Order_Date        185950 non-null  object
 5   Purchase_Address  186305 non-null  object
dtypes: object(6)
memory usage: 8.6+ MB


In [192]:
# Using pandas' unique() function
unique_values = df['Order_ID'].unique()
print(unique_values)

['141234' '141235' '141236' ... 319668.0 319669.0 319670.0]


In [193]:

# Step 1: Convert the 'Order_ID' column to numeric, setting errors='coerce' to convert strings to NaN
df['Order_ID'] = pd.to_numeric(df['Order_ID'], errors='coerce')

# Step 2: Drop NaN values (which were originally strings)
df = df.dropna(subset=['Order_ID'])

# Step 3: Convert the remaining numeric values to integers
df['Order_ID'] = df['Order_ID'].astype(int)

# Step 4: Use the unique() function if needed
unique_values = df['Order_ID'].unique()
print(unique_values)

[141234 141235 141236 ... 319668 319669 319670]


In [194]:
# Using pandas' unique() function
unique_values = df['Product'].unique()
print(unique_values)

['iPhone' 'Lightning Charging Cable' 'Wired Headphones' '27in FHD Monitor'
 'AAA Batteries (4-pack)' '27in 4K Gaming Monitor' 'USB-C Charging Cable'
 'Bose SoundSport Headphones' 'Apple Airpods Headphones'
 'Macbook Pro Laptop' 'Flatscreen TV' 'Vareebadd Phone'
 'AA Batteries (4-pack)' 'Google Phone' '20in Monitor'
 '34in Ultrawide Monitor' 'ThinkPad Laptop' 'LG Dryer'
 'LG Washing Machine']


In [195]:
# Using pandas' unique() function
unique_values = df['Quantity_Ordered'].unique()
print(unique_values)

['1' '2' '3' '5' '4' '7' '6' '9' 1.0 5.0 2.0 3.0 4.0 6.0 7.0 8.0 9.0]


In [196]:

# Step 1: Convert 'Quantity_Ordered' to numeric, coercing non-numeric values to NaN
df['Quantity_Ordered'] = pd.to_numeric(df['Quantity_Ordered'], errors='coerce')

# Step 2: Drop NaN values that result from invalid conversions
df = df.dropna(subset=['Quantity_Ordered'])

# Step 3: Convert the column to integers
df['Quantity_Ordered'] = df['Quantity_Ordered'].astype(int)

# Step 4: Get the unique values
unique_values = df['Quantity_Ordered'].unique()

# Step 5: Print the unique integer values
print(unique_values)

[1 2 3 5 4 7 6 9 8]


In [197]:
# Using pandas' unique() function
unique_values = df['Price_Each'].unique()
print(unique_values)


['700' '14.95' '11.99' '149.99' '2.99' '389.99' '11.95' '99.99' '150'
 '1700' '300' '400' '3.84' '600' '109.99' '379.99' '999.99' '600.0'
 '700.0' '150.0' '300.0' '1700.0' 150.0 300.0 3.8399999141693115
 2.990000009536743 99.98999786376953 600.0 11.949999809265137
 14.949999809265137 379.989990234375 11.989999771118164 400.0
 149.99000549316406 109.98999786376953 999.989990234375 389.989990234375
 1700.0 700.0]


In [198]:
# Clean the 'Price_Each' column
df['Price_Each'] = pd.to_numeric(df['Price_Each'], errors='coerce')  # Convert to numeric, forcing errors to NaN

# Drop NaN values
df_cleaned = df.dropna(subset=['Price_Each'])

# Round to 2 decimal places
df_cleaned['Price_Each'] = df_cleaned['Price_Each'].round(2)

# Check the unique values after cleaning
unique_values_cleaned = df_cleaned['Price_Each'].unique()
print(unique_values_cleaned)

[ 700.     14.95   11.99  149.99    2.99  389.99   11.95   99.99  150.
 1700.    300.    400.      3.84  600.    109.99  379.99  999.99]


In [199]:
# Round values in the 'Price_Each' column to 2 decimal places
df['Price_Each'] = df['Price_Each'].round(2)

# Verify the result
print(df['Price_Each'].head())

0    700.00
1     14.95
2     11.99
3    149.99
4     11.99
Name: Price_Each, dtype: float64


In [200]:
# Using pandas' unique() function
unique_values = df['Order_Date'].unique()
print(unique_values)

['2019-01-22 21:25:00.000000' '2019-01-28 14:15:00.000000'
 '2019-01-17 13:33:00.000000' ... '2001-12-19 12:01:00.0000000'
 '2009-12-19 06:43:00.0000000' '2003-12-19 10:39:00.0000000']


In [201]:
df['Order_Date'].value_counts()

Order_Date
2015-12-19 20:16:00.0000000    8
2011-12-19 13:24:00.0000000    7
2030-10-19 21:28:00.0000000    7
2019-04-02 13:24:00.000000     7
2015-09-19 23:41:00.0000000    6
                              ..
2019-05-08 15:57:00.000000     1
2019-05-13 15:35:00.000000     1
2019-05-23 23:16:00.000000     1
2019-05-11 12:07:00.000000     1
2003-12-19 10:39:00.0000000    1
Name: count, Length: 142395, dtype: int64

In [202]:
# Convert 'Order_Date' to datetime if it's not already
df['Order_Date'] = pd.to_datetime(df['Order_Date'])

# Function to replace year with 2019, keeping month, day, and time the same
def replace_year(date):
    if date.year != 2019:
        return date.replace(year=2019)
    else:
        return date

# Apply the function to the 'Order_Date' column
df['Order_Date'] = df['Order_Date'].apply(replace_year)

# Verify the result
print(df['Order_Date'].head())


0   2019-01-22 21:25:00
1   2019-01-28 14:15:00
2   2019-01-17 13:33:00
3   2019-01-05 20:33:00
4   2019-01-25 11:59:00
Name: Order_Date, dtype: datetime64[ns]


In [203]:
df['Order_Date'].value_counts()

Order_Date
2019-12-19 18:56:00    44
2019-12-19 19:25:00    44
2019-12-19 19:01:00    43
2019-12-19 19:16:00    42
2019-12-19 19:46:00    42
                       ..
2019-03-12 02:15:00     1
2019-03-25 07:21:00     1
2019-03-17 20:35:00     1
2019-03-31 17:03:00     1
2019-12-19 03:15:00     1
Name: count, Length: 74739, dtype: int64

In [204]:
# Using pandas' unique() function
unique_values = df['Purchase_Address'].unique()
print(unique_values)

['944 Walnut St, Boston, MA 02215' '185 Maple St, Portland, OR 97035'
 '538 Adams St, San Francisco, CA 94016' ...
 '273 Wilson St, Seattle, WA 98101' '778 River St, Dallas, TX 75001'
 '747 Chestnut St, Los Angeles, CA 90001']


In [205]:
df

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,141234,iPhone,1,700.00,2019-01-22 21:25:00,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00,"387 10th St, Austin, TX 73301"
...,...,...,...,...,...,...
186845,319666,Lightning Charging Cable,1,14.95,2019-12-19 20:58:00,"14 Madison St, San Francisco, CA 94016"
186846,319667,AA Batteries (4-pack),2,3.84,2019-12-19 12:01:00,"549 Willow St, Los Angeles, CA 90001"
186847,319668,Vareebadd Phone,1,400.00,2019-12-19 06:43:00,"273 Wilson St, Seattle, WA 98101"
186848,319669,Wired Headphones,1,11.99,2019-12-19 10:39:00,"778 River St, Dallas, TX 75001"


In [206]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 185950 entries, 0 to 186849
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Order_ID          185950 non-null  int32         
 1   Product           185950 non-null  object        
 2   Quantity_Ordered  185950 non-null  int32         
 3   Price_Each        185950 non-null  float64       
 4   Order_Date        185950 non-null  datetime64[ns]
 5   Purchase_Address  185950 non-null  object        
dtypes: datetime64[ns](1), float64(1), int32(2), object(2)
memory usage: 8.5+ MB


In [207]:
# Check for null values in the entire DataFrame
null_counts = df.isnull().sum()
print("Null values in each column:")
print(null_counts)

Null values in each column:
Order_ID            0
Product             0
Quantity_Ordered    0
Price_Each          0
Order_Date          0
Purchase_Address    0
dtype: int64


In [208]:
# Loop through each column and print unique values
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values in column '{column}':")
    print(unique_values)
    print()  # Blank line for better readability

Unique values in column 'Order_ID':
[141234 141235 141236 ... 319668 319669 319670]

Unique values in column 'Product':
['iPhone' 'Lightning Charging Cable' 'Wired Headphones' '27in FHD Monitor'
 'AAA Batteries (4-pack)' '27in 4K Gaming Monitor' 'USB-C Charging Cable'
 'Bose SoundSport Headphones' 'Apple Airpods Headphones'
 'Macbook Pro Laptop' 'Flatscreen TV' 'Vareebadd Phone'
 'AA Batteries (4-pack)' 'Google Phone' '20in Monitor'
 '34in Ultrawide Monitor' 'ThinkPad Laptop' 'LG Dryer'
 'LG Washing Machine']

Unique values in column 'Quantity_Ordered':
[1 2 3 5 4 7 6 9 8]

Unique values in column 'Price_Each':
[ 700.     14.95   11.99  149.99    2.99  389.99   11.95   99.99  150.
 1700.    300.    400.      3.84  600.    109.99  379.99  999.99]

Unique values in column 'Order_Date':
<DatetimeArray>
['2019-01-22 21:25:00', '2019-01-28 14:15:00', '2019-01-17 13:33:00',
 '2019-01-05 20:33:00', '2019-01-25 11:59:00', '2019-01-29 20:22:00',
 '2019-01-26 12:16:00', '2019-01-05 12:04:00', '2

In [215]:
# Check for duplicated Order_IDs (shows True for duplicates)
duplicated_ids = df['Order_ID'].duplicated()
duplicated_ids


0         False
1         False
2         False
3         False
4         False
          ...  
186845    False
186846    False
186847    False
186848    False
186849    False
Name: Order_ID, Length: 185950, dtype: bool

In [216]:
# Display duplicated Order_ID rows
print("Duplicated Order_ID rows:")
print(df[duplicated_ids])

Duplicated Order_ID rows:
        Order_ID                     Product  Quantity_Ordered  Price_Each  \
42        141275            Wired Headphones                 1       11.99   
58        141290       AA Batteries (4-pack)                 3        3.84   
134       141365            Wired Headphones                 1       11.99   
154       141384        USB-C Charging Cable                 1       11.95   
221       141450  Bose SoundSport Headphones                 1       99.99   
...          ...                         ...               ...         ...   
186711    319536            Wired Headphones                 1       11.99   
186732    319556            Wired Headphones                 1       11.99   
186761    319584            Wired Headphones                 1       11.99   
186774    319596    Lightning Charging Cable                 1       14.95   
186810    319631    Lightning Charging Cable                 1       14.95   

                Order_Date           

In [217]:
# Drop duplicate Order_ID rows, keeping the first occurrence
df_unique = df.drop_duplicates(subset='Order_ID', keep='first')

# Verify the result
print("DataFrame after removing duplicate Order_ID rows:")
print(df_unique)


DataFrame after removing duplicate Order_ID rows:
        Order_ID                     Product  Quantity_Ordered  Price_Each  \
0         141234                      iPhone                 1      700.00   
1         141235    Lightning Charging Cable                 1       14.95   
2         141236            Wired Headphones                 2       11.99   
3         141237            27in FHD Monitor                 1      149.99   
4         141238            Wired Headphones                 1       11.99   
...          ...                         ...               ...         ...   
186845    319666    Lightning Charging Cable                 1       14.95   
186846    319667       AA Batteries (4-pack)                 2        3.84   
186847    319668             Vareebadd Phone                 1      400.00   
186848    319669            Wired Headphones                 1       11.99   
186849    319670  Bose SoundSport Headphones                 1       99.99   

             

In [218]:
# Count the number of duplicated Order_IDs
num_duplicates = df['Order_ID'].duplicated().sum()

print(f"Number of duplicated Order_IDs: {num_duplicates}")


Number of duplicated Order_IDs: 7513


In [223]:
df_unique

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,141234,iPhone,1,700.00,2019-01-22 21:25:00,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00,"387 10th St, Austin, TX 73301"
...,...,...,...,...,...,...
186845,319666,Lightning Charging Cable,1,14.95,2019-12-19 20:58:00,"14 Madison St, San Francisco, CA 94016"
186846,319667,AA Batteries (4-pack),2,3.84,2019-12-19 12:01:00,"549 Willow St, Los Angeles, CA 90001"
186847,319668,Vareebadd Phone,1,400.00,2019-12-19 06:43:00,"273 Wilson St, Seattle, WA 98101"
186848,319669,Wired Headphones,1,11.99,2019-12-19 10:39:00,"778 River St, Dallas, TX 75001"


In [224]:
# Ensure that Order_Date is in datetime format
df_unique['Order_Date'] = pd.to_datetime(df_unique['Order_Date'])

# Create a new column 'Month' by extracting the month from 'Order_Date'
df_unique['Month'] = df_unique['Order_Date'].dt.month

# Verify the result
print(df_unique[['Order_Date', 'Month']].head())

           Order_Date  Month
0 2019-01-22 21:25:00      1
1 2019-01-28 14:15:00      1
2 2019-01-17 13:33:00      1
3 2019-01-05 20:33:00      1
4 2019-01-25 11:59:00      1


In [225]:
df_unique

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address,Month
0,141234,iPhone,1,700.00,2019-01-22 21:25:00,"944 Walnut St, Boston, MA 02215",1
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00,"185 Maple St, Portland, OR 97035",1
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00,"538 Adams St, San Francisco, CA 94016",1
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00,"738 10th St, Los Angeles, CA 90001",1
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00,"387 10th St, Austin, TX 73301",1
...,...,...,...,...,...,...,...
186845,319666,Lightning Charging Cable,1,14.95,2019-12-19 20:58:00,"14 Madison St, San Francisco, CA 94016",12
186846,319667,AA Batteries (4-pack),2,3.84,2019-12-19 12:01:00,"549 Willow St, Los Angeles, CA 90001",12
186847,319668,Vareebadd Phone,1,400.00,2019-12-19 06:43:00,"273 Wilson St, Seattle, WA 98101",12
186848,319669,Wired Headphones,1,11.99,2019-12-19 10:39:00,"778 River St, Dallas, TX 75001",12


In [227]:
# Ensure that 'Quantity_Ordered' and 'Price_Each' are numeric
df_unique['Quantity_Ordered'] = pd.to_numeric(df_unique['Quantity_Ordered'], errors='coerce')
df['Price_Each'] = pd.to_numeric(df_unique['Price_Each'], errors='coerce')

# Create a new column 'Sales' by multiplying 'Quantity_Ordered' and 'Price_Each'
df_unique['Sales'] = df_unique['Quantity_Ordered'] * df_unique['Price_Each']

# Verify the result
print(df_unique[['Quantity_Ordered', 'Price_Each', 'Sales']].head())

df_unique

   Quantity_Ordered  Price_Each   Sales
0                 1      700.00  700.00
1                 1       14.95   14.95
2                 2       11.99   23.98
3                 1      149.99  149.99
4                 1       11.99   11.99


Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address,Month,Sales
0,141234,iPhone,1,700.00,2019-01-22 21:25:00,"944 Walnut St, Boston, MA 02215",1,700.00
1,141235,Lightning Charging Cable,1,14.95,2019-01-28 14:15:00,"185 Maple St, Portland, OR 97035",1,14.95
2,141236,Wired Headphones,2,11.99,2019-01-17 13:33:00,"538 Adams St, San Francisco, CA 94016",1,23.98
3,141237,27in FHD Monitor,1,149.99,2019-01-05 20:33:00,"738 10th St, Los Angeles, CA 90001",1,149.99
4,141238,Wired Headphones,1,11.99,2019-01-25 11:59:00,"387 10th St, Austin, TX 73301",1,11.99
...,...,...,...,...,...,...,...,...
186845,319666,Lightning Charging Cable,1,14.95,2019-12-19 20:58:00,"14 Madison St, San Francisco, CA 94016",12,14.95
186846,319667,AA Batteries (4-pack),2,3.84,2019-12-19 12:01:00,"549 Willow St, Los Angeles, CA 90001",12,7.68
186847,319668,Vareebadd Phone,1,400.00,2019-12-19 06:43:00,"273 Wilson St, Seattle, WA 98101",12,400.00
186848,319669,Wired Headphones,1,11.99,2019-12-19 10:39:00,"778 River St, Dallas, TX 75001",12,11.99
