## Business Understanding

### `Project Description`
Imagine being at the helm of a business with a treasure trove of transactional data, brimming with untapped potential, yet unable to harness its power. That's where our client finds themselves—a company sitting on a goldmine of 2019 transactional data, eager to uncover insights that could propel their business to new heights. They've turned to us, seeking a transformative business intelligence solution that not only answers their pressing questions but also illuminates hidden opportunities to boost sales and streamline operations.

Enter getINNOtized, a dynamic organization dedicated to connecting talented data professionals with businesses in need of innovative data solutions. Through their platform, companies can leverage top-tier analytical expertise to unlock the full potential of their data. getINNOtized has assigned us this mission, confident in our ability to deliver a comprehensive, actionable business intelligence report that will empower the client to make informed, strategic decisions. Our task is clear: dive deep into this data, decode its secrets, and deliver insights that guide the client towards increased revenue and enhanced efficiency.

#### **Key Stakeholders**
The stakeholders include the client's company executives, sales and marketing team, product management team, logistics and supply chain team, IT and data team, getINNOtized and external stakeholders(investors and suppliers)

#### **Success Criteria**

The success of this project will be measured by the ability to deliver a detailed and actionable business intelligence report that answers the client's questions and provides insights for driving sales and efficiency. The report should be clear, visually appealing, and easy to understand, providing the client with a solid foundation for making data-driven decisions.

#### **Constraints and Considerations**

- Data Quality: Ensure the data is clean, accurate, and complete before analysis.
- Timeliness: The analysis and report should be delivered within the agreed-upon timeframe.
- Client Collaboration: Regular communication with the client to understand their needs and provide updates on progress.
- Tool Selection: Utilize appropriate data analysis and visualization tools to generate insights and present findings effectively.

#### **Data Requirements**
- Utilize data that was collected for each month in the entire year of 2019. The data for the first half of the year (January to June) was collected in excel and saved as csv files before management decided to use databases to store their data for analysis.

**<i>NB</i>** Additionally, categorize products based on their unit prices:
- Products with unit prices above $99.99 should be labeled as high-level products.
- Products with unit prices $99.99 and below should be labeled as basic-level products.

#### **Business Impact**
- Enhance customer satisfaction through better product availability.
- Optimize inventory management, leading to cost savings and improved operational efficiency.

### `Hypothesis`

*Null Hypothesis (Ho):* 

*Alternate Hypothesis (Ha):* 

### `Analytical Business Questions`

1. How much money did we make this year?
2. Can we identify any seasonality in the  sales?
3. What are our best and worst-selling products?
4. How do sales compare to previous months or weeks? 
5. Which cities are our products delivered to most? 
6. How do product categories compare in revenue generated and quantities  ordered?
7. You are required to show additional details from your findings in your data.

### `Importations`

In [127]:
# Import the necessary libraries

# Data Connection
import pyodbc
from dotenv import dotenv_values

# Data Manipulation
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Stats Packages
from scipy.stats import stats


# Others
from datetime import datetime
import os
from itertools import product
pd.options.display.float_format = '{:.2f}'.format
import warnings
warnings.filterwarnings('ignore')

print("PACKAGE SUCCESS!! 🎉")


PACKAGE SUCCESS!! 🎉


### `Data Connection`

In [128]:
# Load environment variables from .env file into a dictionary
environment_variables=dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("server_name")
database = environment_variables.get("database_name")
username = environment_variables.get("user")
password = environment_variables.get("password")

connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

# Use the connect method of the pyodbc library and pass in the connection string.
connection = pyodbc.connect(connection_string)

In [129]:
# Get the cursor
# The connection cursor is used to execute statements to communicate with the MySQL database
cursor = connection.cursor()
# Retrieve the table names
table_names = cursor.tables(tableType='TABLE')
# Fetch all the table names
tables = table_names.fetchall()
# Print the table names
for table in tables:
    print(table.table_name)

Sales_August_2019
Sales_December_2019
Sales_July_2019
Sales_November_2019
Sales_October_2019
Sales_September_2019
change_streams_destination_type
change_streams_partition_scheme
trace_xe_action_map
trace_xe_event_map


In [130]:
# sql query to get the datasets
query = "SELECT * FROM Sales_August_2019"
query2 = "SELECT * FROM Sales_December_2019"
query3 = "SELECT * FROM Sales_July_2019"
query4 = "SELECT * FROM Sales_November_2019"
query5 = "SELECT * FROM Sales_October_2019"
query6 = "SELECT * FROM Sales_September_2019"

data=pd.read_sql(query,connection)
data2=pd.read_sql(query2,connection)
data3=pd.read_sql(query3,connection)
data4=pd.read_sql(query4,connection)
data5=pd.read_sql(query5,connection)
data6=pd.read_sql(query6,connection)

In [131]:
# save tables to csv
data.to_csv('data\Sales_August_2019.csv', index=False)
data2.to_csv('data\Sales_December_2019.csv', index=False)
data3.to_csv('data\Sales_July_2019.csv', index=False)
data4.to_csv('data\Sales_November_2019.csv', index=False)
data5.to_csv('data\Sales_October_2019.csv', index=False)
data6.to_csv('data\Sales_September_2019.csv', index=False)

In [132]:
data3.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,222910.0,Apple Airpods Headphones,1.0,150.0,2026-07-19 16:51:00.0000000,"389 South St, Atlanta, GA 30301"
1,222911.0,Flatscreen TV,1.0,300.0,2005-07-19 08:55:00.0000000,"590 4th St, Seattle, WA 98101"
2,222912.0,AA Batteries (4-pack),1.0,3.84,2029-07-19 12:41:00.0000000,"861 Hill St, Atlanta, GA 30301"
3,222913.0,AA Batteries (4-pack),1.0,3.84,2028-07-19 10:15:00.0000000,"190 Ridge St, Atlanta, GA 30301"
4,222914.0,AAA Batteries (4-pack),5.0,2.99,2031-07-19 02:13:00.0000000,"824 Forest St, Seattle, WA 98101"


In [133]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14371 entries, 0 to 14370
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          14291 non-null  float64
 1   Product           14326 non-null  object 
 2   Quantity_Ordered  14291 non-null  float64
 3   Price_Each        14291 non-null  float64
 4   Order_Date        14291 non-null  object 
 5   Purchase_Address  14326 non-null  object 
dtypes: float64(3), object(3)
memory usage: 673.8+ KB


In [134]:
# Read CSV and change date column from object to date type
sales_Aug = pd.read_csv("data\Sales_August_2019.csv", parse_dates =['Order_Date'])
sales_Dec = pd.read_csv("data\Sales_December_2019.csv", parse_dates =['Order_Date'])
sales_Jul = pd.read_csv("data\Sales_July_2019.csv",parse_dates=['Order_Date'])
sales_Nov = pd.read_csv("data\Sales_November_2019.csv",parse_dates=['Order_Date'])
sales_Oct = pd.read_csv("data\Sales_October_2019.csv",parse_dates=['Order_Date'])
sales_Sept = pd.read_csv("data\Sales_September_2019.csv",parse_dates=['Order_Date'])

In [135]:
sales_Nov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17661 entries, 0 to 17660
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_ID          17580 non-null  float64       
 1   Product           17616 non-null  object        
 2   Quantity_Ordered  17580 non-null  float64       
 3   Price_Each        17580 non-null  float64       
 4   Order_Date        17580 non-null  datetime64[ns]
 5   Purchase_Address  17616 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 828.0+ KB


In [136]:
# Remove microseconds by formatting the datetime without microseconds
sales_Aug['Order_Date'] = sales_Aug['Order_Date'].dt.floor('S')
sales_Dec['Order_Date'] = sales_Dec['Order_Date'].dt.floor('S')
sales_Jul['Order_Date'] = sales_Jul['Order_Date'].dt.floor('S')
sales_Nov['Order_Date'] = sales_Nov['Order_Date'].dt.floor('S')
sales_Oct['Order_Date'] = sales_Oct['Order_Date'].dt.floor('S')
sales_Sept['Order_Date'] = sales_Sept['Order_Date'].dt.floor('S')

In [137]:
sales_Sept.head()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,248151.0,AA Batteries (4-pack),4.0,3.84,2017-09-19 14:44:00,"380 North St, Los Angeles, CA 90001"
1,248152.0,USB-C Charging Cable,2.0,11.95,2029-09-19 10:19:00,"511 8th St, Austin, TX 73301"
2,248153.0,USB-C Charging Cable,1.0,11.95,2016-09-19 17:48:00,"151 Johnson St, Los Angeles, CA 90001"
3,248154.0,27in FHD Monitor,1.0,149.99,2027-09-19 07:52:00,"355 Hickory St, Seattle, WA 98101"
4,248155.0,USB-C Charging Cable,1.0,11.95,2001-09-19 19:03:00,"125 5th St, Atlanta, GA 30301"


In [168]:
# Remove leading/trailing spaces from column names for all DataFrames
dataframes = [sales_Aug, sales_Dec, sales_Jul, sales_Nov, sales_Oct, sales_Sept]
for df in dataframes:
    df.columns = df.columns.str.strip()

def fill_missing_object_columns(df):
    # Iterate through each column in the DataFrame
    for col in df.columns:
        # Check if the column data type is object
        if df[col].dtype == 'object':
            # Fill missing values with NaN
            df[col] = df[col].fillna(np.nan)
    return df

# Apply the function to each DataFrame
sales_Aug = fill_missing_object_columns(sales_Aug)
sales_Dec = fill_missing_object_columns(sales_Dec)
sales_Jul = fill_missing_object_columns(sales_Jul)
sales_Nov = fill_missing_object_columns(sales_Nov)
sales_Oct = fill_missing_object_columns(sales_Oct)
sales_Sept = fill_missing_object_columns(sales_Sept)

In [169]:
sales_Sept.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11686 entries, 0 to 11685
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_ID          11629 non-null  float64       
 1   Product           11646 non-null  object        
 2   Quantity_Ordered  11629 non-null  float64       
 3   Price_Each        11629 non-null  float64       
 4   Order_Date        11629 non-null  datetime64[ns]
 5   Purchase_Address  11646 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 547.9+ KB


In [170]:
def fill_missing_float_columns(df):
    # Iterate through each column in the DataFrame
    for col in df.columns:
        # Check if the column data type is float
        if df[col].dtype == 'float':
            # Fill missing values with NaN
            df[col] = df[col].fillna(np.nan)
    return df

# Apply the function to each DataFrame
sales_Aug = fill_missing_float_columns(sales_Aug)
sales_Dec = fill_missing_float_columns(sales_Dec)
sales_Jul = fill_missing_float_columns(sales_Jul)
sales_Nov = fill_missing_float_columns(sales_Nov)
sales_Oct = fill_missing_float_columns(sales_Oct)
sales_Sept = fill_missing_float_columns(sales_Sept)

In [171]:
sales_Nov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17661 entries, 0 to 17660
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_ID          17580 non-null  float64       
 1   Product           17616 non-null  object        
 2   Quantity_Ordered  17580 non-null  float64       
 3   Price_Each        17580 non-null  float64       
 4   Order_Date        17580 non-null  datetime64[ns]
 5   Purchase_Address  17616 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 828.0+ KB


## Loading First 6 months of the data can be found in this OneDrive

In [138]:
data7 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_April_2019.xlsx')
data8 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_February_2019.xlsx')
data9 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_January_2019.xlsx')
data10 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_June_2019.xlsx')
data11 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_March_2019.xlsx')
data12 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_May_2019.xlsx')

In [139]:
data10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13622 entries, 0 to 13621
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          13579 non-null  object
 1   Product           13579 non-null  object
 2   Quantity Ordered  13579 non-null  object
 3   Price Each        13579 non-null  object
 4   Order Date        13579 non-null  object
 5   Purchase Address  13579 non-null  object
dtypes: object(6)
memory usage: 638.7+ KB


In [140]:
data10.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,209921,USB-C Charging Cable,1,11.95,2019-06-23 19:34:00,"950 Walnut St, Portland, ME 04101"
1,209922,Macbook Pro Laptop,1,1700.0,2019-06-30 10:05:00,"80 4th St, San Francisco, CA 94016"
2,209923,ThinkPad Laptop,1,999.99,2019-06-24 20:18:00,"402 Jackson St, Los Angeles, CA 90001"
3,209924,27in FHD Monitor,1,149.99,2019-06-05 10:21:00,"560 10th St, Seattle, WA 98101"
4,209925,Bose SoundSport Headphones,1,99.99,2019-06-25 18:58:00,"545 2nd St, San Francisco, CA 94016"


In [141]:
# save tables to csv
data7.to_csv('data\Sales_April_2019.csv', index=False)
data8.to_csv('data\Sales_February_2019.csv', index=False)
data9.to_csv('data\Sales_January_2019.csv', index=False)
data10.to_csv('data\Sales_June_2019.csv', index=False)
data11.to_csv('data\Sales_March_2019.csv', index=False)
data12.to_csv('data\Sales_May_2019.csv', index=False)

In [142]:
# Read CSV and change date column from object to date type
sales_Apr = pd.read_csv("data\Sales_April_2019.csv", parse_dates =['Order Date'])
sales_Feb = pd.read_csv("data\Sales_February_2019.csv", parse_dates =['Order Date'])
sales_Jan = pd.read_csv("data\Sales_January_2019.csv",parse_dates=['Order Date'])
sales_Jun = pd.read_csv("data\Sales_June_2019.csv",parse_dates=['Order Date'])
sales_Mar = pd.read_csv("data\Sales_March_2019.csv",parse_dates=['Order Date'])
sales_May = pd.read_csv("data\Sales_May_2019.csv",parse_dates=['Order Date'])

In [143]:
# Function to safely parse the dates and remove microseconds
def safe_to_datetime(df, column_name):
    df['Order Date'] = pd.to_datetime(df['Order Date'], format='mixed', errors='coerce')
    df['Order Date'] = df['Order Date'].dt.floor('S')
    return df

# Apply the function to each DataFrame
sales_Apr = safe_to_datetime(sales_Apr, 'Order Date')
sales_Feb = safe_to_datetime(sales_Feb, 'Order Date')
sales_Jan = safe_to_datetime(sales_Jan, 'Order Date')
sales_Jun = safe_to_datetime(sales_Jun, 'Order Date')
sales_Mar = safe_to_datetime(sales_Mar, 'Order Date')
sales_May = safe_to_datetime(sales_May, 'Order Date')

In [144]:
sales_Mar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15226 entries, 0 to 15225
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order ID          15189 non-null  object        
 1   Product           15189 non-null  object        
 2   Quantity Ordered  15189 non-null  object        
 3   Price Each        15189 non-null  object        
 4   Order Date        15154 non-null  datetime64[ns]
 5   Purchase Address  15189 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 713.8+ KB


In [145]:
sales_Mar.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,162009,iPhone,1,700.0,2019-03-28 20:59:00,"942 Church St, Austin, TX 73301"
1,162009,Lightning Charging Cable,1,14.95,2019-03-28 20:59:00,"942 Church St, Austin, TX 73301"
2,162009,Wired Headphones,2,11.99,2019-03-28 20:59:00,"942 Church St, Austin, TX 73301"
3,162010,Bose SoundSport Headphones,1,99.99,2019-03-17 05:39:00,"261 10th St, San Francisco, CA 94016"
4,162011,34in Ultrawide Monitor,1,379.99,2019-03-10 00:01:00,"764 13th St, San Francisco, CA 94016"


# Data Cleaning

In [163]:
# Remove leading/trailing spaces from column names for all DataFrames
dataframes = [sales_Apr, sales_Feb, sales_Jan, sales_Jun, sales_Mar, sales_May]
for df in dataframes:
    df.columns = df.columns.str.strip()

# Function to clean and convert columns
def clean_and_convert(df, columns):
    for column in columns:
        # Remove leading/trailing spaces from the values (only if they are strings)
        if df[column].dtype == 'object':
            df[column] = df[column].str.strip()
        
        # Replace non-numeric values with NaN and convert to float
        df[column] = pd.to_numeric(df[column], errors='coerce')
    
    return df

# Columns to clean and convert
columns_to_convert = ['Order ID', 'Quantity Ordered', 'Price Each']

# Clean and convert the columns for each DataFrame
sales_Apr = clean_and_convert(sales_Apr, columns_to_convert)
sales_Feb = clean_and_convert(sales_Feb, columns_to_convert)
sales_Jan = clean_and_convert(sales_Jan, columns_to_convert)
sales_Jun = clean_and_convert(sales_Jun, columns_to_convert)
sales_Mar = clean_and_convert(sales_Mar, columns_to_convert)
sales_May = clean_and_convert(sales_May, columns_to_convert)



In [164]:
sales_May.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,194095.0,Wired Headphones,1.0,11.99,2019-05-16 17:14:00,"669 2nd St, New York City, NY 10001"
1,194096.0,AA Batteries (4-pack),1.0,3.84,2019-05-19 14:43:00,"844 Walnut St, Dallas, TX 75001"
2,194097.0,27in FHD Monitor,1.0,149.99,2019-05-24 11:36:00,"164 Madison St, New York City, NY 10001"
3,194098.0,Wired Headphones,1.0,11.99,2019-05-02 20:40:00,"622 Meadow St, Dallas, TX 75001"
4,194099.0,AAA Batteries (4-pack),2.0,2.99,2019-05-11 22:55:00,"17 Church St, Seattle, WA 98101"


In [165]:
# Define the renaming dictionary
rename_dict = {
    'Order ID': 'Order_ID',
    'Quantity Ordered': 'Quantity_Ordered',
    'Price Each':'Price_Each',
    'Order Date':'Order_Date',
    'Purchase Address':'Purchase_Address',
}

# List of DataFrames
dataframes = [sales_Apr, sales_Feb, sales_Jan, sales_Jun, sales_Mar, sales_May]

# Rename columns in each DataFrame
for df in dataframes:
    df.rename(columns=rename_dict, inplace=True)


In [167]:
#checking the dataframe
sales_May.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16635 entries, 0 to 16634
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_ID          16554 non-null  float64       
 1   Product           16587 non-null  object        
 2   Quantity_Ordered  16554 non-null  float64       
 3   Price_Each        16554 non-null  float64       
 4   Order_Date        16554 non-null  datetime64[ns]
 5   Purchase_Address  16587 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 779.9+ KB


In [172]:
def fill_missing_float_columns(df):
    # Iterate through each column in the DataFrame
    for col in df.columns:
        # Check if the column data type is float
        if df[col].dtype == 'float':
            # Fill missing values with NaN
            df[col] = df[col].fillna(np.nan)
    return df

# Apply the function to each DataFrame
sales_Apr = fill_missing_float_columns(sales_Apr)
sales_Feb = fill_missing_float_columns(sales_Feb)
sales_Jan = fill_missing_float_columns(sales_Jan)
sales_Jun = fill_missing_float_columns(sales_Jun)
sales_Mar = fill_missing_float_columns(sales_Mar)
sales_May = fill_missing_float_columns(sales_May)
