## Business Understanding

### `Project Description`
Imagine being at the helm of a business with a treasure trove of transactional data, brimming with untapped potential, yet unable to harness its power. That's where our client finds themselves—a company sitting on a goldmine of 2019 transactional data, eager to uncover insights that could propel their business to new heights. They've turned to us, seeking a transformative business intelligence solution that not only answers their pressing questions but also illuminates hidden opportunities to boost sales and streamline operations.

Enter getINNOtized, a dynamic organization dedicated to connecting talented data professionals with businesses in need of innovative data solutions. Through their platform, companies can leverage top-tier analytical expertise to unlock the full potential of their data. getINNOtized has assigned us this mission, confident in our ability to deliver a comprehensive, actionable business intelligence report that will empower the client to make informed, strategic decisions. Our task is clear: dive deep into this data, decode its secrets, and deliver insights that guide the client towards increased revenue and enhanced efficiency.

#### **Key Stakeholders**
The stakeholders include the client's company executives, sales and marketing team, product management team, logistics and supply chain team, IT and data team, getINNOtized and external stakeholders(investors and suppliers)

#### **Success Criteria**

The success of this project will be measured by the ability to deliver a detailed and actionable business intelligence report that answers the client's questions and provides insights for driving sales and efficiency. The report should be clear, visually appealing, and easy to understand, providing the client with a solid foundation for making data-driven decisions.

#### **Constraints and Considerations**

- Data Quality: Ensure the data is clean, accurate, and complete before analysis.
- Timeliness: The analysis and report should be delivered within the agreed-upon timeframe.
- Client Collaboration: Regular communication with the client to understand their needs and provide updates on progress.
- Tool Selection: Utilize appropriate data analysis and visualization tools to generate insights and present findings effectively.

#### **Data Requirements**
- Utilize data that was collected for each month in the entire year of 2019. The data for the first half of the year (January to June) was collected in excel and saved as csv files before management decided to use databases to store their data for analysis.

**<i>NB</i>** Additionally, categorize products based on their unit prices:
- Products with unit prices above $99.99 should be labeled as high-level products.
- Products with unit prices $99.99 and below should be labeled as basic-level products.

#### **Business Impact**
- Enhance customer satisfaction through better product availability.
- Optimize inventory management, leading to cost savings and improved operational efficiency.

### `Hypothesis`

*Null Hypothesis (Ho):* 

*Alternate Hypothesis (Ha):* 

### `Analytical Business Questions`

1. How much money did we make this year?
2. Can we identify any seasonality in the  sales?
3. What are our best and worst-selling products?
4. How do sales compare to previous months or weeks? 
5. Which cities are our products delivered to most? 
6. How do product categories compare in revenue generated and quantities  ordered?
7. You are required to show additional details from your findings in your data.

### `Importations`

In [57]:
# Import the necessary libraries

# Data Connection
import pyodbc
from dotenv import dotenv_values

# Data Manipulation
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Stats Packages
from scipy.stats import stats


# Others
import os
from itertools import product
pd.options.display.float_format = '{:.2f}'.format
import warnings
warnings.filterwarnings('ignore')

print("PACKAGE SUCCESS!! 🎉")


PACKAGE SUCCESS!! 🎉


### `Data Connection`

In [58]:
# Load environment variables from .env file into a dictionary
environment_variables=dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("server_name")
database = environment_variables.get("database_name")
username = environment_variables.get("user")
password = environment_variables.get("password")

connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

# Use the connect method of the pyodbc library and pass in the connection string.
connection = pyodbc.connect(connection_string)

In [59]:
# Get the cursor
# The connection cursor is used to execute statements to communicate with the MySQL database
cursor = connection.cursor()
# Retrieve the table names
table_names = cursor.tables(tableType='TABLE')
# Fetch all the table names
tables = table_names.fetchall()
# Print the table names
for table in tables:
    print(table.table_name)

Sales_August_2019
Sales_December_2019
Sales_July_2019
Sales_November_2019
Sales_October_2019
Sales_September_2019
change_streams_destination_type
change_streams_partition_scheme
trace_xe_action_map
trace_xe_event_map


In [60]:
# sql query to get the datasets
query = "SELECT * FROM Sales_August_2019"
query2 = "SELECT * FROM Sales_December_2019"
query3 = "SELECT * FROM Sales_July_2019"
query4 = "SELECT * FROM Sales_November_2019"
query5 = "SELECT * FROM Sales_October_2019"
query6 = "SELECT * FROM Sales_September_2019"

data=pd.read_sql(query,connection)
data2=pd.read_sql(query2,connection)
data3=pd.read_sql(query3,connection)
data4=pd.read_sql(query4,connection)
data5=pd.read_sql(query5,connection)
data6=pd.read_sql(query6,connection)

In [61]:
# save tables to csv
data.to_csv('data\Sales_August_2019.csv', index=False)
data2.to_csv('data\Sales_December_2019.csv', index=False)
data3.to_csv('data\Sales_July_2019.csv', index=False)
data4.to_csv('data\Sales_November_2019.csv', index=False)
data5.to_csv('data\Sales_October_2019.csv', index=False)
data6.to_csv('data\Sales_September_2019.csv', index=False)

In [62]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14371 entries, 0 to 14370
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Order_ID          14291 non-null  float64
 1   Product           14326 non-null  object 
 2   Quantity_Ordered  14291 non-null  float64
 3   Price_Each        14291 non-null  float64
 4   Order_Date        14291 non-null  object 
 5   Purchase_Address  14326 non-null  object 
dtypes: float64(3), object(3)
memory usage: 673.8+ KB


In [63]:
# Read CSV and change date column from object to date type
sales_Aug = pd.read_csv("data\Sales_August_2019.csv", parse_dates =['Order_Date'])
sales_Dec = pd.read_csv("data\Sales_December_2019.csv", parse_dates =['Order_Date'])
sales_Jul = pd.read_csv("data\Sales_July_2019.csv",parse_dates=['Order_Date'])
sales_Nov = pd.read_csv("data\Sales_November_2019.csv",parse_dates=['Order_Date'])
sales_Oct = pd.read_csv("data\Sales_October_2019.csv",parse_dates=['Order_Date'])
sales_Sept = pd.read_csv("data\Sales_September_2019.csv",parse_dates=['Order_Date'])

In [64]:
sales_Nov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17661 entries, 0 to 17660
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Order_ID          17580 non-null  float64       
 1   Product           17616 non-null  object        
 2   Quantity_Ordered  17580 non-null  float64       
 3   Price_Each        17580 non-null  float64       
 4   Order_Date        17580 non-null  datetime64[ns]
 5   Purchase_Address  17616 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 828.0+ KB


Loading First 6 months of the data can be found in this OneDrive

In [65]:
data7 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_April_2019.xlsx')
data8 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_February_2019.xlsx')
data9 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_January_2019.xlsx')
data10 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_June_2019.xlsx')
data11 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_March_2019.xlsx')
data12 = pd.read_excel('C:\\Users\\KEMUNTO\\Downloads\\Sales_May_2019.xlsx')

In [66]:
data10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13622 entries, 0 to 13621
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          13579 non-null  object
 1   Product           13579 non-null  object
 2   Quantity Ordered  13579 non-null  object
 3   Price Each        13579 non-null  object
 4   Order Date        13579 non-null  object
 5   Purchase Address  13579 non-null  object
dtypes: object(6)
memory usage: 638.7+ KB


In [67]:
data10.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,209921,USB-C Charging Cable,1,11.95,2019-06-23 19:34:00,"950 Walnut St, Portland, ME 04101"
1,209922,Macbook Pro Laptop,1,1700.0,2019-06-30 10:05:00,"80 4th St, San Francisco, CA 94016"
2,209923,ThinkPad Laptop,1,999.99,2019-06-24 20:18:00,"402 Jackson St, Los Angeles, CA 90001"
3,209924,27in FHD Monitor,1,149.99,2019-06-05 10:21:00,"560 10th St, Seattle, WA 98101"
4,209925,Bose SoundSport Headphones,1,99.99,2019-06-25 18:58:00,"545 2nd St, San Francisco, CA 94016"


In [68]:
# save tables to csv
data7.to_csv('data\Sales_April_2019.csv', index=False)
data8.to_csv('data\Sales_February_2019.csv', index=False)
data9.to_csv('data\Sales_January_2019.csv', index=False)
data10.to_csv('data\Sales_June_2019.csv', index=False)
data11.to_csv('data\Sales_March_2019.csv', index=False)
data12.to_csv('data\Sales_May_2019.csv', index=False)

In [74]:
# Read CSV and change date column from object to date type
sales_Apr = pd.read_csv("data\Sales_April_2019.csv", parse_dates =['Order_Date'])
sales_Feb = pd.read_csv("data\Sales_February_2019.csv", parse_dates =['Order_Date'])
sales_Jan = pd.read_csv("data\Sales_January_2019.csv",parse_dates=['Order_Date'])
sales_Jun = pd.read_csv("data\Sales_June_2019.csv",parse_dates=['Order_Date'])
sales_Mar = pd.read_csv("data\Sales_March_2019.csv",parse_dates=['Order_Date'])
sales_May = pd.read_csv("data\Sales_May_2019.csv",parse_dates=['Order_Date'])

ValueError: Missing column provided to 'parse_dates': 'Order_Date'

In [73]:
sales_Jun.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13622 entries, 0 to 13621
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order_ID          13579 non-null  object
 1   Product           13579 non-null  object
 2   Quantity_Ordered  13579 non-null  object
 3   Price_Each        13579 non-null  object
 4   Order_Date        13579 non-null  object
 5   Purchase_Address  13579 non-null  object
dtypes: object(6)
memory usage: 638.7+ KB


# Data Cleaning

In [70]:
#Renaming the names of the dataframes from Onedrive
sales_Apr = sales_Apr.rename(columns={'Order ID': 'Order_ID', 'Quantity Ordered': 'Quantity_Ordered','Price Each':'Price_Each','Order Date':'Order_Date','Purchase Address':'Purchase_Address'})
sales_Feb = sales_Feb.rename(columns={'Order ID': 'Order_ID', 'Quantity Ordered': 'Quantity_Ordered','Price Each':'Price_Each','Order Date':'Order_Date','Purchase Address':'Purchase_Address'})
sales_Jan = sales_Jan.rename(columns={'Order ID': 'Order_ID', 'Quantity Ordered': 'Quantity_Ordered','Price Each':'Price_Each','Order Date':'Order_Date','Purchase Address':'Purchase_Address'})
sales_Jun = sales_Jun.rename(columns={'Order ID': 'Order_ID', 'Quantity Ordered': 'Quantity_Ordered','Price Each':'Price_Each','Order Date':'Order_Date','Purchase Address':'Purchase_Address'})
sales_Mar = sales_Mar.rename(columns={'Order ID': 'Order_ID', 'Quantity Ordered': 'Quantity_Ordered','Price Each':'Price_Each','Order Date':'Order_Date','Purchase Address':'Purchase_Address'})
sales_May = sales_May.rename(columns={'Order ID': 'Order_ID', 'Quantity Ordered': 'Quantity_Ordered','Price Each':'Price_Each','Order Date':'Order_Date','Purchase Address':'Purchase_Address'})

In [71]:
sales_May.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16635 entries, 0 to 16634
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order_ID          16587 non-null  object
 1   Product           16587 non-null  object
 2   Quantity_Ordered  16587 non-null  object
 3   Price_Each        16587 non-null  object
 4   Order_Date        16587 non-null  object
 5   Purchase_Address  16587 non-null  object
dtypes: object(6)
memory usage: 779.9+ KB


In [None]:
# changing the data type of columns order_id, quantity_ordered, price_each
sales_Apr[['Column1', 'Column2']] = df[['Column1', 'Column2']].astype(float)