# Programming for Data Analytics in the Higher Diploma in Science in Data Analytics.

## Project_Superstore

**by Grainne Boyle**

This notebook contains a project that demonstrates what I have learned in this module.

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import datetime
#To ignore warnings, re: the figure layout changes, we import the warnings module.  
import warnings
warnings.filterwarnings('ignore')


In [2]:
# I had difficulty reading in the file, I was getting an error. I used chardet, a library that can detect file encodings.

#The file looked okay but there may have been unreadable characters, meaning the file was encoded with one format and your application is trying to read it using a different encoding format.
import chardet  # This imports a library that can detect file encodings

# Opens your file in binary mode ('rb') to read the raw bytes
with open('sample_superstore.csv', 'rb') as file:
    raw_data = file.read()  # Reads the entire file as raw bytes
    result = chardet.detect(raw_data)  # Analyzes the bytes to guess the encoding
    encoding = result['encoding']  # Gets the detected encoding type
    
# Uses the detected encoding to read the CSV file correctly
storedf = pd.read_csv('sample_superstore.csv', encoding=encoding)

In [3]:
print(storedf.head())

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
1       2  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
2       3  CA-2016-138688   6/12/2016   6/16/2016    Second Class    DV-13045   
3       4  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   
4       5  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   

     Customer Name    Segment        Country             City  ...  \
0      Claire Gute   Consumer  United States        Henderson  ...   
1      Claire Gute   Consumer  United States        Henderson  ...   
2  Darrin Van Huff  Corporate  United States      Los Angeles  ...   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   

  Postal Code  Region       Product ID         Category Sub-Category  \
0       42420   Sout

In [4]:
# Change the columns to datetime
storedf['Order Date'] = pd.to_datetime(storedf['Order Date'], format='%m/%d/%Y')
storedf['Ship Date'] = pd.to_datetime(storedf['Ship Date'], format='%m/%d/%Y') 

In [5]:
# From viewing the data in excel, I decided to remove the following columns as there was either too much detail to analyse, e.g Customer ID or not enough varied detail , e.g Country is only United States.  
storedf = storedf.drop(columns=['Row ID', 'Order ID', 'Customer ID', 'Customer Name', 'Country', 'Postal Code', 'Product ID', 'Product Name'])

In [6]:
# I added some columns to broaden my analysis:

# This added a column that calculates the time taken to process the order, how many days after the order was taken before it shipped. 
storedf.insert(loc=2, column='Order Processing Days', value=(storedf['Ship Date'] - storedf['Order Date']).dt.days)

# This enters a column that shows the month only so I can see if the ire are higher or lower ssale in certain months.

storedf.insert(loc=3, column='Month', value=storedf['Order Date'].dt.month)

# This enters a column that shows the gross margin as a percentage of sales.

storedf['Profit Margin (%)'] = (storedf['Profit'] / storedf['Sales']) * 100

# Note - I am assuming the sales figure is after the discount has been given. 
# This enters a column showing the sales price per unit(assuming after the discount)

storedf.insert(loc=13, column='SP per unit', value= (storedf['Sales'] / storedf['Quantity']))

# I am creating a function to create the final column. This column will classify the sales price per unit into a low,medium or high value category.  
def sales_cat(value):
    if value < 100:
        return "Low value"
    elif value > 1000:
        return "High value"
    else:
        return "Medium value"
    
storedf['Sales Category'] = storedf['SP per unit'].apply(sales_cat)





In [7]:
print(storedf.head())

  Order Date  Ship Date  Order Processing Days  Month       Ship Mode  \
0 2016-11-08 2016-11-11                      3     11    Second Class   
1 2016-11-08 2016-11-11                      3     11    Second Class   
2 2016-06-12 2016-06-16                      4      6    Second Class   
3 2015-10-11 2015-10-18                      7     10  Standard Class   
4 2015-10-11 2015-10-18                      7     10  Standard Class   

     Segment             City       State Region         Category  \
0   Consumer        Henderson    Kentucky  South        Furniture   
1   Consumer        Henderson    Kentucky  South        Furniture   
2  Corporate      Los Angeles  California   West  Office Supplies   
3   Consumer  Fort Lauderdale     Florida  South        Furniture   
4   Consumer  Fort Lauderdale     Florida  South  Office Supplies   

  Sub-Category     Sales  Quantity  SP per unit  Discount    Profit  \
0    Bookcases  261.9600         2     130.9800      0.00   41.9136   
1   

In [8]:
## Research

#[Datetime](https://www.statology.org/convert-columns-to-datetime-pandas/) - Check to see how to use datetime function to convert dates so they can be used for analysis.  
#[Chardet](https://stackoverflow.com/questions/54389780/using-chardet-to-detect-encoding)The file looked okay but there may have been unreadable characters, meaning the file was encoded with one format and your application is trying to read it using a different encoding format.  
#[Adding Columns](https://realpython.com/pandas-dataframe/#inserting-and-deleting-columns) - adding a column to my file. I used this tutorial to add some columns relevant to my analysis.  
#[Datetime](https://stackoverflow.com/questions/69375868/extract-month-from-datetime-column-in-pandas-dataframe) - extracting a month from the order date.  
#[Adding Columns](https://stackoverflow.com/questions/59642338/creating-new-column-based-on-condition-on-other-column-in-pandas-dataframe) - Adding a column based on data in another column, in this case, using the Selling price per unit column, and categorising the sales by value.  