## **Univariate Statistics Python Project**
**Supervisor: Zion Pibowei**

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Supermarket sales dataset
<p>The dataset is one of the historical sales of a supermarket company that was recorded in three different branches over a three-month period. With this dataset, predictive data analytics methods are simple to apply.</p>

<strong>Attribute information</strong>
1. Invoice id: Computer generated sales slip invoice identification number
2. Branch: Branch of supercenter (3 branches are available identified by A, B and C).

3. City: Location of supercenters
4. Customer type: Type of customers, recorded by <strong>Members</strong> for customers using member card and <strong>Normal</strong> for without member card.

5. Gender: Gender type of customer
6. Product line: General item categorization groups

7. Unit price: Price of each product in dollars
8. Quantity: Number of products purchased by customer

9. Tax: 5% tax fee for customer buying
10. Total: Total price including tax

11. Date: Date of purchase (Record available from January 2019 to March 2019)
12. Time: Purchase time (10am to 9pm)

13. Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)

14. COGS: Cost of goods sold
15. Gross margin percentage: Gross margin percentage

16. Gross income: Gross income
17. Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

In [39]:
sales = pd.read_csv("../data/Univariate_Statistics/supermarket_sales.csv")
sales.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


In [3]:
sales.shape

(1000, 17)

### Objectives

<p><strong>0. About the Data </strong> <br>
    Create a cell at the beginning of your notebook and use markdown to give a brief description of the dataset and the agenda of your project. You can find more details about the data on Kaggle: <a href ="https://www.kaggle.com/aungpyaeap/supermarket-sales">https://www.kaggle.com/aungpyaeap/supermarket-sales</a> <p>

<p><strong>1. Describe each of the numerical variables using summary measures </strong> <br>
    (a) Summary measures to be computed: mean, median, variance, standard deviation, skewness, kurtosis <br>
    (b) Use Sympy to display the formula for each of the summary measures above <br>
    (c) Write custom functions to compute each of these summary measures, and apply the functions on each of the numerical variables</p>
<p><strong> 2. Describe each of the numerical variables using graphical methods </strong><br>
    (a) Use histograms, strip plots and box plots to plot the distribution of each of the variables <br>
    (b) Carry out your visualisations using pandas, matplotlib and seaborn for each of the variables.</p>
<p><strong>3. Describe each of the categorical variables using numerical methods </strong><br>
    (a) Obtain the unique categories and frequency of occurence of these elements for each of the categorical variables</p>
<p><strong> 4. Describe each of the categorical variables using graphical methods </strong><br>
    (a) Use pie charts and bar charts to plot the distribution of each of the categorical variables <br>
    (b) Carry out your visualisations using pandas, matplotlib and seaborn.</p><br>

### 1. Describe each of the numerical variables using summary measures

In [5]:
sales.describe()

Unnamed: 0,Unit price,Quantity,Tax 5%,Total,cogs,gross margin percentage,gross income,Rating
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,55.67213,5.51,15.379369,322.966749,307.58738,4.761905,15.379369,6.9727
std,26.494628,2.923431,11.708825,245.885335,234.17651,6.131498e-14,11.708825,1.71858
min,10.08,1.0,0.5085,10.6785,10.17,4.761905,0.5085,4.0
25%,32.875,3.0,5.924875,124.422375,118.4975,4.761905,5.924875,5.5
50%,55.23,5.0,12.088,253.848,241.76,4.761905,12.088,7.0
75%,77.935,8.0,22.44525,471.35025,448.905,4.761905,22.44525,8.5
max,99.96,10.0,49.65,1042.65,993.0,4.761905,49.65,10.0


In [9]:
sales.dtypes

Invoice ID                  object
Branch                      object
City                        object
Customer type               object
Gender                      object
Product line                object
Unit price                 float64
Quantity                     int64
Tax 5%                     float64
Total                      float64
Date                        object
Time                        object
Payment                     object
cogs                       float64
gross margin percentage    float64
gross income               float64
Rating                     float64
dtype: object

In [11]:
sales["Quantity"].mean()

5.51

In [26]:
import statistics

In [27]:
sales["Quantity"].median()

5.0

In [28]:
sales["Total"].median()

253.848

In [12]:
import sympy as sym
from sympy.abc import i, k, m, n, x

In [13]:
xi, xm = sym.symbols("x_i x_m")

In [14]:
xi

x_i

In [15]:
xm

x_m

### a. Mean

In [16]:
sym.Sum(xi, (i, 1, n)) / n

Sum(x_i, (i, 1, n))/n

In [17]:
def mean(data):
    input_sum = data.sum()
    count = data.count()
    mean = input_sum/count
    return mean

In [31]:
sales.mean()

Unit price                  55.672130
Quantity                     5.510000
Tax 5%                      15.379369
Total                      322.966749
cogs                       307.587380
gross margin percentage      4.761905
gross income                15.379369
Rating                       6.972700
dtype: float64

In [18]:
mean(sales["Unit price"])

55.67213

In [19]:
mean(sales["Total"])

322.966749

In [20]:
mean(sales["Tax 5%"])

15.379368999999999

In [21]:
mean(sales["Quantity"])

5.51

In [22]:
mean(sales["cogs"])

307.58738

In [23]:
mean(sales["gross margin percentage"])

4.761904762

In [24]:
mean(sales["gross income"])

15.379368999999999

In [25]:
mean(sales["Rating"])

6.9727

### b. Median

In [81]:
from IPython.display import Math, display
# Odd
median_O = ((n)/2)
display(Math('median_O = ' + sym.latex(median_O)))

<IPython.core.display.Math object>

In [84]:
# Even
from sympy.abc import N
median_E = ((((n-1)/2) + ((N+1)/2))/2)
display(Math('median_E = ' + sym.latex(median_E)))
#sym.Eq((((n-1)/2) + ((n+1)/2))/2)

<IPython.core.display.Math object>

In [46]:
def median(data):
    n = len(data)
    # data.sort_()
    
    # Sample with an even number of observations
    if n % 2 == 0:
        index = data[n // 2]
        index1 = data[n // 2 - 1]
        median = (index + index1) / 2
        return median
    
    # Sample with an odd number of observations
    else:
        median = data[n // 2]
        return median

In [29]:
sales.median()

Unit price                  55.230000
Quantity                     5.000000
Tax 5%                      12.088000
Total                      253.848000
cogs                       241.760000
gross margin percentage      4.761905
gross income                12.088000
Rating                       7.000000
dtype: float64

In [48]:
median(sorted(sales["Quantity"]))

5.0

In [49]:
median(sorted(sales["Unit price"]))

55.230000000000004

In [50]:
median(sorted(sales["Tax 5%"]))

12.088000000000001

In [51]:
median(sorted(sales["Total"]))

253.848

In [52]:
median(sorted(sales["cogs"]))

241.76

In [53]:
median(sorted(sales["gross margin percentage"]))

4.761904762

In [54]:
median(sorted(sales["gross income"]))

12.088000000000001

In [55]:
median(sorted(sales["Rating"]))

7.0

In [None]:
### c. Variance

In [None]:
### d. Standard deviation

In [None]:
### e. Skewness

In [None]:
### f. kurtosis