# Pricing

A game company gave gift coins to its users for purchasing items in a game. Using these virtual coins, users buy various 
vehicles for their characters. The game company did not specify a price for an item and provided users to buy this item 
at the price they wanted. For example, for the item named shield, users will buy this shield by paying the amounts they 
see fit. For example, a user can pay with 30 units of virtual money given to him, while the other user can pay with 45 
units. Therefore, users can buy this item with the amounts they can afford to pay.

## Data Preparation

### Import Libraries

In [1]:
import pandas as pd
import itertools
import statsmodels.stats.api as sms
from scipy.stats import shapiro
import scipy.stats as stats
import warnings
warnings.filterwarnings("ignore")

### Load Data

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/SatyamSarmah/Dynamic_Pricing-Feynn-Labs/main/pricing.csv", sep=";")
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,category_id,price
0,489756,32.117753
1,361254,30.71137
2,361254,31.572607
3,489756,34.54384
4,489756,47.205824


### Data Analysis

In [4]:
def data_analysis(dataframe):
    print(f'Data: {dataframe.head(5)} \n')
    print(f'Shape:{dataframe.shape} \n')
    print(f'Number of unique categories: {dataframe["category_id"].nunique()} \n')
    print(f'Name of categories: {dataframe.category_id.unique()} \n')
    print(f'Null value: {dataframe.isnull().sum()} \n')
    print(dataframe.describe([0.01, 0.05, 0.50, 0.95, 0.99]).T)
    
    
data_analysis(df)

Data:    category_id      price
0       489756  32.117753
1       361254  30.711370
2       361254  31.572607
3       489756  34.543840
4       489756  47.205824 

Shape:(3448, 2) 

Number of unique categories: 6 

Name of categories: [489756 361254 874521 326584 675201 201436] 

Null value: category_id    0
price          0
dtype: int64 

              count           mean            std       min        1%  \
category_id  3448.0  542415.171984  192805.689911  201436.0  201436.0   
price        3448.0    3254.475770   25235.799009      10.0      30.0   

                   5%            50%            95%            99%  \
category_id  326584.0  489756.000000  874521.000000  874521.000000   
price            30.0      34.798544      92.978218  201436.464204   

                       max  
category_id  874521.000000  
price        201436.991255  


There is a difference between 95% and 99% values.

### Outlier Values

In [5]:
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.05)
    quartile3 = dataframe[variable].quantile(0.95)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

low, up = outlier_thresholds(df, variable="price")
print(f'Low Limit: {low}  Up Limit: {up}')

Low Limit: -64.46732638496242  Up Limit: 187.44554397493738


Threshold values are determined for the price variable

In [6]:
def has_outliers(dataframe, numeric_columns):
    for col in numeric_columns:
        low_limit, up_limit = outlier_thresholds(dataframe, col)
        if dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].any(axis=None):
            number_of_outliers = dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].shape[0]
            print(col, ":", number_of_outliers, "outliers")

has_outliers(df, ['price'])

price : 77 outliers


In [7]:
def remove_outliers(dataframe, numeric_columns):
    for variable in numeric_columns:
        low_limit, up_limit = outlier_thresholds(dataframe, variable)
        dataframe_without_outliers = dataframe[~((dataframe[variable] < low_limit) | (dataframe[variable] > up_limit))]
    return dataframe_without_outliers

df = remove_outliers(df, ["price"])
data_analysis(df)

Data:    category_id      price
0       489756  32.117753
1       361254  30.711370
2       361254  31.572607
3       489756  34.543840
4       489756  47.205824 

Shape:(3371, 2) 

Number of unique categories: 6 

Name of categories: [489756 361254 874521 326584 675201 201436] 

Null value: category_id    0
price          0
dtype: int64 

              count           mean            std       min        1%  \
category_id  3371.0  541235.913082  192847.981205  201436.0  201436.0   
price        3371.0      40.398652      18.205540      10.0      30.0   

                   5%           50%            95%            99%  \
category_id  326584.0  489756.00000  874521.000000  874521.000000   
price            30.0      34.74272      73.680507     126.786865   

                       max  
category_id  874521.000000  
price           187.445135  


In [8]:
# When the average price by categories is analyzed,we can make comparisons for groups,but we need to prove this statistically
df.groupby("category_id").agg({"price": "mean"}).reset_index()

Unnamed: 0,category_id,price
0,201436,36.175498
1,326584,35.69317
2,361254,35.477261
3,489756,43.603983
4,675201,37.443592
5,874521,39.273175


When we look at the average of the categories, we can observe. But these observations are not statistically significant 
results. So, we will test all hypotheses of categories in pairs and obtain statistical results.

## Testing

### 1.Checking Assumptions
1.1 Normal Distribution
1.2 Homogeneity of Variance

1.1 Normal Distribution
H0: There is no statistically significant difference between sample distribution and theoretical normal distribution. 
H1: There is statistically significant difference between sample distribution and theoretical normal distribution

In [9]:
print("Shapiro Wilks Test Result \n")
for x in df["category_id"].unique():
    test_statistic, pvalue = shapiro(df.loc[df["category_id"] == x, "price"])
    if (pvalue<0.05):
        print(f'{x}:')
        print('Test statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue), "H0 is rejected")
    else:
        print(f'{x}:')
        print('Test statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue), "H0 is not rejected")

Shapiro Wilks Test Result 

489756:
Test statistic = 0.6328, p-value = 0.0000 H0 is rejected
361254:
Test statistic = 0.4757, p-value = 0.0000 H0 is rejected
874521:
Test statistic = 0.5116, p-value = 0.0000 H0 is rejected
326584:
Test statistic = 0.5026, p-value = 0.0000 H0 is rejected
675201:
Test statistic = 0.6382, p-value = 0.0000 H0 is rejected
201436:
Test statistic = 0.6190, p-value = 0.0000 H0 is rejected


Normal distribution is not provided,so we can apply a non-parametric method

### 2.Implementing Hypothesis

In [10]:
groups = []
for x in itertools.combinations(df["category_id"].unique(),2):
    groups.append(x)

result = []
print("Mann-Whitney U Test Result ")
for x in groups:
    test_statistic, pvalue = stats.stats.mannwhitneyu(df.loc[df["category_id"] == x[0], "price"],
                                                      df.loc[df["category_id"] == x[1], "price"])
    if (pvalue<0.05):
        result.append((x[0], x[1], "H0 is rejected"))
        print('\n', "{0} - {1} ".format(x[0], x[1]))
        print('Test statistic= %.4f, p-value= %.4f' % (test_statistic, pvalue), "H0 is rejected")
    else:
        result.append((x[0], x[1], "H0 is not rejected"))
        print('\n', "{0} - {1} ".format(x[0], x[1]))
        print('Test statistic= %.4f, p-value= %.4f' % (test_statistic, pvalue), "H0 is not rejected")

Mann-Whitney U Test Result 

 489756 - 361254 
Test statistic= 371652.5000, p-value= 0.0000 H0 is rejected

 489756 - 874521 
Test statistic= 482405.0000, p-value= 0.0000 H0 is rejected

 489756 - 326584 
Test statistic= 68317.0000, p-value= 0.0000 H0 is rejected

 489756 - 675201 
Test statistic= 83360.5000, p-value= 0.0000 H0 is rejected

 489756 - 201436 
Test statistic= 60158.0000, p-value= 0.0000 H0 is rejected

 361254 - 874521 
Test statistic= 214411.0000, p-value= 0.0909 H0 is not rejected

 361254 - 326584 
Test statistic= 32541.0000, p-value= 0.0000 H0 is rejected

 361254 - 675201 
Test statistic= 38936.0000, p-value= 0.3708 H0 is not rejected

 361254 - 201436 
Test statistic= 29521.0000, p-value= 0.4354 H0 is not rejected

 874521 - 326584 
Test statistic= 38009.0000, p-value= 0.0000 H0 is rejected

 874521 - 675201 
Test statistic= 46044.0000, p-value= 0.3623 H0 is not rejected

 874521 - 201436 
Test statistic= 34006.0000, p-value= 0.2772 H0 is not rejected

 326584 - 67

In [11]:
result_df = pd.DataFrame()
result_df["Category 1"] = [x[0] for x in result]
result_df["Category 2"] = [x[1] for x in result]
result_df["H0"] = [x[2] for x in result]
result_df

Unnamed: 0,Category 1,Category 2,H0
0,489756,361254,H0 is rejected
1,489756,874521,H0 is rejected
2,489756,326584,H0 is rejected
3,489756,675201,H0 is rejected
4,489756,201436,H0 is rejected
5,361254,874521,H0 is not rejected
6,361254,326584,H0 is rejected
7,361254,675201,H0 is not rejected
8,361254,201436,H0 is not rejected
9,874521,326584,H0 is rejected


## Problems

Does the price of the item differ by category?

In [21]:
result_df[result_df["H0"] == "H0 is not rejected"]

Unnamed: 0,Category 1,Category 2,H0
5,361254,874521,H0 is not rejected
7,361254,675201,H0 is not rejected
8,361254,201436,H0 is not rejected
10,874521,675201,H0 is not rejected
11,874521,201436,H0 is not rejected
14,675201,201436,H0 is not rejected


There is no statistically significant difference average price between 6 categorical groups.

In [20]:
result_df[result_df["H0"] == "H0 is rejected"]

Unnamed: 0,Category 1,Category 2,H0
0,489756,361254,H0 is rejected
1,489756,874521,H0 is rejected
2,489756,326584,H0 is rejected
3,489756,675201,H0 is rejected
4,489756,201436,H0 is rejected
6,361254,326584,H0 is rejected
9,874521,326584,H0 is rejected
12,326584,675201,H0 is rejected
13,326584,201436,H0 is rejected


There is a statistically significant difference average price between 9 categorical groups

### What should the item cost?
The average of 4 statistically identical categories will be the price we will determine.

In [13]:
not_rejected = [361254, 874521, 675201, 201436]
sum = 0
for i in not_rejected:
    sum += df.loc[df["category_id"] == i,  "price"].mean()
PRICE = sum / 4

print("Price : %.4f" % PRICE)

Price : 37.0924


Flexible Price Range

In [14]:
prices = []
for category in not_rejected:
    for i in df.loc[df["category_id"] == category, "price"]:
        prices.append(i)

print(f'Flexible Price Range: {sms.DescrStatsW(prices).tconfint_mean()}')

Flexible Price Range: (36.7109597897918, 38.17576299427283)


the prices of the 4 categories that selected for pricing.

### Income Simulation

We will calculate the incomes that can be obtained from the minimum, maximum values of the
confidence interval and the prices we set.

In [15]:
# for minimum price in confidence interval
freq = len(df[df["price"] >= 36.7109597897918])
# number of sales equal to or greater than this price
income = freq * 36.7109597897918
print(f'Income: {income}')

Income: 38436.374899912014


In [16]:
freq = len(df[df["price"] >= 37.0924])
# number of sales equal to or greater than this price
income = freq * 37.0924
print(f'Income: {income}')

Income: 37611.6936


In [17]:
# for maximum price in confidence interval
freq = len(df[df["price"] >= 38.17576299427283])
# number of sales equal to or greater than this price
income = freq * 38.17576299427283
print(f'Income: {income}')

Income: 35388.93229569092
