# Hypothesis testing
A case study of hypothesis testing in comparision of two fields of mobile prices

## Contents
1. Importing the database using pandas
2. Cleaning up the database
3. Finding the mean and the standard deviation
4. Taking consistent currency units
5. Hypothesis and alternative hypothesis
6. Finding level of significance
7. Conclusion

## Importing the database
We are using pandas to read two csv files named `inr_pricing.csv` and `usd_pricing.csv` these are two different statistical record accquired from [kaggle](https://www.kaggle.com/) to do the hypothesis testing.

In [1]:
import pandas as pd

df_in = pd.read_csv("datasets/inr_pricing.csv")
df_us = pd.read_csv("datasets/usd_pricing.csv")
df_in.head()

Unnamed: 0.1,Unnamed: 0,Name,Brand,Model,Battery capacity (mAh),Screen size (inches),Touchscreen,Resolution x,Resolution y,Processor,...,Rear camera,Front camera,Operating system,Wi-Fi,Bluetooth,GPS,Number of SIMs,3G,4G/ LTE,Price
0,0,OnePlus 7T Pro McLaren Edition,OnePlus,7T Pro McLaren Edition,4085,6.67,Yes,1440,3120,8,...,48.0,16.0,Android,Yes,Yes,Yes,2,Yes,Yes,58998
1,1,Realme X2 Pro,Realme,X2 Pro,4000,6.5,Yes,1080,2400,8,...,64.0,16.0,Android,Yes,Yes,Yes,2,Yes,Yes,27999
2,2,iPhone 11 Pro Max,Apple,iPhone 11 Pro Max,3969,6.5,Yes,1242,2688,6,...,12.0,12.0,iOS,Yes,Yes,Yes,2,Yes,Yes,106900
3,3,iPhone 11,Apple,iPhone 11,3110,6.1,Yes,828,1792,6,...,12.0,12.0,iOS,Yes,Yes,Yes,2,Yes,Yes,62900
4,4,LG G8X ThinQ,LG,G8X ThinQ,4000,6.4,Yes,1080,2340,8,...,12.0,32.0,Android,Yes,Yes,Yes,1,No,No,49990


In [2]:
df_us.head()

Unnamed: 0,Brand,Model,Storage,RAM,Screen Size (inches),Camera (MP),Battery Capacity (mAh),Price
0,Apple,iPhone 13 Pro,128 GB,6 GB,6.1,12 + 12 + 12,3095,999
1,Samsung,Galaxy S21 Ultra,256 GB,12 GB,6.8,108 + 10 + 10 + 12,5000,1199
2,OnePlus,9 Pro,128 GB,8 GB,6.7,48 + 50 + 8 + 2,4500,899
3,Xiaomi,Redmi Note 10 Pro,128 GB,6 GB,6.67,64 + 8 + 5 + 2,5020,279
4,Google,Pixel 6,128 GB,8 GB,6.4,50 + 12.2,4614,799


## Cleaning up the database
There are many rows in both the database, but mainly we are comparing only the price so we will sort out the price for the two databases. We have changed the row name for the raw data for the US database, since it will be easier to understand. Notice that we only need integers in the analysis of the database, since some values in the US dataframe contain the **$** sign we need to remove it from the dataframe.

In [3]:
# working with the base  database
price_in = df_in["Price"].to_frame()
price_us = df_us["Price"].to_frame()
# stripping the $ and ,sign from us database
price_us = price_us["Price"].str.replace('$,', "").to_frame()
price_in.head()

Unnamed: 0,Price
0,58998
1,27999
2,106900
3,62900
4,49990


In [4]:
price_us.head()

Unnamed: 0,Price
0,999
1,1199
2,899
3,279
4,799


## Finding the mean and standard deviation
Now we have cleaned up the database to only show the pricing for both the markets we need to find the mean and standard deviation for both the databases. For this a function is already bundled with pandas to find the mean and standard deviation. We will also count the number of entries in the dataset.

If we were not using python and relying only on mathematical functions then the formula would be \
x̄ = $\sum (fixi) \over (fi)$ \
For unordered data its \
x̄ = $\sum fx \over n $

σ = $\sqrt(\sum(x−x̄))^2 \over n$

x1 = $58998 + 27999 + ... + 3999 \over 1359$

x2 = $999 + 1199 + ... + 649 \over 407$

σ1 = $\sqrt(58998 - 11465.83) + ... \over 1359$

σ2 = $\sqrt(999 - 423) + ... \over 423$

In [5]:
# finding mean
mean_in = price_in["Price"].mean()
# mean_us = price_us["Price"].mean()

# finding standard deviaiton
sd_in = price_in["Price"].std()
# sd_us = price_us["Price"].std()

# taking vals from user
mean_us = 413
sd_us = 326

# counting the number of entries
count_in = len(price_in["Price"])
count_us = len(price_us["Price"])

# printing the values
print(f"The mean of Indian market is ₹{round(mean_in, 2)} and the mean of US market is ${round(mean_us, 2)}.")
print(f"The SD of Indian market is ₹{round(sd_in, 2)} and the SD of US market is ${round(sd_us,2)}")
print(f"There are a total of {count_in} entries in the Indian dataset and {count_us} in the US dataset.")

The mean of Indian market is ₹11465.83 and the mean of US market is $413.
The SD of Indian market is ₹13857.5 and the SD of US market is $326
There are a total of 1359 entries in the Indian dataset and 407 in the US dataset.


From the calculation we get \
x̄1 = 11463.83 \
x̄2 = 413 \
σ1 = 13857.5 \
σ2 = 326 \
n1 = 1359 \
n2 = 407

## Taking consistant currency units
Since the currency unit of India is in INR and the currency unit of US is USD, we need a consistant currency unit for the phone prices. As USD is globally accepted as a standard currency unit, we shall convert all the currency values into USD, converting INR into USD. At the time of doing this case study we have found out that the conversion metric of the conversion unit is $1 = ₹0.012. Now we shall multiply both our mean and standard deviation of popularity by this metric.

x̄1 = x̄1 * 0.012 \
σ1 = σ1 * 0.012

In [6]:
mean_in = mean_in * 0.012
sd_in = sd_in * 0.012
print("Mean India INR:", mean_in)
print("SD INR:", sd_in)

Mean India INR: 137.58990728476823
SD INR: 166.2899649198617


## Hypothesis and alternative hypothesis

- h0 = The two markets have similar prices
- h1 = The two markets have completely different prices

We shall check the values at 1%, 5% and 10% using **two tailed test**. We are opting for the two tailed test because we are checking only the variation in these two data values and not specifically checking if the value exceeds or lowers.

## Finding level of significance

The formula of level of significance can be given as such

z = $x1-x2 \over \sqrt({σ1^2 \over n1} + {σ2^2 \over n2})$

z = $11463.83 - 413 \over \sqrt({13857.5^2 \over 1359} + {326^2 \over 407})$

The standard values for two tailed tests are for **1%, 5% and 10%** are **2.58, 1.966 and 0.645** respectively.

In [7]:
import math
z = (mean_in - mean_us) / math.sqrt((sd_in**2/count_in) + (sd_us**2/count_us))
print("Value of z:", z)
# logic of finding finding or rejecting the hypothesis
if abs(z) < abs(2.58):
    print("The hypothesis is accepted at 1% level of significance")
elif abs(z) < abs(1.966):
    print("The hypothesis is accepted at 5% level of significance")
elif abs(z) < abs(0.645):
    print("The hypothesis is accepted at 10% level of significance")
else:
    print("The hypothesis is rejected at all levels")

Value of z: -16.415925162146806
The hypothesis is rejected at all levels


## Conclusion

Since the test is rejected at all levels 1%, 5% and 10% we can conclude that the orignal hypothesis was rejected and the alternative hypothesis is accepted. We can infer that the two markets have completely different prices.