In [10]:
from my_statistics import stat
import pandas as pd

In [11]:
reviews_hybrid = pd.read_csv("../reviews_hybrid.csv")

In [12]:
# Define your price categories
price_categories = ["Budget-Friendly", "Mid-Range", "Premium", "Luxury", "Collectible/Investment"]

# Use list comprehension to apply your stat function and renaming within a loop
stat_results = [reviews_hybrid.groupby('Price_Category')['Price'].apply(lambda x: stat(x)).loc[(category), :].T.rename(index={"Value": category}) for category in price_categories]

# Concatenate all the DataFrames into a single DataFrame
combined_stats = pd.concat(stat_results)

In [13]:
combined_stats.columns
col = ['Count', 'Mean', 'Standard Deviation', 'Skewness', 'Kurtosis', 'Coefficient of Variation', 'Mode',
        'Q1 (25th percentile)', 'Median (50th percentile)', 'Q3 (75th percentile)', 'IQR', 'Range, excl outliers',
        'Whisker Bottom', 'Whisker Top', 'Max value', 'Min value', 'Range, incl. outliers',
        'Number of Outliers', 'IQR midpoint', 'Whiskers midpoint']

In [14]:
combined_stats[col]

Unnamed: 0,Count,Mean,Standard Deviation,Skewness,Kurtosis,Coefficient of Variation,Mode,Q1 (25th percentile),Median (50th percentile),Q3 (75th percentile),IQR,"Range, excl outliers",Whisker Bottom,Whisker Top,Max value,Min value,"Range, incl. outliers",Number of Outliers,IQR midpoint,Whiskers midpoint
Budget-Friendly,62879,19.03,6.59,0.22,-0.94,0.35,20.0,14.0,18.0,24.6,10.6,28.8,4.0,32.8,32.8,4.0,28.8,0,19.3,18.4
Mid-Range,27568,46.93,10.41,0.64,-0.62,0.22,35.0,38.0,45.0,55.0,17.0,39.6,33.0,72.6,72.6,33.0,39.6,0,46.5,52.8
Premium,6417,98.52,24.05,1.12,0.39,0.24,75.0,80.0,90.0,110.0,30.0,82.2,72.8,155.0,171.4,72.8,98.6,204,95.0,113.9
Luxury,809,245.31,62.48,0.93,0.07,0.25,200.0,195.0,228.0,281.0,86.0,238.0,172.0,410.0,430.0,172.0,258.0,9,238.0,291.0
Collectible/Investment,173,625.37,253.56,3.82,18.93,0.41,537.0,500.0,550.0,672.0,172.0,464.2,435.8,900.0,2300.0,435.8,1864.2,10,586.0,667.9


<font color="brown">**1. What are Quartiles?**</font>  
- **Definition:** Quartiles are values that divide a sample of data into four equal parts. With them you can quickly evaluate a data set's spread and central tendency, which are important first steps in understanding your data.   
    - **1st quartile(Q1) or 25th percentile:** 25% of the data are less than or equal to this value.   
    - **2nd quartile(Q2) or 50th percentile:**  
       - The median is represented by the line in the box. The median is a common measure of the center of your data. Half the observations are less than or equal to it, and half are greater than or equal to it.  
       - The median. 50% of the data are less than or equal to this value.  
    - **3rd quartile(Q3) or 75th percentile:** 75% of the data are less than or equal to this value.
    - **Interquatile range:** The distance between the 1st and 3rd quartiles (Q3-Q1); thus, it spans the middle 50% of the data.  
- **Quartiles are not observations:** Quartiles are calculated values, not observations in the data. It is often necessary to interpolate between two observations to calculate a quartile accurately.
- **Quartiles are not affected by extreme observations:** Quartiles are not affected by extreme observations, the median and interquartile range are a better measure of central tendency and spread for highly skewed data than are the mean and standard deviation.
   

<font color="brown">**2. Whiskers**</font>
   - **Definition:** The whiskers extend from either side of the box. The whiskers represent the ranges for the bottom 25% and the top 25% of the data values, excluding outliers. The whiskers are the two lines outside the box, that go from the minimum to the lower quartile (the start of the box) and then from the upper quartile (the end of the box) to the maximum.

<font color="brown">**3. Spreads: Comparing IQR (Interquartile Range) of different groups means looking for differences between the spreads (variance) of the groups.**</font>
- **Definition:** The IQR is the range within which the central 50% of the data lies, specifically between the 25th percentile (Q1) and the 75th percentile (Q3). It shows the distance between the first and third quartiles (Q3-Q1). A larger box indicates that the middle 50% of the data is more spread out or has a wider range.
- **Measure of Spread:** The IQR measures the spread of the middle 50% of your data. It tells you how tightly or widely data points are clustered around the median. A larger IQR indicates greater variability, while a smaller IQR suggests less variability. You can use the IQR to compare the **variability** of two or more datasets. If one dataset has a larger IQR than another, it indicates that the data in the first dataset is more **spread out** or **variable**.
- **Given Information:**
    - **For Mid-Range:** The IQR is 17.0, calculated as Q3 (55.0) - Q1 (38.0). This tells us that the middle 50% of the prices fall within the range of $55.0$ and $38.0$.
    - **For Budget-Friendly:** The IQR is 10.6, calculated as Q3 (24.6) - Q1 (14.0). This tells us that the middle 50% of the prices fall within the range of $24.6$ and $14.0$.
- **Explanation:**
    - **For Mid-Range:** The IQR provides insights into the variability of the majority of wine prices. The Budget-Friendly category is more consistent with its prices as its IQR of 10.6 is smaller compared to the Mid-Range category. The smaller the IQR, the more consistent the values compared to a bigger IQR, meaning that Mid-Range wines show a wider range of prices within the middle 50% compared to Budget-Friendly wines, suggesting a more concentrated distribution of prices in the Budget-Friendly category.
    - **Main Check of Price Distribution:** Mid-Range has a wider spread of prices, while Budget-Friendly is more concentrated.

<font color="browm">**4. Comparing IRQ to the Range**</font>

- **Definition:** The IQR represents the central 50% of your data's spread. The Range indicates the full spread from the minimum to the maximum.
- **Given Information:**
  - **For Mid-Range:** The Range excluding outliers is 39.6, and the IQR is 17.0. The IQR occupies 42.9% of the total range, indicating a significant spread within the middle 50%.
  - **For Budget-Friendly:** The Range excluding outliers is 28.8, and the IQR is 10.6. The IQR covers 36.8% of the total range, showing a more concentrated middle 50%.

This comparison highlights the relative spread of the central data points against the full range for both categories, indicating the variability within and across the entire data set.

<font color="brown">**5. Comparing Medians(Q2)**</font>  
- **Definition:** The Median (Q2) represents the central value of the data set.
- **Given Information:**
  - **For Mid-Range:** The median price is $\$45.0$.  
  - **For Budget-Friendly:** The median price is $\$18.0$.
- **Explanation:** The median for Mid-Range is higher than for Budget-Friendly. This indicates that, on average, Mid-Range wines are priced higher than Budget-Friendly wines, reflecting a 60% higher median price for Mid-Range compared to Budget-Friendly.

This comparison provides an insight into the typical price point within each category, highlighting the differences in their market positioning and value proposition.

<font color="brown">**6.0 First Check of Symmetry of Whole Data. Is distribution Skewed? Is Mean equal to Median?**</font>   
Skewness is a measure of the asymmetry or lack of symmetry in a dataset's distribution. It indicates the degree and direction of skew (tilt) in the data. There are three types of skewness:

1. **Negatively Skewed (Left Skewed):**
   - The left tail is longer or fatter than the right tail.
   - The majority of the data points are concentrated on the right side of the distribution.
   - The mean is typically less than the median.
   - If you know that your data are not naturally skewed, investigate possible causes. I

2. **Positively Skewed (Right Skewed):**
   - The right tail is longer or fatter than the left tail.
   - The majority of the data points are concentrated on the left side of the distribution.
   - The mean is typically greater than the median.
   - If you know that your data are not naturally skewed, investigate possible causes. I

3. **Symmetric:**
   - The distribution is perfectly balanced, and both tails are equal in length.
   - The mean and median are approximately equal.

- **Given Information:**
  - **For Mid-Range:** The mean is 46.93, and the median is 45.0, indicating the mean is greater, suggesting a slight positive skew.
  - **For Budget-Friendly:** The mean is 19.03, and the median is 18.0, also indicating a slight positive skew.

Both categories show a slight positive skew, meaning there are higher value outliers pulling the mean to the right. This analysis helps in understanding the data's central tendency and distribution shape.

<font color="browm">**6.1 Second Check of Symmetry of Whole Data. Is Midpoint of IQR equal to Midpoint of Whiskers?**</font>
   - **Definition:** In a symmetric distribution, the IQR should be centered between the upper and lower whiskers, meaning the midpoint of IQR is equal to the Midpoint of the Whiskers.  
In a symmetric distribution, the IQR's midpoint should equal the whiskers' midpoint. For "Mid-Range," the whiskers' midpoint is 52.8 and the IQR's midpoint is 46.5, indicating a slight asymmetry. For "Budget-Friendly," these midpoints are 18.4 and 19.3, respectively, also suggesting a slight asymmetry. This discrepancy indicates a deviation from perfect symmetry, hinting at the distribution's skewness.

<font color="brown">**7. Symmetry of Central 50% (Comparing Median to Midpoint IQR)**</font>
   - **Definition:** If the distribution is symmetric, the median should be close to the midpoint of IQR(between Q1 and Q3).  
For "Mid-Range," the median is $\$45.0$, and the midpoint of the IQR is $\$46.5$, indicating the median is slightly lower than the midpoint of the IQR, suggesting a very slight negative skew within the central 50%. For "Budget-Friendly," the median is $\$18.0$, and the midpoint of the IQR is $\$19.3$, also indicating a slight negative skew within the central 50%. This suggests a nuanced distribution symmetry in both categories, with the central values slightly lower than the middle of their respective IQRs

<font color="Blue">**General Comparison Summary**</font>
The analysis of "Mid-Range" and "Budget-Friendly" wine categories reveals slight asymmetries in their price distributions, with "Mid-Range" showing a slight positive skew and "Budget-Friendly" exhibiting a more concentrated distribution. The IQR and median comparisons suggest nuanced skewness, indicating variability within and across the categories. These insights highlight the inherent differences in price variability and distribution tendencies between the two categories, providing valuable perspectives for consumers and marketers alike in understanding wine pricing dynamics.

The statistical analysis of the five wine categories—Budget-Friendly, Mid-Range, Premium, Luxury, and Collectible/Investment—reveals distinct market segments. Budget-Friendly wines have the lowest average price with minimal skew, indicating a consistent, affordable range. Mid-Range wines show greater variability and a slight positive skew, suggesting a mix of standard and higher-priced offerings. Premium and Luxury categories exhibit higher means and greater skewness, indicating a significant presence of high-value wines. Collectible/Investment wines, with the highest mean and skewness, represent a niche market with exceptional prices and variability, catering to collectors and investors. This analysis highlights the wine market's complexity, from everyday affordability to exclusive investment-grade offerings.