Subject: Data Hypotheses Development Code Submission.



Date: December 26, 2024.

Dear Professor Ilia Tetin,

I am writing on behalf of our presentation team, which consists of two members:


*   LE TRAN NHA TRAN - JASMINE (Student ID: 11285100M);
*   DINH VAN LONG - BRAD (Student ID: 11285109M).

In this section, we conducted tests on nine hypotheses, categorized into three distinct groups for clarity and focus:

- Price-Related Hypotheses: These hypotheses examine the impact of pricing strategies on consumer behavior, such as price sensitivity and pricing trends for used smartphones.
- Seller/Rating-Related Hypotheses: These explore the influence of seller credibility and customer ratings on purchase decisions, providing insights into trust-building factors in e-commerce transactions.
- Geographic-Related Hypotheses: These focus on regional variations in consumer preferences and purchasing patterns, comparing trends between urban and rural areas.

Each hypothesis was tested using visualizations and preliminary hypothesis development frameworks, leading to conclusions aligned with the observed data.

In [None]:
import pandas as pd
import plotly.express as px
import polars as pl
import statsmodels.api as sm
from scipy import stats

In [None]:
alpha = 0.05
high_end_price = 600  # USD

df = pl.read_csv("cleaned_info.csv")

In [None]:
def create_categorical_plot(
    df: pd.DataFrame,
    x_col: str,
    y_col: str,
    title: str,
    plot_type: str = "box",
    color_col: str = None,
):
    """
    Creates a box plot or violin plot for visualizing numerical variable
    distributions across categorical variables using Plotly Express.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        x_col (str): The column for the x-axis (categorical variable).
        y_col (str): The column for the y-axis (numerical variable).
        title (str): The title of the plot.
        plot_type (str, optional): The type of plot ('box', 'violin').
                                   Defaults to 'box'.
        color_col (str, optional) : The column that contains the colors. Defaults to None.

    Returns:
        plotly.graph_objects.Figure: The Plotly Figure object.
    """

    if plot_type == "box":
        fig = px.box(df, x=x_col, y=y_col, title=title, color=color_col)
    elif plot_type == "violin":
        fig = px.violin(df, x=x_col, y=y_col, title=title, color=color_col)
    else:
        raise ValueError("Invalid plot_type. Choose either 'box' or 'violin'.")
    return fig


def create_scatter_plot(
    df: pd.DataFrame, x_col: str, y_col: str, title: str, color_col: str = None
):
    """
    Creates a scatter plot using Plotly Express.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        x_col (str): The column for the x-axis.
        y_col (str): The column for the y-axis.
        title (str): The title of the plot.
        color_col (str, optional) : The column that contains the colors. Defaults to None.


    Returns:
        plotly.graph_objects.Figure: The Plotly Figure object.
    """
    fig = px.scatter(df, x=x_col, y=y_col, title=title, color=color_col)
    return fig


def create_heatmap(
    df: pd.DataFrame, title: str, x_axis: str = None, y_axis: str = None
):
    """
    Creates a heatmap from a dataframe using plotly.

     Args:
        df (pd.DataFrame): The DataFrame containing the data.
        title (str): The title of the plot.
        x_axis (str, optional): The column used as x axis. Default to None.
        y_axis (str, optional): The column used as y axis. Default to None.

    Returns:
        plotly.graph_objects.Figure: The Plotly Figure object.
    """
    fig = px.imshow(df, text_auto=True, title=title, labels=dict(x=x_axis, y=y_axis))
    return fig


def create_bar_chart(
    df: pd.DataFrame,
    x_col: str,
    y_col: str,
    title: str,
    color_col: str = None,
):
    """
    Creates a bar chart using Plotly Express.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        x_col (str): The column for the x-axis.
        y_col (str): The column for the y-axis.
        title (str): The title of the plot.
        color_col (str, optional) : The column that contains the colors. Defaults to None.

    Returns:
        plotly.graph_objects.Figure: The Plotly Figure object.
    """
    fig = px.bar(df, x=x_col, y=y_col, title=title, color=color_col)
    return fig


def create_stacked_bar_chart(
    df: pd.DataFrame,
    x_col: str,
    y_col: str,
    title: str,
    color_col: str = None,
):
    """
    Creates a stacked bar chart using Plotly Express.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        x_col (str): The column for the x-axis.
        y_col (str): The column for the y-axis.
        title (str): The title of the plot.
        color_col (str, optional) : The column that contains the colors. Defaults to None.

    Returns:
        plotly.graph_objects.Figure: The Plotly Figure object.
    """
    fig = px.bar(df, x=x_col, y=y_col, title=title, color=color_col, barmode="stack")
    return fig

# **HYPOTHESES DEVELOPMENT:**

## 1. Price-related Hypotheses:

*   **Hypothesis 1:** *For each additional GB of storage, the price increases by a statistically significant percentage, and this elasticity differs across brands.*
    *   **Statistical Test:** A linear regression model with interaction to test price elasticity of storage, where storage is measured in GB and interacts with brands.
    *  **Null Hypothesis:** There is no statistically significant change in price with increase in storage capacity, or the elasticity does not differ across brands.
    *   **Observation:** The storage capacity distribution chart shows a clear trend where higher storage (e.g., 256GB, 512GB) correlates with higher prices across brands. This trend is especially strong for premium models (e.g., Apple and Samsung).

In [None]:
# Apply log transformation to price.
df = df.with_columns(pl.col("price").log().alias("log_price"))

# Convert capacity to numeric (GB), handling non-numeric values.
capacity_mapping = {
    "less_than_8": 4,  # Assign a representative value.
    "8": 8,
    "16": 16,
    "32": 32,
    "64": 64,
    "128": 128,
    "256": 256,
    "512": 512,
    "1024": 1024,
    "more_than_2048": 2048,  # Assign a representative value.
}

df = df.with_columns(
    pl.col("capacity").replace(capacity_mapping).cast(pl.Int64).alias("capacity_gb")
)

# Oneq-hot encode brand.
brands = df["brand"].unique()
for brand in brands:
    df = df.with_columns(
        pl.when(pl.col("brand") == brand).then(1).otherwise(0).alias(f"brand_{brand}"),
    )

# Define the independent variables (including interactions).
independent_vars = ["capacity_gb"] + [f"brand_{brand}" for brand in brands]
interaction_vars = [f"capacity_gb*brand_{brand}" for brand in brands]

# Add interaction terms.
for brand in brands:
    df = df.with_columns(
        (pl.col("capacity_gb") * pl.col(f"brand_{brand}")).alias(
            f"capacity_gb*brand_{brand}"
        )
    )

independent_vars += interaction_vars
# Prepare the dependent and independent variables.
dependent_var = df["log_price"]
independent_vars_df = df.select(independent_vars).to_pandas()

# Add a constant term for the intercept.
independent_vars_df = sm.add_constant(independent_vars_df)

# Perform the multiple linear regression.
model = sm.OLS(dependent_var, independent_vars_df)
results = model.fit()

# Print the regression results.
print(results.summary())

# Check the p value for capacity, and interaction variables.
p_value_capacity = results.pvalues["capacity_gb"]
p_value_interactions = results.pvalues[interaction_vars]

# Decision based on p-value for capacity.
if p_value_capacity < alpha:
    print(
        f"\nReject the first part of the null hypothesis. Conclude that: There is a significant effect of capacity on price. (p-value: {p_value_capacity:.4f})"
    )
else:
    print(
        f"\nFail to reject the first part of the null hypothesis. Conclude that: There is no significant effect of capacity on price. (p-value: {p_value_capacity:.4f})"
    )

# Decision based on p-value for interaction.
if any(p < alpha for p in p_value_interactions):
    print(
        "\nReject the second part of the null hypothesis. Conclude that: There is significant effect of the interactions between capacity and brand on price."
    )
else:
    print(
        "\nFail to reject the second part of the null hypothesis. Conclude that: There is no significant effect of the interactions between capacity and brand on price."
    )

fig = create_scatter_plot(
    df,
    x_col="capacity_gb",
    y_col="price",
    title="Price vs. Capacity with different brands",
    color_col="brand",
)
fig.show()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.396
Model:                            OLS   Adj. R-squared:                  0.394
Method:                 Least Squares   F-statistic:                     181.9
Date:                Sat, 04 Jan 2025   Prob (F-statistic):               0.00
Time:                        12:39:11   Log-Likelihood:                -19904.
No. Observations:               16405   AIC:                         3.993e+04
Df Residuals:                   16345   BIC:                         4.039e+04
Df Model:                          59                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const     


- Storage Elasticity:

  - The interaction term capacity_gb*brand tests whether the relationship between storage capacity and price differs across brands.

  - Significant interaction terms (e.g., capacity_gb*brand_Google, capacity_gb*brand_Sony) indicate that for some brands, price increases more significantly with higher storage.
- Null Hypothesis Rejection:

  - The results indicate statistically significant price elasticity for some brands (e.g., Apple, Samsung, Sony, Xiaomi). However, it is not universally significant across all brands, as seen with capacity_gb*brand_Pocophone and capacity_gb*brand_Honor (p > 0.05).

- R-squared: The model explains 39.6% of price variance, suggesting additional factors (e.g., condition, warranty, or seller rating) could improve the model.

The regression analysis indicates that storage capacity (measured in GB) is positively correlated with price. Higher storage models, such as 256GB and 512GB, are priced significantly higher than models with lower storage capacities.

The regression model also has an R-squared value of 0.396, suggesting that around 39.6% of the variance in price is explained by the model, including storage capacity and brand interactions.


Overall, capacity_gb itself is insignificant (p = 0.257), rejects the null hypothesis for interaction effects, suggesting that storage alone doesn't universally determine price without considering brand interactions.

*Significant Brand Interactions*

 - Positive Elasticity: Higher storage significantly increases price for premium brands (e.g., Google, Sony, Samsung). For instance, brands like Apple (coef = 1.238) and Google (coef = 0.569) command a higher base price, with additional storage enhancing this premium further.
 - Lower Elasticity or Insignificance: Mid-range brands like Oppo and Vivo show smaller or insignificant elasticity.

The plot highlights the tiered segmentation in the market:

1. Distribution Across Capacity:
  - The scatterplot clearly shows that higher storage capacities (e.g., 512GB, 1TB, 2TB) correspond to higher prices, particularly for premium brands like Apple and Samsung.
  - Lower-capacity models (e.g., 32GB, 64GB) cluster at the lower price range, appealing to budget-conscious consumers or entry-level users.
2. Brand Differentiation:
  - Apple and Samsung:
Represent the most expensive models in every storage category.
These brands dominate in the higher storage capacities, with prices reaching significantly higher levels than other brands.
  - Other Brands (e.g., Realme, Oppo, Xiaomi):
Spread across the lower and middle price ranges, with limited representation in high-storage, high-price categories.
This suggests that these brands focus on affordability and cost-efficiency for mid-range and budget consumers.
  - Local Brands (e.g., Vsmart): Predominantly lower price points, appealing to highly price-sensitive consumers.

3.  Outliers:

Some dots at the lower storage levels (e.g., 0GB or small capacity) correspond to unexpected high prices. These may represent flagship models with unique features or errors in data recording.

*Human Behavior in the Vietnam Market*

Consumers in Vietnam likely perceive storage as a key feature, especially for premium brands like Apple and Samsung. It shows that consumers willing to pay higher prices are likely targeting premium brands, which bundle storage upgrades with enhanced features (e.g., better cameras, higher performance).

Other brands like Realme, Oppo, and Xiaomi show more competitive pricing for similar storage capacities, suggesting a value-driven strategy to attract cost-conscious consumers.

Local and regional brands like Vsmart and BKAV likely appeal to practical buyers who prioritize value over prestige.

*   **Hypothesis 2:** *The mean price of phones sold by companies is significantly higher than the mean price of phones sold by individuals.*
    *   **Statistical Test:** A two-sample t-test (or a Mann-Whitney U test) to compare the mean prices of company vs. individual sellers.
    *   **Null Hypothesis:** There is no statistically significant difference between the mean price of phones sold by companies and individuals.
    *   **Observation:** While company sellers exhibit less variance and generally consistent higher prices (supported by the boxplot), individual sellers occasionally list rare or premium devices as outliers. On average, companies maintain higher pricing due to new inventory and warranties.

In [None]:
# Separate prices for company sellers and individual sellers.
company_prices = df.filter(pl.col("is_company"))["price"].to_numpy()  # Extract prices for company sellers.
individual_prices = df.filter(~pl.col("is_company"))["price"].to_numpy()  # Extract prices for individual sellers.

# Perform Mann-Whitney U test.
u_stat, p_value = stats.mannwhitneyu(
    company_prices, individual_prices, alternative="greater"  # Test if company prices are greater.
)
test_used = "Mann-Whitney U test"  # Specify the test used.

# Output test results.
print(f"Test used: {test_used}")
print(f"P-value: {p_value:.4f}")

# Make decision based on p-value.
if p_value < alpha:
    print(
        "Reject the null hypothesis. Conclude that: The mean price of phones sold by companies is significantly higher than those sold by individuals."
    )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no significant difference in the mean prices of phones sold by companies and individuals."
    )

fig = create_categorical_plot(
    df,
    x_col="is_company",
    y_col="price",
    title="Price Distribution by Seller Type",
    plot_type="box",
)
fig.show()

Test used: Mann-Whitney U test
P-value: 0.0000
Reject the null hypothesis. Conclude that: The mean price of phones sold by companies is significantly higher than those sold by individuals.


1. Mann-Whitney U Test Results:
- The p-value = 0.0000, which is well below the typical significance threshold (0.05).
- This leads to the rejection of the null hypothesis, concluding that the mean price of phones sold by companies is significantly higher than those sold by individuals.


This supports the observation that companies price phones higher due to factors like:

- New inventory.
- Warranties.
- Standardized pricing strategies.

2. Box Plot Analysis:

- Individual Sellers (false):
The price distribution is concentrated at the lower end, with a smaller interquartile range (IQR).
Most of the phones sold by individuals are priced below 500, with fewer outliers reaching higher price points.
- Companies (true):
The price distribution has a wider IQR and a higher median compared to individual sellers.
Companies exhibit a more consistent pricing strategy for higher-priced phones, with fewer extreme outliers.
3. Price Variability:

The variability in prices for companies is larger than that for individual sellers, suggesting that companies sell a broader range of products (from budget to premium).


In vietnam Market:
- Companies tend to sell premium and mid-range phones, reflected in higher median and overall prices.
- Individuals focus on selling used, refurbished, or budget devices, catering to price-sensitive consumers.

Hence, buyers in Vietnam appear willing to pay a premium when purchasing from companies, likely due to the perceived assurance of quality, authenticity, and post-purchase support. In contrast, a significant portion of the market prefers buying from individuals due to affordability, even if it involves potential trade-offs like lack of warranty or uncertain quality.

*   **Hypothesis 3:**  *There is a statistically significant difference in prices among phones with different colors.*
    *   **Statistical Test:** ANOVA test to see if the groups mean are not all the same.
    *   **Null Hypothesis:** The mean price across all colors is the same.
    *   **Observation:** The hypothesis is aligns with consumer behavior. Normal colors (black, white, gold) are expected to have more standardized prices, while niche or rare colors may exhibit price premiums due to exclusivity and limited availability.

In [None]:
# Drop rows where color is null to make the analysis feasible.
df = df.filter(pl.col("color").is_not_null())

# Group prices by color and collect them into lists for analysis.
prices_by_color = (
    df.group_by("color").agg(pl.col("price").alias("prices"))["prices"].to_list()
)

# Perform Kruskal-Wallis test (non-parametric).
h_stat, p_kruskal = stats.kruskal(*prices_by_color)

print(f"Kruskal Wallis H test p-value: {p_kruskal:.4f}")

# Perform ANOVA test only if Kruskal-Wallis test shows significance.
if p_kruskal < alpha:
    # Perform ANOVA test (parametric).
    f_stat, p_anova = stats.f_oneway(*prices_by_color)
    print(f"ANOVA test p-value: {p_anova:.4f}")

    # Decision based on ANOVA test.
    if p_anova < alpha:
        print(
            "Reject the null hypothesis. Conclude that: There is a statistically significant difference in prices among phones with different colors."
        )
    else:
        print(
            "Fail to reject the null hypothesis. Conclude that: There is no statistically significant difference in prices among phones with different colors."
        )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no statistically significant difference in prices among phones with different colors. (from Kruskal Wallis test)"
    )

fig = create_categorical_plot(
    df,
    x_col="color",
    y_col="price",
    title="Price Distribution by Color",
    plot_type="box",
)
fig.show()

Kruskal Wallis H test p-value: 0.0000
ANOVA test p-value: 0.0000
Reject the null hypothesis. Conclude that: There is a statistically significant difference in prices among phones with different colors.


Tests Used:
- ANOVA Test: Assumes normality and equal variance, tests if group means are significantly different.
- Kruskal-Wallis H Test: Non-parametric, suitable for non-normal distributions, examines if at least one group differs.

In this case, both the ANOVA test and Kruskal-Wallis test yield a p-value of 0.0000, indicating strong evidence against the null hypothesis. This confirms that color significantly impacts phone prices, supporting the notion that consumer preferences and exclusivity drive price variation.

Key Observations:
- Consumer Behavior:

  - Standard Colors:
   - Gold and Silver: These colors also have higher price medians, suggesting they are perceived as luxurious and are often used in premium phone designs.
   - Black and White: These colors are widely available and show a lower median price, likely due to their association with both entry-level and mid-range models.
  - Rare Colors: Exclusive shades (e.g., gradient, blue, or custom finishes) may command price premiums due to limited supply and increased demand.
  - Purple and Gray: These colors exhibit the highest median prices and the widest interquartile range (IQR), indicating they are commonly associated with premium models. These may be limited-edition colors or tied to flagship devices.
- Price Variance:

  - ANOVA suggests variability in mean prices among colors.
  - Kruskal-Wallis confirms this variability, even if assumptions of normality or homogeneity are not met.

- Outliers:
  - Premium models in all colors can reach prices above 1500, as indicated by the outliers in the plot.
  - Some colors (e.g., purple and gray) have more consistent high-price outliers.

Human Behavior in the Vietnam Market:
- Color as a status indicator, such as gold, silver, purple, and gray are perceived as premium and luxurious, leading to higher price points. Consumers opting for these colors likely associate them with status and exclusivity.
- Appeal of Black and White:
Black and white are popular among both budget-conscious buyers and mid-range consumers due to their neutrality and availability across price tiers.
- Bright colors like blue, green, red, and orange may appeal more to younger demographics seeking uniqueness but are often priced in the lower to mid-range.

## 2. Seller/Rating-related Hypotheses:

*   **Hypothesis 4:** *There is a statistically significant positive correlation between seller ratings and phone prices.*
    *   **Statistical Test:** Calculate Pearson correlation coefficient to measure linear correlation (or Spearman rank correlation if data is not normally distributed).
    *   **Null Hypothesis:** There is no statistically significant correlation between seller ratings and phone prices.
    *   **Observation:** The rating distribution chart suggests high ratings dominate, but the connection between pricing and seller ratings has not been explicitly shown. We consider this hypothesis aligns with expectations, but additional data is needed to confirm.

In [None]:
# Extract price and average_rating columns.
prices = df["price"].to_numpy()  # Convert prices to NumPy array.
ratings = df["average_rating"].to_numpy()  # Convert average ratings to NumPy array.

# Calculate Spearman rank correlation coefficient.
corr_coef, p_value = stats.spearmanr(prices, ratings)  # Perform Spearman correlation.
test_used = "Spearman correlation"  # Specify the test used.

# Output test results.
print(f"Test used: {test_used}")
print(f"Correlation coefficient: {corr_coef:.4f}")
print(f"P-value: {p_value:.4f}")

# Make decision based on p-value.
if p_value < alpha and corr_coef > 0:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant positive correlation between seller ratings and phone prices."
    )
elif p_value < alpha and corr_coef < 0:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant negative correlation between seller ratings and phone prices."
    )
elif p_value < alpha and corr_coef == 0:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant but not positive correlation between seller ratings and phone prices"
    )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no statistically significant correlation between seller ratings and phone prices."
    )

fig = create_scatter_plot(
    df, x_col="average_rating", y_col="price", title="Price vs. Seller Rating"
)
fig.show()

Test used: Spearman correlation
Correlation coefficient: 0.1249
P-value: 0.0000
Reject the null hypothesis. Conclude that: There is a statistically significant positive correlation between seller ratings and phone prices.


Tests Used:

Spearman Correlation, which is suitable for ordinal or non-linear relationships, measures the rank-based correlation.

A Spearman coefficient of 0.1249 suggests a weak positive correlation between seller ratings and phone prices. Moreover, the p-value of 0.0000 confirms the correlation is statistically significant, rejecting the null hypothesis.

Higher seller ratings are associated with higher phone prices, but the relationship is relatively weak.
This aligns with expectations that trusted sellers (indicated by high ratings) can command slightly higher prices, potentially due to buyer confidence in quality and service.

From the Scatter Plot:
1. Distribution of Prices by Ratings:

- Low Ratings (0-2):

Phones sold by sellers with low ratings are clustered in the lower price range (<500), indicating that these sellers primarily deal in budget or used phones.
- Moderate Ratings (3-4):

Sellers with moderate ratings show a broader price range, covering both budget and mid-range phones.
- High Ratings (4-5):

Sellers with high ratings are more likely to sell premium and high-priced phones (>1000), suggesting a correlation between trust (ratings) and the ability to command higher prices.
2. Outliers:

Some high-priced phones are sold by low-rated sellers. These could represent niche products, unique offerings, or potentially fraudulent listings that require further investigation.

Consumers in Vietnam appear to associate high seller ratings with trustworthiness, making them more likely to purchase higher-priced items from such sellers, while lower-rated sellers might attract price-sensitive consumers who prioritize affordability over trust or reputation.

*   **Hypothesis 5:** *Company sellers have a statistically significant higher average rating than individual sellers.*
    *   **Statistical Test:** Two-sample t-test (or Mann-Whitney U test) to compare mean ratings of company vs. individual sellers.
    *   **Null Hypothesis:** There is no statistically significant difference between the mean ratings of company sellers and individual sellers.
    *   **Observation:** Company sellers are often perceived as more reliable due to warranties and consistent pricing, which could result in higher average ratings. However, this needs specific validation from rating data.

In [None]:
# Separate ratings for company sellers and individual sellers.
company_ratings = df.filter(pl.col("is_company"))["average_rating"].to_numpy()  # Extract ratings for company sellers.
individual_ratings = df.filter(~pl.col("is_company"))["average_rating"].to_numpy()  # Extract ratings for individual sellers.

# Perform Mann-Whitney U test.
u_stat, p_value = stats.mannwhitneyu(
    company_ratings, individual_ratings, alternative="greater"  # Test if company ratings are greater.
)
test_used = "Mann-Whitney U test"  # Specify test used.

# Output test results.
print(f"Test used: {test_used}")
print(f"P-value: {p_value:.4f}")

# Make decision based on p-value.
if p_value < alpha:
    print(
        "Reject the null hypothesis. Conclude that: Company sellers have a significantly higher average rating than individual sellers."
    )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no significant difference in average ratings between company and individual sellers."
    )

fig = create_categorical_plot(
    df,
    x_col="is_company",
    y_col="average_rating",
    title="Seller Rating by Seller Type",
    plot_type="box",
)
fig.show()

Test used: Mann-Whitney U test
P-value: 0.0000
Reject the null hypothesis. Conclude that: Company sellers have a significantly higher average rating than individual sellers.


Test Used:

Mann-Whitney U Test, in which non-parametric test suitable for comparing the distributions of two independent groups when assumptions of normality are not met.

A p-value of 0.0000 strongly supports rejecting the null hypothesis, confirming a significant difference in ratings between the two groups.

Company sellers have statistically higher average ratings than individual sellers.
This finding aligns with the observation that companies are perceived as more reliable due to warranties, consistent pricing, and professional service.

*From the Box Plot:*
1. Company Sellers (true):

- Company sellers have consistently high ratings, with most ratings clustering around the upper limit of the scale (close to 5).
- The interquartile range (IQR) is narrow, suggesting less variation in ratings and a consistent reputation for quality and reliability.
- Few outliers exist in the lower range, which could be due to occasional negative experiences or new companies without established reputations.
2. Individual Sellers (false):

- Individual sellers exhibit lower average ratings overall, with a wider IQR and more variation in ratings.
- Ratings span a much broader range, with many sellers receiving ratings below 3, indicating variability in trustworthiness or quality of service.
- Some individual sellers achieve high ratings (close to 5), showing that exceptional service from individuals can still rival companies.

In reality, Vietnamese consumers are likely to associate company sellers with professionalism, reliability, and better after-sales service, leading to consistently higher ratings. Conversely, individual sellers may face challenges in building trust due to inconsistent service or lack of brand recognition. Buyers might accept lower-rated individual sellers if the price difference is substantial, especially for budget-conscious consumers.

*   **Hypothesis 6:** *There is a statistically significant positive correlation between the number of `sold_ads` and seller average ratings.*
    *   **Statistical Test:** Pearson correlation coefficient (or Spearman) to measure the relationship.
    *   **Null Hypothesis:** There is no statistically significant correlation between the number of `sold_ads` and seller average ratings.
    *   **Observation:** The hypothesis aligns with the idea that experienced sellers with many sales build trust, resulting in higher ratings. However, we consider more data linking sold_ads and ratings is necessary for confirmation.

In [None]:
# Extract sold_ads and average_rating columns.
sold_ads = df["sold_ads"].to_numpy()  # Convert number of sold ads to NumPy array.
ratings = df["average_rating"].to_numpy()  # Convert average ratings to NumPy array.

# Calculate Spearman rank correlation coefficient.
corr_coef, p_value = stats.spearmanr(sold_ads, ratings)
test_used = "Spearman correlation"  # Specify test used.

# Output test results.
print(f"Test used: {test_used}")
print(f"Correlation coefficient: {corr_coef:.4f}")
print(f"P-value: {p_value:.4f}")

# Make decision based on p-value and correlation coefficient.
if p_value < alpha and corr_coef > 0:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant positive correlation between the number of sold_ads and seller average ratings."
    )
elif p_value < alpha and corr_coef < 0:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant negative correlation between the number of sold_ads and seller average ratings."
    )
elif p_value < alpha and corr_coef == 0:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant but not positive correlation between the number of sold_ads and seller average ratings."
    )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no statistically significant correlation between the number of sold_ads and seller average ratings."
    )

fig = create_scatter_plot(
    df,
    x_col="sold_ads",
    y_col="average_rating",
    title="Seller Rating vs. Number of Sold Ads",
)
fig.show()

Test used: Spearman correlation
Correlation coefficient: 0.3814
P-value: 0.0000
Reject the null hypothesis. Conclude that: There is a statistically significant positive correlation between the number of sold_ads and seller average ratings.


Test Used:

Spearman Correlation, which is measures the strength and direction of a rank-based relationship, suitable for non-linear data.

Results:
- Correlation Coefficient:

A Spearman coefficient of 0.3814 suggests a moderate positive correlation between the number of sold ads and seller ratings.
Sellers with higher sales tend to have better ratings, supporting the idea that experience builds trust and reliability.
- P-value:

A p-value of 0.0000 indicates the correlation is statistically significant, leading to the rejection of the null hypothesis.

There is a statistically significant positive correlation between the number of sold ads and seller ratings. So, sellers with more sales may have higher buyer trust due to proven reliability.
Their ratings reflect accumulated positive feedback over time.

- Moderate Correlation:

While there is a clear relationship, a coefficient of 0.3814 suggests other factors also influence seller ratings (e.g., product quality, customer service).

*From the Scatter Plot:*
1. High-Volume Sellers:

Sellers with a large number of sold ads (e.g., above 1000) consistently maintain high ratings (close to 4.5 or 5).
This indicates that high-volume sellers likely provide reliable and consistent service, contributing to better customer satisfaction and trust.
2. Low-Volume Sellers:

Sellers with fewer sold ads (<100) show a wider range of ratings, with many having lower ratings (below 3).
The variability in ratings suggests inconsistent performance, likely due to less experience or occasional negative interactions.
3. Outliers:

A few outliers with low ratings and high numbers of sold ads may indicate sellers who prioritize volume over quality, leading to customer dissatisfaction.
Similarly, low-volume sellers with high ratings show that quality-focused individuals can still build strong reputations despite limited activity.


Hence, in Vietnam, consumers are more likely to trust experienced sellers who have handled numerous transactions successfully.

## 3. Geographic-related Hypotheses:

*   **Hypothesis 7:** *The mean price of phones in major cities (Hanoi, HCMC) is statistically significantly higher than the mean price of phones in other regions.*
    *   **Statistical Test:** ANOVA test to compare multiple group means, followed by post-hoc tests (e.g., Tukey) to find specific differences between regions.
    *   **Null Hypothesis:** There is no statistically significant difference in mean prices among the regions.
    *   **Observation:** The price distribution by region clearly shows that major cities like Ho Chi Minh City and Ha Noi Capital have significantly higher average prices compared to rural areas. This reflects higher purchasing power and demand for premium products.

In [None]:
# Define major cities for analysis.
major_cities = ["ho_chi_minh_city", "hanoi"]

# Extract prices for major cities.
major_cities_prices = (
    df.filter(pl.col("region_name").is_in(major_cities))  # Filter rows for major cities.
    .group_by("region_name")  # Group data by region.
    .agg(pl.col("price").alias("prices"))["prices"]  # Collect prices as a list.
    .to_list()
)

# Extract prices for other regions.
other_regions_prices = df.filter(~pl.col("region_name").is_in(major_cities))[  # Exclude major cities.
    "price"
].to_numpy()

# Ensure sufficient data for statistical testing.
if len(major_cities_prices) > 1 and len(other_regions_prices) > 0:
    # Perform Kruskal-Wallis test (non-parametric).
    prices_to_test = major_cities_prices + [other_regions_prices]
    h_stat, p_kruskal = stats.kruskal(*prices_to_test)
    print(f"Kruskal Wallis H test p-value: {p_kruskal:.4f}")

    if p_kruskal < alpha:
        # Perform ANOVA test (parametric).
        f_stat, p_anova = stats.f_oneway(*prices_to_test)
        print(f"ANOVA test p-value: {p_anova:.4f}")

        # Decision based on ANOVA test.
        if p_anova < alpha:
            print(
                "Reject the null hypothesis. Conclude that: The mean price of phones in major cities (Hanoi, HCMC) is significantly higher than the mean price of phones in other regions."
            )
        else:
            print(
                "Fail to reject the null hypothesis. Conclude that: There is no significant difference in mean prices among the regions. (from ANOVA test)"
            )
    else:
        print(
            "Fail to reject the null hypothesis. Conclude that: There is no significant difference in mean prices among the regions. (from Kruskal Wallis test)"
        )
else:
    print("There is not enough data for both groups. Cannot perform statistical tests.")


# Classify the regions before plotting
def classify_region_for_plot(area_name):
    if any(city in area_name for city in major_cities):
        return "major_cities"
    else:
        return "other"


df = df.with_columns(
    pl.col("region_name")
    .map_elements(classify_region_for_plot, return_dtype=pl.String)
    .alias("region_type")
)
fig = create_categorical_plot(
    df,
    x_col="region_type",
    y_col="price",
    title="Price Distribution Major Cities vs. Other regions",
    plot_type="box",
)
fig.show()

Kruskal Wallis H test p-value: 0.0000
ANOVA test p-value: 0.0031
Reject the null hypothesis. Conclude that: The mean price of phones in major cities (Hanoi, HCMC) is significantly higher than the mean price of phones in other regions.


- P-values:

  - ANOVA: p-value = 0.0031, indicating statistically significant differences among region means.
  - Kruskal-Wallis: p-value = 0.0000, confirming the results even if assumptions of normality or homogeneity are violated.

Both results reject the null hypothesis, confirming that the mean price of phones in major cities (Hanoi, HCMC) is significantly higher than in other regions.

*From the Box Plot:*
1. Price Distribution in Major Cities:

- There is a wider interquartile range (IQR) in major cities, reflecting greater variation in prices. This suggests a mix of premium and budget phones available in urban markets.
- Outliers in major cities reach prices above 1500, indicating the presence of high-end flagship models catering to affluent customers.
2. Price Distribution in Other Regions:

- The median price is lower, and the IQR is narrower, showing that phones in other regions are predominantly in the budget to mid-range categories.
- Fewer high-priced outliers suggest limited demand or availability for premium models in these areas.

Major cities often serve as hubs for new product releases, further driving higher prices. It reflects higher purchasing power, greater demand for premium devices, and possibly higher operational costs for sellers. In contrast, consumers in rural and suburban areas are more price-sensitive, prices are likely lower due to:

- Less demand for premium devices.
- More competitive pricing strategies to attract buyers.

*   **Hypothesis 8:** *There is a statistically significant association between brands and regions.*
    *   **Statistical Test:** Chi-squared test to test association of categorical variables.
    *   **Null Hypothesis:** There is no statistically significant association between brand and regions.
    *   **Observation:** While the listing distribution shows geographic concentration, specific brand preferences by region were not provided in the visualizations.

In [None]:
# Create contingency table.
contingency_table = (
    df.group_by(["brand", "region_name"])  # Group data by brand and region.
    .len()  # Calculate counts for each group.
    .pivot(index="brand", on="region_name", values="len")  # Reshape data into a pivot table.
    .fill_null(0)  # Replace missing values with 0.
    .drop("brand")  # Remove the brand index to get only the numerical data.
    .to_numpy()  # Convert to a NumPy array for statistical testing.
)

# Perform Chi-squared test.
chi2_stat, p_value, _, _ = stats.chi2_contingency(contingency_table)

# Output test results.
print(f"Chi-squared statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Make a decision based on p-value.
if p_value < alpha:
    print(
        "Reject the null hypothesis. Conclude that: There is a statistically significant association between brand and region."
    )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no statistically significant association between brand and region."
    )

Chi-squared statistic: 2249.7385
P-value: 0.0000
Reject the null hypothesis. Conclude that: There is a statistically significant association between brand and region.


Test used:

Chi-Squared Test, which is suitable for testing independence between two categorical variables (brands and regions).

The statistic of 2249.7385 indicates a substantial deviation from expected values, suggesting a strong association between brands and regions. In addition, a p-value of 0.0000 confirms the association is statistically significant, leading to rejection of the null hypothesis.

Certain brands are likely more concentrated in specific regions, reflecting geographic preferences, availability, or marketing strategies.

1. Regional Brand Preferences:

- Urban Areas (Hanoi, HCMC):
Likely dominated by premium brands (e.g., Apple, Samsung) due to higher purchasing power and demand for high-end devices.
- Rural Areas:
May prefer budget or mid-range brands (e.g., Oppo, Xiaomi), reflecting affordability and functional priorities.
2. Geographic Concentration:

- Premium brands may focus their presence in regions with higher demand and purchasing capacity.
- Domestic or lesser-known brands might have a stronger presence in less competitive or rural areas.

*   **Hypothesis 9:** *The proportion of high-end phones is significantly higher in urban regions compared to rural regions.*
    *   **Statistical Test:** Chi-squared test to test proportion differences. You need to define a price to denote "high-end".
    *   **Null Hypothesis:** The proportion of high-end phones is not significantly different between urban and rural regions.
    *   **Observation:** The data suggests urban areas like Ho Chi Minh City dominate in listings and high average prices, indicating a higher proportion of premium devices, while rural areas tend to have lower-priced listings, likely reflecting demand for budget-conscious devices.

Definition of High-End Phones *(in this used market)*: A specific price threshold (e.g., phones priced above a defined value like $800) is required to categorize "high-end" devices.

In [None]:
# Define urban and rural postfixes.
urban_postfixes = ["city", "district", "town"]
rural_postfixes = ["rural_district", "commune", "township"]

# Define a function to classify regions.
def classify_region(area_name):
    if any(postfix in area_name for postfix in rural_postfixes):
        return "rural"
    elif any(postfix in area_name for postfix in urban_postfixes):
        return "urban"
    else:
        return "unknown"

# Apply the classification to the dataset.
df = df.with_columns(
    pl.col("area_name")
    .map_elements(classify_region, return_dtype=pl.String)
    .alias("region_type")
)

# Filter the data to include only rural and urban regions.
df = df.filter(pl.col("region_type").is_in(["urban", "rural"]))

# Create a new column to classify phones as high-end based on price threshold.
df = df.with_columns((pl.col("price") > high_end_price).alias("is_high_end"))

# Create a contingency table for region type and high-end phone classification.
contingency_table = (
    df.group_by(["region_type", "is_high_end"])
    .len()
    .pivot(index="region_type", on="is_high_end", values="len")
    .fill_null(0)
    .drop("region_type")
    .to_numpy()
)

# Perform the Chi-squared test.
chi2_stat, p_value, _, _ = stats.chi2_contingency(contingency_table)

# Output results.
print(f"Chi-squared statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Decision based on the p-value.
if p_value < alpha:
    print(
        "Reject the null hypothesis. Conclude that: The proportion of high-end phones is significantly higher in urban regions compared to rural regions."
    )
else:
    print(
        "Fail to reject the null hypothesis. Conclude that: There is no significant difference in the proportion of high-end phones between urban and rural regions."
    )

Chi-squared statistic: 44.3843
P-value: 0.0000
Reject the null hypothesis. Conclude that: The proportion of high-end phones is significantly higher in urban regions compared to rural regions.


Test Used:

Chi-Squared Test, which is suitable for comparing categorical proportions (high-end vs. non-high-end phones across regions).

A statistic of 44.3843 indicates a strong deviation from the expected distribution of high-end phones across regions. Moreover, a p-value of 0.0000 confirms statistical significance, leading to rejection of the null hypothesis.


Urban regions have a significantly higher proportion of high-end phones compared to rural regions.

1. Urban Dominance in High-End Phones:

- Urban areas like Ho Chi Minh City and Hanoi likely have:
  - Greater purchasing power and demand for premium devices.
  - A higher density of company sellers and flagship stores offering high-end models.
- Rural regions prioritize affordability, leading to a lower proportion of high-end phones.
2. Proportional Differences: Urban areas are not only dominant in total listings but also skew towards higher-priced segments, reinforcing this finding.