About the homework: In some tasks, in addition to writing code, answers to questions and conclusions are required (there are special Markdown cells marked with **Answer**).

The ability to analyze the results of experiments is an important skill. Therefore, answers carry more weight than the code: the code accounts for 30% of the task grade, while answers to questions account for 70%.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

Download the [train.csv](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv) with House Prices

In [None]:
df = pd.read_csv('train.csv')
df.sample(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
205,206,20,RL,99.0,11851,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2009,WD,Normal,180500
1403,1404,20,RL,49.0,15256,Pave,,IR1,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,282922
1098,1099,50,RM,50.0,6000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,7,2009,WD,Normal,128000


# Task 1 (2 points)

Test the hypothesis that the expected value ($\mu$) of size of garage in square feet (`'GarageArea'`) is 485. Use two-sided alternative hypothesis. Choose the test that is suitable for the data (and explain, why).

Do we reject/accept the hypothesis if $\alpha = 5\%$ ? And if $\alpha = 1\%$ ?

In [None]:
#YOUR CODE

from scipy import stats

# Testing if mean GarageArea is 485 square feet
garage_area = df['GarageArea'].dropna()  # Drop NaN values to avoid errors in calculation

# Setting the hypothesized mean and significance levels
hypothesized_mean = 485
alpha_5_percent = 0.05
alpha_1_percent = 0.01

# Performing the one-sample t-test
t_stat, p_value = stats.ttest_1samp(garage_area, hypothesized_mean)

print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

# Decision based on the significance level
if p_value < alpha_5_percent:
    print("Reject the null hypothesis at the 5% significance level.")
else:
    print("Fail to reject the null hypothesis at the 5% significance level.")

if p_value < alpha_1_percent:
    print("Reject the null hypothesis at the 1% significance level.")
else:
    print("Fail to reject the null hypothesis at the 1% significance level.")


t-statistic: -2.1481193678999415
p-value: 0.03186850064864378
Reject the null hypothesis at the 5% significance level.
Fail to reject the null hypothesis at the 1% significance level.


**Answer** \#YOUR ANSWER

The p value test is suitable for the distribution. the p-value  tells you how likely the data you have observed is to have occurred under the null hypothesis.
We fail to reject the null hipotesis at 1% significant level and reject the null hypothesis at the 5% signicance level.

# Task 2 (2 points)

Is the condition of the material on the exterior (`'ExterCond'`) independent of the exterior material quality (`'ExterQual'`)?

Find it out using hypothesis testing ($\alpha = 5\%$)

In [None]:
#YOUR CODE

from scipy.stats import chi2_contingency


# Creating a contingency table of ExterCond and ExterQual
contingency_table = pd.crosstab(df['ExterCond'], df['ExterQual'])

# Performing the chi-square test of independence
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-square Statistic:", chi2_stat)
print("p-value:", p_value)

# Determining if we reject or fail to reject the null hypothesis at α = 5%
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: ExterCond and ExterQual are not independent.")
else:
    print("Fail to reject the null hypothesis: ExterCond and ExterQual are independent.")


Chi-square Statistic: 156.2895311162874
p-value: 2.9908872405484838e-27
Reject the null hypothesis: ExterCond and ExterQual are not independent.


**Answer** \#YOUR ANSWER.
both the material and the quality are not independent of eachother.

# Task 3 (2 points)

The *United Housing Journal™* conducted theoretical research and calculated the probabilities that house is located in a particular zone (`'MSZoning'`):

* $0.01$ - Agriculture
* $0.01$ - Commercial
* $0.05$ - Floating Village Residential
* $0.01$ - Industrial
* $0.01$ - Residential High Density
* $0.8$ - Residential Low Density
* $0.01$ - Residential Low Density Park
* $0.1$ - Residential Medium Density

Does the data from our dataset follow this distribution?

Find it out using hypothesis testing ($\alpha = 5\%$)

In [None]:
#YOUR CODE


from scipy.stats import chisquare

# Defining the theoretical probabilities from the *United Housing Journal™*
theoretical_probs = {
    'A': 0.01,  # Agriculture
    'C': 0.01,  # Commercial
    'FV': 0.05, # Floating Village Residential
    'I': 0.01,  # Industrial
    'RH': 0.01, # Residential High Density
    'RL': 0.8,  # Residential Low Density
    'RP': 0.01, # Residential Low Density Park
    'RM': 0.1   # Residential Medium Density
}

# Calculating the observed frequencies from the dataset
observed_counts = df['MSZoning'].value_counts().reindex(theoretical_probs.keys(), fill_value=0)

# Calculating the expected frequencies based on the theoretical probabilities
total_count = observed_counts.sum()
expected_counts = [total_count * prob for prob in theoretical_probs.values()]

# Performing the chi-square goodness-of-fit test
chi2_stat, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts)

print("Chi-square Statistic:", chi2_stat)
print("p-value:", p_value)

# Decision based on significance level α = 5%
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The data does not follow the theoretical distribution.")
else:
    print("Fail to reject the null hypothesis: The data follows the theoretical distribution.")


Chi-square Statistic: 95.75258620689655
p-value: 8.111526055720843e-18
Reject the null hypothesis: The data does not follow the theoretical distribution.


**Answer** \#YOUR ANSWER

Using a significant level of 5%, the data does not follow the theoretical distribution.

#Task 4 (2 points)

Let's compare houses that has access to differnet types of alleys (`'Alley'`). Do they have the same linear feet of street connected to property (`'LotFrontage'`)?

Find it out using hypothesis testing ($\alpha = 5\%$)

In [None]:
#YOUR CODE


from scipy.stats import f_oneway

# Dropping rows where 'LotFrontage' or 'Alley' is NaN to ensure valid comparison
df = df[['LotFrontage', 'Alley']].dropna()

# Grouping 'LotFrontage' values by different types of 'Alley'
gravel_frontage = df[df['Alley'] == 'Grvl']['LotFrontage']
paved_frontage = df[df['Alley'] == 'Pave']['LotFrontage']
no_alley_frontage = df[df['Alley'].isnull()]['LotFrontage']

# Performing the one-way ANOVA test
f_stat, p_value = f_oneway(gravel_frontage, paved_frontage)

print("F-statistic:", f_stat)
print("p-value:", p_value)

# Decision based on the significance level α = 5%
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in LotFrontage among different alley types.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in LotFrontage among different alley types.")


F-statistic: 14.11276801696159
p-value: 0.0003167170921464645
Reject the null hypothesis: There is a significant difference in LotFrontage among different alley types.


**Answer** \#YOUR ANSWER

There is a significant  difference in the lotfrontage among different alley types.

# Task 5 (2 points)

Find features with the maximal positive/negative correlation. Does it seem logical? Why/why not?

In [None]:

# Selecting only the numeric columns for correlation calculation
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Calculating the correlation matrix
correlation_matrix = numeric_df.corr()

# Find the maximal positive and negative correlation
# Stack the matrix to get pairwise correlations and filter out self-correlations
correlation_pairs = correlation_matrix.unstack()
correlation_pairs = correlation_pairs[correlation_pairs.index.get_level_values(0) != correlation_pairs.index.get_level_values(1)]

# Finding the max positive and min negative correlation
max_positive_correlation = correlation_pairs.idxmax(), correlation_pairs.max()
max_negative_correlation = correlation_pairs.idxmin(), correlation_pairs.min()

print("Max positive correlation:", max_positive_correlation)
print("Max negative correlation:", max_negative_correlation)


Max positive correlation: (('GarageCars', 'GarageArea'), 0.882475414281462)
Max negative correlation: (('BsmtFinSF1', 'BsmtUnfSF'), -0.49525146925701125)


**Answer** \#YOUR ANSWER
it is logical. this is because the garage cars are parked in the garage so they must be able to fit in the garage hence the positive correlation. the other is not logical because there is no correlation.
