<a href="https://colab.research.google.com/github/Kartavya-Jharwal/Kartavya_Business_Analytics2025/blob/main/Class_Assignments/week10/Week_10_Session_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Fundamentals of Business Analytics - Week 10

## Notes

- Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the data.

- A scatter plot can be used to:  
	- Visualize the relationship between X  and Y variables.
	  
- Only one independent variable, X.  
- Relationship between X and Y is described by a linear function.  
- Changes in Y are assumed to be related to changes in X.  


#### Simple Linear Regression Model  

- Yi = β0 + β1Xi + εi
- Dependent Variable = Population Y intercept + (Population Slope Coefficient * Independent Variable) + Random Error term  
 - The simple linear regression equation provides an estimate of the population regression line.
   
- b0 is the estimated mean value of Y when the value of X is zero.
- b1 is the estimated change in the mean value of Y as a result of a one-unit increase in X.
  
  - homoscedasciticty



# House Price Prediction Using Simple Linear Regression

This notebook reads house price data from a CSV file, visualizes it with a scatter plot, and performs simple linear regression to model the relationship between house size and price.

In [None]:
# IMPORT OUR LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm

# plotting defaults
sns.set(style='whitegrid')
%matplotlib inline

## STEP 1 - SET OBJECTIVE

**t test for a population slope:**

Is there a linear relationship between House prices and Size?

Null and alternative hypotheses:

H0: β1 = 0 (no linear relationship)

H1: β1 ≠ 0 (linear relationship does exist)

Significance level: 0.05 (95% CI)

In [None]:
# Main analysis: load data, visualize, fit OLS, report t-test and 95% CI for slope
try:
    df = pd.read_csv('house_prices.csv')
    print("Loaded 'house_prices.csv'.")
except FileNotFoundError:
    print("'house_prices.csv' not found — creating a synthetic example dataset for demonstration.")
    np.random.seed(0)
    n = 100
    df = pd.DataFrame({
        'Size': np.random.normal(1500, 300, n).round(1),
    })
    # true linear relation: Price = 50 * Size + noise
    df['Price'] = (50 * df['Size'] + np.random.normal(0, 20000, n)).round(1)

# quick head
print(df.head())

# scatter plot
plt.figure(figsize=(8,6))
sns.scatterplot(x='Size', y='Price', data=df)
plt.title('House Price vs Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (currency units)')

# fit OLS using formula API
model = smf.ols('Price ~ Size', data=df).fit()
print('\n=== OLS Summary ===')
print(model.summary())

# t-test for slope (Size)
slope_t = model.tvalues['Size']
slope_p = model.pvalues['Size']
ci = model.conf_int(alpha=0.05).loc['Size']

print(f"\nSlope t-statistic: {slope_t:.4f}")
print(f"Slope p-value: {slope_p:.4e}")
print(f"95% CI for slope: [{ci[0]:.4f}, {ci[1]:.4f}]")

alpha = 0.05
if slope_p < alpha:
    print('\nConclusion: Reject H0 — there is evidence of a linear relationship between Size and Price at alpha=0.05.')
else:
    print('\nConclusion: Fail to reject H0 — no evidence of linear relationship at alpha=0.05.')

# plot regression line
plt.figure(figsize=(8,6))
sns.regplot(x='Size', y='Price', data=df, ci=95, scatter_kws={'s':40, 'alpha':0.6})
plt.title('House Price vs Size with OLS fit')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (currency units)')
plt.show()

# ANOVA table (optional)
try:
    anova_results = anova_lm(model)
    print('\nANOVA results:\n', anova_results)
except Exception as e:
    print('\nANOVA could not be computed:', e)
