# Data Preparation

In [2]:
import pandas as pd

housing = pd.read_csv(r"C:\Users\lenovo\Downloads\archive (3)housing.csv")

print(housing.info())
print(housing.head())


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

X = housing[['median_income']]
y = housing['median_house_value']

def perform_linear_regression(test_size, random_state, train_fraction):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state)
    
    train_size = int(len(X_train) * train_fraction)
    X_train = X_train[:train_size]
    y_train = y_train[:train_size]
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    
    plt.figure(figsize=(10, 6))
    plt.scatter(X_test, y_test, color='blue', label='Actual values')
    plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression line')
    plt.xlabel('Median Income')
    plt.ylabel('Median House Value')
    plt.title(f'Train Fraction: {train_fraction*100}%, Test Size: {test_size*100}%, Random State: {random_state}\nMSE: {mse:.2f}')
    plt.legend()
    plt.show()
    
    return mse

mse = perform_linear_regression(test_size=0.2, random_state=42, train_fraction=1.0)
print(f'Mean Squared Error: {mse:.2f}')


**3. Experimentation**

We'll analyze how the regression line and its performance change by varying:

- **Train-test split sizes**: 70-30, 80-20, 90-10
- **Random seeds**: Different values to observe variability
- **Dataset size**: Training on 25%, 50%, 75%, and 100% of the training data

For each combination, we'll plot the regression line and note the Mean Squared Error (MSE).

**Observations**

- **Train-test split sizes**: As the training set size increases (e.g., from 70% to 90%), the model generally has more data to learn from, potentially reducing the MSE. However, too small a test set might not adequately represent the model's performance on unseen data.

- **Random seeds**: Changing the random seed alters the specific data points in the train-test split. This can lead to variability in model performance, especially if the dataset isn't large enough to ensure consistent splits. It's essential to set a random seed for reproducibility.

- **Dataset size**: Training on smaller fractions of the data (e.g., 25%) can lead to higher MSE due to insufficient learning. As the training data size increases, the model's performance typically improves, evidenced by a lower MSE.

**Conclusion**

The performance of a linear regression model is influenced by the train-test split ratio, the random seed used for data splitting, and the size of the training dataset. Careful consideration of these factors is crucial for building robust predictive models. 