<a href="https://colab.research.google.com/github/RajeshkumarA/Springboard_assignments/blob/main/17.3_Data_Modeling_Rajesh_Ananthula.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling
Load the dataset "American Housing Data.csv", prepare the data for linear regression to predict house prices, train a linear regression model, and evaluate its performance.

# Load the dataset


In [13]:
import pandas as pd

df = pd.read_csv('/content/Dataset/American_Housing_Data_20231209.csv')
display(df.head())

Unnamed: 0,Zip Code,Price,Beds,Baths,Living Space,Address,City,State,Zip Code Population,Zip Code Density,County,Median Household Income,Latitude,Longitude
0,10013,3999000.0,2,3,1967,74 GRAND ST APT 3,New York,New York,29563,20967.9,New York,370046.0,40.72001,-74.00472
1,10013,3999000.0,2,3,1967,74 GRAND ST APT 3,New York,New York,29563,20967.9,New York,370046.0,40.72001,-74.00472
2,10014,1650000.0,1,1,718,140 CHARLES ST APT 4D,New York,New York,29815,23740.9,New York,249880.0,40.73407,-74.00601
3,10014,760000.0,3,2,1538,38 JONES ST,New York,New York,29815,23740.9,New York,249880.0,40.73407,-74.00601
4,10014,1100000.0,1,1,600,81 BEDFORD ST APT 3F,New York,New York,29815,23740.9,New York,249880.0,40.73407,-74.00601


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np


# Separate target variable
X = df.drop('Price', axis=1)
y = df['Price']

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=np.number).columns

# Create transformers for preprocessing
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a full preprocessing and modeling pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])

# Fit and transform the data
X_processed = preprocessing_pipeline.fit_transform(X)

# Display the shape of the processed data
print("Shape of processed data:", X_processed.shape)

Shape of processed data: (39981, 39196)


## Split the data

### Subtask:
Split the data into training and testing sets to evaluate the model's performance.


**Reasoning**:
Split the processed data and target variable into training and testing sets.



In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (31984, 39196)
Shape of X_test: (7997, 39196)
Shape of y_train: (31984,)
Shape of y_test: (7997,)


In [20]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np


# Split data (re-doing this because the preprocessing pipeline has changed)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Fit the pipeline to the training data
model_pipeline.fit(X_train, y_train)

print("Linear Regression model trained successfully with imputation.")

Linear Regression model trained successfully with imputation.


## Evaluate the model

### Subtask:
Evaluate the trained model on the testing data using relevant metrics like Mean Squared Error (MSE) or R-squared.


**Reasoning**:
Evaluate the trained model by making predictions on the test set and calculating the MSE and R-squared metrics.



In [21]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the testing data
y_pred = model_pipeline.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Calculate R-squared
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 778537232811.3337
R-squared: 0.2783630652893986


## Summary:

### Data Analysis Key Findings

*   The Mean Squared Error (MSE) of the linear regression model on the test data is approximately 778,537,232,811.33.
*   The R-squared score of the model on the test data is approximately 0.278, indicating that the model explains about 27.8% of the variance in house prices.

### Insights or Next Steps

*   The current R-squared score of 0.278 is relatively low, suggesting that the linear regression model is not capturing a large portion of the variance in house prices. Further feature engineering, exploring non-linear models, or incorporating more relevant data might be necessary to improve performance.
*   Investigate the features with the highest coefficients in the trained linear regression model to understand which factors are most strongly associated with house prices. This could provide insights for further model improvement or business understanding.
