In [None]:

import pandas as pd

# Loading the the dowloaded data
df = pd.read_csv(r"C:\Users\Admin\Downloads\train.csv")

#first few rows of the dataframe to check
print(df.head())




The "House Prices: Advanced Regression Techniques" dataset from Kaggle is designed for practicing feature engineering and applying advanced regression techniques to predict housing prices. It contains 79 explanatory variables that describe various aspects of residential homes in Ames, Iowa. These variables cover a wide range of features, including:

- Structural: Overall quality, size, and condition of the house.
- Functional: Design and layout aspects.
- Exterior  : Materials and finish of the house's exterior.
- Interior: Details about the interior finishes and features.
- Location: Geographical and neighborhood information.
- Utilities: Types of available utilities.
- Garage: Garage size, condition, and features.
- Basement: Information about the basement, if present.
- Lot: Lot size and related features.
- Pool: Presence and quality of a pool, if any.
- Miscellaneous: Other features like outbuildings or additional amenities.

This comprehensive dataset provides a rich foundation for practicing data preprocessing, feature selection, and applying various regression models to predict house sale prices. 

In [None]:
# Displays the data types of each column
df.info()



df.info() gives us an overview of the dataset, showing the data types of each column. For example, it will indicate if a column is numerical (like int64 or float64) or categorical (like object).

In [None]:
# Displays column names to identify the target variable
df.columns


df.columns will return the list of all column names. Based on the problem description, we can identify that the column SalePrice is the target variable.

In [None]:
# Displays mean, standard deviation, and quartiles of numerical features
df.describe()


df.describe() will give us summary statistics for all numerical columns in the dataset. This includes the mean, standard deviation, minimum, maximum, and the 25th, 50th, and 75th percentiles (quartiles) for each numerical feature.

In [None]:
!pip install missingno

# Import necessary libraries
import pandas as pd
import missingno as msno

# Load the dataset
df = pd.read_csv(r"C:\Users\Admin\Downloads\train.csv")

#Visualize missing values
msno.matrix(df)  # Visualize missing values in the dataset

#Calculate and display the percentage of missing values for each column
missing_percentage = df.isnull().mean() * 100
print("Percentage of missing values in each column:")
print(missing_percentage)

#Drop columns with 5 or more missing values
df_cleaned = df.dropna(axis=1, thresh=len(df) - 5)
print("\nColumns after removing those with 5 or more missing values:")
print(df_cleaned.columns)

#Drop rows with any missing values
df_cleaned = df_cleaned.dropna(axis=0)

# Check if any missing values remain
print("\nRemaining missing values after dropping columns and rows:")
print(df_cleaned.isnull().sum())




Kurtosis measures the "tailedness" of a distribution, showing whether data has outliers (heavy tails or light tails) and the peak's height.

Skewness measures the asymmetry of a distribution, showing if the data tends to have a long tail on the right (positive skew) or left (negative skew).

Both kurtosis and skewness provide valuable insights into the shape of a data distribution, helping in understanding the underlying nature of the data.

In [None]:
# Imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import kurtosis, skew
import numpy as np

# Loading dataset
df = pd.read_csv(r"C:\Users\Admin\Downloads\train.csv")

#plotting style
sns.set(style="whitegrid")

# Target variable
target = df['SalePrice']

#original distribution using displot and histplot
plt.figure(figsize=(14, 6))

#displot
sns.displot(target, kde=True)
plt.title("Original Distribution of SalePrice (displot)")
plt.xlabel("SalePrice")
plt.ylabel("Frequency")
plt.show()

#histplot
plt.figure(figsize=(10, 5))
sns.histplot(target, kde=True, bins=30)
plt.title("Original Distribution of SalePrice (histplot)")
plt.xlabel("SalePrice")
plt.ylabel("Frequency")
plt.show()

# Calculating kurtosis and skewness
orig_kurtosis = kurtosis(target)
orig_skewness = skew(target)
print(f"Original Kurtosis: {orig_kurtosis:.2f}")
print(f"Original Skewness: {orig_skewness:.2f}")

#Log transformation
log_target = np.log1p(target)  # log(1 + x) to avoid log(0)

#log-transformed distribution
plt.figure(figsize=(14, 6))

#displot
sns.displot(log_target, kde=True)
plt.title("Log-Transformed Distribution of SalePrice (displot)")
plt.xlabel("Log(SalePrice)")
plt.ylabel("Frequency")
plt.show()

#histplot
plt.figure(figsize=(10, 5))
sns.histplot(log_target, kde=True, bins=30)
plt.title("Log-Transformed Distribution of SalePrice (histplot)")
plt.xlabel("Log(SalePrice)")
plt.ylabel("Frequency")
plt.show()

#kurtosis and skewness for log-transformed data
log_kurtosis = kurtosis(log_target)
log_skewness = skew(log_target)
print(f"Log-Transformed Kurtosis: {log_kurtosis:.2f}")
print(f"Log-Transformed Skewness: {log_skewness:.2f}")





Before Log Transformation:

The distribution of SalePrice is right-skewed (positively skewed).

This means most house prices are lower, but a few very high-priced houses stretch the tail to the right.

Skewness and kurtosis are both high, indicating the data is not normally distributed and has many outliers.

After Log Transformation:

The distribution becomes much closer to a normal distribution (bell-shaped curve).

Skewness and kurtosis decrease significantly.

This helps normalize the target variable, which improves the performance of many machine learning algorithms (especially linear regression) that assume normality.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


numeric_df = df.select_dtypes(include=['number'])

#correlation matrix
corr_matrix = numeric_df.corr()

plt.figure(figsize=(16, 12))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title("Correlation Heatmap (Numeric Features Only)")
plt.show()


# correlations with SalePrice
target_corr = corr_matrix['SalePrice'].abs().sort_values(ascending=False)


print(target_corr.head(11))

#top 10 features excluding 'SalePrice' itself
top_features = target_corr.index[1:11]
print("\nTop 10 features most correlated with SalePrice:\n", top_features.tolist())



# Correlation matrix for selected features
selected_corr = df[top_features.tolist() + ['SalePrice']].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(selected_corr, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Top 10 Correlated Features + SalePrice Heatmap")
plt.show()



Describing What the Top Features Represent from Kaggle Data Description

Feature             	Description

OverallQual 	 -      Overall material and finish quality

GrLivArea	     -      Above ground living area in square feet

GarageCars	     -      Size of garage in car capacity

GarageArea	     -      Size of garage in square feet

TotalBsmtSF	     -      Total basement area in square feet

1stFlrSF	     -      First Floor square feet

FullBath	     -      Full bathrooms above grade

TotRmsAbvGrd	 -      Total rooms above grade (does not include bathrooms)

YearBuilt	     -      Original construction date

YearRemodAdd	 -      Remodel date (same as construction date if no remodeling)

These features mostly represent the size, quality, and age of the house — all factors that logically affect the SalePrice. For instance, higher overall quality, more square footage, and newer construction/remodel years typically command higher prices.

In [None]:
# Threshold for "high" correlation
threshold = 0.8

#correlation matrix of the top features
top_corr_pairs = selected_corr[top_features].unstack().sort_values(ascending=False)

# Removing self-correlation and duplicate pairs
top_corr_pairs = top_corr_pairs[top_corr_pairs < 1]
top_corr_pairs = top_corr_pairs[::2]  # drop duplicates

#top 3 correlated feature pairs
high_corr_pairs = top_corr_pairs[top_corr_pairs > threshold]
print("\nHighly correlated feature pairs (r > 0.8):")
print(high_corr_pairs.head(3))



