# **AxxendCorp AI/ML Assessment**

***Project Goal :***
Build a machine learning model to predict house prices based on different features such as
size, location, and number of rooms. The project should demonstrate data preprocessing,
model training, evaluation, and reporting of insights.

In [1]:
# Code for uploading kaggle notebook for the dataset given
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lespin/house-prices-dataset")

print("Path to dataset files:", path)

ModuleNotFoundError: No module named 'kagglehub'

**DATA UNDERSTANDING AND CLEANING**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

In [None]:
# Loading dataset using pandas
data = pd.read_csv( path + "/train.csv")
data.head()

In [None]:
data.tail()

**Loaded Head section a tail section of the data above to see the features involved**

In [None]:
data.shape  # There are 81 features with 1460 observations in the data

In [None]:
data.nunique()  # This tells the number of unique observations or rows per feature

In [None]:
pd.set_option('display.max_rows', None)
data.info()

**Data above shows the information about the data including the number of observations per feature and also shows if the data is balanced as well as the data types of each feature.**

In [None]:
data.describe()

**Description of the data which includes the number of observations or rows per feature, the mean , standard deviation and other attributes which will help later on when cleaning the dataset.**

In [None]:
data.isnull().sum()  #This code helps to know the number of null values in each feature attribute.

In [None]:
print(data.dtypes)  # Checking for the data types with object or boolean

In [None]:
cate_data = [x for x in data.columns if data[x].dtypes == "object" or data[x].dtypes == "bool"]
print(len(cate_data))
print(cate_data)

**Data above shows the columns having the categorical attributes. Here, we are going to convert them to numerical observations since the Machine learning model needs numerical data.**

In [None]:
from sklearn.preprocessing import LabelEncoder

encode = LabelEncoder()
for x in cate_data:
    data[x] = encode.fit_transform(data[x])

In [None]:
data.head()

**Handling missing values from data right after changing the categorical features making the observations numeric**

In [None]:
print(data.isnull().sum()[data.isnull().sum() > 0])

In [None]:
data.fillna({"LotFrontage": data["LotFrontage"].mean()}, inplace = True)
data.fillna({"MasVnrArea": data["MasVnrArea"].mean()}, inplace = True)
data.fillna({"GarageYrBlt": data["GarageYrBlt"].mean()}, inplace = True)

In [None]:
print(data.isnull().sum()[data.isnull().sum() > 0])

**Replaced missing values with their respective mean values**

In [None]:
# Select some numerical columns for visualization
numerical_cols = data.select_dtypes(include=np.number).columns.tolist()

# Exclude 'Id' and the target variable 'SalePrice' for now
numerical_cols.remove('Id')
numerical_cols.remove('SalePrice')

# You can select a subset of these columns to visualize, for example the first 10
cols_to_plot = numerical_cols[:10]

plt.figure(figsize=(15, 10))
for i, col in enumerate(cols_to_plot):
    plt.subplot(2, 5, i + 1)
    sns.boxplot(y=data[col])
    plt.title(col)
plt.tight_layout()
plt.show()