# Business Understanding

### Questions to be answered

* Who are the stakeholders in this project? Who will be directly affected by the creation of this project?
* What business problem(s) will this Data Science project solve for the organization?
* What problems are inside the scope of this project?
* What problems are outside the scope of this project?
* What data sources are available to us?

# Data Understanding

### Questions to be answered

* What data is available to us?
* What is our target?
* What predictors are available to us?
* What data types are the predictors we'll be working with?
* What is the distribution of our data?
* How many observations does our dataset contain? Do we have a lot of data? Only a little?
* How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong?
* Do we have enough data to build a model? Will we need to use resampling methods?

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('G:\Moringa\Moringa-Phase-2\Phase_2_Project\Phase_2_project\Data\kc_house_data.csv')
print("First five rows of the dataset:")
df.head()

In [None]:
# Total missing values per column
print('Total missing values per column')
missing_values_per_column = df.isna().sum()
missing_values_per_column

In [None]:
# Display data types and missing values
print("\nData types and missing values:")
df.info()

In [None]:
# Converting date into datetime
df['date'] = pd.to_datetime(df['date'])


In [None]:
df['sqft_basement'].unique()

In [None]:
# Replace '?' with NaN, then convert to numeric and fill NaN with 0
df['sqft_basement'] = pd.to_numeric(df['sqft_basement'].replace('?', float('nan')), errors='coerce')
df['sqft_basement'].fillna(df['sqft_basement'].median(), inplace=True)


In [None]:
df.info()

In [None]:
# Fill missing values
df['waterfront'].fillna(0, inplace=True)
df['view'].fillna(df['view'].mode()[0], inplace=True)
df['yr_renovated'].fillna(0, inplace=True)

In [None]:
missing_values_per_column = df.isna().sum()
print("Missing values per column:")
missing_values_per_column

# Data Preparation

### Questions to be answered

* Detecting and dealing with missing values
* Data type conversions (e.g. numeric data mistakenly encoded as strings)
* Checking for and removing multicollinearity (correlated predictors)
* Normalizing our numeric data
* Converting categorical data to numeric format through one-hot encoding

# Modeling

### Questions to be answered

* Is this a classification task? A regression task? Something else?
* What models will we try?
* How do we deal with overfitting?
* Do we need to use regularization or not?
* What sort of validation strategy will we be using to check that our model works well on unseen data?
* What loss functions will we use?
* What threshold of performance do we consider as successful?