## Data Description
This machine learning model predicts house prices. This prediction is done based on certain factors. These features include the 
number of bathrooms and bedrooms, square living space, square lot, floors, waterfront, and whether or not there is a view.

The goal of this project is to train the model to predict the prices for various houses based on the distinguishing factors. 
This model will be useful for prospective home owners, current home owners, and real estate businesses to get an estimate of the price range for a broad range of houses. 

## But how can this model help businesses save money? 

Businesses could run this tool on available properties to guide the real estate business in proposing a well-fitting, and likely to win pricing package for a piece of property. This would help the business not to overpay for a property, helps prospective home owners to have a good knowledge of the price range of properties, and even help current home owners to understand the fair market value of their properties.

## Download the dataset
i. Install all required libraries

ii. Download dataset from Kaggle 

iii. Explore the dataset size, shape, and descriptions

iv. Load the training set with Pandas

v. Load the test set with Pandas

In [1]:
!pip install opendatasets xgboost pandas numpy scikit-learn --quiet



In [24]:
import opendatasets as od

In [25]:
my_url = 'https://www.kaggle.com/datasets/shree1992/housedata'

In [26]:
od.download(my_url)

Skipping, found downloaded files in ".\housedata" (use force=True to force download)


In [38]:
mydata = ".\housedata\data.csv" 

In [27]:
import os
import pandas as pd

In [39]:
df = pd.read_csv(mydata)

In [41]:
df.head

<bound method NDFrame.head of                      date         price  bedrooms  bathrooms  sqft_living  \
0     2014-05-02 00:00:00  3.130000e+05       3.0       1.50         1340   
1     2014-05-02 00:00:00  2.384000e+06       5.0       2.50         3650   
2     2014-05-02 00:00:00  3.420000e+05       3.0       2.00         1930   
3     2014-05-02 00:00:00  4.200000e+05       3.0       2.25         2000   
4     2014-05-02 00:00:00  5.500000e+05       4.0       2.50         1940   
...                   ...           ...       ...        ...          ...   
4595  2014-07-09 00:00:00  3.081667e+05       3.0       1.75         1510   
4596  2014-07-09 00:00:00  5.343333e+05       3.0       2.50         1460   
4597  2014-07-09 00:00:00  4.169042e+05       3.0       2.50         3010   
4598  2014-07-10 00:00:00  2.034000e+05       4.0       2.00         2090   
4599  2014-07-10 00:00:00  2.206000e+05       3.0       2.50         1490   

      sqft_lot  floors  waterfront  view  con

In [42]:
df.shape

(4600, 18)

In [43]:
df.describe

<bound method NDFrame.describe of                      date         price  bedrooms  bathrooms  sqft_living  \
0     2014-05-02 00:00:00  3.130000e+05       3.0       1.50         1340   
1     2014-05-02 00:00:00  2.384000e+06       5.0       2.50         3650   
2     2014-05-02 00:00:00  3.420000e+05       3.0       2.00         1930   
3     2014-05-02 00:00:00  4.200000e+05       3.0       2.25         2000   
4     2014-05-02 00:00:00  5.500000e+05       4.0       2.50         1940   
...                   ...           ...       ...        ...          ...   
4595  2014-07-09 00:00:00  3.081667e+05       3.0       1.75         1510   
4596  2014-07-09 00:00:00  5.343333e+05       3.0       2.50         1460   
4597  2014-07-09 00:00:00  4.169042e+05       3.0       2.50         3010   
4598  2014-07-10 00:00:00  2.034000e+05       4.0       2.00         2090   
4599  2014-07-10 00:00:00  2.206000e+05       3.0       2.50         1490   

      sqft_lot  floors  waterfront  view 

In [44]:
df.values

array([['2014-05-02 00:00:00', 313000.0, 3.0, ..., 'Shoreline',
        'WA 98133', 'USA'],
       ['2014-05-02 00:00:00', 2384000.0, 5.0, ..., 'Seattle',
        'WA 98119', 'USA'],
       ['2014-05-02 00:00:00', 342000.0, 3.0, ..., 'Kent', 'WA 98042',
        'USA'],
       ...,
       ['2014-07-09 00:00:00', 416904.166667, 3.0, ..., 'Renton',
        'WA 98059', 'USA'],
       ['2014-07-10 00:00:00', 203400.0, 4.0, ..., 'Seattle', 'WA 98178',
        'USA'],
       ['2014-07-10 00:00:00', 220600.0, 3.0, ..., 'Covington',
        'WA 98042', 'USA']], dtype=object)

In [45]:
df.info

<bound method DataFrame.info of                      date         price  bedrooms  bathrooms  sqft_living  \
0     2014-05-02 00:00:00  3.130000e+05       3.0       1.50         1340   
1     2014-05-02 00:00:00  2.384000e+06       5.0       2.50         3650   
2     2014-05-02 00:00:00  3.420000e+05       3.0       2.00         1930   
3     2014-05-02 00:00:00  4.200000e+05       3.0       2.25         2000   
4     2014-05-02 00:00:00  5.500000e+05       4.0       2.50         1940   
...                   ...           ...       ...        ...          ...   
4595  2014-07-09 00:00:00  3.081667e+05       3.0       1.75         1510   
4596  2014-07-09 00:00:00  5.343333e+05       3.0       2.50         1460   
4597  2014-07-09 00:00:00  4.169042e+05       3.0       2.50         3010   
4598  2014-07-10 00:00:00  2.034000e+05       4.0       2.00         2090   
4599  2014-07-10 00:00:00  2.206000e+05       3.0       2.50         1490   

      sqft_lot  floors  waterfront  view  c

In [46]:
for selected_cols in df.columns:
    print(selected_cols)

date
price
bedrooms
bathrooms
sqft_living
sqft_lot
floors
waterfront
view
condition
sqft_above
sqft_basement
yr_built
yr_renovated
street
city
statezip
country
