<a href="https://colab.research.google.com/github/MuneefMumthas/CO538-Muneef-22206529/blob/main/CO538_CW1_MLProject_Muneef_22206529.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Introduction**

This is an .ipynb file for Machine Learning project for **CO538**.

**Student Name:** Muneef Ahamed Mohamed Mumthas

**Student ID:** 22206529

**Course:** BSc (Hons) Artificial Intelligence with Foundation Year

**Module:** CO538 - Machines And Their Languages

# **Research Question**

## Can we predict house prices based on other factors, despite dealing with an unknown data format and potential quality issues?

Predicting house prices is a critical task in the real estate industry, enabling stakeholders to make informed decisions. However, it becomes challenging when dealing with datasets of unknown format and potential data quality issues. This research aims to investigate whether it is possible to predict house prices using various factors, even in the presence of these challenges.

#**Dataset Summary**

The real estate markets present an interesting opportunity for data analysts to analyze and predict where property prices are moving towards. Prediction of property prices is becoming increasingly important and beneficial as property prices are a good indicator of both the overall market condition and the economic health of a country. Considering the data provided, we are wrangling a large set of property sales records stored in an unknown format and with unknown data quality issues.

## Columns Explanation

- **Date**: The date when the property information was recorded.
- **Price**: The price of the property.
- **Bedrooms**: The number of bedrooms in the property.
- **Bathrooms**: The number of bathrooms in the property.
- **Sqft_living**: The square footage of living space in the property.
- **Sqft_lot**: The square footage of the lot where the property is located.
- **Floors**: The number of floors in the property.
- **Waterfront**: Whether the property has a waterfront view (binary: 0 for no, 1 for yes).
- **View**: An index from 0 to 4 representing the level of view the property has.
- **Condition**: An index from 1 to 5 representing the overall condition of the property.
- **Sqft_above**: The square footage of the interior space above ground level.
- **Sqft_basement**: The square footage of the basement space.
- **Yr_built**: The year the property was built.
- **Yr_renovated**: The year the property was last renovated.
- **Street**: The street address of the property.
- **City**: The city where the property is located.
- **Statezip**: The state and ZIP code of the property.
- **Country**: The country where the property is located.

**This column description is based on my analysis of the dataset CSV file.**


#**Importing The Libraries**

In [147]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [148]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#**Importing The Dataset**


In [149]:
path = "/content/drive/MyDrive/HousingData.csv"

df = pd.read_csv(path)



In [150]:
df.head()
# sampling the dataset by showing the first five rows

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


In [151]:
# changing the float format to make it easier to understand
pd.set_option('display.float_format', lambda x: format(x, '.2f'))
df.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,551962.99,3.4,2.16,2139.35,14852.52,1.51,0.01,0.24,3.45,1827.27,312.08,1970.79,808.61
std,563834.7,0.91,0.78,963.21,35884.44,0.54,0.08,0.78,0.68,862.17,464.14,29.73,979.41
min,0.0,0.0,0.0,370.0,638.0,1.0,0.0,0.0,1.0,370.0,0.0,1900.0,0.0
25%,322875.0,3.0,1.75,1460.0,5000.75,1.0,0.0,0.0,3.0,1190.0,0.0,1951.0,0.0
50%,460943.46,3.0,2.25,1980.0,7683.0,1.5,0.0,0.0,3.0,1590.0,0.0,1976.0,0.0
75%,654962.5,4.0,2.5,2620.0,11001.25,2.0,0.0,0.0,4.0,2300.0,610.0,1997.0,1999.0
max,26590000.0,9.0,8.0,13540.0,1074218.0,3.5,1.0,4.0,5.0,9410.0,4820.0,2014.0,2014.0


In [152]:
# checking the number of rows and columns
df.shape

(4600, 18)

In [153]:
# checking whether the columns have appropriate data types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

We can see that the datatypes for bedrooms, bathrooms and floors are not the expected types (they can't be float).

In [154]:
# changing the dtype of floors, bedrooms and bathrooms to 'int'
df[['bedrooms','bathrooms', 'floors']] = df[['bedrooms','bathrooms', 'floors']].astype('int')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   int64  
 3   bathrooms      4600 non-null   int64  
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   int64  
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

In [155]:
# checking the number of rows where the price is 0
(df['price']==0).sum(0)

49

There are **49** rows where the price is 0. since price of a house can not be 0, let's consider them as missing values and replace them with the average price.

In [156]:
# replacing the price 0 with NaN
df['price'].replace(0, np.nan, inplace=True)

# replacing the NaN values of price with the average price
df['price'].fillna(value=df['price'].mean(), inplace=True)

In [139]:
df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3,1,1340,7912,1,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5,2,3650,9050,2,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3,2,1930,11947,1,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3,2,2000,8030,1,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4,2,1940,10500,1,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


In [157]:
# dropping the features/columns that do not affect the predictive power.
df.drop(['street', 'date', 'country'], axis=1, inplace=True)


In [158]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,city,statezip
0,313000.0,3,1,1340,7912,1,0,0,3,1340,0,1955,2005,Shoreline,WA 98133
1,2384000.0,5,2,3650,9050,2,0,4,5,3370,280,1921,0,Seattle,WA 98119
2,342000.0,3,2,1930,11947,1,0,0,4,1930,0,1966,0,Kent,WA 98042
3,420000.0,3,2,2000,8030,1,0,0,4,1000,1000,1963,0,Bellevue,WA 98008
4,550000.0,4,2,1940,10500,1,0,0,4,1140,800,1976,1992,Redmond,WA 98052


#**Exploratory Data Analysis (EDA)**
