## This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
!mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [None]:
 # Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

You will require the following libraries:


In [None]:
import piplite
await piplite.install(['pandas','matplotlib','scikit-learn','seaborn', 'numpy'])


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

# Importing Data Sets


The functions below will download the dataset into your browser:


In [None]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

In [None]:
file_name='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv'

You will need to download the dataset; if you are running locally, please comment out the following code: 


In [None]:
await download(file_name, "kc_house_data_NaN.csv")
file_name="kc_house_data_NaN.csv"

Use the Pandas method <b>read_csv()</b> to load the data from the web address.


In [None]:
df = pd.read_csv(file_name)

We use the method <code>head</code> to display the first 5 columns of the dataframe.


In [None]:
df.head()


Display the data types of each column using the function dtypes, then take a screenshot and submit it, include your code in the image.


In [None]:
df.dtypes

We use the method describe to obtain a statistical summary of the dataframe.


In [None]:
df.describe()

# Data Wrangling




Drop the columns <code>"id"</code>  and <code>"Unnamed: 0"</code> from axis 1 using the method <code>drop()</code>, then use the method <code>describe()</code> to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the <code>inplace</code> parameter is set to <code>True</code>


In [None]:
df.columns = df.columns.str.strip()
columns_to_drop = ["id", "Unnamed: 0"]
df.drop(columns_to_drop, axis=1, inplace=True)
summary = df.describe()
print(summary)

We can see we have missing values for the columns <code> bedrooms</code>  and <code> bathrooms </code>


In [None]:
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())


We can replace the missing values of the column <code>'bedrooms'</code> with the mean of the column  <code>'bedrooms' </code> using the method <code>replace()</code>. Don't forget to set the <code>inplace</code> parameter to <code>True</code>


In [None]:
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)

We also replace the missing values of the column <code>'bathrooms'</code> with the mean of the column  <code>'bathrooms' </code> using the method <code>replace()</code>. Don't forget to set the <code> inplace </code>  parameter top <code> True </code>


In [None]:
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)

In [None]:
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

# Exploratory Data Analysis




Use the method <code>value_counts</code> to count the number of houses with unique floor values, use the method <code>.to_frame()</code> to convert it to a dataframe.


In [None]:
floor_counts = df['floors'].value_counts()
floor_counts_df = floor_counts.to_frame()
floor_counts_df = floor_counts_df.rename(columns={'floors': 'Number of Houses'})
floor_counts_df['Floor'] = floor_counts_df.index
floor_counts_df.reset_index(drop=True, inplace=True)
print(floor_counts_df)



Use the function <code>boxplot</code> in the seaborn library  to  determine whether houses with a waterfront view or without a waterfront view have more price outliers.


In [None]:
sns.boxplot(x="waterfront",y="price", data=df)


Use the function <code>regplot</code>  in the seaborn library  to  determine if the feature <code>sqft_above</code> is negatively or positively correlated with price.


In [None]:
sns.regplot(x="sqft_above",y="price",data=df)

We can use the Pandas method <code>corr()</code>  to find the feature other than price that is most correlated with price.


In [None]:
df.corr()['price'].sort_values()

# Model Development


We can Fit a linear regression model using the  longitude feature <code>'long'</code> and  caculate the R^2.


In [None]:
X = df[['long']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.score(X, Y)



Fit a linear regression model to predict the <code>'price'</code> using the feature <code>'sqft_living'</code> then calculate the R^2. Take a screenshot of your code and the value of the R^2.


In [None]:
a=df[['sqft_living']]
b=df['price']
lg=LinearRegression()
lg.fit(a,b)
price=lg.predict(a)
print("predicted price: ", price)

print("R^2 value: ", lg.score(a,b))



Fit a linear regression model to predict the <code>'price'</code> using the list of features:


In [None]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]     

Then calculate the R^2. Take a screenshot of your code.


In [None]:
from sklearn.impute import SimpleImputer
c = df[features]
d = df['price']  # Make sure the target column name is correct
c_train, c_test, d_train, d_test = train_test_split(c, d, test_size=0.2, random_state=42)
imputer = SimpleImputer(strategy='mean')
c_train_imputed = imputer.fit_transform(c_train)
lg = LinearRegression()
lg.fit(c_train_imputed, d_train)
c_test_imputed = imputer.transform(c_test)
predict = lg.predict(c_test_imputed)
print("Predicted Prices:", predict)
r_squared = lg.score(c_test_imputed, d_test)
print("R-squared:", r_squared)




Create a list of tuples, the first element in the tuple contains the name of the estimator:

<code>'scale'</code>

<code>'polynomial'</code>

<code>'model'</code>

The second element in the tuple  contains the model constructor

<code>StandardScaler()</code>

<code>PolynomialFeatures(include_bias=False)</code>

<code>LinearRegression()</code>


In [None]:
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]



Use the list to create a pipeline object to predict the 'price', fit the object using the features in the list <code>features</code>, and calculate the R^2.


In [None]:
from sklearn.impute import SimpleImputer
features = ["floors", "waterfront", "lat", "bedrooms", "sqft_basement", "view", "bathrooms",
            "sqft_living15", "sqft_above", "grade", "sqft_living"]
data = df[features + ['price']]
data.dropna(inplace=True)
X = data[features]
y = data['price']
pipeline = Pipeline([
    ('scale', StandardScaler()),  # Standardize features
    ('polynomial', PolynomialFeatures(include_bias=False)),  # Create polynomial features
    ('model', LinearRegression())  # Linear regression model
])
pipeline.fit(X, y)
predicted_prices = pipeline.predict(X)
r_squared = pipeline.score(X, y)
print("Predicted Prices:", predicted_prices)
print("R-squared:", r_squared)


# Model Evaluation and Refinement


Import the necessary modules:


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")

We will split the data into training and testing sets:


In [None]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]    
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)


print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])



Create and fit a Ridge regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data.


In [None]:
from sklearn.linear_model import Ridge

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
features = ["floors", "waterfront", "lat", "bedrooms", "sqft_basement", "view", "bathrooms",
            "sqft_living15", "sqft_above", "grade", "sqft_living"]
data = df[features + ['price']]
data.dropna(inplace=True)
X = data[features]
y = data['price']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)
ridge = Ridge(alpha=0.1)
ridge.fit(x_train, y_train)
predicted_prices = ridge.predict(x_test)
r_squared = r2_score(y_test, predicted_prices)
print("R-squared using Ridge regression:", r_squared)




Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.


In [None]:
features = ["floors", "waterfront", "lat", "bedrooms", "sqft_basement", "view", "bathrooms",
            "sqft_living15", "sqft_above", "grade", "sqft_living"]
data = df[features + ['price']]
data.dropna(inplace=True)
X = data[features]
y = data['price']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)
poly = PolynomialFeatures(degree=2)
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.transform(x_test)
ridge = Ridge(alpha=0.1)
ridge.fit(x_train_poly, y_train)
predicted_prices = ridge.predict(x_test_poly)
r_squared = r2_score(y_test, predicted_prices)
print("R-squared using Polynomial Ridge regression:", r_squared)
