<a href="https://colab.research.google.com/github/ApoorvAkash/MachineLearning/blob/main/Assessment_California.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**California House Pricing Prediction**

In [None]:
# Importing the Libraries
import numpy as np
import pandas as pd

# Plotting Libs
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn libs
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error

np.linspace(5,15,10).size

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Load the data :**

In [None]:
# Importing Data
housing = pd.read_excel('housing.xlsx')
housing.info()

FileNotFoundError: ignored

In [None]:
# Displaying few values of the data
print(housing.head())

**Visual EDA**

In [None]:
# Plotting the Housing Data
pairObject = sns.PairGrid(housing)
pairObject.map_diag(sns.distplot)
pairObject.map_lower(sns.scatterplot)
pairObject.map_upper(sns.scatterplot)


Deductions

*   median_house_value is **continuous**, therefore it is a **regression** problem.
*   ocean_proximity is a **categorical** column.
*   As per Visual EDA, Data is quite skewed, Accuracy may be low.



**Handle missing values :**

In [None]:
# Handling Missing Data
housing.isna().sum()

Missing Values for total_bedrooms is 207, We will replace the NaN values with the  mean of the column.

In [None]:
# Using SimpleImputer for Handling Missing Values
total_bedrooms = housing.iloc[:, 4:5]
# Create SimpleImputer object to replace NaN values with Mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(total_bedrooms)
housing.iloc[:, 4:5] = imputer.transform(total_bedrooms)
housing.isna().sum()

**Encode categorical data :**

In [None]:
# Encoding Categorical Column ocean_proximity using get_dummies method from Pandas
# Dividing Dataset into Label and Features
label = housing.iloc[:, -1]
features = pd.concat([pd.get_dummies(housing.ocean_proximity), housing.iloc[:, 0:8] ], axis=1)
print ("Labels: \n{}".format(label.head()))
print ("Features: \n{}".format(features.head()))

**Split the dataset**

In [None]:
# Train and Test Split
X_train, X_test, Y_train, Y_test = train_test_split(features, label, test_size = 0.2, random_state=5)

**Standardize data**

In [None]:
# Standarizing the Features
independent_scaler = StandardScaler()
X_train = independent_scaler.fit_transform(X_train)
X_test = independent_scaler.transform(X_test)

**Perform Linear Regression**

In [None]:
# As Label is continuous, using Linear Regression
linearRegressionModel = LinearRegression()
linearRegressionModel.fit(X_train, Y_train)

# Training Data Score
linearRegressionModel.score(X_train, Y_train)

# Test Data Score
print ("Model Test Score: {}".format(linearRegressionModel.score(X_test, Y_test)))

# Intercept and Coeffecients
print("Intercept is "+str(linearRegressionModel.intercept_))
print("Coeffs  are "+str(linearRegressionModel.coef_))

# Prediction Based on X_test
y_pred = linearRegressionModel.predict(X_test)
comparison = pd.DataFrame({'Predicted':y_pred,'Actual':Y_test})
sns.jointplot(x='Actual',y='Predicted',data=comparison,kind='reg');

In [None]:
# Root Mean Square Error Between Y
print("Root mean Square error for Y_Test and Y_Pred is: " + str(np.sqrt(metrics.mean_squared_error(Y_test,y_pred))))
print("Root mean Square error for Y_Train and X_train is: " + str(np.sqrt(metrics.mean_squared_error(Y_train,linearRegressionModel.predict(X_train)))))

**Perform Linear Regression with one independent variable**

In [None]:
# Perform Linear Regression for only one independent variable median_income
X_train2 = X_train[:,[-1]]
X_test2 = X_test[:,[-1]]

linRegModelForMedianIncome = LinearRegression()
linRegModelForMedianIncome.fit(X_train2,Y_train)

Y_pred2 = linRegModelForMedianIncome.predict(X_test2)

# Test Data Score
print ("Model Test Score for One Independent Variable: {}".format(linRegModelForMedianIncome.score(X_test2, Y_test)))

# Jointplot for Actual Y_test and Y_pred2

comparison2 = pd.DataFrame({'Predicted':Y_pred2,'Actual':Y_test})
sns.jointplot(x='Actual',y='Predicted',data=comparison2,kind='reg');

# Scatterplot for X_train2 and X_train2 Prediction and Y_test and Y_test Prediction

fig = plt.figure(figsize=(30,10))
plt.scatter(Y_test,Y_pred2,marker="o",edgecolors ="b",s=60)
plt.scatter(Y_train,linRegModelForMedianIncome.predict(X_train2),marker="*",s=50,alpha=0.5)
plt.xlabel(" Actual median_house_value")
plt.ylabel(" Predicted median_house_value")
