# Machine Learning Process

- Outline
    - Problem Statement
    - Machine Learning Problem
    - Data Set
    - Pre-processing Data
    - Machine Learning Model
    - Model Evaluation 

## Problem Statement

A Real Estate Agency wants to estimate/project _fair_ prices of houses in a locality. In their daily business, the agency uses various types of parameters/features in evaluating a property and come up with a fair price. The problem with the current approach is that it requires human experty and at times is also subjective.

The agency wants to design a mechanism to estimate/project a _fair_ price of the house based on its features/attirbutes.

![image.png](attachment:2f12ecc2-6b81-49d7-bf13-d700796decd3.png)

## Machine Learning Problem

![image.png](attachment:8004b026-bfcc-4e9e-9de0-4dac7990f45b.png)

### What is the nature of the problem?

- Regression vs Classification

## Data Set

- https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data

![image.png](attachment:e3a83b6f-9237-4d1c-8991-9a97a38a3410.png)

# Python Packages for Data Science

![image.png](attachment:02f5fecf-7fff-4ad3-9b84-c49216f42574.png)

![image.png](attachment:723565cc-f57f-49ee-828d-2a9b2d21d9c8.png)

![image.png](attachment:e88442e3-cad2-427e-be98-172f14de83ba.png)

![image.png](attachment:b3ba1d76-8274-4823-bef4-ec9481f98cb7.png)

# Reading & Writing Data in Python

In [None]:
import pandas as pd
import numpy as np

In [None]:
path = "Data/HousingPrices.csv"
df = pd.read_csv(path)

## Printing the Data Frame
* df prints the entire data frame
* df.head(n) prints the first n rows
* df.tail(n) prints the last n rows

In [None]:
df.head() # Prints the top 5 rows

In [None]:
df.head(10)

![image.png](attachment:5d528a6e-6889-4361-a2a6-a447cf609ce8.png)

# Analyzing Data in Python

![image.png](attachment:c4398c9f-983c-4978-8996-01f2c1d1d696.png)

## Checking Data Types of data
* df.dtypes

In [None]:
df.dtypes

In [None]:
df.describe() 

# Data Preprocessing

![image.png](attachment:99f2e557-0725-4cb0-ab98-3710b14b567d.png)

![image.png](attachment:59655133-cdd7-4a01-89a7-903286e34a64.png)

## Accessing columns of a data frame

In [None]:
df["Area"]

In [None]:
df["Bedrooms"]

## Dealing with Missing Values

![image.png](attachment:e504a4dd-761a-45e9-ac82-e4ced31cb660.png)

![image.png](attachment:4ff8a5f1-2648-4427-b8c9-6f6d12c38a18.png)

## Replace ? in _Area_ column by NAN

In [None]:
df.head()

In [None]:
df["Area"]

In [None]:
df["Area"].replace("?", np.nan, inplace = True)

In [None]:
df.head()

In [None]:
df["Area"] = pd.to_numeric(df["Area"])

In [None]:
df.dtypes

In [None]:
mean = df["Area"].mean()
mean

In [None]:
df["Area"].replace(np.nan, mean, inplace=True)

In [None]:
df["Area"]

## How to drop rows/columns with misisng values?

![image.png](attachment:dfc765a4-3b64-4881-ae05-c4a4ebeb6212.png)

In [None]:
df.dropna(subset=["Price"], axis=0, inplace = True)

## Drop the non-numeric columns

In [None]:
df.drop(['Mainroad', 'Guestroom', 'Basement', 'Hotwaterheating', 'Airconditioning', 'Furnishingstatus' ], axis=1, inplace=True)

In [None]:
df.head()

## Visualization

In [None]:
import pandas as pd
path = "Data/HousingPricesClean.csv"
df = pd.read_csv(path)
df.head()

In [None]:
import matplotlib.pyplot as plt #installation may be required via pip, pip install matplotlib
plt.scatter(df["Area"], df["Price"])
plt.title("Relationship b/w Area and Price")
plt.xlabel("Area")
plt.ylabel("Price")

In [None]:
plt.scatter(df["Bedrooms"], df["Price"])
plt.title("Relationship b/w Bedrooms and Price")
plt.xlabel("Area")
plt.ylabel("Price")

In [None]:
import seaborn as sns
sns.regplot(x="Area", y = "Price", data=df)
plt.ylim(0,)
plt.xlabel("Area")
plt.title("Correlation b/w Area and Price")

## Dividing the Data into Two Parts

In [None]:
X = df.iloc[:,:-1]
Y = df.iloc[:, -1]

In [None]:
print(X,Y)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2, random_state=42)

In [None]:
df.shape

In [None]:
print(X_train.shape)
print(Y_train.shape)

In [None]:
print(X_test.shape)
print(Y_test.shape)

In [None]:
help(train_test_split)

## Model Development

In [None]:
from sklearn import linear_model

In [None]:
model = linear_model.LinearRegression()

In [None]:
# Training the model with train set
model.fit(X_train, Y_train)

In [None]:
# The coefficients
print("Coefficients: \n", model.coef_)

In [None]:
# Training Error
from sklearn.metrics import mean_squared_error

Y_train_pred = model.predict(X_train)

train_error = mean_squared_error(Y_train, Y_train_pred)


print("Mean squared error: %.3f" % train_error)

In [None]:
#Test Error
Y_test_pred = model.predict(X_test)

test_error = mean_squared_error(Y_test, Y_test_pred)


print("Mean squared error: %.3f" % test_error)

In [None]:
print(train_error/test_error)

## Way Forward

- Both train and test errors are too high
- What should be done?
- Add more features?
    - How to encode non-numeric data?
    - 0,1 for binary data columns
    - Is that a good approach?
- What if the results are still not good?
- Feature Engineering?
- Model Optimization
- More Advanced Models
    - Polynomial Models
    - Neural Networks 