## House Prices Prediction

- `SalePrice`: the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

In [None]:
df = pd.read_csv('train.csv')
target = 'SalePrice'

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['SaleType'].dtypes

In [None]:
missing_per = df.isnull().sum()*100/len(df)

In [None]:
# for index, value in missing_per.items():
#     print(value)


In [None]:
missing_less_than_10_per = []
missing_greater_tha_10_per = []
missing_greater_tha_20_per = []

for col, value in missing_per.items():
    if 0<value<=10:
        missing_less_than_10_per.append(col)
    elif 10<= value<=20: # value :[10, 20]
        missing_greater_tha_10_per.append(col)
    elif value>20: #value>30
        missing_greater_tha_20_per.append(col)

In [None]:
# 0< value <= 10


In [None]:
df[missing_greater_tha_20_per].isnull().sum()

In [None]:
df.drop(columns=missing_greater_tha_20_per, inplace=True) 

In [None]:
cat_columns = []
num_columns = []
yes_no_columns  = []

for column in df.columns:
    if column == target:
        continue
    elif df[column].nunique()==2:
        yes_no_columns.append(column)
    elif df[column].dtypes =='O':
        cat_columns.append(column)
    else:
        num_columns.append(column)

In [None]:
(df.nunique()==2).sum()

In [None]:
# df[yes_no_columns]

In [None]:
yes_no_columns

In [None]:
df.shape

In [None]:
num_columns

In [None]:
df[num_columns].shape

In [None]:
# cat_columns

In [None]:
df[cat_columns].shape

In [None]:
# df.info()

### Feature Engineering Pipeline

In [None]:
df = df.dropna()

In [None]:
df.shape

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression

In [None]:
preprocessing = ColumnTransformer([
    ('scaling', StandardScaler(), num_columns),
    ('oneHot', OneHotEncoder(), cat_columns ),
    ('yes_no', OrdinalEncoder(), yes_no_columns)
])
Training_Pipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('model', LinearRegression())
])

In [None]:
x, y = df.drop(columns=target), df[target]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.01)

In [None]:
Training_Pipeline.fit(X_train, y_train)

In [None]:
df['LotShape'].unique()

In [None]:
X_test

In [None]:
y_prediction = Training_Pipeline.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
mean_absolute_error()

In [None]:
y_prediction

In [None]:
cat_columns

### Detecting Multicollinearity
A statistical technique called the **variance inflation factor (VIF)** can detect and measure the amount of collinearity in a multiple regression model.
- A VIF of 1 will mean that the variables are not correlated.
- A VIF between 1 and 5 shows that variables are moderately correlated.
- A VIF between 5 and 10 will mean that variables are highly correlated.
```python
# def VIF(x):
#     vif = pd.DataFrame({
#         'Features': x.columns,
#         'VIF': [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
#     })
#     vif['VIF'] = vif['VIF'].round(2)
#     vif = vif.sort_values(by='VIF', ascending=False).reset_index(drop=True)
#     return vif
```
from statsmodels.stats.outliers_influence import variance_inflation_factor