<a href="https://colab.research.google.com/github/JATC1024/Kaggle-House-Prices/blob/master/Kaglge_house_prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KAGGLE HOUSE PRICES
This notebook is an attempt to compete in [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition.<br>
The goal of the competition is to predict house prices from some provided features.<br>
The data is collected from the competition itself.

## Data collection
The train data can be found [here](http://www.kaggle.com/c/digit-recognizer/download/train.csv).<br>
The test data can be found [here](http://www.kaggle.com/c/digit-recognizer/download/test.csv).

In [0]:
import pandas as pd

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [22]:
# Check if the collected data sets are correct
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [23]:
test_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


## Data understanding
Now we're going to discover some insights about the data to get better understanding. Some steps we can flollow:
1. Make an investigation on the data description provided by Kaggle.
2. Visualize the data.

**Note**: The following code requires **train.csv**, **test.csv**, **data_description.txt** and **brief.txt** are placed in the current directory.

**How big is the data**<br>
It's good to see how big the data we're dealing with.

In [33]:
train_data.shape

(1460, 81)

As we can see, the data contains 1460 observations along with 81 features, which is a lot of features. So let's slowly and carefully extract each feature.

**Let's take a look on the data description:**<br>
A brief description:

In [30]:
with open("brief.txt") as file:
  print(file.read())

Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof m

Since the detailed description is rather long, we will not show it here. Instead, let's make a function to get the details along with some statistical characteristics of each feature. When we want to learn more about a feature showed in the brief description, we just need to call the function.

In [0]:
import re
def show_details(feature):
  with open("data_description.txt", "r") as file:
    lines = file.readlines()
    
  # Define a regular expression:
  pattern = re.compile("[a-z,A-Z].+:.+")
  flag = False # A flag that checks if we have found the wanted feature or not.
  for index, line in enumerate(lines):         
    if pattern.match(line):
      cur_feature, des = line.split(": ")
      if cur_feature.lower() == feature.lower():
        flag = True # Turn on the flag
        print("Feature name: ", cur_feature)
        print("Description: ", des)
        print("Values: ")
      else:        
        flag = False
    elif flag:
      print(line)
      
  # Describe the data  
  print(train_data[feature].describe())
  # Check for missing values:
  missing_values = train_data[feature].isnull()
  print("Missing values:")
  print(missing_values.value_counts())

We will learn about the features in the following cell. Simply change the feature name for whatever you want to learn about.

In [69]:
show_details("MSSubClass")

Feature name:  MSSubClass
Description:  Identifies the type of dwelling involved in the sale.	

Values: 


        20	1-STORY 1946 & NEWER ALL STYLES

        30	1-STORY 1945 & OLDER

        40	1-STORY W/FINISHED ATTIC ALL AGES

        45	1-1/2 STORY - UNFINISHED ALL AGES

        50	1-1/2 STORY FINISHED ALL AGES

        60	2-STORY 1946 & NEWER

        70	2-STORY 1945 & OLDER

        75	2-1/2 STORY ALL AGES

        80	SPLIT OR MULTI-LEVEL

        85	SPLIT FOYER

        90	DUPLEX - ALL STYLES AND AGES

       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER

       150	1-1/2 STORY PUD - ALL AGES

       160	2-STORY PUD - 1946 & NEWER

       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER

       190	2 FAMILY CONVERSION - ALL STYLES AND AGES



count    1460.000000
mean       56.897260
std        42.300571
min        20.000000
25%        20.000000
50%        50.000000
75%        70.000000
max       190.000000
Name: MSSubClass, dtype: float64
Missing values:
False    1460
Na