### <h1>Zillow Town Prediction based on Housing Sale Record<h1>

Author: Leah Shin 

Date Created: 1/27/2023

Data on about 11,000 homes sold in a single year are extracted from Zillow for 50 campus towns in the U.S. (https://www.zillow.com). The House_Prices data represent a highly simplified/modified sample of anonymized homes, excluding homes with observations that seem erroneous and/or not pertaining to sales.

### STEP 1

1. How many rows and columns are there?

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

#Read csv file in pandas dataframe and print first 5 rows.
Data =  pd.read_csv('House_Prices.csv')
Data.head()

Unnamed: 0.1,Unnamed: 0,Record,Sale_amount,Sale_date,Beds,Baths,Sqft_home,Sqft_lot,Type,Build_year,Town,University,Type2
0,1,1,295000.0,42521,5,3.0,2020,38332.8,3,1976,1,10,3
1,2,2,240000.0,42541,4,2.0,1498,54014.4,3,2002,1,10,3
2,3,3,385000.0,42521,5,4.0,4000,85813.2,3,2001,1,10,3
3,4,4,268000.0,42472,3,2.5,2283,118918.8,3,1972,1,10,3
4,5,5,186000.0,42465,3,1.25,1527,15681.6,3,1975,1,10,3


2. Looks like there are some extra columns with row markers that appeared over the cleaning process.  How many "actual" columns are there?

In [5]:
Data=Data.drop(['Unnamed: 0', 'Type2'], axis=1)
print(len(Data.index))
print(len(Data.columns))

10659
11


Seems like there is column that does not contain useful information: `Unnamed`,`type2`.  I will go ahead and remove those columns. 
After cleaning columns, I have 10659 rows and 11 actual columns that I can use for my model.

In [6]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10659 entries, 0 to 10658
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Record       10659 non-null  int64  
 1   Sale_amount  10659 non-null  float64
 2   Sale_date    10659 non-null  int64  
 3   Beds         10659 non-null  int64  
 4   Baths        10659 non-null  float64
 5   Sqft_home    10659 non-null  int64  
 6   Sqft_lot     10659 non-null  float64
 7   Type         10659 non-null  int64  
 8   Build_year   10659 non-null  int64  
 9   Town         10659 non-null  int64  
 10  University   10659 non-null  int64  
dtypes: float64(3), int64(8)
memory usage: 916.1 KB


Seems like the `Sale_date` is not formated as datetime. Normally this happens when the user do not specify the format. Since we already know Lee's number format of the date, we do not need to change the format.

### STEP 2

Now I will be using classification model for the analysis. Before creating X and Y for the model, I will be cleaning some columns such as `Record` and `University` column.

1. Before creating X and Y for the model, Since I am going to predict the town and including the closest university would make the test too easy, I will remove `University` column, and any record IDs.

In [7]:
Data = Data.drop(['Record','University'], axis=1)
Data.head()

Unnamed: 0,Sale_amount,Sale_date,Beds,Baths,Sqft_home,Sqft_lot,Type,Build_year,Town
0,295000.0,42521,5,3.0,2020,38332.8,3,1976,1
1,240000.0,42541,4,2.0,1498,54014.4,3,2002,1
2,385000.0,42521,5,4.0,4000,85813.2,3,2001,1
3,268000.0,42472,3,2.5,2283,118918.8,3,1972,1
4,186000.0,42465,3,1.25,1527,15681.6,3,1975,1


Next, I will create X and Y for the model. 

In [8]:
X = Data.drop(['Town'],axis=1)
Y = Data[['Town']]


2. Now, Split to a Training, Test set and choose the model.

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Initialize the Random Forest Classifier
model = RandomForestClassifier()


I decided to use RandomForest due to following reasons:

* Perform both regression and classification tasks. 
* Produces good predictions that can be understood easily. 
* Can handle large datasets efficiently.
* Algorithm provides a higher level of accuracy in predicting outcomes over the decision tree algorithm.


3. Grid Search CV is a great method to find the optimal hyperparameters.

In [10]:
# Define the hyperparameters to be searched
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [5, 10, 15, 20]
}

# Use grid search to find the optimal hyperparameters
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, Y_train.values.ravel())

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [5, 10, 15, 20],
                         'n_estimators': [10, 50, 100, 200]})

4. Also, I will be using Cross-Validation to keep the Test data clean and hoose one best model to use for test set. (`accracy_score()` will be used to pick a good model.)

In [11]:
# Use cross-validation to keep the test data clean
scores = cross_val_score(grid_search.best_estimator_, X_train, Y_train.values.ravel(), cv=5)

# Train the model using the training set
grid_search.best_estimator_.fit(X_train, Y_train.values.ravel())

# Use accuracy score to evaluate the model
train_acc = accuracy_score(Y_train, grid_search.best_estimator_.predict(X_train))
test_acc = accuracy_score(Y_test, grid_search.best_estimator_.predict(X_test))


6. Next, I will ask it to show me the answer and explain/present the findings. 

In [12]:

# Present the findings
print("Train accuracy:", train_acc)
print("Test accuracy:", test_acc)
print("Best hyperparameters:", grid_search.best_params_)

if test_acc < train_acc:
    print("The testing accuracy is lower than the training accuracy.")
    print("This could be due to overfitting, where the model is too complex and fits the training data too well.")
else:
    print("The testing accuracy is equal to or higher than the training accuracy.")

Train accuracy: 0.9861616043157031
Test accuracy: 0.399624765478424
Best hyperparameters: {'max_depth': 15, 'n_estimators': 200}
The testing accuracy is lower than the training accuracy.
This could be due to overfitting, where the model is too complex and fits the training data too well.


7. Lets try to do some prediction to see if it the model works well. 


* Lee purchased a 1,450 sq ft Single Family home (coded as 1) on 2018-04-11, (43201) for $350,000.  The house has 3 bedrooms and 2 baths.  It was built in 1992, and is on a 40,000 square foot lot.  What town it will be located?

In [13]:
property = pd.DataFrame({

    'Sale_amount': [3500000],
    'Sale_date':[43201],
    'Beds': [3],
    'Baths': [2],
    'Sqft_home': [1450],
    'Sqft_lot': [40000],
    'Type': [3],
    'Build_year': [1992],
})

# Use the predict method of the trained model to predict the town
prediction = grid_search.best_estimator_.predict(property)

# The prediction will be an array with one element, the predicted town
print("The predicted town is:", prediction[0])

The predicted town is: 43
