# ***1.BUSINESS UNDERSTANDING***
## ***Goal:***

- Identify the top 5 zip codes for real estate investment, considering potential return on investment, market stability, and future growth prospects.

## ***Key Variables (Model Targets):***
- Price Appreciation: Growth in property values over time.
- Market Stability: Consistency in price trends, indicating lower risk.
- Demand Indicators: Factors influencing the desirability of the area (e.g., demographics, economic growth).
- Investment Return Potential: Estimated return based on historical and forecasted data.

## ***Data Source Identification:***

-Primary Source: Zillow Research dataset, providing historical real estate prices by zip code.

## ***Objectives:***
#### ***Project Goals:***
- Quantitative Analysis: Analyze historical price trends and forecast future growth.

### ***Success Metrics:***
- Financial Returns: Target a specific return rate or price appreciation percentage.
- Risk Assessment: Evaluate and limit the investment risk based on market stability.

----------------------
# ***2.DATA ACQUISATION AND UNDERSTANDING***

The Zillow dataset provides detailed real estate data, with each row representing a unique zip code. Here's an overview of the dataset structure:

- RegionID: A unique identifier for each region.
- RegionName: The zip code for the region.
- City: The city where the region is located.
- State: The state where the region is located.
- Metro: The metropolitan area associated with the region.
- CountyName: The name of the county where the region is located.
- SizeRank: A ranking of the region based on size.
- Monthly Price Data: Starting from April 1996 to April 2018, this dataset includes monthly real estate prices for each zip code.

We'll analyze historical price trends at a zip code level, which is crucial for our objective of identifying the top 5 zip codes for real estate investment. The analysis will involve:

- Trend Analysis: Evaluating the long-term price trends in each zip code.
-Volatility Assessment: Understanding the stability or variability in prices over time.
-Comparative Analysis: Comparing zip codes across different regions, cities, or states.
- Forecasting: Applying statistical or machine learning models to predict future price trends.


In [1]:
import pandas as pd

# Loading the dataset
zillow_data = pd.read_csv('zillow_data.csv')



In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Zillow dataset

zillow_data = pd.read_csv('zillow_data.csv')

# Basic Data Overview
print("Data Shape:", zillow_data.shape)
print("Data Info:")
zillow_data.info()

# Summary Statistics
print("\nSummary Statistics:")
zillow_data.describe()

# Check for Missing Values
missing_values = zillow_data.isnull().sum()
print("\nMissing Values:")
print(missing_values[missing_values > 0])

# Function to reshape the dataset from wide to long format
def melt_data(df):
    # Convert date columns
    date_columns = pd.to_datetime(df.columns[7:], format='%Y-%m')
    df.columns = list(df.columns[:7]) + list(date_columns)

    melted = pd.melt(df, id_vars=['RegionName', 'RegionID', 'SizeRank', 'City', 'State', 'Metro', 'CountyName'], var_name='time', value_name='value')
    melted = melted.dropna(subset=['value'])
    return melted

# Reshape the dataset
long_format_data = melt_data(zillow_data)

# Ensure the time column is sorted
long_format_data = long_format_data.sort_values(by='time')

# Displaying the first few rows of the reshaped and processed dataset
print("\nProcessed Data:")
long_format_data.head()



Data Shape: (14723, 272)
Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB

Summary Statistics:



Missing Values:
Metro      1043
1996-04    1039
1996-05    1039
1996-06    1039
1996-07    1039
           ... 
2014-02      56
2014-03      56
2014-04      56
2014-05      56
2014-06      56
Length: 220, dtype: int64

Processed Data:


Unnamed: 0,RegionName,RegionID,SizeRank,City,State,Metro,CountyName,time,value
0,60657,84654,1,Chicago,IL,Chicago,Cook,1996-04-01 00:00:00,334200.0
9494,34602,73138,9495,Hill 'n Dale,FL,Tampa,Hernando,1996-04-01 00:00:00,49600.0
9495,53128,81233,9496,Pell Lake,WI,Whitewater,Walworth,1996-04-01 00:00:00,112600.0
9496,7462,60685,9497,Vernon,NJ,New York,Sussex,1996-04-01 00:00:00,135300.0
9497,97048,99049,9498,Rainier,OR,Portland,Columbia,1996-04-01 00:00:00,95300.0


### ***Missing Values***
Metro Column: Contains 1,043 missing values. Since this column represents the metropolitan area, missing values could indicate that the zip code does not belong to a metropolitan area. We'll fill these with a placeholder value like "Non-Metro" 

### ***Data Types***
- Date Columns (Monthly Prices): These are currently in integer format. For time series analysis, converting these to a datetime format would be beneficial.
- Other Columns: RegionID, RegionName, City, State, and CountyName are correctly classified. However, we might not need RegionID for our analysis.

### ***Next Steps in Preprocessing***
- Handle Missing Values
- Convert Date Columns: Transform the monthly price columns into a datetime format.
- Drop Irrelevant Columns: 
- Data Normalization/Standardization: 