# **Data Load And Check**

In [None]:
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        path=os.path.join(dirname, filename)
        print(path)

In [None]:
path[0]

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from warnings import filterwarnings
filterwarnings("ignore")

In [None]:
realtor= pd.read_csv(path)
df=realtor.copy()

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

# **EDA**

### **Missing Values and Feature Extraction**
Firstly, we need to fill in the **missing values** in the data, which are **structurally missing**. These are variables that we can determine. Let's examine them.

In [None]:
nulls=df.isnull().sum()
fig = px.bar(x=nulls.index.values, y=nulls.values,text_auto=True,height=500,width=600)
fig.update_layout(title="Null Variables",
                 xaxis_title="Features",
                 yaxis_title="Count")
fig.update_xaxes(tickangle=70)
fig.show()

The **unsold observation** units have been filled with **NaN** in the **prev_sold_date** variable, which is a structural issue. Therefore, I am marking these values as **not_sold** and assigning them to a new variable as **"sold"** and **"not_sold"**.

In [None]:
df["status_of_sale"]=np.where(df.sold_date.isnull(),"not_sold","sold")
df.drop(columns="sold_date",inplace=True)

I am categorizing the states into **five categories**: **the top 4 states** with the **highest number of listings** and **other states**, in order to visualize the **'state'** variable.

In [None]:
df["grouped_states"]=[i if i == "Massachusetts" or i== "New Hampshire" or i == "Connecticut" or i=="Vermont" else "Other States" for i in df.state]

### **Number Of State Variable**
The majority of the items are **ready for sale**.

In [None]:
ct=df.groupby(['status']).size()
fig = px.bar(x=ct.index.values, y=ct.values,text_auto=True,height=500,width=600)
fig.update_layout(title="Number of Status Variable",
                 xaxis_title="Status",
                 yaxis_title="Count")
fig.show()


### **Bed - Bath Density Heatmap**
Generally, houses with **3 bedrooms** and **2 bathrooms** have been listed.
Due to the wide scale of these two compared variables, they can encompass listings ranging from a **simple house** to **an entire apartment complex** being put up for sale

In [None]:
fig = px.density_heatmap(df, x="bed", y="bath",marginal_x="histogram", marginal_y="histogram",width=500,height=500)

fig.update_layout(yaxis_range=[0,10],xaxis_range=[0,10],title="Bed-Bath Compare")
fig.show()

The **'bed'** and **'bath'** variables represent the rooms of a house, so we can combine them into a single variable and compare them with other variables.

In [None]:
df["bed_bath"] = df.bed + df.bath

### **Bed_Bath - House Size Comparison**
In general, as the number of rooms increases, the living area also tends to increase. However, as seen in the graph, there are also listings with **very high numbers of rooms**. These listings are likely to be a **hotel** or a **large commercial property**.

In [None]:
fig = px.bar(df.groupby(["bed_bath"]).mean().reset_index(), x='bed_bath', y='house_size',width=700)
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'},
                 title="Comparison of the number of rooms and the variables of living areas.")
fig.update_traces(marker_color='rgb(50,150,200)', marker_line_color='rgb(0,0,0)',
                  marker_line_width=1.5, opacity=0.6)
fig.update_xaxes(type="category")
fig.show()

### **Bed_Bath - Price Comparison**
Again, in this graph, it is observed that as **the number of rooms** increases, **the price** generally increases. But, for certain room numbers, there is a **wide range** of prices, indicating that houses of the same size are listed in both **wealthy** and **poorer states**.

In [None]:
fig = px.bar(df.groupby(["bed_bath"]).mean().reset_index(), x='bed_bath', y='price',width=700)
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'},
                 title="Comparing the number of rooms with the prices.")
fig.update_xaxes(type="category")
fig.show()

### **Distribution Of The Acre Lot Variable**
The **'acre_lot'** variable represents the ratio of the house to the land in a listing. The fact that its maximum value is **significantly higher than the median value indicates** that the majority of the listings are **single apartments** or **detached houses**.

In [None]:
fig = px.violin(df, y="acre_lot",width=500)
fig.update_traces(points=False)
fig.show()

### **State - House Size Comparison**
In the **state of Pennsylvania**, it can be observed that the living areas of the listings are **higher compared** to other states. From this graph, two things can be inferred. Firstly, it suggests that houses in this state generally have **spacious living areas**. Secondly, it indicates that the majority of the listings in this state could be for **hotels**, **guesthouses**, or **large establishments**.

In [None]:
fig = px.bar(df.groupby(["state"]).mean().reset_index(), x='state', y='house_size',width=700)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'},
                 title="Comparison of states based on the sizes of houses.")
fig.update_xaxes(type="category",tickangle=-70)
fig.show()

### **State - Price Comparison**
In the **state** and **price** comparison, the average prices of the states are displayed; however, this does not mean that **lower-priced** states will not sell **houses at higher prices**.

In [None]:
fig = px.bar(df.groupby(["state"]).mean().reset_index(), x='state', y='price',width=700)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'},
                 title="State-Price Compare.")
fig.update_xaxes(type="category",tickangle=-70)
fig.show()

### **Price Distribution For Each State**
As seen in this graph, contrary to the previous graph, even in states with **lower average** prices, **high-priced houses** can be sold.

In [None]:
fig = px.box(df, x="price",y="state",width=700,points=False)
fig.update_layout(xaxis_range=[0,2*10**6])
fig.show()

### **State - Bed_Bath Comparison**
As we observed previously in the **bed_bath - price** comparison, it was generally noted that as the number of rooms in houses increased, prices tended to increase as well. However, we also observed that in some cases, houses with high room numbers could have lower prices. As analyzed in this graph, it can be observed that states such as **Pennsylvania**, **Georgia**, and **Virgin Islands** have a higher average number of rooms compared to **the state of Massachusetts**. This suggests that these three states may have more **affordable** and **larger houses** compared to **Massachusetts**.

In [None]:
fig = px.bar(df.groupby(["state"]).mean().reset_index(), x='state', y='bed_bath',width=700)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'},
                 title="State - Bed_Bath Compare.")
fig.update_xaxes(type="category",tickangle=-70)
fig.show()

### **Number of Listings For Each State**
It is observed that the majority of listings are in the state of **Massachusetts**. Therefore, since the state with the **highest average prices** also has a **wide range**, it is possible to have a broad price range in **Massachusetts**.

In [None]:
ct=df.groupby(['state']).size()
fig = px.bar(x=ct.index.values, y=ct.values,text_auto=True,height=600,width=700)
fig.update_layout(title="Number of States Variable",
                xaxis={'categoryorder':'total descending'},
                 xaxis_title="States",
                 yaxis_title="Count")
fig.show()

### **Number of Listings By Price Ranges For Each State**
This graph displays the number of listings in specific price ranges for the four states with the highest number of listings. It can be observed that, apart from **Massachusetts**, **the other states** are mostly listed in **lower price ranges**.

In [None]:
df["price_cat"]=pd.cut(df.price,bins=[0,500000,1000000,5000000,10000000,60000000],right=False,ordered=False,labels=["0-500000","500000-1000000","1000000-5000000","5000000-10000000","10000000-60000000"])
barr=pd.crosstab(df.price_cat,df.grouped_states,rownames=["price_cat"],colnames=["grouped_states"])
barr.reset_index(inplace=True,level="price_cat")
fig = px.bar(barr, x="price_cat", y=["Massachusetts","New Hampshire","Connecticut","Vermont","Other States"], title="Number of listings by price range according to states",width=700,height=600)
fig.update_layout(xaxis_title="Price Ranges",yaxis_title="Number of Listings",legend_title="Top 4 Most Listings States")
fig.show()

### **City Proportions In The State of Massachusetts**
It can be said that the majority of cities in **Massachusetts** are listed in **Boston**, indicating that the data primarily focuses on real estate centered around **Boston**.

***You can experiment interactively between states and city numbers.***

In [None]:
@interact
def show_cities(state=df.state.value_counts().index.values,city_number=20):
    fig = px.pie(values=df[df.state==state].city.value_counts()[:city_number].values, names=df[df.state==state].city.value_counts()[:city_number].index.values,width=700,height=700)
    fig.update_layout(legend_title="Cities",title_text="Top {} Most Posted Cities in the State of {}".format(city_number,state))
    fig.update_traces(textposition='inside', textinfo='percent+label')
    fig.show()

### **Grouped States, Price and Status of Sale Comparison**
In this graph, it is observed that the average selling prices of **Massachusetts** and **New Hampshire** states are higher compared to the other states.

In [None]:
fig = px.bar(df.groupby(["grouped_states","status_of_sale"]).mean().reset_index(), x="grouped_states", y="price",color="status_of_sale", title="Comparison of Sales Status by State and Prices.",width=700)

fig.show()

#### **States - Price - Count - Status of Sale Comparison**
If we take a closer look, it appears that **Massachusetts**, which is the most listed state in the advertisement, has a high number of sales in the price range of **\$200,000** to **\$600,000** for houses.

In [None]:
fig = px.histogram(df, x="price",nbins=1000,color="grouped_states",pattern_shape="status_of_sale")
fig.update_layout(xaxis_range=[0,2*10**6],width=800,bargap=0.2)
fig.show()

### **Data Correlation**
If we examine the correlation of the dataset, we can observe the rate of variation between variables. It is observed that the relationship between the **number of rooms** and **house prices** is **stronger** than the relationship with the **size of the house**. Similarly, the relationship between the rooms themselves, their relationship with **the price**, and their relationship with **the size of the house** can also be observed.

In [None]:
df_corr = df.corr().round(1)  
# Mask to matrix
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Viz
df_corr_viz = df_corr.mask(mask).dropna(how='all').dropna('columns', how='all')
fig = px.imshow(df_corr_viz, text_auto=True,width=700)
fig.show()

### **Comparison Of living Areas With Price**
In this graph, it is observed that there are price increases according to house size in the states. It is observed that **New Hampshire** is the state where the price increases the most as the size of the house increases. **Vermont** is the least affected of these 4 states in terms of house size and price.

In [None]:
g=sns.lmplot(x="house_size",y="price",hue="grouped_states",data=df,scatter_kws={'alpha':1},scatter=False);
plt.xlim(0,df.house_size.median()*10)
plt.ylim(0,df.price.median()*10)
g._legend.set_title("States")
g._legend.set_bbox_to_anchor((.35, 0.8))

### **Comparison Of Room Numbers With Price**
Similarly, when comparing the number of rooms and prices, it can be understood from this graph that houses in **New Hampshire** generally do not have a wide range of room numbers. The state with the highest range of room numbers is **Massachusetts**.

In [None]:
g=sns.lmplot(x="bed_bath",y="price",hue="grouped_states",data=df,scatter_kws={'alpha':1},scatter=False);
plt.xlim(0,df.bed_bath.median()*15)
plt.ylim(0,df.price.median()*10)
g._legend.set_title("States")
g._legend.set_bbox_to_anchor((.7, 0.3))

### **Price Densities By States**
Here, you can see the price distributions of the four states that are most listed as well as the other states. It is evident that homes in **Massachusetts** have a wide range of prices, while homes in **Vermont** generally have similar prices.

In [None]:
(sns
 .FacetGrid(df,hue="grouped_states",height=5,xlim=(0,11**6))
 .map(sns.kdeplot,"price",fill=True)
 .add_legend()
);

### **OLS Regression**
In terms of statistics, when we look at the entirety of the data, we can obtain information such as **R-squared**, **F-statistics**, and **variable coefficients** through **OLS (Ordinary Least Squares)** regression. The **R-squared** value indicates the proportion of the dependent variable that can be explained by the independent variables. The **F-statistics** value, on the other hand, indicates whether the data is statistically significant or not.

In [None]:
import  statsmodels.api as sm
sdf=df.copy()
sdf.dropna(inplace=True)
sdf=sdf.select_dtypes("float64")
X=sdf.drop(columns="price")
y=sdf.price
sms=sm.OLS(y,X)
model=sms.fit()
model.summary()

# **Conclusion**

* The majority of prices were observed to be concentrated within a specific range, with a decrease in higher price ranges.
* While some regions had a higher number of listings, others had fewer listings.
* The listings for houses varied widely in terms of the number of rooms, but generally, there were a significant number of listings with **3 bedrooms** and **2 bathrooms** or in close proximity to that.
* We compared the prices and room numbers of the listed houses based on their sizes. As the number of rooms increased, the living area generally increased, but the same trend did not apply to prices due to variations in prices across states.
* **Pennsylvania** had the highest average living area, while **New York** had the highest average price.
* Due to the high number of listings in **Massachusetts**, the price range was **quite wide**, while the price averages of other states were relatively **close to each other**.
* **Massachusetts** and **New Hampshire** had more sales than unsold properties, but the opposite was observed in other states.
* Based on the sales and unsold ratios, the price range with the highest sales rate was between **\$200,000** and **\$600,000**.
* In the comparison of **price** and **living area**, **New Hampshire** was more influenced by price, while **Vermont** was the least influenced.
* In the comparison of **price** and **room numbers**, **New Hampshire** was more influenced by price, while **Massachusetts** was the least influenced.
* It was observed that the distribution of prices varied in intensity across states. **Massachusetts** had a wider price range, while houses in **Vermont** had prices **closer** to each other.

***Thanks For Reading :)***