# EDA - HOUSING PRICING DATASET
In this notebook we perform a Explorative Data Analisis of the data of the housing.csv dataset.
The dataset contains 545 observation (houses) and 13 variables.
- price: price of the house;
- area: area of the house (m^2);
- bedrooms: number of house bedroom's;
- bathrooms: number/if batrooms;
- storeis: number of house storeis (storey meaning any level part of a building with a floor that could be used by people;
- mainroad: Weather connected to Main Road;
- guestroom: Weather has a guest room;
- basement: Weather has a basement;
- hotwaterheating: Weather has a hotwater heater;
- airconditioning: Weather has an airconditioning;
- parking: number/if parking in the house;
- prefarea: if the House is or not in a preferable are of the town;
- furnishingstatus: Furnishing status of the House.

### OBJECTIVE
The objective of the analisys is to import, clean and prepare the data for a regression problem.
In this operation keep in mind that u should deal with null values (choose weather to drop them or fill them and with what),
also for the regression problem we can not have categorical variables therefore we need some kind of encoding for the variables that are not numeric.
Try to get the most out of the this dataset studying how the variables are correlated wqith each other and which are the variables that most influence the price outcome.

##### KEEP IN MIND THAT THIS IS ONLY A DEMO OF WHAT WE CAN PERFORM IN AN EDA AND THERE ARE MANY OTHER THINGS THAT CAN BE DONE TO IMPROVE THE ANALISYS. USE THIS AS A GUIDE OR INSPIRATION FOR YOUR OWN ANALISYS AND DO NOT JUST COPY AND PASTE THE CODE BUT TRY TO UNDERSTAND WHAT IS HAPPENING AND WHY.

### IMPORT AND FIRST CHECK
We need to import both the pandas and numpy libraries to work with the data. We also import the matplotlib and seaborn libraries to plot the data and the data itself.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#### Importing the data 
We import the data and check that the procedure was successful


In [None]:
housing = pd.read_csv('housing_price_na.csv')
housing.head()

#### Checking for null value
Check if there are null value in the datafram both form a numerical and grafical point of view

In [None]:
# check for na value
housing.info()
# visualizing missing values
sns.heatmap(housing.isna(),cbar=False)
plt.show()

#### Oher general information

In [None]:
housing.describe()


### DEALING WITH MISSING VALUES
We have seen that there are some missing values in our dataset, let's see how we can deal with them.
Most of the time people just drop na value or fill them with the mean of the column, but this is not always the best solution.
Infact we fistr need to verify if the missing data has a meaning, for example if the missing data in mainroad means that the house is not connected to the main road or that there is no main road in the area, and so it's bringing us information about the house?

So before dropping or filling the missing values we need to understand if they are missing at random or not.
we wanto to insert the label value in the missing values and then we want to see if the distribution of the label is the same as the distribution of the variable we are studying.
If the distribution is the same we can say that the missing values are missing at random and we can fill them with the mean of the variable or the median.

In [None]:
tmp = housing[["mainroad","price"]]
tmp["mainroad"] = tmp["mainroad"].apply(lambda x: "value" if not pd.isna(x) else "na")
# plotting the two variables distributions
sns.displot(data=tmp,x="price",hue="mainroad",kde=True)
plt.show()

As we see from the graph the distribution of the mainroad variable is the same for the houses with mainroad and the houses without mainroad, therefore the missing data in the mainroad variable are not significant for our target variable and we can fill them with the mean of the variable. (in this case is a categorical variable so we can fill them with the mode of the variable)

### UNDERSTANDING THE DATA
Let's now spend some time doing what is arguably the most important step - understanding the data.

#### Visualizing the data 
We want to have a general grasp about the composition of our dataset.
So we want something that makes us see the general distribution of the single variable but also how that affects the other variables.
We can do that by using a pairplot from seaborn library.
We can also see the correlation between the variables by using the heatmap function from seaborn library.

In [None]:
# Pairplot
sns.pairplot(housing)
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(12,8))
sns.heatmap(housing.corr(),annot=True,cmap='coolwarm')

We see that the variables are not very correlated with each other, this is a good thing because it means that we can use all the variables in our model.
Keep in mind that if we had a lot of variables that are correlated with each other we would have to drop some of them to avoid multicollinearity, and that categorical variables are not shown in the heatmap or in the pairplot.

#### Visualizing our target variable: price
Since our target variable is numeric we plot its distribution. 
We also extract the information about the mean of the house pricing.

What ca we also say from this plot?

In [None]:
sns.displot(data=housing,x="price",kde=True)
plt.legend(title="mean={:,.0f}".format(housing.price.mean()))
plt.show()


#### Example of analisys for a categorical variable
In out dataset we see that we have many categorical variable, let's analize them.
In this case we want to use a BARBLOT.

What ca we also say about the AC variable? Ca we apply this kind of analisys for other variables? If yes try to do this on your own.

In [None]:
housing['airconditioning'].value_counts().sort_values().plot(kind='barh')
plt.show()

Since the majority of the house in our df has no air conditioning we are curious to see if this affect the price:
- in the first graph we plot the two distribution in a box plot, that help us see if we have OUTLIERS (:meaning datapoints that are very distant from the other in terms of the variable we are studying, this points have very different values and we are usually interested in understanding why it is so;
- the second example is a violin plot, it works more or less like a box plot but makes us able to se where the mean is positioned and the shape of the distribution of the values we are analyzing.

In [None]:
# example one: boxplot
sns.catplot(data=housing,x="airconditioning",y="price",kind="box")
plt.show()

In [None]:
# example two: violinplot
sns.violinplot(x = 'airconditioning', y = 'price', data = housing)
plt.show()

We notice that the price of the houses with AC is generally higher than the price of the houses that do not.
So our variables are actually correlated, therefore interesting for our analysis.


#### Example of analisys for a numerical variable aginst the target variable
In this case we want to check how a numerical variable (say area) affect the price of the house.
We can do this by plotting the two variable against each other using either a scatter plot or a joint plot.
The jointplot is a combination of a scatter plot and a histogram, it is very useful to see the distribution of the two variable and how they are correlated.

In [None]:
sns.jointplot(data=housing,x="price",y="area",kind="reg") # kind = "reg" for regression line that shows the relationship between the two variables
plt.show()

#### Example of the analisys of three variables aginst each other (price, area and airconditioning)
In this case we want to check how a numerical variable (say area) affect the price of the house.
For this we use a scatter plot and we color the points according to the value of the third variable (airconditioning).
In seaborn we can do this using the hue parameter of the lmplot function.
The lmplot is a combination of a scatter plot and a regression line, it is very useful to see the distribution of the two variable and how they are correlated.

In [None]:
sns.lmplot(data=housing,x="area",y="price",hue="airconditioning")
plt.show()

In this case we can see that, as expected, the price of the house with AC is higher than the price of the house without AC. And that the lowest prices are for the houses with the lowest area and no AC. We also see that the AC variable splits the data cloud in two parts, one for the houses with AC and one for the houses without AC. This means that the AC variable is correlated with the price variable suggesting us that the prices for houses with AC will eventually be higher than the prices for houses without AC and this will evolve in a better prediction of the price of the house.

### CONCLUSION
I recomend you to look for the definition of what we have seen in this example of EDA (says type of graphs and what they are plottin respectively).
Also, try to experiment on other kind of analisys of this and other dataset, you can easily find a lot of datasets on https://www.kaggle.com/datasets.