## 1.Importing Data To .CSV File

In [4]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
 os.makedirs(housing_path, exist_ok=True)
 tgz_path = os.path.join(housing_path, "housing.tgz")
 urllib.request.urlretrieve(housing_url, tgz_path)
 housing_tgz = tarfile.open(tgz_path)
 housing_tgz.extractall(path=housing_path)
 housing_tgz.close()

In [5]:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
 csv_path = os.path.join(housing_path, "housing.csv")
 return pd.read_csv(csv_path)

## 2.Analyzing the data
let's first import our data and have a quick look at the first 5 columns using head method 

In [6]:
fetch_housing_data() #fetching data
housing = load_housing_data()
housing.head()

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

Now let's see infos about our data

In [None]:
housing.info()

We remarque that some columns contain non numerical values such as ocean_proximity , let's have a look at how many values does this column may contain

In [None]:
housing["ocean_proximity"].value_counts()

In [None]:
housing.describe()

Another way to analyze the data is by plotting into histograms

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
housing.hist(bins=50, figsize=(20,15))
plt.show()

## 3.Creating The test set 
Now we need to create a test set , so we will build a funcrion that splits our data into a train test and a test set

In [None]:
import numpy as np
def split_train_test(data, test_ratio):
    np.random.seed(42)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]


In [None]:
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

But both this solution will break next time you fetch an updated dataset

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
 bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
 labels=[1, 2, 3, 4, 5])

Now let's use Startified sampling from scikit learn , basically strata means spliting your dataset into many 

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
 strat_train_set = housing.loc[train_index]
 strat_test_set = housing.loc[test_index]

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)


Now we should remove the income_cat attribute so the data is back to its original
state

In [None]:
for set_ in (strat_train_set, strat_test_set):
 set_.drop("income_cat", axis=1, inplace=True)

## 4.Visualize Data

In [None]:
housing = strat_train_set.copy()

Since there is A geo info we can use latitude and longtitude to plot the data

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")


Let's set alpha to 0.1 to visualize better the density of points

In [None]:
housing.plot(kind="scatter",x="longitude", y="latitude",alpha=0.1)

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
 s=housing["population"]/100, label="population", figsize=(10,7),
 c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()


## 5.Looking for Correlations

In [None]:
corr_matrix = housing.corr(numeric_only=True)

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

If correlation is close to 1 , it means if a y goes up the, x also goes up , -1 is the opposite and for 0 there is no linear relation between y and x

Another way to check for correlation between attributes is to use the pandas
scatter_matrix() function

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
 "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

The most promising attribute to predict the median house value is the median
income, so let’s zoom in on their correlation scatterplot

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
 alpha=0.1)


This plot reveals a few things. First, the correlation is indeed very strong; you can
clearly see the upward trend, and the points are not too dispersed. Second, the price
cap that we noticed earlier is clearly visible as a horizontal line a $500,000. But this
plot reveals other less obvious straight lines: a horizontal line aroud $450,000,
another aro3d $350,000, perhaps one around $280,000, and a few more below that.
You may want to try removing the corresponding districts to prevent your algorithms
from learning to reproduce these data quirks.

Since Now we analyzed the data , we may create some attributes that may be more interesting for our case 

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]


Now let's see their correlation 

In [None]:
 corr_matrix = housing.corr(numeric_only=True)

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

Hey, not bad! The new bedrooms_per_room attribute is much more correlated with
the median house value than the total number of rooms or bedrooms. Apparently
houses with a lower bedroom/room ratio tend to be more expensive. The number of
rooms per household is also more informative than the total number of rooms in a
district—obviously the larger the houses, the more expensive they ar

## Prepare the Data for Machine Learning Algorithms

Let’s also separate the predictors and the labels, since we don’t necessarily want to
apply the same transformations to the predictors and the target values

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()