# I : notebook_EDA introduction

### I / A : Data dictionary

1. longitude
2. latitude
3. housingMedianAge: Âge médian d'une maison dans un pâté de maisons ; un chiffre plus bas correspond à un bâtiment plus récent.
4. totalRooms: Nombre total de chambres dans un bloc
5. totalBedrooms: Nombre total de chambres dans un bloc
6. population: Nombre total de personnes résidant dans un bloc
7. households: Nombre total de ménages, c'est-à-dire un groupe de personnes résidant dans une unité d'habitation, pour un bloc
8. medianIncome: Revenu médian des ménages dans un bloc de maisons (mesuré en dizaines de milliers de dollars US)
9. medianHouseValue: Valeur médiane des maisons pour les ménages d'un bloc (mesurée en dollars US)
10. oceanProximity: Situation de la maison par rapport à la mer

### I / B : Goal of this notebook.
In this notebook, we will do our Exploratory Data Analysis for the Silicon Valley project.
We will try to do that in a specific order and with comments in order to explain these different steps.

# II : Preliminary steps

### II / A : Importing libraries 

In [1]:
# Here, we import the libraries that we will use later.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# import os
# print(os.listdir("../input"))

# import matplotlib
# matplotlib.rc('figure', figsize = (20, 8))
# matplotlib.rc('font', size = 14)
# matplotlib.rc('axes.spines', top = False, right = False)
# matplotlib.rc('axes', grid = False)
# matplotlib.rc('axes', facecolor = 'white')

### II / B : Importing and copying our dataset 

In [2]:
# We import the dataset from our folder, using pd.read_csv()
imported_data = pd.read_csv('data/4054a881-9509-4cc0-9501-1174d5bbf6fc.txt')
imported_data

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,2072,-119.84,36.77,6.0,1853.0,473.0,1397.0,417.0,1.4817,72000.0,INLAND
1,10600,-117.80,33.68,8.0,2032.0,349.0,862.0,340.0,6.9133,274100.0,<1H OCEAN
2,2494,-120.19,36.60,25.0,875.0,214.0,931.0,214.0,1.5536,58300.0,INLAND
3,4284,-118.32,34.10,31.0,622.0,229.0,597.0,227.0,1.5284,200000.0,<1H OCEAN
4,16541,-121.23,37.79,21.0,1922.0,373.0,1130.0,372.0,4.0815,117900.0,INLAND
...,...,...,...,...,...,...,...,...,...,...,...
16507,1099,-121.90,39.59,20.0,1465.0,278.0,745.0,250.0,3.0625,93800.0,INLAND
16508,18898,-122.25,38.11,49.0,2365.0,504.0,1131.0,458.0,2.6133,103100.0,NEAR BAY
16509,11798,-121.22,38.92,19.0,2531.0,461.0,1206.0,429.0,4.4958,192600.0,INLAND
16510,6637,-118.14,34.16,39.0,2776.0,840.0,2546.0,773.0,2.5750,153500.0,<1H OCEAN


In [3]:
# Lets get rid of the first column, which is not in the dataset that we use
imported_data = imported_data.drop("Unnamed: 0", axis=1)
imported_data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-119.84,36.77,6.0,1853.0,473.0,1397.0,417.0,1.4817,72000.0,INLAND
1,-117.80,33.68,8.0,2032.0,349.0,862.0,340.0,6.9133,274100.0,<1H OCEAN
2,-120.19,36.60,25.0,875.0,214.0,931.0,214.0,1.5536,58300.0,INLAND
3,-118.32,34.10,31.0,622.0,229.0,597.0,227.0,1.5284,200000.0,<1H OCEAN
4,-121.23,37.79,21.0,1922.0,373.0,1130.0,372.0,4.0815,117900.0,INLAND
...,...,...,...,...,...,...,...,...,...,...
16507,-121.90,39.59,20.0,1465.0,278.0,745.0,250.0,3.0625,93800.0,INLAND
16508,-122.25,38.11,49.0,2365.0,504.0,1131.0,458.0,2.6133,103100.0,NEAR BAY
16509,-121.22,38.92,19.0,2531.0,461.0,1206.0,429.0,4.4958,192600.0,INLAND
16510,-118.14,34.16,39.0,2776.0,840.0,2546.0,773.0,2.5750,153500.0,<1H OCEAN


In [4]:
# Now we will create a copy of this dataframe, in order to keep it without clean and do all the changes inside the copy.
# This copy will be the dataframe that we will use.
data_df = imported_data.copy()
data_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-119.84,36.77,6.0,1853.0,473.0,1397.0,417.0,1.4817,72000.0,INLAND
1,-117.80,33.68,8.0,2032.0,349.0,862.0,340.0,6.9133,274100.0,<1H OCEAN
2,-120.19,36.60,25.0,875.0,214.0,931.0,214.0,1.5536,58300.0,INLAND
3,-118.32,34.10,31.0,622.0,229.0,597.0,227.0,1.5284,200000.0,<1H OCEAN
4,-121.23,37.79,21.0,1922.0,373.0,1130.0,372.0,4.0815,117900.0,INLAND
...,...,...,...,...,...,...,...,...,...,...
16507,-121.90,39.59,20.0,1465.0,278.0,745.0,250.0,3.0625,93800.0,INLAND
16508,-122.25,38.11,49.0,2365.0,504.0,1131.0,458.0,2.6133,103100.0,NEAR BAY
16509,-121.22,38.92,19.0,2531.0,461.0,1206.0,429.0,4.4958,192600.0,INLAND
16510,-118.14,34.16,39.0,2776.0,840.0,2546.0,773.0,2.5750,153500.0,<1H OCEAN


In [5]:
# We can use .info() on this new dataframe to have a short preview of these data.
data_df.info()

# 1st info: We have 10 columns, 9 features and 1 target (median_house_value).
# 2nd info: one of the feature seems to have missing data (total_bedrooms).
# 3rd info: all the features are of type float64, except for the ocean_proximity.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  float64
 3   total_rooms         16512 non-null  float64
 4   total_bedrooms      16336 non-null  float64
 5   population          16512 non-null  float64
 6   households          16512 non-null  float64
 7   median_income       16512 non-null  float64
 8   median_house_value  16512 non-null  float64
 9   ocean_proximity     16512 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.3+ MB


#### We are done with preliminary steps, and now we can go to the EDA

# III : Exploratory Data Analysis

### III / A : Basic exploration

In [6]:
# We want to have a short description of different statistics of our dataset, using .describe()
data_df.describe()

# 1st info: there's a scale difference between some of our features, so maybe we will need to scale the data if needed.

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,16512.0,16512.0,16512.0,16512.0,16336.0,16512.0,16512.0,16512.0,16512.0
mean,-119.564046,35.626523,28.624516,2644.170603,539.31954,1435.01726,501.135962,3.864091,206509.251453
std,2.005033,2.13915,12.59798,2213.946369,425.207704,1158.151967,385.650673,1.893244,115225.957661
min,-124.35,32.54,1.0,6.0,2.0,3.0,2.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1446.0,296.0,788.0,280.0,2.5625,119400.0
50%,-118.49,34.25,29.0,2116.0,435.0,1168.0,410.0,3.5313,179300.0
75%,-118.01,37.71,37.0,3154.0,647.0,1738.0,606.0,4.733225,264500.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


### III / B : Missing and duplicated values

In [7]:
# We will use .isnull().sum() to check how many missing values we have for each feature
data_df.isnull().sum()

# 1st info: all the missing data are in the total_bedrooms feature.

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        176
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [8]:
# We can use imputation with these missing values (mean, median,...). But I will rather remove the rows with missing values.
# We have more than 16 000 rows, so I will assume that removing only 176 rows will not have a huge negative impact.
data_df = data_df.dropna(axis=0)
data_df.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [9]:
# We can also check if we have duplicated rows. If yes, we will remove the duplicate one 
# because it could lead to wrong predictions with our model.
data_df.duplicated().sum()

# 1st info: the dataset has no duplicated rows.

0

### III / C : Distribution visualization

In [None]:
data_df.plot(kind='box', subplots=True, layout=(4, 4), figsize = (12, 12));