# Notebook de experimentación - Gabriel Tumbaco

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from housing_price_prediction.utils.paths import data_raw_dir

train_path = data_raw_dir() / "train.csv"
df_raw = pd.read_csv(train_path)

## Exploracion y limpieza de las siguientes variables

Variables que describen el terreno, la ubicación y el tipo general de la propiedad.

- MSSubClass
- MSZoning
- LotFrontage
- LotArea
- Street
- Alley
- LotShape
- LandContour
- Utilities
- LotConfig
- LandSlope
- Neighborhood
- Condition1
- Condition2
- BldgType
- HouseStyle

**Experimentacion con el dataset**

In [3]:
variables = [
    'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 
    'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 
    'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 
    'BldgType', 'HouseStyle', 'SalePrice' 
]

# se indexa en las variables y se crea una copia de experimentacion
df = df_raw[variables].copy()

In [5]:
print(df.shape)
df.info()

(1460, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    1460 non-null   int64  
 1   MSZoning      1460 non-null   object 
 2   LotFrontage   1201 non-null   float64
 3   LotArea       1460 non-null   int64  
 4   Street        1460 non-null   object 
 5   Alley         91 non-null     object 
 6   LotShape      1460 non-null   object 
 7   LandContour   1460 non-null   object 
 8   Utilities     1460 non-null   object 
 9   LotConfig     1460 non-null   object 
 10  LandSlope     1460 non-null   object 
 11  Neighborhood  1460 non-null   object 
 12  Condition1    1460 non-null   object 
 13  Condition2    1460 non-null   object 
 14  BldgType      1460 non-null   object 
 15  HouseStyle    1460 non-null   object 
 16  SalePrice     1460 non-null   int64  
dtypes: float64(1), int64(3), object(13)
memory usage: 194.0+ KB


# Valores faltantes

In [6]:
df.isna()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,SalePrice
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
1456,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
1457,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
1458,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False


Dato curioso: En Pandas, *isnull* e *isna* son alias para la misma funcion, no hay diferencias 

In [11]:
print("Contador de nulos / na por variables")
df.isnull().sum()

Contador de nulos / na por variables


MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
Street             0
Alley           1369
LotShape           0
LandContour        0
Utilities          0
LotConfig          0
LandSlope          0
Neighborhood       0
Condition1         0
Condition2         0
BldgType           0
HouseStyle         0
SalePrice          0
dtype: int64

## Variables numericas

### Variable LotFrontage
LotFrontage: Linear feet of street connected to property

In [None]:
print(df['LotFrontage'].isnull().sum())

259


In [23]:
df['LotFrontage'].describe()

count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64

Se podria imputar los valores faltantes con la mediana global, pero se investigó que podría ser buena idea agrupar por vecindario y observar como cambia la mediana.
Se elige la mediana porque es una medida de tendencia central que no es sensible a datos atípicos

In [38]:
print(df.groupby('Neighborhood')['LotFrontage'].median())

Neighborhood
Blmngtn    43.0
Blueste    24.0
BrDale     21.0
BrkSide    52.0
ClearCr    80.0
CollgCr    70.0
Crawfor    74.0
Edwards    65.5
Gilbert    65.0
IDOTRR     60.0
MeadowV    21.0
Mitchel    73.0
NAmes      73.0
NPkVill    24.0
NWAmes     80.0
NoRidge    91.0
NridgHt    88.5
OldTown    60.0
SWISU      60.0
Sawyer     71.0
SawyerW    66.5
Somerst    73.5
StoneBr    61.5
Timber     85.0
Veenker    68.0
Name: LotFrontage, dtype: float64


Se observa que la mediana cambia muchisimo dependiendo del vecindario, por lo tanto no seria una buena idea imputar los valores faltantes de la variables con la mediana global. Una mejor estrategia seria imputar los valores faltantes con los de su correspondiente vecindario.

In [None]:
#Uso de la funcion transform para obtener una columna util para llenar los nulos
mediana_vecindarios = df.groupby('Neighborhood')['LotFrontage'].transform('median')
print(mediana_vecindarios)

0       70.0
1       68.0
2       70.0
3       74.0
4       91.0
        ... 
1455    65.0
1456    80.0
1457    74.0
1458    73.0
1459    65.5
Name: LotFrontage, Length: 1460, dtype: float64


In [37]:
df['LotFrontage'] = df['LotFrontage'].fillna(mediana_vecindarios)
print(df['LotFrontage'].isnull().sum())

0


## Variables categoricas

### Variable Alley
Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access

In [None]:
print(df['Alley'].value_counts(dropna=False))

Alley
NaN     1369
Grvl      50
Pave      41
Name: count, dtype: int64


Entonces, la mayoria de casas no tienen callejon. Para proceder correctamente se imputará el valor NaN con el string None.

In [15]:
df['Alley'] = df['Alley'].fillna("None")
print(df['Alley'].value_counts(dropna=False))

Alley
None    1369
Grvl      50
Pave      41
Name: count, dtype: int64


### Variable MSZoning
MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

In [16]:
print(df['MSZoning'].value_counts(dropna=False))

MSZoning
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: count, dtype: int64
