### Hoja de trabajo 03, Minería de datos
**Authors:** [Melissa Perez](https://github.com/MelissaPerez09), [Adrian Flores](https://github.com/adrianRFlores), [Andrea Ramirez](https://github.com/Andrea-gt)

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.stats.diagnostic as diag
from scipy.stats import spearmanr

## Descripción de variables

In [15]:
data = pd.read_csv('Data/train.csv', encoding='unicode_escape')

In [16]:
data_describe = data.loc[:, data.columns != 'id']
data_describe.describe(include=[np.number])

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


| Variable                            | Descripción                 |  Tipo |
|:-----------------------------------:|:----------------------:     |:----------------------:|
| Id                                  | Un identificador único      | Cualitativa |
| MSSubClass                          | La clase de construcción    | Cualitativa|
| MSZoning                            | La clasificación de zonificación general de la propiedad  | Cualitativa |
| LotFrontage                         | Longitud de la fachada de la propiedad    | Cuantitativa |
| LotArea                             | Tamaño del lote en pies cuadrados    |Cuantitativa |
| Street                              | Tipo de acceso a la propiedad    |Cualitativa |
| Alley                               | Tipo de acceso al callejón   |Cualitativa |
| LotShape                            | Forma general del lote    |Cualitativa |
| LandContour                         | Contorno de la propiedad    |Cualitativa |
| Utilities                           | Servicios públicos disponibles  |Cualitativa |
| LotConfig                           | Configuración del lote    |Cualitativa |
| LandSlope                           | Pendiente de la propiedad  |Cuantitativa |
| Neighborhood                        | Ubicación física dentro de los límites de la ciudad   |Cualitativa |
| Condition1 y Condition2             | Condiciones de proximidad a varias características  |Cualitativa |
| BldgType                            | Tipo de vivienda     |Cualitativa |
| HouseStyle                          | Estilo de la vivienda    |Cualitativa |
| OverallQual                         | Calidad general de los materiales y el acabado de la casa  |Cualitativa |
| OverallCond                         | Estado general de la casa    |Cualitativa |
| YearBuilt                           | Año de construcción original  |Cualitativa |
| YearRemodAdd                        | Año de remodelación  |Cualitativa |
| RoofStyle                           | Tipo de techo    |Cualitativa |
| RoofMatl                            | Material del techo  |Cualitativa |
| Exterior1st y Exterior2nd           | Revestimiento exterior de la casa    |Cualitativa |
| MasVnrType                          | Tipo de revestimiento de mampostería  |Cualitativa |
| MasVnrArea                          | Área de revestimiento de mampostería en pies cuadrados  |Cualitativa |
| ExterQual y ExterCond               | Calidad y estado del material exterior  |Cualitativa|
| Foundation                          | Tipo de cimentación  |Cualitativa |
| BsmtQual                            | Atributos relacionados con el sótano    |Cualitativa |
| Heating                             | Tipo de calefacción  |Cualitativa |
| HeatingQC                           | Calidad y condición de la calefacción    |Cualitativa |
| CentralAir                          | Aire acondicionado central  |Cualitativa |

## Análisis exploratorio

### Análisis de Normalidad

In [17]:
# Select only the numeric variables from the DataFrame
numeric_df = data.select_dtypes(include=[np.number])

# Iterate over each column in the numeric DataFrame
for column in numeric_df:

    # Exclude the 'id' column from the numeric variables DataFrame
    if column == 'Id':
        continue
    
    # Drop any null values from the current column
    columnSeriesObj = numeric_df[column].dropna()

    # Apply the Lilliefors test to the data of each column in numeric_df
    stat, p_value = diag.lilliefors(columnSeriesObj.values)

    # Determine if the column follows a normal distribution based on the p-value
    if p_value <= 0.05: 
        print(f'{column} does not follow a normal distribution. ({p_value})\n')
    else:
        print(f'{column} follows a normal distribution.\n')

MSSubClass does not follow a normal distribution. (0.0009999999999998899)

LotFrontage does not follow a normal distribution. (0.0009999999999998899)

LotArea does not follow a normal distribution. (0.0009999999999998899)

OverallQual does not follow a normal distribution. (0.0009999999999998899)

OverallCond does not follow a normal distribution. (0.0009999999999998899)

YearBuilt does not follow a normal distribution. (0.0009999999999998899)

YearRemodAdd does not follow a normal distribution. (0.0009999999999998899)

MasVnrArea does not follow a normal distribution. (0.0009999999999998899)

BsmtFinSF1 does not follow a normal distribution. (0.0009999999999998899)

BsmtFinSF2 does not follow a normal distribution. (0.0009999999999998899)

BsmtUnfSF does not follow a normal distribution. (0.0009999999999998899)

TotalBsmtSF does not follow a normal distribution. (0.0009999999999998899)

1stFlrSF does not follow a normal distribution. (0.0009999999999998899)

2ndFlrSF does not follow a

### Tablas de Frecuencia

In [18]:
# Iterate over every column in the dataframe
for column in data.columns:

    if data[column].dtype != 'O' and column != 'MSSubClass':
        continue

    freqTable = data[column].value_counts().reset_index()

    freqTable.columns = [column, 'Frequency']

    # Table formatting
    freqTable[column] = freqTable[column].astype(str).str.center(20)
    freqTable['Frequency'] = freqTable['Frequency'].astype(str).str.center(20)
    freqTable.columns = [col.center(20) for col in freqTable.columns]

    print(f"Frequency Table for {column}:\n{freqTable}\n")

Frequency Table for MSSubClass:
         MSSubClass            Frequency      
0            20                   536         
1            60                   299         
2            50                   144         
3           120                    87         
4            30                    69         
5           160                    63         
6            70                    60         
7            80                    58         
8            90                    52         
9           190                    30         
10           85                    20         
11           75                    16         
12           45                    12         
13          180                    10         
14           40                    4          

Frequency Table for MSZoning:
         MSZoning             Frequency      
0           RL                   1151        
1           RM                   218         
2           FV                    65         
3

### Análisis de relaciones con la variable respuesta

In [19]:
df_numeric = data.select_dtypes(include=[np.number])

salePriceData = data['SalePrice']

for col in df_numeric.columns:
    colData = df_numeric[col]
    corr, p_value = spearmanr(salePriceData, colData)
    if corr >= 0.50:
        print(f"Column '{col}' spearman correlation coefficient:", corr)

Column 'OverallQual' spearman correlation coefficient: 0.8098285862017292
Column 'YearBuilt' spearman correlation coefficient: 0.6526815462850586
Column 'YearRemodAdd' spearman correlation coefficient: 0.5711589780582342
Column 'TotalBsmtSF' spearman correlation coefficient: 0.6027254448924096
Column '1stFlrSF' spearman correlation coefficient: 0.5754078354212824
Column 'GrLivArea' spearman correlation coefficient: 0.7313095834659141
Column 'FullBath' spearman correlation coefficient: 0.6359570562496957
Column 'TotRmsAbvGrd' spearman correlation coefficient: 0.5325859351169929
Column 'Fireplaces' spearman correlation coefficient: 0.5192474498367013
Column 'GarageCars' spearman correlation coefficient: 0.6907109670497434
Column 'GarageArea' spearman correlation coefficient: 0.6493785338868229
Column 'SalePrice' spearman correlation coefficient: 1.0
