# I.Data Preparation Part 1
The classification variable is a stratified log price of low, average, or high, with threshold prices being defined on a property-type basis. The data will be scaled so that feature importance may be determined using class weights, and dimensionality reduction will be explored (PCA?). In addition to this, several variables that are not thought to be important will be removed from the dataset.

In [2]:
# Read in the imputed dataset

# Tom
#df = pd.read_csv('C:\\Users\\Tpeng\\OneDrive\\Documents\\SMU\\Term 3\\Machine Learning\\Lab1\\Imputed_Dataset.csv', sep = ',', header = 0)

#Quynh
df = pd.read_csv('Imputed_Dataset.csv', sep = ',', header = 0)

# Anish
#df = pd.read_csv('filepath, sep = ',', header = 0)

# Michael
#df = pd.read_csv('filepath', sep = ',', header = 0)

# Drop index column
df = df.drop(columns = 'Unnamed: 0')

In [3]:
# Reformat attributes, excluding categoricals, which aren't supported for the the dummy variable generation method used.
ordinal_vars = ['rooms', 'bedrooms', 'bathrooms' ]
continuous_vars = ['lat', 'lon', 'surface_total', 'surface_covered', 'price', 'log_price']
string_vars = ['id', 'title', 'description']
time_vars = ['start_date', 'end_date', 'created_on']

# Change data types
df[ordinal_vars] = df[ordinal_vars].astype('uint8')
df[continuous_vars] = df[continuous_vars].astype(np.float64)
df[string_vars] = df[string_vars].astype(str)

# Remove observations missing l3 and price before encoding 
df2 = df.dropna(axis = 0, subset = ['price', 'l3'])

Create a transformed dataset with numeric variables square-root transformed to better meet model assumptions of feature distributions. This reduces the number and magnitude of outliers. In addition to this, both datasets will have the property_type, country, province, and department dummified, and all other attributes will be scaled. Both the transformed and non-transformed datasets will be used to create competing models.

In [4]:
# Create datasets with transformed variables for model selection methods
# Transform rooms, bedrooms, bathrooms, surface_total, and surface_covered using square root
df_transform = df2.copy()
df_transform['sqrt_surface_total'] = df_transform.surface_total.transform(func = 'sqrt')
df_transform['sqrt_surface_covered'] = df_transform.surface_covered.transform(func = 'sqrt')
df_transform['sqrt_bedrooms'] = df_transform.bedrooms.transform(func = 'sqrt')
df_transform['sqrt_bathrooms'] = df_transform.bathrooms.transform(func = 'sqrt')
df_transform['sqrt_rooms'] = df_transform.rooms.transform(func = 'sqrt')

df_transform = df_transform.drop(columns = ['surface_total', 'surface_covered', 'bedrooms', 'bathrooms', 'rooms'])

In [5]:
# Get dummy variables for non-transformed dataset
data = pd.get_dummies(df2, columns = ['l1', 'l2', 'l3', 'property_type'], 
                      prefix = {'l1':'Country', 'l2':'Province', 'l3': 'Department', 'property_type': 'Property_type'}, 
                      sparse = True, drop_first = False)

# Drop reference levels for each dummified feature and unimportant or currently unusable features. 
data = data.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Department_Posadas', 'Property_type_Casa'])
data = data.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

# Get dummy variables for transformed dataset
trans = pd.get_dummies(df_transform, columns = ['l1', 'l2', 'l3', 'property_type'], 
                          prefix = {'l1':'Country', 'l2':'Province', 'l3': 'Department', 'property_type': 'Property_type'}, 
                          sparse = True, drop_first = False)

# Drop reference levels for each dummified feature (same references as non-transformed data) and unimportant or currently unusable features. 
trans = trans.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Department_Posadas', 'Property_type_Casa'])
trans = trans.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

# II.Data Preparation Part 2
The final data set has 441,216 rows with 1,214 features considered relevant to the classification of price categories: Low, Average, High and also to price predictions. Numerical features were transformed using sqrt functions because our original data was skewed right and therefore, deviated from normality. These transformed features include:  
price           ---> log_price
surface_total   ---> sqrt_surface_total
surface_covered ---> sqrt_surface_covered
bedrooms        ---> sqrt_bedrooms
bathrooms       ---> sqrt_rooms

In addition dummy variables for country types, region types and property types were created for purposes of building our models.  


In [6]:
trans.head(5)

Unnamed: 0,log_price,price_class,sqrt_surface_total,sqrt_surface_covered,sqrt_bedrooms,sqrt_bathrooms,sqrt_rooms,Country_Colombia,Country_Ecuador,Country_Perú,Country_Uruguay,Province_Ancash,Province_Antioquia,Province_Apurimac,Province_Arequipa,Province_Atlántico,Province_Ayacucho,Province_Azuay,Province_Bolívar,Province_Boyacá,Province_Bs.As. G.B.A. Zona Norte,Province_Bs.As. G.B.A. Zona Oeste,Province_Bs.As. G.B.A. Zona Sur,Province_Buenos Aires Costa Atlántica,Province_Buenos Aires Interior,Province_Cajamarca,Province_Caldas,Province_Callao,Province_Canelones,Province_Capital Federal,Province_Caquetá,Province_Casanare,Province_Catamarca,Province_Cauca,Province_Cesar,Province_Chaco,Province_Chocó,Province_Chubut,Province_Colonia,Province_Corrientes,Province_Cundinamarca,Province_Cusco,Province_Córdoba,Province_El Oro,Province_Entre Ríos,Province_Formosa,Province_Guayas,Province_Huancavelica,Province_Huila,Province_Huánuco,Province_Ica,Province_Imbabura,Province_Jujuy,Province_Junín,Province_La Guajira,Province_La Libertad,Province_La Pampa,Province_La Rioja,Province_Lambayeque,Province_Lima,Province_Loreto,Province_Madre de Dios,Province_Magdalena,Province_Maldonado,Province_Manabi,Province_Mendoza,Province_Meta,Province_Montevideo,Province_Moquegua,Province_Morona Santiago,Province_Nariño,Province_Neuquén,Province_Norte de Santander,Province_Pasco,Province_Pastaza,Province_Pichincha,Province_Piura,Province_Puno,Province_Putumayo,Province_Quindío,Province_Risaralda,Province_Rocha,Province_Río Negro,Province_Salta,Province_San Andrés Providencia y Santa Catalina,Province_San Juan,Province_San Luis,Province_San Martin,Province_Santa Cruz,Province_Santa Fe,Province_Santander,Province_Santiago Del Estero,Province_Santo Domingo De Los Tsáchilas,Province_Sucre,Province_Tacna,Province_Tierra Del Fuego,Province_Tolima,Province_Tucumán,Province_Tumbes,Province_Tungurahua,Province_Ucayali,Province_Valle del Cauca,Province_Vichada,Department_9 de Julio,Department_Abasto,Department_Abejorral,Department_Acacías,Department_Acambuco,Department_Acebal,Department_Achiras,Department_Adolfo Alsina,Department_Agronomía,Department_Agua de Dios,Department_Agua de Oro,Department_Aguada,Department_Aguas Verdes,Department_Aguazul,Department_Aipe,Department_Alberti,Department_Albán,Department_Alcorta,Department_Aldao,Department_Aldea Brasilera,Department_Aldea Spatzenkutter,Department_Allen,Department_Almafuerte,Department_Almagro,Department_Almirante Brown,Department_Alpa Corral,Department_Alta Gracia,Department_Alto Amazonas,Department_Aluminé,Department_Alvear,Department_Ambato,Department_Anapoima,Department_Andalucía,Department_Angélica,Department_Anillaco,Department_Anisacate,Department_Anolaima,Department_Anserma,Department_Anta,Department_Apulo,Department_Apóstoles,Department_Arata,Department_Arbeláez,Department_Arequipa,Department_Arjona,Department_Armenia,Department_Arrecifes,Department_Arroyito,Department_Arroyo Aguiar,Department_Arroyo Leyes,Department_Arroyo Seco,Department_Ascochinga,Department_Ascope,Department_Atahualpa,Department_Ataliva Roca,Department_Atlántida,Department_Avellaneda,Department_Ayacucho,Department_Azul,Department_Añelo,Department_Bahía Blanca,Department_Balboa,Department_Balcarce,Department_Balnearia,Department_Balneario Orense,Department_Balneario Sauce Grande,Department_Balvanera,Department_Banda del Río Salí,Department_Baradero,Department_Baranoa,Department_Barbosa,Department_Barda del Medio,Department_Barichara,Department_Barracas,Department_Barranca,Department_Barrancabermeja,Department_Barrancas,Department_Barranqueras,Department_Barranquilla,Department_Barrio Cívico,Department_Barrio Norte,Department_Barrio Ruta 40,Department_Barrio Sur,Department_Basavilbaso,Department_Belgrano,Department_Bell Ville,Department_Bella Italia,Department_Bella vista,Department_Bello,Department_Belvedere,Department_Benito Juárez,Department_Berazategui,Department_Berisso,Department_Bernardo de Irigoyen,Department_Berrotarán,Department_Bialet Massé,Department_Bigand,...,Department_Santafé de Antioquia,Department_Santander,Department_Santiago de Tolú,Department_Santiago del Estero,Department_Santo Domingo,Department_Santo Tomé,Department_Sasaima,Department_Satipo,Department_Sauce Viejo,Department_Sayago,Department_Segunda Usina,Department_Sesquilé,Department_Sibaté,Department_Silvania,Department_Simijaca,Department_Sincelejo,Department_Sinsacate,Department_Soacha,Department_Socorro,Department_Sogamoso,Department_Soldini,Department_Soledad,Department_Sopetrán,Department_Sopó,Department_Speluzzi,Department_Subachoque,Department_Suesca,Department_Sullana,Department_Supatá,Department_Susana,Department_Sutatausa,Department_Tabay,Department_Tabio,Department_Tacna,Department_Tafí Viejo,Department_Tafí del Valle,Department_Tala Huasi,Department_Talaini,Department_Talara,Department_Tambopata,Department_Tandil,Department_Tanti,Department_Tarma,Department_Tartagal,Department_Tauramena,Department_Tena,Department_Tenjo,Department_Tibacuy,Department_Tibasosa,Department_Tibirita,Department_Tigre,Department_Tilcara,Department_Tilisarao,Department_Timbúes,Department_Tinjacá,Department_Toay,Department_Tocaima,Department_Tocancipá,Department_Toledo,Department_Tornquist,Department_Trelew,Department_Trenque Lauquen,Department_Tres Arroyos,Department_Tres Cruces,Department_Tres de Febrero,Department_Trevelín,Department_Tribunales,Department_Trinidad,Department_Trujillo,Department_Tubará,Department_Tuluá,Department_Tumbes,Department_Tunja,Department_Tunuyán,Department_Tupungato,Department_Turbaco,Department_Turbaná,Department_Tuta,Department_Ubaque,Department_Unión,Department_Unquillo,Department_Urdinarrain,Department_Urubamba,Department_Ushuaia,Department_Uspallata,Department_Valeria del Mar,Department_Valle Hermoso,Department_Valle de Anisacate,Department_Valledupar,Department_Vaqueros,Department_Veinticinco de Mayo,Department_Velez Sarsfield,Department_Venadillo,Department_Venado Tuerto,Department_Venecia,Department_Vera,Department_Vergara,Department_Versalles,Department_Vicente López,Department_Victoria,Department_Viedma,Department_Vijes,Department_Villa Aberastain,Department_Villa Allende,Department_Villa Amancay,Department_Villa Amelia,Department_Villa Ascasubi,Department_Villa Barboza,Department_Villa Carlos Paz,Department_Villa Carmela,Department_Villa Cerro Catedral,Department_Villa Ciudad Parque Los Reartes,Department_Villa Constitución,Department_Villa Crespo,Department_Villa Cura Brochero,Department_Villa Devoto,Department_Villa Dolores,Department_Villa El Chocón,Department_Villa Española,Department_Villa General Belgrano,Department_Villa General Mitre,Department_Villa Gesell,Department_Villa Giardino,Department_Villa Gobernador Gálvez,Department_Villa Icho Cruz,Department_Villa La Angostura,Department_Villa La Bolsa,Department_Villa Larca,Department_Villa Los Aromos,Department_Villa Lugano,Department_Villa Luro,Department_Villa María,Department_Villa Mercedes,Department_Villa Nueva,Department_Villa Ocampo,Department_Villa Ortuzar,Department_Villa Paranacito,Department_Villa Parque Santa Ana,Department_Villa Parque Siquiman,Department_Villa Pehuenia,Department_Villa Pueyrredón,Department_Villa Quillinzo,Department_Villa Real,Department_Villa Riachuelo,Department_Villa Rumipal,Department_Villa San Lorenzo,Department_Villa Santa Cruz del Lago,Department_Villa Santa Rita,Department_Villa Sarmiento,Department_Villa Soldati,Department_Villa Urquiza,Department_Villa Yacanto,Department_Villa de Las Rosas,Department_Villa de Leyva,Department_Villa de Soto,Department_Villa de la Quebrada,Department_Villa del Dique,Department_Villa del Parque,Department_Villa del Rosario,Department_Villa del Totoral,Department_Villada,Department_Villaguay,Department_Villamaría,Department_Villanueva,Department_Villapinzón,Department_Villarino,Department_Villavicencio,Department_Villeta,Department_Viotá,Department_Viru,Department_Vista Flores,Department_Vistalba,Department_Viterbo,Department_Wheelwright,Department_Yacu Hurmana,Department_Yerba Buena,Department_Yopal,Department_Yotoco,Department_Yumbo,Department_Zapala,Department_Zapatoca,Department_Zarumilla,Department_Zarzal,Department_Zavalla,Department_Zipacón,Department_Zipaquirá,Department_Zárate,Department_Álvarez,Department_Útica,Property_type_Casa de campo,Property_type_Departamento,Property_type_Depósito,Property_type_Finca,Property_type_Garaje,Property_type_Local comercial,Property_type_Lote,Property_type_Oficina,Property_type_Otro,Property_type_PH,Property_type_Parqueadero
0,12.860999,High,14.071247,12.247449,1.732422,1.414062,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,12.860999,High,14.071247,12.247449,1.732422,1.414062,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,12.180755,Average,13.152946,13.152946,1.732422,1.414062,2.646484,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,11.350407,Low,7.0,6.324555,1.732422,1.0,1.732422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,11.350407,Low,7.0,6.324555,1.732422,1.0,1.732422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
X=trans.drop(['price_class'], axis =1)
X=np.asarray(X)
y=np.asarray(trans['price_class'])
X.shape

(441216, 1214)

# III. Modeling and Evaluation 1: Evaluation Metrics
Evaluation metrics used to evaluate our models include F1 scores, accuracy, and MSE.  Since our dataset is to be used by investors to make decisions on how good a property is for investment from a price point perspective, it is important to be accurate.  For the classification tasks, F1 scores will be our first metric because it blends both accuracy and precision measures.  Among those with the top F1 scores, we looked at accuracy as our next measure of importance because we have a balanced dataset.  For the price prediction tasks, MSE will be used to evaluate models.-
# IV. Modeling and Evaluation 2: Train-Test Split Data
Stratified 5-fold cross validation was used to split the data into 80-20 train-test split before standardization to prevent the test set data from influencing the training data scale.  In addition, this cross validation will adjust for any imbalance features in the data set including country, province, department and property type.  

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20, random_state = 6)

In [28]:
from sklearn import preprocessing
X_train = preprocessing.scale(X_train)
X_test = preprocessing.scale(X_test)


In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
model=RandomForestClassifier(n_estimators = 100, random_state = 6)
model.fit(X_train, y_train)
model.score(X_test, y_test)


from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
cm=confusion_matrix(y_test, y_pred)
cm
print (classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     Average       0.49      0.48      0.49     29451
        High       0.65      0.65      0.65     30724
         Low       0.63      0.64      0.63     28069

    accuracy                           0.59     88244
   macro avg       0.59      0.59      0.59     88244
weighted avg       0.59      0.59      0.59     88244



In [32]:
print(model.score)

<bound method ClassifierMixin.score of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=6, verbose=0,
                       warm_start=False)>


## IV.Modeling and Evaluation 3
Two tasks were carried out using the transformed data set:  
1. Classification of Properties into price categories : Low, Average, High  using KNN, Randomforest Classifier, 
2. Prediction of Property Prices:  

Tuning of hyperparameters were carried out using GridSearchCV to find the optimal parameters for both classification and prediction tasks.

In [None]:
## KNN using Transformed data
The following stratified cross-validated model uses KNN classifier on our dataset.

In [10]:
# Standardize the transformed data before applying PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics as mt
from sklearn.decomposition import PCA
#from sklearn.pipeline import Pipeline

# Create cross validation, standard scalar, and PCA object
cv_obj = StratifiedKFold(n_splits = 3, random_state = 6)
clf = KNeighborsClassifier(n_jobs = -1)
ss = StandardScaler()

# Require the number of components used to have 99% variance explained
pca = PCA(n_components = .99, svd_solver = 'full', random_state = 6)

X = trans.drop(columns = ['price_class']).values
y = trans.price_class.values

In [8]:
# Split dataset and fit PCA to data
iter_num=0

# Iterate over the split data
for train_indices, test_indices in cv_obj.split(X,y): 

    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # Scale training and test data to training data scale
    scaled_features = X_train.copy()
    X_train = ss.fit_transform(scaled_features)
    X_test = ss.fit(scaled_features).transform(X_test)
    
    # Run the PCA algorithm on the data
    %time Xtrain_pca = pca.fit(X_train).transform(X_train)
    Xtest_pca = pca.transform(X_test)
    
    # train the KNN model on the training data
    %time clf.fit(Xtrain_pca,y_train)
    y_hat = clf.predict(Xtest_pca)

    # Print the accuracy, precision, recall, fscore, and confusion matrix for each iteration
    acc = mt.accuracy_score(y_test,y_hat)
    metrics = mt.precision_recall_fscore_support(y_test, y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    print("Recall, Precision, Fscore\n", metrics)
    iter_num+=1

Wall time: 1min 1s
Wall time: 35.3 s
====Iteration 0  ====
accuracy 0.5895031718942294
confusion matrix
 [[28409 12493  8278]
 [15199 33471  2664]
 [17204  4535 24820]]
Recall, Precision, Fscore
 (array([0.46716109, 0.6628052 , 0.69403277]), array([0.57765352, 0.652024  , 0.53308705]), array([0.51656484, 0.6573704 , 0.60300531]), array([49180, 51334, 46559], dtype=int64))
Wall time: 1min 1s
Wall time: 36.2 s
====Iteration 1  ====
accuracy 0.5927232919930374
confusion matrix
 [[26853 12579  9748]
 [14258 33882  3193]
 [15270  4851 26438]]
Recall, Precision, Fscore
 (array([0.47627747, 0.66031338, 0.67137307]), array([0.54601464, 0.66004325, 0.56783866]), array([0.50876744, 0.66017828, 0.61528078]), array([49180, 51333, 46559], dtype=int64))
Wall time: 1min 1s
Wall time: 31.4 s
====Iteration 2  ====
accuracy 0.3939117841042762
confusion matrix
 [[ 7571  3325 38283]
 [ 7267  7598 36468]
 [ 2667  1128 42764]]
Recall, Precision, Fscore
 (array([0.432505  , 0.6304871 , 0.36390248]), array([0

In [8]:
# Now repeat the above process using a grid search algorithm
from sklearn.model_selection import GridSearchCV

gs_clf = GridSearchCV(clf, param_grid = {'n_neighbors': [10, 20, 30]}, n_jobs = -1, verbose = 2, cv = 3)

# Split dataset and fit PCA to data
iter_num=0

# Iterate over the split data
for train_indices, test_indices in cv_obj.split(X,y): 

    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # Scale training and test data to training data scale
    scaled_features = X_train.copy()
    %time X_train = ss.fit_transform(scaled_features)
    X_test = ss.fit(scaled_features).transform(X_test)
    
    # Run the PCA algorithm on the data
    %time Xtrain_pca = pca.fit(X_train).transform(X_train)
    Xtest_pca = pca.transform(X_test)
    
    # train the KNN model on the training data
    %time gs_clf.fit(Xtrain_pca,y_train)
    %time y_hat = gs_clf.predict(Xtest_pca)

    # Print the accuracy, precision, recall, fscore, and confusion matrix for each iteration
    acc = mt.accuracy_score(y_test,y_hat)
    metrics = mt.precision_recall_fscore_support(y_test, y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    print("Recall, Precision, Fscore\n", metrics)
    print("best estimator", gs_clf.best_params_)
    print("score", gs_clf.best_score_)
    iter_num+=1

Wall time: 11.6 s
Wall time: 1min 2s
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 25.6min remaining: 12.8min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 34.9min finished


Wall time: 35min 29s
Wall time: 1min 39s
====Iteration 0  ====
accuracy 0.6194134885397048
confusion matrix
 [[29232 11005  8943]
 [14228 34304  2802]
 [15565  3431 27563]]
Recall, Precision, Fscore
 (array([0.49524778, 0.70381617, 0.70120586]), array([0.59438796, 0.66825106, 0.59200155]), array([0.54030775, 0.68557268, 0.64199285]), array([49180, 51334, 46559], dtype=int64))
best estimator {'n_neighbors': 30}
score 0.5326796830113244
Wall time: 10.4 s
Wall time: 1min 2s
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 23.8min remaining: 11.9min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 35.1min finished


Wall time: 35min 44s
Wall time: 1min 32s
====Iteration 1  ====
accuracy 0.6192069190600522
confusion matrix
 [[27487 11358 10335]
 [13242 34558  3533]
 [13565  3971 29023]]
Recall, Precision, Fscore
 (array([0.5062622 , 0.69272556, 0.67666877]), array([0.55890606, 0.67321216, 0.62335961]), array([0.53128322, 0.68282948, 0.64892119]), array([49180, 51333, 46559], dtype=int64))
best estimator {'n_neighbors': 30}
score 0.5545277143167973
Wall time: 11 s
Wall time: 1min 3s
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed: 14.0min remaining:  7.0min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 21.8min finished


Wall time: 22min 21s
Wall time: 5min 27s
====Iteration 2  ====
accuracy 0.38814586152266595
confusion matrix
 [[ 5079  2996 41104]
 [ 3509  7548 40276]
 [  722  1379 44458]]
Recall, Precision, Fscore
 (array([0.54554243, 0.63306215, 0.35329551]), array([0.10327579, 0.14703992, 0.95487446]), array([0.17367368, 0.2386493 , 0.51576304]), array([49179, 51333, 46559], dtype=int64))
best estimator {'n_neighbors': 30}
score 0.6305359601557056


# V.Modeling and Evaluation 4: Analyzing the results using F1, and Accuracy Metrics

# VI. Modeling and Evaluation 5: Model Comparisons:  Relative Advantages and Statistical Significance

# VII.Modeling and Evaluation 6: Attribute Importance and Model Usefulness



# VIII. Deployment
Our model can be used by potential investors wanting to do due diligence on properties for sale in South America.  In order to stay relevant, 
the model will need to be updated on a quarterly basis to reflect market fluctuations.  Since regular updates are needed, a user subscription based
deployment would be ideal.  

# VIIII.  Exception Work

Exceptional work credits are requested for the following task:
    1.  Data Wrangling,
    2. 


## Principal Component Analysis
Features in both datasets will be normalized and Principal Component Analysis will be conducted to reduce the dimensionality of our dataset. This will allow us to extract latent information which is thought to be contained within country, province, and department features, while significantly reducing our dataset size and model computation times.

# PCA for transformed dataset
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

X = trans.drop(columns = ['price_class']).values
y = trans.price_class

pca = PCA(n_components = 10, random_state = 6)
X_pca = pca.fit(X).transform(X)

print('pca: ', pca.components_)

In [15]:
print('pca variance explained: ', pca.explained_variance_ratio_)
print('Cumulative ', sum(pca.explained_variance_ratio_))
print('first 3 ', sum(pca.explained_variance_ratio_[0:3]))

pca variance explained:  [7.25915920e-01 2.68862366e-01 1.16526088e-03 5.90946039e-04
 3.93789852e-04 2.69261263e-04 2.33105627e-04 2.02883871e-04
 1.62168838e-04 1.45526174e-04]
Cumulative  0.9979412291040144
first 3  0.9959435474386268


The PCA on the transformed data with 10 components explains 99.7% of all variance within the data. This is a significant reduction in our dataset size, while still retaining a lot of the information within the data. Further exploration shows that just 3 principal components explain 99.5% of the variation within our data.