# Workshop We Work Santiago 2022

### Objetivo de hoy

2022: Estamos lanzando una startup InmobiTech. Necesitamos recolectar información sobre los arriendos en Santiago. 

![Se arriendo](./Arriendo.jpeg)

👉 Nuestra idea: entrar en [Portal Inmobiliario](https://www.portalinmobiliario.com) y sacar toda la info!

> A mano? 😳 🤯

> Definitivamente NO! 😉

Para eso, necesitaremos aprender sobre:

- Estructura de datos en *Python*: listas y diccionarios
- Recolección de datos utilizando WebScraping (bs4)
- Visualización utilizando librerías de Python (Plotly / Seaborn)
- Calculos y estadisticas

### Estructura de Datos

**Listas**

- Indice (posición)
- Puedo leer, agregar, modificar o eliminar

In [None]:
students = ["Sebas", "Fede", "Camila"]

In [None]:
age = [32, 28, 26]

In [None]:
age[0]
age

In [None]:
students.append("Tomas")

In [None]:
students[1:4]

**Diccionarios**

- Parejas: `clave` : `valor`
- No tiene indices
- Las claves son únicas

In [None]:
{'name': 'Sebas', 'age': 32}

**Entonces...**

In [None]:
students = [
    {'name': 'Sebas', 'age': 32},
    {'name': 'Fede', 'age': 29},
    {'name': 'Camila', 'age': 26}
]
students

students.append({'name': 'Agustin', 'age': 24})
students[0]

### Web 101

![HTTP Request](./Web.png)

## Funcionamiento del Webscraping --> La idea de esta tecnica de extraccion de datos, es sustituir lo que hacemos en un navegador web por un programa en Python. 



#### - Lo que hacemos en un navegador es escribir un URL y esto lo que hace es enviar una peticion siguiendo el protocolo http a un servidor el cual nos devuelve el codigo html, el cual nuestro navegador consigue interpretar y transformar con ese aspecto visual lo que vemos en las paginas webs.



#### - Con Python podemos hacer lo mismo, creando algoritmos que generen peticiones al servidor y recibir el codigo fuente en formato html.



#### - Existe una libreria en Python llamda Beutifulsoup, la cual nos ayudara a analizar documentos html y extraer datos de ellos. Dando la posibilidad de poder acceder a lo que solamente nos interesa debido a que un codigo fuente html contiene mucha informacion de la cual nos interesa solamente la data que nos servira para nuestro estudio de interes.


---

![HTML Tag](./Tags.png)

### OK, let's go!

##### Importamos la librerías de Python que necesitamos

In [None]:
import requests
import numpy as np 
from bs4 import BeautifulSoup
import re

Pedido de información a la web:

In [None]:
url = "https://www.portalinmobiliario.com/arriendo/departamento/santiago-metropolitana/"
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [None]:
pages = np.arange(1, 34*40, 40).tolist()

##### Funcion de recolección

In [None]:
def transform_html_to_data(soup):
    restaurants_data = soup.find_all(class_='ui-search-layout__item')
    restaurants = []
    for restaurant in restaurants_data:
        price = restaurant.find('span', class_='price-tag-fraction').text
        price = int(price.replace(".", ""))
        address = restaurant.find(class_='ui-search-item__group__element ui-search-item__location shops__items-group-details').text
        space_information = restaurant.find(class_='ui-search-item__group ui-search-item__group--attributes shops__items-group').text
        if space_information:
            size = re.search(r'(\d+) m', space_information)
            if size:
                size = int(size.group(1))
            rooms = re.search(r'(\d+) dormitorio', space_information)
            if rooms:
                rooms = int(rooms.group(1))
        data = {'price (CLP)': price, 'rooms': rooms, 'size (m2)': size, 'address': address}
        restaurants.append(data)

    return restaurants

##### Iteramos según cuantas `pages` haya disponibles

In [None]:
restaurants_list = []
for page in pages:
    url = f"https://www.portalinmobiliario.com/arriendo/departamento/santiago-metropolitana/_Desde_{page}_NoIndex_True"
    response = requests.get(url)
    soup = BeautifulSoup(response.content)
    restaurants_list += transform_html_to_data(soup)

**Cuantos departamentos pudimos recuperar?**

In [None]:
len(restaurants_list)

##### Transformamos los datos a un DataFrame

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(restaurants_list)
df.head(20)

In [None]:
df.shape

##### Limpiamos la información

- Falta de información

In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace=True)
df.shape

In [None]:
df.head(10)

In [None]:
df['price (CLP)'].max()

In [None]:
df['price (CLP)'].min()

## Calculo Uf a CLP, estandarizar los datos para dejarlos en misma magnitud en este caso CLP

In [None]:
q_low = df["price (CLP)"].quantile(0.1)
q_hi  = df["price (CLP)"].quantile(0.98)

q_low1 = df["size (m2)"].quantile(0.1)
q_hi1  = df["size (m2)"].quantile(0.98)

df_filtered1= df[(df["size (m2)"] < q_hi1) & (df["size (m2)"] > q_low1)]
df_filtered1
# df_filtered = df[(df["price (CLP)"] < q_hi) & (df["price (CLP)"] > q_low)]


In [None]:
df_filtered = df_filtered1[(df_filtered1["price (CLP)"] < q_hi) & (df_filtered1["price (CLP)"] > q_low)]
df_filtered

In [None]:
# uf = 34750
# min_price_charters = 3
# # max_price_charters = 



# df_filtered['price (CLP)'] = df_filtered['price (CLP)'].map(lambda x : x if len(str(x)) > min_price_charters else x * uf)

# df['price (CLP)'] = df['price (CLP)'].map(lambda x : x if len(str(x)) > min_price_charters and len(str(x)) < max_price_charters else df.drop(['price (CLP)']))

In [None]:
df_filtered['price (CLP)'].max()

In [None]:
df_filtered['price (CLP)'].min()

In [None]:
df_filtered['price (CLP)'].mean()

In [None]:
df_filtered.to_csv('./RegionMetropolitanalimpio.csv')

- Outliers

> Precio de arriendo menor a 500000 CLP

In [None]:
condition = df_filtered['price (CLP)'] < 500000
df_filtered = df_filtered[condition]
df_filtered

In [None]:
df_filtered.shape

> Precio mayor a 300000 CLP

In [None]:
condition = df_filtered['price (CLP)'] > 300000
df_filtered = df_filtered[condition]
df_filtered

In [None]:
df_filtered.shape

> Menos de 5 piezas

In [None]:
condition = df_filtered['rooms'] < 5
df_filtered = df_filtered[condition]
df_filtered

In [None]:
df_filtered.shape

### Data Visualization

In [None]:
import seaborn as sns
import plotly.express as px

In [None]:
fig = px.scatter(df_filtered, x="size (m2)", y="price (CLP)", size="rooms", title="Precio vs. Tamaño", width=800, height=400)
fig.show()

In [None]:
sns.countplot(x="rooms", data=df_filtered)
sns.set(rc={'figure.figsize':(15, 6)})

3681# sns.catplot(x='rooms', y='price (CLP)', data=df, kind="box")

In [None]:
sns.regplot(x='size (m2)', y='price (CLP)', data=df_filtered)

In [None]:
condition = df_filtered['size (m2)'] < 200
df_max_size_200 = df_filtered[condition]

In [None]:
df.shape

In [None]:
df_max_size_200.shape

In [None]:
sns.regplot(x='size (m2)', y='price (CLP)', data=df_max_size_200, color='green')

##### Exportemos a CSV

Aprovechando la librería de **Pandas**, utilizamos solo `.to_csv()`

In [None]:
df_max_size_200.to_csv('./RegionMetropolitana.csv')

## Machine Learning
### Linnear regression

### Hacer con 1 variable para simplificar
### Despues explicar que puede ser multivariable y hay que hacer preprocesing, normalizar y ente otros antes de entrenar el modelo.

In [None]:
data = pd.read_csv('./RegionMetropolitanalimpio.csv')
data

In [None]:
data.drop(['Unnamed: 0'], axis=1)

In [None]:
livecode_data = data[['size (m2)','price (CLP)']]
livecode_data.head()

In [None]:
import matplotlib.pyplot as plt # Plot Living area vs Sale price
plt.scatter(data['size (m2)'], data['price (CLP)'])
#
# Labels
plt.xlabel("Living area")
plt.ylabel("Sale price")
plt.show()

In [None]:
sns.regplot(x='size (m2)', y='price (CLP)', data=data, color='green')

In [None]:
data[['size (m2)']].boxplot()

In [None]:
data[['price (CLP)']].boxplot()

## Training

### Training a Linear Regression model with Sklearn LinearRegression

In [None]:
from sklearn.linear_model import LinearRegression # Instanciate the model

model = LinearRegression()

# Define X and y
X = data[['size (m2)']] # a dataframe of featureres
y = data['price (CLP)'] #a series of target

# Train the model on the data
model.fit(X, y)

### At this stage, the model has learned the optimal slope a and intercept b needed to map the relationship between X and y.

## Model Attributes
### a (slope) and b (intercept) are stored as model attributes and can be accessed.


In [None]:
 # View the model's slope (a)
model.coef_

In [None]:
# View the model's intercept (b)
model.intercept_

In [None]:
 # Evaluate the model's performance / R2
model.score(X,y)

## Prediccion

### El train model nos servira para predecir nueva data

In [None]:
#  Predict on new data
model.predict([[50]])

### Por tanto, un departamento con una superficie de 300 m2 ha predecido un valor de arriendo de $x pesos mensuales

Pasos

1. Import the model: from sklearn import model
2. Instantiate the model: model = model()
3. Train the model: model.fit(X, y)
4. Evaluate the model: model.score(new_X, new_y) 5. Make predictions: model.predict(new_X)


## Generalization

### The performance of a Machine Learning model is evaluated on its ability to generalize when predicting unseen data.

## The Holdout Method

### The Holdout Method is used to evaluate a model's ability to generalize. It consists of splitting the dataset into two sets:

### Training set (70%) Testing set (30%)
### Model.score() on the test set

In [None]:
from sklearn.model_selection import train_test_split # split the data into train and test



train_data, test_data = train_test_split(livecode_data, test_size=0.3)
# Ready X's and y's

X_train = train_data[['size (m2)']]
y_train = train_data['price (CLP)']

X_test = test_data[['size (m2)']]
y_test = test_data['price (CLP)']

In [None]:
# Ready X and y
X = livecode_data[['size (m2)']]
y = livecode_data['price (CLP)']

# Split into Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
# Instantiate the model
model = LinearRegression()

# Train the model on the Training data
model.fit(X_train, y_train)

# Score the model on the Test data

model.score(X_test,y_test)


Para mejorar el rendimiento, puede iterar por estos pasos: 
    
Recopilar datos: Aumentar el número de ejemplos de entrenamiento. 
Procesamiento de características: Añada más variables y mejor de procesamiento de características.

In [None]:
model.predict([[50]])

## Calculos para encontrar buenas oportunidades de arriendo  / Inversion

In [None]:
data1 = pd.read_csv('./RegionMetropolitanalimpio.csv')
data1

In [None]:
data2 = data1.drop(['Unnamed: 0'], axis=1)

data2

In [None]:
index = data2.index
number_of_rows = len(index)
number_of_rows

In [None]:
import math


data2['clp/m2'] = (data2['price (CLP)'] / data2['size (m2)']) / 10000
data2['clp/m2'] = data2['clp/m2'].round(2)

data2.replace([np.inf, -np.inf], np.nan, inplace=True)
data2.dropna(subset=['clp/m2'], inplace=True)
mean = data2['clp/m2'].mean()

data2['clp/m2 mean'] = data2['clp/m2'].mean() 

data2['over clp/m2 mean'] = data2['clp/m2'].map(lambda x : True if x > mean else False)

data2['dif'] = data2['clp/m2'] - data2['clp/m2 mean']

data2['dif^2'] = data2['dif'] * data2['dif'] 
#
data2['desv'] = math.sqrt((sum(data2['dif^2']) / (number_of_rows - 1)))

data2['sigma'] = (data2['dif'] / data2['desv'])

data2['dif prom'] =  data2['dif'] / data2['clp/m2 mean']

data2

In [None]:
order = data2.sort_values('sigma')   
order

## Idea

- Agregar Links para facilitar la busqueda