# Ownerships in Bogota, Colombia

The main goal of this inmersion is to generate a model which predicts the cost of an ownership in the different neighborhoods in Bogota, using the dataset called "inmuebles_bogota.csv"

## Importing and describing our data

### Importing libraries that will be used

In [50]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data=pd.read_csv(r"C:\Users\Emmanuel\OneDrive - Instituto Politecnico Nacional\GITHUB\Inmersion-datos-aiura-Latam\Inmersion-datos-Aiura-LATAM\inmuebles_bogota.csv")
data.head()

Unnamed: 0,Tipo,Descripcion,Habitaciones,Baños,Área,Barrio,UPZ,Valor
0,Apartamento,Apartamento en venta en Zona Noroccidental,3,2,70,Zona Noroccidental,CHAPINERO: Pardo Rubio + Chapinero,$ 360.000.000
1,Casa,Casa en venta en Castilla,4,3,170,Castilla,KENNEDY: Castilla + Bavaria,$ 670.000.000
2,Apartamento,Apartamento en venta en Chico Reservado,3,3,144,Chico Reservado,CHAPINERO: Chicó Lago + El Refugio,$ 1.120.000.000
3,Apartamento,Apartamento en venta en Usaquén,3,2,154,Usaquén,Usaquén,$ 890.000.000
4,Apartamento,Apartamento en venta en Bella Suiza,2,3,128,Bella Suiza,USAQUÉN: Country Club + Santa Bárbara,$ 970.000.000


## We will try to answere the next questions:

Q1.- Is there any outlayer in the columns "Habitaciones", "Baños", "Area"?

Q2.- Are all ownerships at sale?

Q3.- How many categories are in the "Tipo" column and how many ownerships are in each one?

Q4.- How many ownerships are by neighborhood?

Q5.- What is the mean cost/m2 for each neighborhood? 

Q6.- Which are the top 10 neighborhoods with the highest variation in cost?

Q7.- Which are the top 10 neighborhoods with the highest mean cost?

Q8.- Which are the top 10 neighborhoods with the cheapest mean cost?

### 1.- Knowing and cleaning data

In this section I will take off the possible errors I might get because of 

- strange characters

- wrong type values

- Values that does not make sense

In [51]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9520 entries, 0 to 9519
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Tipo          9520 non-null   object
 1   Descripcion   9520 non-null   object
 2   Habitaciones  9520 non-null   int64 
 3   Baños         9520 non-null   int64 
 4   Área          9520 non-null   int64 
 5   Barrio        9520 non-null   object
 6   UPZ           9478 non-null   object
 7   Valor         9520 non-null   object
dtypes: int64(3), object(5)
memory usage: 595.1+ KB


- It is clear that the characters "ñ" and "Á" might be problematique, so I will take them off.

In [52]:
data=data.rename(columns={"Área":"Area","Baños":"Banos"})

- I will change the data type of "Valor" column, from object (string) to float and I will change the unit to millions

In [53]:
data["cost_millions"] = data.Valor.str.split(regex=True, expand = True )[1] # regex = read regular expresions and expand returns a Series
data.cost_millions = data.cost_millions.str.replace(".","")
data.cost_millions = data.cost_millions.astype("int64")/1e6

  data.cost_millions = data.cost_millions.str.replace(".","")


In [54]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9520 entries, 0 to 9519
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Tipo           9520 non-null   object 
 1   Descripcion    9520 non-null   object 
 2   Habitaciones   9520 non-null   int64  
 3   Banos          9520 non-null   int64  
 4   Area           9520 non-null   int64  
 5   Barrio         9520 non-null   object 
 6   UPZ            9478 non-null   object 
 7   Valor          9520 non-null   object 
 8   cost_millions  9520 non-null   float64
dtypes: float64(1), int64(3), object(5)
memory usage: 669.5+ KB


Now data types are right, I will check for unusual values depending of the columns

In [55]:
data.describe()

Unnamed: 0,Habitaciones,Banos,Area,cost_millions
count,9520.0,9520.0,9520.0,9520.0
mean,3.07,2.45,146.66,602.06
std,2.05,1.26,1731.38,704.6
min,1.0,0.0,2.0,60.0
25%,2.0,2.0,57.0,250.0
50%,3.0,2.0,80.0,409.18
75%,3.0,3.0,135.0,680.0
max,110.0,9.0,166243.0,16000.0


I just need two decimals, so I will change the display of the numbers

In [57]:
pd.set_option("display.precision",2)
pd.set_option("display.float_format",lambda x: "%.2f" %x)
data.describe()

Unnamed: 0,Habitaciones,Banos,Area,cost_millions
count,9520.0,9520.0,9520.0,9520.0
mean,3.07,2.45,146.66,602.06
std,2.05,1.26,1731.38,704.6
min,1.0,0.0,2.0,60.0
25%,2.0,2.0,57.0,250.0
50%,3.0,2.0,80.0,409.18
75%,3.0,3.0,135.0,680.0
max,110.0,9.0,166243.0,16000.0


For example, I can see the max rooms number and check for the info, just in case it might be a data that do not correspond with the Area value

In [58]:
data[data.Habitaciones == data.Habitaciones.max()]

Unnamed: 0,Tipo,Descripcion,Habitaciones,Banos,Area,Barrio,UPZ,Valor,cost_millions
897,Casa,Casa en venta en La Uribe,110,2,110,La Uribe,Usaquén,$ 480.000.000,480.0
