# <b>New York City Airbnb EDA</b>
----
What is Exploratory Data Analysis (EDA)?

Exploratory data analysis (EDA) is the starting and fundamental approach to any data analysis, as it aims to understand the main characteristics of a data set before performing more advanced analysis or further modeling.

EDA involves the following:

Data visualization: Using plots such as histograms, box plots, scatter plots and many others to visualize the distribution of the data, the relationships between variables and any anomalies or peculiarities in the data.
Anomaly identification: Detecting and sometimes dealing with outliers or missing data that could affect further analysis.
Hypothesis formulation: From the scan, analysts can begin to formulate hypotheses that will then be tested in more detailed analysis or modeling.
The main purpose of EDA is to see what the data can tell us beyond the formal modeling task. It serves to ensure that the subsequent modeling or analysis phase is done properly and that any conclusions are based on a correct understanding of the structure of the data

Step 1: Problem statement and data collection
Step 2: Exploration and data cleaning
2.1 Summarize DF
2.2 Eliminate duplicates
2.3 Elimnate irrelevant information
Step 3: Analysis of univariate variables
3.1 Analysis of categorical variables
3.2 Analysis of numerical variables
Step 4: Analysis of multivariate variables
4.1 Numerical-numerical analysis
4.2 Categorical-categorical analysis
4.3 Other mixed analysis
4.4 From string to numerical transformation --> JSON saving
4.5 Correlation matrix
Step 5: Feature engineering
5.1 Outliers analysis
5.2 Upper and lower limits of outliers variables --> JSON saving
5.3 Copy 2 DataFrames: with and without outliers
5.4 Missing value analysis
5.5 Inference of new features
5.6 Feature Scaling
5.6.1 train test
5.6.2 Normalization
5.6.3 Min-Max Scaling
Step 6: Feature selection

In [9]:
# Your code here
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
import pandas as pd
#Cargamos el csv via URL o pathing de nuestro repositorio
#y lo guardamos en una variable tipo dataframe
df_rawdata = pd.DataFrame(pd.read_csv("/workspaces/JBGEDA1/data/raw/AB_NYC_2019.csv"))

In [10]:
#Creamos una variable para que contenga una tupla, para saber
#las dimensiones de nuestro dataframe
dimensions = df_rawdata.shape
#El .info() nos dara informacion basica de nuestro dataframe
df_rawdata.info()
print("dimensions")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [11]:
#Eliminamos las columnas que no nos interese para correlacionar
#con nuestra variable target
df_rawdata.drop(['name','host_name','id'], inplace=True, axis=1)
#Podemos calcular la cantidad de nulls por columna, es decir datos
#erróneos o no insertados al indice
nulls = df_rawdata.isnull().sum()
#Printeamos nulls por consola
print(nulls)

host_id                               0
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64


NUESTRO TARGET , evidentemente sera el PRECIO DE ALQUILER POR HABITACIONES de las viviendas de airbnb en Nueva York. Por lo cual, viendo la cantidad de nulls que tienen las columnas ['last_review'] y ['reviews_per_month'] no nos interesa hacer una correlacion con dichas columnas, ya que vamos a perder mucha cantidad de indices con datos de otras columnas, por lo cual, tambien eliminaremos estas dos columnas.

In [12]:
#Volvemos a eliminar columnas del DF.
df_rawdata.drop(['last_review','reviews_per_month'], inplace=True, axis=1)
#Para Visualizar que se ha eliminado
#podemos printear de nuevo el null
nulls = df_rawdata.isnull().sum()
#Printeamos nulls por consola
print(nulls)

host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64
