# Data Types

In [None]:
import numpy as np
import pandas as pd

## Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Paris. The data is freely available at **Inside AirBnb**: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896).

We are going to use 2 datasets:

- listing dataset: contains listing-level information
- pricing dataset: contains pricing data, over time

In [5]:
# Import listings data
url_listings = "http://data.insideairbnb.com/france/ile-de-france/paris/2023-09-04/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

# Import pricing data
url_prices = "http://data.insideairbnb.com/france/ile-de-france/paris/2023-09-04/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

In [6]:
df_prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24798862 entries, 0 to 24798861
Data columns (total 7 columns):
 #   Column          Dtype  
---  ------          -----  
 0   listing_id      int64  
 1   date            object 
 2   available       object 
 3   price           object 
 4   adjusted_price  object 
 5   minimum_nights  float64
 6   maximum_nights  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 1.3+ GB


In [9]:
df_prices.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,3109,2023-09-05,f,$110.00,$110.00,2.0,30.0
1,3109,2023-09-06,f,$110.00,$110.00,2.0,30.0
2,3109,2023-09-07,f,$110.00,$110.00,2.0,30.0
3,3109,2023-09-08,t,$110.00,$110.00,2.0,30.0
4,3109,2023-09-09,t,$110.00,$110.00,2.0,30.0


## Numerical Data

> Methods
>
>- `+`, `-`, `*`, `/`
>- numpy functions
>- `pd.cut()`

Standard mathematical operations between columns are done row-wise.

In [11]:
df_prices['maximum_nights'] - df_prices['minimum_nights']

We can use most `numpy` operations element-wise on a single column.

In [14]:
np.log(df_listings['price'])

0        4.700480
1        4.941642
2        4.941642
3        5.192957
4        4.317488
           ...   
67937    5.010635
67938    5.645447
67939    4.454347
67940    5.257495
67941    5.662960
Name: price, Length: 67942, dtype: float64

We can create a categorical variables from a numerical one using the `pd.cut()` function.

In [16]:
pd.cut(df_listings['price'], 
       bins = [0, 50, 100, np.inf], 
       labels=['cheap', 'ok', 'expensive'])

0        expensive
1        expensive
2        expensive
3        expensive
4               ok
           ...    
67937    expensive
67938    expensive
67939           ok
67940    expensive
67941    expensive
Name: price, Length: 67942, dtype: category
Categories (3, object): ['cheap' < 'ok' < 'expensive']

## String Data

> Methods
>
>- `+`
>- `.str.replace`
>- `.str.contains`
>- `.astype(str)`
>-`pd.get_dummies()`

We can use the `+` operator between columns, to do pairwise append.

**Note**: we cannot do it with strings.

In [17]:
df_listings['host_name'] + df_listings['neighbourhood']

0                  AnneObservatoire
1              BorzouHôtel-de-Ville
2              FranckHôtel-de-Ville
3                        AnaïsOpéra
4                  BernadetteLouvre
                    ...            
67937    Elisa TahinaPalais-Bourbon
67938                  JoffreyOpéra
67939                  JoffreyPassy
67940        Vinicius ParisianOpéra
67941                 RubenGobelins
Length: 67942, dtype: object

Pandas Series have a lot of vectorized string functions. You can find a list [here](https://pandas.pydata.org/docs/user_guide/text.html#method-summary).

For example, we want to remove the dollar symbol from the `price` variable in the `df_prices` dataset.

In [19]:
df_prices['price'].str.replace('$', '', regex=False)

0           110.00
1           110.00
2           110.00
3           110.00
4           110.00
             ...  
24798857     85.00
24798858     85.00
24798859     85.00
24798860     85.00
24798861     85.00
Name: price, Length: 24798862, dtype: object

Some of these functions use regular expressions.

- `match()`: Call re.match() on each element, returning a boolean.
- `extract()`: Call re.match() on each element, returning matched groups as strings.
- `findall()`: Call re.findall() on each element
- `replace()`: Replace occurrences of pattern with some other string
- `contains()`: Call re.search() on each element, returning a boolean
- `count()`: Count occurrences of pattern
- `split()`: Equivalent to str.split(), but accepts regexps
rsplit()

For example, the next code checks whether in the word `centre` or `center` are contained in the text description.

In [20]:
df_listings['name'].str.contains('centre|center')

0        False
1        False
2        False
3        False
4        False
         ...  
67937    False
67938    False
67939    False
67940    False
67941    False
Name: name, Length: 67942, dtype: bool

Lastly, we can (try to) convert string variables to numeric using `astype(float)`.

In [21]:
df_prices['price'].str.replace('[$|,]', '', regex=True).astype(float)

0           110.0
1           110.0
2           110.0
3           110.0
4           110.0
            ...  
24798857     85.0
24798858     85.0
24798859     85.0
24798860     85.0
24798861     85.0
Name: price, Length: 24798862, dtype: float64

We can also use it to convert numerics to strings using `astype(str)`.

In [None]:
df_listings['id'].astype(str)

We can generate dummies from a categorical variable using `pd.get_dummies()`

In [22]:
df_listings['neighbourhood'].head()

0      Observatoire
1    Hôtel-de-Ville
2    Hôtel-de-Ville
3             Opéra
4            Louvre
Name: neighbourhood, dtype: object

In [23]:
pd.get_dummies(df_listings['neighbourhood']).head()

Unnamed: 0,Batignolles-Monceau,Bourse,Buttes-Chaumont,Buttes-Montmartre,Entrepôt,Gobelins,Hôtel-de-Ville,Louvre,Luxembourg,Ménilmontant,Observatoire,Opéra,Palais-Bourbon,Panthéon,Passy,Popincourt,Reuilly,Temple,Vaugirard,Élysée
0,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False


## Time Data

> Methods
>
>- `pd.to_datetime()`
>- `.dt.year`
>- `.df.to_period()`
>- `pd.to_timedelta()`

In the `df_prices` we have a date variable, `date`. Which format is it in? We can check it with the `.dtypes` attribute.

In [None]:
df_prices['date'].dtypes

We can **convert** a variable into a date using the 

In [None]:
df_prices['datetime'] = pd.to_datetime(df_prices['date'])

Indeed, if we now check the format of the `datetime` variable, it's `datetime`.

In [None]:
df_prices['datetime'].dtypes

Once we have a variable in `datetime` format, we gain plenty of datetime operations through the `dt` accessor object for datetime-like properties. 

For example, we can **extract the year** using `.dt.year`. We can do the same with `month`, `week` and `day`.

In [None]:
df_prices['datetime'].dt.year

We can change the **level of aggregation** of a date using `.dt.to_period()`. The option `M` converts to year-month level. 

In [None]:
df_prices['datetime'].dt.to_period('M')

We can **add or subtract time periods** from a date using the `pd.to_timedelta()` function. We need to specify the unit of measurement with the `unit` option.

In [None]:
df_prices['datetime'] -  pd.to_timedelta(3, unit='d')

## Missing Data

> Methods
>
>- `.isna()`
>- `.dropna()`
>- `.fillna()`

The function `isna()` reports missing values.

In [None]:
df_listings.isna().head()

To get a quick description of the amount of missing data in the dataset, we can use

In [None]:
df_listings.isna().sum()

We can drop missing values using `dropna()`. It drops all rows with at least one missing value.

In [None]:
df_listings.dropna().shape

In this case unfortunately, it drops all the rows. If we wa to drop only rows with all missing values, we can use the parameter `how='all'`.

In [None]:
df_listings.dropna(how='all').shape

If we want to drop only missing values for one particular value, we can use the `subset` option.

In [None]:
df_listings.dropna(subset=['reviews_per_month']).shape

We can also fill the missing values instead of dropping them, using `fillna()`.

In [None]:
df_listings.fillna(' -- This was NA  -- ').head()

We can also make missing values if we want.

In [None]:
df_listings.iloc[2, 2] = np.nan
df_listings.iloc[:3, :3]