![Riga](https://cdn.getyourguide.com/img/tour_img-1814782-148.jpg)
# Introduction
[Riga](https://en.wikipedia.org/wiki/Riga) is a lovely city near the Baltic Sea, the capital of Latvia. 


This kernel is written by Riga Data Science Club - an international community of data scientists based in Riga and Slack 😃
We will be happy to accept people from all over the world to join our friendly chat. It is totally free. Please sign up here: [http://rigadsclub.com/join-us/](http://rigadsclub.com/join-us/)

Yours,
Riga DS Club

# Data exploration
First, let's load our dataset and get familiar with it by printing out several rows:

In [None]:
import pandas as pd
df = pd.read_csv('/kaggle/input/riga-real-estate-dataset/riga_re.csv')
# Printing top 5 rows
df.head(5)

In [None]:
# Checking total amount of rows in given dataset
len(df)

Let's take a look at the **op_type** column. This abbreviation stands for "operation type". Values of this column might have huge impact on our further work, since sale price is much different from the rent price for any object.

Let's check if there are any other operation types in this column:

In [None]:
# Printing out unique values of a column
df.op_type.unique()

In [None]:
# Grouping by operation type and getting statistics within groups
df_by_op_type = df.groupby('op_type')
df_by_op_type.describe()


As you see, there are also other values like "Buying", "Renting", "Change" and "Other". Before continuing, let's do the following:
1. Drop entries with operations "Change", "Other" as irrelevant to our goal - price prediction
2. Drop entries with operations "Buying" and "Renting" as they are presented with very few samples

In [None]:
df_filt = df[~df['op_type'].isin(['Change', 'Other', 'Buying', 'Renting'])]
len(df_filt)

Next, we could pay attention to **district** column. Let's explore unique districts first:

In [None]:
df_filt.district.unique()

Let's inspect unique values of other columns as well

In [None]:
for col in ['floor', 'total_floors']:
    print(col, ":", sorted(df_filt[col].unique()))

Floor values look fine.

In [None]:
 for col in ['house_seria', 'house_type', 'condition']:
    print(col, ":", df_filt[col].unique())
    

One not coming from the eastern europe might be confused by the **house_seria** values, but believe us - they are fine. Despite Riga being the city with the highest concentration of [Art Nouveau architecture](https://en.wikipedia.org/wiki/Art_Nouveau_architecture_in_Riga) anywhere in the world, there are also many standardized apartment blocks constructed in the [Soviet period](https://en.wikipedia.org/wiki/Urban_planning_in_communist_countries), so **602**, **119**, **103.**, **467.**, **104.** are just weird names of construction projects. We will treat them as ordinary categorical values.


Now let's check how our items look on the map:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
viz=df_filt.plot(kind='scatter', x='lon', y='lat', alpha=0.4, figsize=(10,10))
viz.legend()

The latitude of Rīga, Latvia is 56.946285, and the longitude is 24.105078. While some of the values seem to be within a correct range, there are broken values, that make plot look terribly zoomed out. Let's check how many samples have wrong coordinates. Previous plot allows us to assume all broken values deviate too much from real Riga coordinates, so we can use rough comparison to filter them out.

In [None]:
wrong = df_filt[(df_filt['lat'] < 55)|(df_filt['lat'] > 58)|(df_filt['lon'] < 24)|(df_filt['lon'] > 25)]
len(wrong)

Not so many to worry about, let's just drop them and see how plot looks without broken values:

In [None]:
df_filt = df_filt[~((df_filt['lat'] < 55)|(df_filt['lat'] > 58)|(df_filt['lon'] < 24)|(df_filt['lon']>25))]
viz=df_filt.plot(kind='scatter', x='lon', y='lat', alpha=0.4, figsize=(10,10))
viz.legend()

Much better! All items are now concentrated within a single location matching Riga coordinates. Let's see them overlaying actual Riga map:

In [None]:
import folium
# Define helper function to plot over Riga map
def plot_on_riga_map(data_frame): 
    riga_map = folium.Map(
        location=[56.946285, 24.105078],
        tiles='cartodbpositron',
        zoom_start=12,
    )
    data_frame.apply(lambda row:folium.Marker(location=[row["lat"], row["lon"]]).add_to(riga_map), axis=1)
    return riga_map

In [None]:
plot_on_riga_map(df_filt[~df_filt['lon'].isna()].head(500))

# Handling missing values


Let's define a helper function to get missing values for a dataframe

In [None]:
def missing(df):
    df_missing = pd.DataFrame(df.isna().sum().sort_values(ascending = False), columns = ['missing_count'])
    df_missing['missing_share'] = df_missing.missing_count / len(df)
    return df_missing

In [None]:
missing(df_filt)

## Missing geo coordinates
We see most missing values come from geo coordinate columns - **lon** and **lat**. Let's fix them using geocoding utility.

In [None]:
# Let's take a look at some samples with missing coordinates
df_filt.loc[df_filt['lon'].isna()].head(10)

To find missing geo coordinates we could potentially use **street** column which in fact is address of the building, however it seems to contain some abbreviations that might not be understood by geocoding utility. Let's check.

In [None]:
from geopandas.tools import geocode
def geocode_safely(address):
    try: 
        return geocode(address, provider="nominatim").geometry.iloc[0]
    except: 
        return 'Not found'
   

print("1.", geocode_safely('Viestura pr. 47'))
print("2.", geocode_safely('Viestura prospekts 47'))

The assumption was correct. Abbreviation of the street is not found by geocoder, while full value is processed correctly. We might need to find a way to deal with this.

In [None]:
# Inspect all street names of samples without geo coordinates to find abbreviation patterns
df_filt.loc[df_filt['lon'].isna(), 'street'].tolist()

Fixing all different kinds of street name abbreviations seems to be a feature engineering task. Let us know if you wish to write a separate LSTM model to handle this 😃

In [None]:
# Constructing dictionary mappings from abbreviations to full values
abbrs = {
  "Asteres": "Aisteres iela",
  "M. Kuldīgas": "Mazā Kuldīgas iela",
  "M. Nometņu": "Mazā Nometņu iela",
  "Pulkv. Brieža": "Pulkveža brieža iela",
  "J. Vācieša": "Jukuma Vācieša iela",
  "J. Daliņa": "Jāņa Daliņa iela",
  "pr.": "prospekts",
  "l.": "līnija",
  "š." : "šoseja",
  "d.": "dambis",
  "g.": "gatve",
  "lauk.": "laukums",
  "bulv.": "bulvāris",
  "krastm.": "krastmala",
  "šķ līnija": "šķērslīnija",
  "šķ. līnija": "šķērslīnija",
  "M.": "mazais",  
  "432k1": "432-k-1",
  "252k5": "252-k-5"
}
# Defining helper method to unabbreviate address
def unabbreviate(address):
    # 1. Replace abbreviations
    for abbr, full in abbrs.items():
        address = address.replace(abbr, full)
     
    streetTypes = list(abbrs.values())
    # 2. If address does not contain word "street" ("iela" in Latvian) and none of manually abbreviated values
    # -> add "iela" as a second word in address
    if ("iela" not in address) & (not any(s in address for s in streetTypes)):
        words = address.split(" ")
        words.insert(1,"iela")
        address = " ".join(words)
    # 3. Finally, append "Rīga" at the end of address if not present
    if "Rīga" not in address:
        address += ", Rīga"
    return address
    
df_filt.loc[df_filt['lon'].isna(), 'street'] = df_filt.loc[df_filt['lon'].isna()].street.apply(unabbreviate)
df_filt.loc[df_filt['lon'].isna(), 'street'].tolist()

Looks good. Let's move to geocoding.

In [None]:
from geopandas.tools import geocode
from geopy.extra.rate_limiter import RateLimiter

# Delay between geocode calls to prevent it from failures
geocode = RateLimiter(geocode, min_delay_seconds=1)

def get_lat_lon(address):
    try:
        point = geocode(address, provider='nominatim').geometry.iloc[0]
        return pd.Series({'lat': point.y, 'lon': point.x})
    except:
        return pd.Series({'lat': None, 'lon': None})

# Running this will take roughly 3 minutes due to artificial delay between geocode calls
df_filt.loc[df_filt['lon'].isna(), ['lat','lon']] = df_filt.loc[df_filt['lon'].isna()].street.apply(get_lat_lon)
len(df_filt.loc[df_filt['lon'].isna()])

All right. We have fixed most geo coordinates - just 1 address hasn't been geolocated. Let's review it manually:

In [None]:
df_filt.loc[df_filt['lon'].isna()]

Let's check this address on the [Google Maps](https://www.google.com/maps/place/Lauvu+iela+22,+Ber%C4%A3i,+Garkalnes+novads,+LV-1024/@56.9855,24.3108859,14z/data=!4m5!3m4!1s0x46eecc767c49e4f1:0x2ac3e039274560b6!8m2!3d56.995796!4d24.3074993). It turns out it is located in Berģi, out of Riga borders, so our "Rīga" postfix in fact made geolocation fail for this particular item. Taking into account the property is located out of Riga, we will drop it.

In [None]:
df_filt = df_filt[df_filt.street != 'Lauvu iela 22, Rīga']

Let's verify all geo coordinates are corrected and review remaining missing values:

In [None]:
missing(df_filt)

## Missing districts
Let's take a look at the entries with missing district value:

In [None]:
df_filt.loc[df_filt['district'].isna()]

One can find out missing district names by looking at rows with the same street:

In [None]:
df_filt.loc[df_filt.street.str.startswith('Ogļu')]

Great! There are multiple properties listed at the same address - Ogļu 32. Let's impute missing value:

In [None]:
df_filt.loc[df_filt.street == 'Ogļu 32', 'district'] = 'Ķīpsala'

Let's try doing the same for **Pupuku iela 9**:

In [None]:
df_filt.loc[df_filt.street.str.startswith('Pupuku')]

No luck this time - this is the only property on the **Pupuku** street in our dataset. We might use alternative approach to seach nearest points within some range using **lat** **lon** column values, but it would be overkill for a single row. Let's impute district manually by finding **Pupuku iela 9** on [Google Maps](https://www.google.com/maps/place/Pupu%C4%B7u+iela+9,+Zemgales+priek%C5%A1pils%C4%93ta,+R%C4%ABga,+LV-1076/@56.9051591,24.1411307,17z/data=!3m1!4b1!4m5!3m4!1s0x46eed191e0607163:0xb7e8552585e17c39!8m2!3d56.9051591!4d24.1433194):

In [None]:
df_filt.loc[df_filt.street == 'Pupuku iela 9', 'district'] = 'Valdlauči'

Once again, let's review what else is missing:

In [None]:
missing(df_filt)

## Invalid or missing Rooms
Just **one** row without **rooms** value. This might be easy! ..not so fast, before doing this, let's check unique room values:

In [None]:
df_filt.rooms.unique()

 It turns out this column is categorical due to the presence of value "Citi". This is bad, as room count by nature is numerical and might be important input for correct price prediction in our model. So what does this "Citi" really mean for **rooms**? "Citi" translates from Latvian as "Other". In our context this word might describe some special architectural solutions, where room count can't be clearly defined. 

For the sake of data integrity let's treat "Citi" the same way as missing value:

In [None]:
df_filt.loc[df_filt['rooms'] == 'Citi', 'rooms'] = None
df_filt.loc[df_filt['rooms'].isna()]

So we have 15 rows to fix instead of 1. In order to do this correctly, we could take advantage of other samples with the similar area. Let's build a helper functions to approximate room count. 

In [None]:
# Filter out only valid rows with rooms
df_with_rooms = df_filt.loc[~df_filt['rooms'].isna()]
# Calculate average dataset room area
average_room_area = (df_with_rooms['area']/df_with_rooms['rooms'].astype('int64')).mean()
average_room_area

In [None]:
import numpy as np
# Very rough room count estimation using average dataset room area
def estimate_room_count_rough(area):
    return np.ceil(area / average_room_area)

In [None]:
# Delicate estimation: finding out room count that occurs most among dataset samples of similar area
# If no samples found of a similar area, fallback to rough estimation
def estimate_room_count(area, delta = 10):
    # Defining lower and upper bounds to find similar area
    area_lo = area - delta
    area_up = area + delta
    try:
        df_similar_by_area = df_with_rooms[(df_with_rooms['area'] > area_lo) & (df_with_rooms['area'] < area_up)]
        room_values = df_similar_by_area["rooms"].values.flatten()
        return pd.value_counts(room_values).idxmax()
    except:
        return estimate_room_count_rough(area)

In [None]:
# Inputing helper, that sets most probable rooms value
def impute_most_probable_room_value(index):
    df_filt.loc[index, 'rooms'] = estimate_room_count(df_filt.loc[index].area)

We are ready!

In [None]:
# Fix missing rooms by imputing most probable room values
df_filt.loc[df_filt['rooms'].isna()].apply(lambda row: impute_most_probable_room_value(row.name), axis=1)

In [None]:
df_filt.loc[df_filt['rooms'].isna()]

In [None]:
# Change column type
df_filt.rooms= df_filt.rooms.astype('int64')

# Verify
df_filt.rooms.unique()

Great! Room column now is numeric and contains no missing values. 

Final check:

In [None]:
missing(df_filt)

In [None]:
df_filt.dtypes

We are done! Now it's time to save corrected dataset. 

In [None]:
df_filt.to_csv('riga.csv',index=False)