# Preprocessing

In this notebook the initial preprocessing of the provided data set will be done. Firstly the needed packages are imported and the raw dataframe is loaded:

In [1]:
# Package import
import pandas as pd
import numpy as np
import os
from geopy.exc import GeocoderTimedOut 
from geopy.geocoders import Nominatim 
import pycountry_convert as pc

# Read data frame
df = pd.read_excel(os.path.abspath('../data/Raw/Cities.xls'), index_col=0)

## Replacing spaces with underscores

In the column names a lot of spaces are found. They are in general not nice to work with coding wise why all spaces are replaced with underscores in the column names:

In [2]:
#%% Add _ instead of space
df.columns = df.columns.str.replace(' ', '_')

## Fixing object columns

Two columns in the date are object columns due to the fact that are few cells that should be empty - i.e. nan - have a space. `Pandas` then interprets this as a object even thouhgh the column should be nummeric. This is fixed below, where the argument `errors='coerce'`ensures that the spaces are turned into nan as it should be:

In [3]:
df['Bikeshare_Stations'] = pd.to_numeric(df['Bikeshare_Stations'], errors='coerce')
df['Bicycle_Modeshare_(%)'] = pd.to_numeric(df['Bicycle_Modeshare_(%)'], errors='coerce')

## Getting longitude and latitude

As the data is based on cities it could be interesting to an overview of their location. Therefor the longitude and latitude for all cities are added using the `geopy`. The cell below take some time to run due to the number of API calls made.

In [4]:
# Initialize empty lists
longitude = [] 
latitude = [] 
   
# function to find the coordinate of a given city using Geopy
def findGeocode(city): 
       
    # try and catch is used to overcome 
    # the exception thrown by geolocator   
    try:
        geolocator = Nominatim(user_agent="VS")   
        return geolocator.geocode(city) 
      
    except GeocoderTimedOut: 
        return findGeocode(city)
    
    except GeocoderUnavailable:
        return findGeocode(city)
  
# Apply the function to all cities in the data.
for _, row in df[["City","Country"]].iterrows():
      
    if findGeocode(row["City"]+', '+ row["Country"]) != None: 
           
        loc = findGeocode(row["City"]+', '+ row["Country"]) 
          
        # coordinates returned from the function is appended to the list.
        latitude.append(loc.latitude) 
        longitude.append(loc.longitude) 
       
    # Insert nan if the city is not found  
    else: 
        latitude.append(np.nan) 
        longitude.append(np.nan)

It is now checked if there are any cities that `Geopy` missed:

In [5]:
print(f"The index of the missing cities are {np.argwhere(np.isnan(latitude)).reshape(-1)}, and they are:")
df.iloc[np.argwhere(np.isnan(latitude)).reshape(-1)].City

The index of the missing cities are [ 68 102 103 132 155], and they are:


282           Denver-Aurora(CO)
325               Valencia(VZL)
143            Osaka-Kobe-Kyoto
281    Minneapolis-St. Paul(MN)
283    Tampa-St. Petersburg(FL)
Name: City, dtype: object

There are two cities only. They can easily be added manually (by looking them up at Google) as done below before appending the new columns to the dataframe. 

In [6]:
# Populate the two missing manually
latitude[102], longitude[102] = 10.156421, -67.999718 #Valencia(VZL)
latitude[155], longitude[155] = 27.773056, -82.639999 #Tampa-St. Petersburg(FL)

# Add columns to data frame
df["Latitude"] = np.array(latitude)
df["Longitude"] = np.array(longitude)

## Adding continent

The continent of all cities are also added. This is primarily so the data can be split baed on continent as it is desired for one part of the prediction challenge.

In [7]:
def country_to_continent(country_name):
    # Get two-letter abbriviation for all countries
    country_alpha2 = pc.country_name_to_country_alpha2(country_name)

    # Get continent code
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)

    # Get continent name from continent code
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name

# Get continent for all countried
df['Continent'] = [country_to_continent(con) for con in df.Country]

## Returning the processed dataframe

The preprocessed dataframe is now returned as a csv to the processed folder:

In [8]:
# Return file
df.to_csv(os.path.abspath('../data/Processed/Cities.csv'))