# Cleaning external dataset

Since we want to make a heatmap based on coordinates, we've obtained a csv that contains coordinates for most if not all our ports. We can normalize and transform the data to get a single csv with the coordinate information for visualization

### Importing libraries

In [1]:
import pandas as pd

### Reading csvs
Note that the files aren't in the project directory, but these were what I named them

In [2]:
cities = pd.read_csv("worldcities.csv")
ports = pd.read_csv("port_names.csv")

In [None]:
# Normalizing the port cities to match that of the external dataset
ports[['city_name', 'country_name']] = ports['port_of_lading'].str.split(', ',n=1,expand=True)

# Making everything lowercase, further normalizing
ports['city_name'] = ports['city_name'].str.lower()
ports['country_name'] = ports['country_name'].str.lower()
cities['city'] = cities['city'].str.lower()
cities['country'] = cities['country'].str.lower()


In [None]:
# We do a join on the city and country name
merged = pd.merge(
    ports, 
    cities, 
    left_on=['city_name', 'country_name'], 
    right_on=['city', 'country'],
    how='left'
)


In [None]:
# We get rid of all the columns in the external dataset we won't need. We just want the coordinates of the cities for our heat map
df = merged.drop(['city_name','country_name','city','city_ascii','country','iso2','iso3','admin_name','capital','population','id'],axis=1)
df = df.dropna()

In [None]:
# Save the coordinates to a csv. This is for reusing purposes and not have to go through the process again.
df.to_csv('city_coordinates.csv',index=False)