In [1]:
import pandas as pd

<font size="6"> Cleaning categories from cleaned poi data </font>

In [6]:
# Reading in the cleaned poi data csv into a df
df_fixclean = pd.read_csv(f'cleaning_data\cleaned_all_poi_data.csv')
df_fixclean.head()

Unnamed: 0.1,Unnamed: 0,category,name,id,lat,lon
0,0,['atm'],Bank für Sozialwirtschaft,78252154,52.523744,13.398627
1,1,['atm'],Sparda-Bank,87036263,52.532985,13.384282
2,2,['atm'],Bankhaus August Lenz,89275133,52.518025,13.406956
3,3,['atm'],,213106623,52.54217,13.441137
4,4,['atm'],Berliner Sparkasse,213113204,52.54275,13.392862


<font size="5"> .str methods on the 'category' column </font>

In [7]:
# Slicing off "['']" at beginning and end
df_fixclean['category'] = df_fixclean['category'].str.slice(start=2, stop=-2)

In [8]:
# Counting how many times our poi has two categories
double_categories = ["bar', 'convenience", "bar', 'historic", "bench', 'viewpoint", "cafe', 'convenience", "historic', 'attraction", "historic', 'tree", "tree', 'attraction"]
df_fixclean['category'].isin(double_categories).value_counts()

False    212954
True         57
Name: category, dtype: int64

The isin() method returns that only 57 poi have more than one category. Their number is so small, that it's justified to simply assign them the first category they come with.

In [9]:
# Splitting the double categories on comma, keeping only the first part
filt = df_fixclean['category'].isin(double_categories) # boolean filter for finding double categories
df_fixclean['category'] = df_fixclean['category'].str.split("',").str[0] # splitting on "'," and keeing first part
df_fixclean['category'].loc[filt] # inspecting if it worked (it did)

1041           bar
1098           bar
2305         bench
8365         bench
21289         cafe
21498         cafe
21499         cafe
22267         cafe
22987         cafe
23319         cafe
28608          bar
28630     historic
28658     historic
29129         tree
29136         tree
29482         tree
30394         tree
30638         tree
30663     historic
34220         tree
50052         tree
50113         tree
97723         tree
183443    historic
211502         bar
211522        cafe
211588        cafe
211590        cafe
211728        cafe
211853        cafe
211865        cafe
211927        cafe
212129        cafe
212216        cafe
212287        cafe
212293    historic
212296        tree
212297    historic
212298        tree
212309        tree
212311    historic
212316    historic
212317        tree
212319        tree
212322    historic
212326    historic
212327        tree
212344    historic
212353        tree
212355        tree
212356    historic
212360        tree
212367    hi

The categories of the points of interest (poi) are now truncated to only one category if (as in 57 cases) there were more than one. All categories now are nicer to look at strings.

In [16]:
# Renaming one categoriy
df_fixclean['category'].replace({'attraction': 'tourist_attraction'}, inplace=True)

tree                  182303
bench                  19032
restaurant              4687
cafe                    2491
atm                     1011
convenience              869
bar                      824
picnic_table             434
ice_cream                289
viewpoint                256
gallery                  189
museum                   156
drinking_water           148
nightclub                141
tourist_attraction        94
historic                  87
Name: category, dtype: int64

In [18]:
# Saving cleaned df in a new csv file
df_fixclean.to_csv(f'cleaning_data\cleaned_all_poi_data_fixed.csv', index=False)