## Word cities dataset

[original source](https://simplemaps.com/data/world-cities)

A dataset of cities from the world, last update is July 2021. This is the Kaggle version

There are over 4 million cities, accurate and with an entry fo each city

the format is csv

columns:

* **city** - city name using unicode
* **city_ascii** - city with ASCII string
* **lat** - latitude
* **long** - longitude 
* **country** - in which country is located
* **iso2** - The alpha-2 iso code of the country
* **iso3** - The alpha-3 iso code of the country
* **capital** - administrative region of the city town
* **admin_name** - three values, primary - country's capital, admin - first-level admin capital, minor - lower-level admin capital
* **population** - population of the city
* **ID** - A 10-digit unique id generated by SimpleMaps.

In [17]:
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("https://raw.githubusercontent.com/SalvatoreRa/tutorial/main/datasets/worldcities.csv")
df.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6897,139.6922,Japan,JP,JPN,Tōkyō,primary,37977000.0,1392685764
1,Jakarta,Jakarta,-6.2146,106.8451,Indonesia,ID,IDN,Jakarta,primary,34540000.0,1360771077
2,Delhi,Delhi,28.66,77.23,India,IN,IND,Delhi,admin,29617000.0,1356872604
3,Mumbai,Mumbai,18.9667,72.8333,India,IN,IND,Mahārāshtra,admin,23355000.0,1356226629
4,Manila,Manila,14.6,120.9833,Philippines,PH,PHL,Manila,primary,23088000.0,1608618140


In [18]:
df.columns

Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
       'admin_name', 'capital', 'population', 'id'],
      dtype='object')

In [19]:
df.describe()

Unnamed: 0,lat,lng,population,id
count,41001.0,41001.0,40263.0,41001.0
mean,30.90985,-4.228119,111761.4,1487309000.0
std,23.504898,68.759032,724891.7,284720500.0
min,-54.9341,-179.59,0.0,1004003000.0
25%,19.1903,-71.85,8194.0,1250291000.0
50%,39.8854,3.3333,15831.0,1484693000.0
75%,47.3717,25.9833,39823.5,1807301000.0
max,81.7166,179.3667,37977000.0,1934000000.0


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41001 entries, 0 to 41000
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        41001 non-null  object 
 1   city_ascii  41001 non-null  object 
 2   lat         41001 non-null  float64
 3   lng         41001 non-null  float64
 4   country     41001 non-null  object 
 5   iso2        40970 non-null  object 
 6   iso3        41001 non-null  object 
 7   admin_name  40902 non-null  object 
 8   capital     9483 non-null   object 
 9   population  40263 non-null  float64
 10  id          41001 non-null  int64  
dtypes: float64(3), int64(1), object(7)
memory usage: 3.4+ MB


In [21]:
#check and count the null values
df.isnull().sum()

city              0
city_ascii        0
lat               0
lng               0
country           0
iso2             31
iso3              0
admin_name       99
capital       31518
population      738
id                0
dtype: int64

In [22]:
#checking the number of capitals
df.capital.value_counts()

minor      5579
admin      3659
primary     245
Name: capital, dtype: int64

In [23]:
#population for country
df.groupby("country")['population'].sum()

country
Afghanistan        7480569.0
Albania            1713382.0
Algeria           13447875.0
American Samoa       12576.0
Andorra              22151.0
                     ...    
West Bank                0.0
Western Sahara           0.0
Yemen              6481951.0
Zambia             4883949.0
Zimbabwe           4204935.0
Name: population, Length: 237, dtype: float64

In [24]:
#sorting by country alphabetical order
df.sort_values(by=["country"], ascending=True)[:10]
#sorting by biggest cities
df.sort_values(by=["population"], ascending=False)[:10]

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6897,139.6922,Japan,JP,JPN,Tōkyō,primary,37977000.0,1392685764
1,Jakarta,Jakarta,-6.2146,106.8451,Indonesia,ID,IDN,Jakarta,primary,34540000.0,1360771077
2,Delhi,Delhi,28.66,77.23,India,IN,IND,Delhi,admin,29617000.0,1356872604
3,Mumbai,Mumbai,18.9667,72.8333,India,IN,IND,Mahārāshtra,admin,23355000.0,1356226629
4,Manila,Manila,14.6,120.9833,Philippines,PH,PHL,Manila,primary,23088000.0,1608618140
5,Shanghai,Shanghai,31.1667,121.4667,China,CN,CHN,Shanghai,admin,22120000.0,1156073548
6,São Paulo,Sao Paulo,-23.5504,-46.6339,Brazil,BR,BRA,São Paulo,admin,22046000.0,1076532519
7,Seoul,Seoul,37.56,126.99,"Korea, South",KR,KOR,Seoul,primary,21794000.0,1410836482
8,Mexico City,Mexico City,19.4333,-99.1333,Mexico,MX,MEX,Ciudad de México,primary,20996000.0,1484247881
9,Guangzhou,Guangzhou,23.1288,113.259,China,CN,CHN,Guangdong,admin,20902000.0,1156237133


In [25]:
#the build in map in the library

folium.Map()

In [26]:
#plotting the biggest cities of italy
df = df[df["country"]=="Italy"]
df = df[df["population"]>=100000]
#size of the radius according to the city
df["size_c"] = np.where(df["population"] > 1000000, 20000, 
                        np.where(df["population"] > 500000, 15000, 
                                 np.where(df["population"] > 200000, 10000, 5000)))
#plot it
italy = folium.Map(location = [41.5335, 12.2858], zoom_start = 7)
for lat, lng, sz in zip(df.lat, df.lng, df.size_c):
  italy.add_child(folium.Circle(location=[lat,lng],
                                 color='blue',          
                                 radius=sz,        
                                 fill=True,             
                                 fill_opacity=0.5       
                                ))
italy