### Library Installation

In [None]:
#!pip install pandas


In [None]:
import pandas as pd


### File Importing

In [None]:
df = pd.read_csv("city_data.csv", delimiter='|')
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update
1,"Vienna, Austria",310,2983513,2018818,20.1,10.2,55770,3,"German, English, Turkish, Serbian",2500,1050,2061,340,2024-06-15 00:00:00
2,"Salzburg, Austria",243,375489,250472,20.3,3,66689,0,German,3200,1100,2186,,2023-11-03 00:00:00
3,"Brussels, Belgium",681,3284548,2137425,27.5,10.7,62500,3,"French, Dutch, Arabic, English",3350,1200,1900,,2023-04-22 00:00:00
4,"Antwerp, Belgium",928,1139663,723396,27.7,6.2,57595,3,"Dutch, French, Arabic",2609,900,1953,,2024-08-09 00:00:00


### Header Correction

In [None]:
new_header = df.iloc[0]

df.columns = new_header
df = df[1:]

df.head()

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update
1,"Vienna, Austria",310,2983513,2018818,20.1,10.2,55770,3,"German, English, Turkish, Serbian",2500,1050,2061,340.0,2024-06-15 00:00:00
2,"Salzburg, Austria",243,375489,250472,20.3,3.0,66689,0,German,3200,1100,2186,,2023-11-03 00:00:00
3,"Brussels, Belgium",681,3284548,2137425,27.5,10.7,62500,3,"French, Dutch, Arabic, English",3350,1200,1900,,2023-04-22 00:00:00
4,"Antwerp, Belgium",928,1139663,723396,27.7,6.2,57595,3,"Dutch, French, Arabic",2609,900,1953,,2024-08-09 00:00:00
5,"Gent, Belgium",552,645813,417832,24.8,,53311,2,"Dutch, French",2400,827,1200,120.0,2023-07-17 00:00:00


## Data Wrangling and Analysis

1. How did you handle missing values and duplicate records in the dataset? Justify your approach.


2. 
    
        a) Which country appears most frequently in the dataset? How many cities are associated with it?

First we need to separate the city name from the country to which it belongs

The problem is that not all pairs in the City collumn have the same formatting.

In [None]:
temp = df.copy()
print(temp.iloc[2,0])
print(temp.iloc[12,0])
print(temp.iloc[15,0])
print(temp.iloc[38,0])

Brussels, Belgium
Lemesos;Cyprus
Berlin. Germany
Lyon,  France


So we will correct it by replacing the formatting errors such as ';', '.' and double spaces with the same ', '.

In [None]:
corrected = []
for city in temp.City:
    if ';' in city:
        corrected.append(city.replace(';', ', '))
    elif '.'in city:
        corrected.append(city.replace('.', ','))
    elif '  ' in city:
        corrected.append(city.replace('  ', ' '))
    else:
        corrected.append(city)

temp.City = corrected

print(temp.iloc[2,0])
print(temp.iloc[12,0])
print(temp.iloc[15,0])
print(temp.iloc[38,0])

Brussels, Belgium
Lemesos, Cyprus
Berlin, Germany
Lyon, France


Another problem is that in row 45 the pair is in 'country, city' format instead of 'city, country', we can solve it by manually correcting it.

In [None]:
temp.iloc[45]

0
City                                    Greece, Athens
Population Density                                1829
Population                                     3530371
Working Age Population                         2287174
Youth Dependency Ratio                              22
Unemployment Rate                                 17.2
GDP per Capita                                   38580
Days of very strong heat stress                     17
Main Spoken Languages                   Greek, English
Average Monthly Salary                            1050
Avgerage Rent Price                                600
Average Cost of Living                            1200
Average Price Groceries                            NaN
Last Data Update                   2024-07-16 00:00:00
Name: 46, dtype: object

In [None]:
temp.iloc[45, 0] = 'Athens, Greece'

temp.iloc[45]

0
City                                    Athens, Greece
Population Density                                1829
Population                                     3530371
Working Age Population                         2287174
Youth Dependency Ratio                              22
Unemployment Rate                                 17.2
GDP per Capita                                   38580
Days of very strong heat stress                     17
Main Spoken Languages                   Greek, English
Average Monthly Salary                            1050
Avgerage Rent Price                                600
Average Cost of Living                            1200
Average Price Groceries                            NaN
Last Data Update                   2024-07-16 00:00:00
Name: 46, dtype: object

Now that all pairs have consistent formatting we can split the pairs in CityName and Country.

In [None]:
#Extract the country

pairs = [city.split(', ') for city in temp.City]

cities = [pair[0] for pair in pairs]
countries = [pair[1] for pair in pairs]

print(cities)
print(countries)

['Vienna', 'Salzburg', 'Brussels', 'Antwerp', 'Gent', 'Bruges', 'Sofia', 'Dobrich', 'Zurich', 'Geneva', 'Basel', 'Lefkosia', 'Lemesos', 'Prague', 'Ostrava', 'Berlin', 'Berlin', 'Hamburg', 'Munich', 'Cologne', 'Frankfurt am Main', 'Stuttgart', 'Leipzig', 'Dresden', 'Dusseldorf', 'Hanover', 'Copenhagen', 'Odense', 'Madrid', 'Barcelona', 'Valencia', 'Seville', 'Malaga', 'Malaga', 'Tallinn', 'Helsinki', 'Tampere', 'Paris', 'Lyon', 'Toulouse', 'London', 'Leeds', 'Glasgow', 'Liverpool', 'Edinburgh', 'Athens', 'Thessaloniki', 'Zagreb', 'Split', 'Budapest', 'Miskolc', 'Debrecen', 'Dublin', 'Cork', 'Rome', 'Milan', 'Naples', 'Turin', 'Florence', 'Venice', 'Luxembourg', 'Riga', 'Malta', 'The Hague', 'Amsterdam', 'Rotterdam', 'Utrecht', 'Eindhoven', 'Oslo', 'Bergen', 'Stavanger', 'Warsaw', 'Lodz', 'Cracow', 'Lisbon', 'Porto', 'Braga', 'Coimbra', 'Giroc', 'Bratislava', 'Ljubljana', 'Stockholm', 'Gothenburg', 'Malmo', 'Ankara', 'Adana']
['Austria', 'Austria', 'Belgium', 'Belgium', 'Belgium', 'Belgi

Now that we have the cities and country lists we can add them to the dataframe.

In [None]:
temp['CityName'] = cities
temp['Country'] = countries

temp.loc[:,['City', 'CityName', 'Country']]

Unnamed: 0,City,CityName,Country
1,"Vienna, Austria",Vienna,Austria
2,"Salzburg, Austria",Salzburg,Austria
3,"Brussels, Belgium",Brussels,Belgium
4,"Antwerp, Belgium",Antwerp,Belgium
5,"Gent, Belgium",Gent,Belgium
...,...,...,...
82,"Stockholm, Sweden",Stockholm,Sweden
83,"Gothenburg, Sweden",Gothenburg,Sweden
84,"Malmo, Sweden",Malmo,Sweden
85,"Ankara, Turkiye",Ankara,Turkiye


As we can see anothor problem arises, there are cities that are duplicated in the dataset.

In [None]:
temp[temp.duplicated(keep=False)]

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update,CityName,Country
16,"Berlin, Germany",304,5303922,3481212,21.3,4.7,46548,3,"German, Turkish, Arabic, English",3200,1220,2200,,2023-06-29 00:00:00,Berlin,Germany
17,"Berlin, Germany",304,5303922,3481212,21.3,4.7,46548,3,"German, Turkish, Arabic, English",3200,1220,2200,,2023-06-29 00:00:00,Berlin,Germany
33,"Malaga, Spain",571,869096,585608,23.5,17.7,27694,0,"Spanish, English",2200,1312,1400,,2023-11-27 00:00:00,Malaga,Spain
34,"Malaga, Spain",571,869096,585608,23.5,17.7,27694,0,"Spanish, English",2200,1312,1400,,2023-11-27 00:00:00,Malaga,Spain


So we will drop the second occurence of each of these cities.

In [None]:
temp.drop_duplicates(inplace=True)

temp[temp.duplicated(keep=False)]

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update,CityName,Country


We just need to check if there is a duplicated city name in the dataset with different data associated with it. As we can check in the following output this doesn't happen in our dataset

In [None]:
temp[temp.duplicated(subset=['CityName'])]

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update,CityName,Country


Now we can make the changes to our DataFrame permanent by applying them to the original DataFrame.

In [None]:
df = temp

Finally we can count the name of rows (Cities) that have the same value in the 'Country' collumn.

In [None]:
df['Country'].value_counts()

Country
Germany            10
Italy               6
United Kingdom      5
Spain               5
Netherlands         5
Belgium             4
Portugal            4
Hungary             3
Poland              3
Sweden              3
France              3
Norway              3
Switzerland         3
Czechia             2
Turkiye             2
Greece              2
Denmark             2
Finland             2
Austria             2
Bulgaria            2
Cyprus              2
Croatia             2
Ireland             2
Estonia             1
Luxembourg          1
Latvia              1
Malta               1
Romania             1
Slovenia            1
Slovak Republic     1
Name: count, dtype: int64

This way we can conclude that the most frequent country is Germany and it has 11 cities associated with it.

        b) How many cities are present in total? How many are associated with Greece?

As we solved duplicate entries in the previous item the number of cities in the dataset is the number of rows in our dataset.

In [None]:
df

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update,CityName,Country
1,"Vienna, Austria",310,2983513,2018818,20.1,10.2,55770,3,"German, English, Turkish, Serbian",2500,1050,2061,340,2024-06-15 00:00:00,Vienna,Austria
2,"Salzburg, Austria",243,375489,250472,20.3,3,66689,0,German,3200,1100,2186,,2023-11-03 00:00:00,Salzburg,Austria
3,"Brussels, Belgium",681,3284548,2137425,27.5,10.7,62500,3,"French, Dutch, Arabic, English",3350,1200,1900,,2023-04-22 00:00:00,Brussels,Belgium
4,"Antwerp, Belgium",928,1139663,723396,27.7,6.2,57595,3,"Dutch, French, Arabic",2609,900,1953,,2024-08-09 00:00:00,Antwerp,Belgium
5,"Gent, Belgium",552,645813,417832,24.8,,53311,2,"Dutch, French",2400,827,1200,120,2023-07-17 00:00:00,Gent,Belgium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,"Stockholm, Sweden",334,2344124,1534225,28.5,6.2,70950,0,"Swedish, English",2700,1400,2300,,2024-09-11 00:00:00,Stockholm,Sweden
83,"Gothenburg, Sweden",245,1037675,672152,28.2,6.3,49588,0,"Swedish, English",2500,1200,2100,,2023-03-10 00:00:00,Gothenburg,Sweden
84,"Malmo, Sweden",368,680335,436271,29.4,9.2,44387,0,"Swedish, English",2400,1100,2000,,2024-07-07 00:00:00,Malmo,Sweden
85,"Ankara, Turkiye",1922,4843511,3417691,30,14.4,38916,3,Turkish,900,450,900,309,2023-06-08 00:00:00,Ankara,Turkiye


As we can see at the bottom of the privious table there are 84 cities in our dataset.

As for the number of cities associated with Greece, there are 2 cities associated with Greece (as per the next output)

In [None]:
print(df.query('Country == "Greece"').shape[0], 'rows')
df.query('Country == "Greece"')

2 rows


Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update,CityName,Country
46,"Athens, Greece",1829,3530371,2287174,22.0,17.2,38580,17,"Greek, English",1050,600,1200,,2024-07-16 00:00:00,Athens,Greece
47,"Thessaloniki, Greece",381,1050568,684564,22.2,19.9,23940,10,Greek,1000,550,1100,,2023-12-21 00:00:00,Thessaloniki,Greece


        c) Which is the least spoken language in the dataset? Which are the top 3 most spoken languages?

Similarly to what we have done to the City/Country pairs we need to split the Most Spoken Languages string into the languages that there are contained.

In [None]:
df['Main Spoken Languages'][46]

'Greek, English'

In [None]:
temp = df.copy()

In [None]:
temp['Main Spoken Languages'].dropna(inplace=True)

In [None]:
for i in range(len(temp['Main Spoken Languages'])):
    temp['Main Spoken Languages'].iloc[i] = temp['Main Spoken Languages'].iloc[i].split(', ')

AttributeError: 'list' object has no attribute 'split'

In [None]:
#languages = [languagestr.split(', ') for languagestr in temp['Main Spoken Languages']]

In [None]:
temp['Main Spoken Languages'].value_counts()

Main Spoken Languages
Italian                                   6
Dutch, English                            5
German                                    4
Portuguese                                3
Norwegian, English                        3
German, English                           3
Swedish, English                          3
Hungarian                                 2
French                                    2
Croatian                                  2
German, English, Turkish                  2
Turkish                                   2
Polish, English                           2
English, Irish Gaelic                     2
Dutch, French                             2
Spanish                                   2
Bulgarian, English, Turkish               1
Bulgarian, Turkish                        1
Greek, Turkish, English                   1
Dutch, French, Arabic                     1
French, Dutch, Arabic, English            1
German, English, Turkish, Serbian         1
Spanish;Va

3. 

        a) Entries uploaded before April 2023 need to be updated. Which cities would require an update?


        b) How many days ago was the last update? On what day, month, and year did it occur?


4. 

        a) How are the Unemployment Rate and GDP per Capita distributed and related? What does this relationship suggest? Provide a visual representation.


        b) Which are the top 5 cities with the largest difference between the Average Monthly Salary and Average Cost of Living?
        What about the top 5 countries with the smallest average difference?
        Show these results with meaningful visualizations.


        c) Which is the best city for someone seeking:
                an average monthly salary above €1600,
                a cost of living below €900, and
                a country suitable for starting a family (with a    relatively larger youth population)?


5. What are three additional insights you find meaningful when comparing the given cities?

## Advanced Topic - Building an Interactive Map

1. Web Scraping

2. Interactive Map

## Data Science In Action