### Library Installation

In [90]:
#!pip install pandas


In [91]:
import pandas as pd


### File Importing

In [92]:
df = pd.read_csv("city_data.csv", delimiter='|')
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update
1,"Vienna, Austria",310,2983513,2018818,20.1,10.2,55770,3,"German, English, Turkish, Serbian",2500,1050,2061,340,2024-06-15 00:00:00
2,"Salzburg, Austria",243,375489,250472,20.3,3,66689,0,German,3200,1100,2186,,2023-11-03 00:00:00
3,"Brussels, Belgium",681,3284548,2137425,27.5,10.7,62500,3,"French, Dutch, Arabic, English",3350,1200,1900,,2023-04-22 00:00:00
4,"Antwerp, Belgium",928,1139663,723396,27.7,6.2,57595,3,"Dutch, French, Arabic",2609,900,1953,,2024-08-09 00:00:00


### Header Correction

In [93]:
new_header = df.iloc[0]

df.columns = new_header
df = df[1:]

df.head()

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update
1,"Vienna, Austria",310,2983513,2018818,20.1,10.2,55770,3,"German, English, Turkish, Serbian",2500,1050,2061,340.0,2024-06-15 00:00:00
2,"Salzburg, Austria",243,375489,250472,20.3,3.0,66689,0,German,3200,1100,2186,,2023-11-03 00:00:00
3,"Brussels, Belgium",681,3284548,2137425,27.5,10.7,62500,3,"French, Dutch, Arabic, English",3350,1200,1900,,2023-04-22 00:00:00
4,"Antwerp, Belgium",928,1139663,723396,27.7,6.2,57595,3,"Dutch, French, Arabic",2609,900,1953,,2024-08-09 00:00:00
5,"Gent, Belgium",552,645813,417832,24.8,,53311,2,"Dutch, French",2400,827,1200,120.0,2023-07-17 00:00:00


## Data Wrangling and Analysis

1. How did you handle missing values and duplicate records in the dataset? Justify your approach.


2. 
    
        a) Which country appears most frequently in the dataset? How many cities are associated with it?

First we need to separate the city name from the country to which it belongs

The problem is that not all pairs in the City collumn have the same formatting

In [94]:
temp = df.copy()
print(temp.iloc[2,0])
print(temp.iloc[12,0])
print(temp.iloc[15,0])
print(temp.iloc[38,0])

Brussels, Belgium
Lemesos;Cyprus
Berlin. Germany
Lyon,  France


So we will correct it by replacing the formatting errors such as ';', '.' and double spaces with the same ', '.

In [95]:
corrected = []
for city in temp.City:
    if ';' in city:
        corrected.append(city.replace(';', ', '))
    elif '.'in city:
        corrected.append(city.replace('.', ','))
    elif '  ' in city:
        corrected.append(city.replace('  ', ' '))
    else:
        corrected.append(city)

temp.City = corrected

print(temp.iloc[2,0])
print(temp.iloc[12,0])
print(temp.iloc[15,0])
print(temp.iloc[38,0])

Brussels, Belgium
Lemesos, Cyprus
Berlin, Germany
Lyon, France


Now that all pairs have consistent formatting we can split the pairs in CityName and Country

In [102]:
#Extract the country

pairs = [city.split(', ') for city in temp.City]

cities = [pair[0] for pair in pairs]
countries = [pair[1] for pair in pairs]

print(cities)
print(countries)

['Vienna', 'Salzburg', 'Brussels', 'Antwerp', 'Gent', 'Bruges', 'Sofia', 'Dobrich', 'Zurich', 'Geneva', 'Basel', 'Lefkosia', 'Lemesos', 'Prague', 'Ostrava', 'Berlin', 'Berlin', 'Hamburg', 'Munich', 'Cologne', 'Frankfurt am Main', 'Stuttgart', 'Leipzig', 'Dresden', 'Dusseldorf', 'Hanover', 'Copenhagen', 'Odense', 'Madrid', 'Barcelona', 'Valencia', 'Seville', 'Malaga', 'Malaga', 'Tallinn', 'Helsinki', 'Tampere', 'Paris', 'Lyon', 'Toulouse', 'London', 'Leeds', 'Glasgow', 'Liverpool', 'Edinburgh', 'Greece', 'Thessaloniki', 'Zagreb', 'Split', 'Budapest', 'Miskolc', 'Debrecen', 'Dublin', 'Cork', 'Rome', 'Milan', 'Naples', 'Turin', 'Florence', 'Venice', 'Luxembourg', 'Riga', 'Malta', 'The Hague', 'Amsterdam', 'Rotterdam', 'Utrecht', 'Eindhoven', 'Oslo', 'Bergen', 'Stavanger', 'Warsaw', 'Lodz', 'Cracow', 'Lisbon', 'Porto', 'Braga', 'Coimbra', 'Giroc', 'Bratislava', 'Ljubljana', 'Stockholm', 'Gothenburg', 'Malmo', 'Ankara', 'Adana']
['Austria', 'Austria', 'Belgium', 'Belgium', 'Belgium', 'Belgi

Now that we have the cities and country lists we can add them to the dataframe

In [104]:
temp['CityName'] = cities
temp['Country'] = countries

temp.head()

Unnamed: 0,City,Population Density,Population,Working Age Population,Youth Dependency Ratio,Unemployment Rate,GDP per Capita,Days of very strong heat stress,Main Spoken Languages,Average Monthly Salary,Avgerage Rent Price,Average Cost of Living,Average Price Groceries,Last Data Update,Country,CityName
1,"Vienna, Austria",310,2983513,2018818,20.1,10.2,55770,3,"German, English, Turkish, Serbian",2500,1050,2061,340.0,2024-06-15 00:00:00,Austria,Vienna
2,"Salzburg, Austria",243,375489,250472,20.3,3.0,66689,0,German,3200,1100,2186,,2023-11-03 00:00:00,Austria,Salzburg
3,"Brussels, Belgium",681,3284548,2137425,27.5,10.7,62500,3,"French, Dutch, Arabic, English",3350,1200,1900,,2023-04-22 00:00:00,Belgium,Brussels
4,"Antwerp, Belgium",928,1139663,723396,27.7,6.2,57595,3,"Dutch, French, Arabic",2609,900,1953,,2024-08-09 00:00:00,Belgium,Antwerp
5,"Gent, Belgium",552,645813,417832,24.8,,53311,2,"Dutch, French",2400,827,1200,120.0,2023-07-17 00:00:00,Belgium,Gent


Now we can make the changes to our DataFrame permanent by applying them to the original df

In [98]:
df = temp

Finally we can count the name of rows (Cities) that have the same value in the 'Country' collumn

In [99]:
df['Country'].value_counts()

Country
Germany            11
Spain               6
Italy               6
Netherlands         5
United Kingdom      5
Belgium             4
Portugal            4
Hungary             3
Norway              3
Poland              3
Sweden              3
Switzerland         3
France              3
Czechia             2
Bulgaria            2
Turkiye             2
Ireland             2
Cyprus              2
Finland             2
Denmark             2
Austria             2
Croatia             2
Estonia             1
Greece              1
Athens              1
Luxembourg          1
Latvia              1
Malta               1
Romania             1
Slovenia            1
Slovak Republic     1
Name: count, dtype: int64

This way we can conclude that the most frequent country is Germany and it has 11 cities associated with it

        b) How many cities are present in total? How many are associated with Greece?

        c) Which is the least spoken language in the dataset? Which are the top 3 most spoken languages?

3. 

        a) Entries uploaded before April 2023 need to be updated. Which cities would require an update?


        b) How many days ago was the last update? On what day, month, and year did it occur?


4. 

        a) How are the Unemployment Rate and GDP per Capita distributed and related? What does this relationship suggest? Provide a visual representation.


        b) Which are the top 5 cities with the largest difference between the Average Monthly Salary and Average Cost of Living?
        What about the top 5 countries with the smallest average difference?
        Show these results with meaningful visualizations.


        c) Which is the best city for someone seeking:
                an average monthly salary above €1600,
                a cost of living below €900, and
                a country suitable for starting a family (with a    relatively larger youth population)?


5. What are three additional insights you find meaningful when comparing the given cities?

## Advanced Topic - Building an Interactive Map

1. Web Scraping

2. Interactive Map

## Data Science In Action