# Exercises

## Dataset for testing

Use the `unesco_heritage_sites.csv` to test the functions


### Exercise 1: Basic Exploration
**Develop a function `get_column_names(df)`** that takes a pandas DataFrame as input and returns a list of column names.

---

### Exercise 2: Filtering Data
**Develop a function `filter_by_category(df, category)`** that takes a pandas DataFrame and a category (either `"Cultural"`, `"Natural"`, or ` in our case "Mixed"`) and returns a new DataFrame containing only the rows that match the given category in category column.

---

### Exercise 3: Counting Rows
**Develop a function `count_sites_per_country(df, country)`** that takes a pandas DataFrame and a country name as input and returns the number of heritage sites located in that country.

---

### Exercise 4: Sorting Data
**Develop a function `get_top_visited_sites(df, n)`** that takes a DataFrame and a number `n` as input and returns the top `n` most visited heritage sites sorted in descending order.

---

### Exercise 5: Categorizing Sites by Visitors
**Develop a function `categorize_sites_by_visitors(df)`** that adds a new column `"VisitorCategory"` to the DataFrame based on the number of visitors per year:
- `"High"`: More than 5 million visitors
- `"Medium"`: Between 1 and 5 million visitors
- `"Low"`: Less than 1 million visitors  

The function should return the modified DataFrame.

---

### Exercise 6: Counting Threatened Sites
**Develop a function `count_threatened_sites(df)`** that returns the total number of threatened heritage sites.

---

### Exercise 7: Grouping by Country
**Develop a function `get_site_counts_by_country(df)`** that returns a dictionary where each key is a country, and the value is the number of heritage sites in that country.

---

### Exercise 8: Finding Sites Inscribed Before a Given Year
**Develop a function `sites_before_year(df, year)`** that takes a DataFrame and a year as input, returning a new DataFrame containing only the sites inscribed before that year.

---

### Exercise 9: Adding New Sites from a Dictionary
**Develop a function `add_new_sites(df, new_sites_dict)`** that takes a DataFrame and a dictionary containing new heritage sites. The function should:
1. Convert the dictionary into a new DataFrame.
2. Concatenate it with the original DataFrame.
3. Return the updated DataFrame.

---

### Exercise 10: Finding the Largest Site per Country
**Develop a function `largest_site_per_country(df)`** that returns a new DataFrame containing only the largest heritage site (in terms of area) for each country.



In [1]:
# For exercise 9

new_sites_dict = {
    "Name": ["Ancient Ruins of Tikal", "Great Zimbabwe National Monument"],
    "Country": ["Guatemala", "Zimbabwe"],
    "Category": ["Cultural", "Cultural"],
    "Year Inscribed": [1979, 1986],
    "Visitors Per Year": [200000, 50000],
    "Threatened": [False, False],
    "Area (sq km)": [16, 7.2]
}

In [2]:
import pandas as pd

unesco = pd.read_csv("unesco_heritage_sites.csv")

In [3]:
unesco

Unnamed: 0,SiteID,Name,Country,Category,YearInscribed,VisitorsPerYear,Area_km2,Threatened
0,1,Great Wall of China,China,Cultural,1987,10.0,21196.0,No
1,2,Machu Picchu,Peru,Cultural,1983,1.5,32.0,No
2,3,Pyramids of Giza,Egypt,Cultural,1979,14.0,16.0,No
3,4,Grand Canyon,USA,Natural,1979,6.0,4927.0,No
4,5,Colosseum,Italy,Cultural,1980,7.6,0.02,No
5,6,Serengeti National Park,Tanzania,Natural,1981,1.5,14763.0,No
6,7,Stonehenge,UK,Cultural,1986,1.3,0.03,No
7,8,Taj Mahal,India,Cultural,1983,8.0,0.17,No
8,9,Galápagos Islands,Ecuador,Natural,1978,0.3,8010.0,Yes
9,10,Angkor Wat,Cambodia,Cultural,1992,2.6,162.6,No


In [4]:
def get_column_names(df):
    return df.columns.to_list() # this is the pandas command to retrieve the list of columns from a DataFrame

In [5]:
get_column_names(unesco)

['SiteID',
 'Name',
 'Country',
 'Category',
 'YearInscribed',
 'VisitorsPerYear',
 'Area_km2',
 'Threatened']

The function get_column_names takes a DataFrame as an input and returns a list of its columns. Inputs: df (DataFrame).
Outputs: list of strings.

For example, if I have a dataframe with 1 column called "Name", and I run the function using it as a parameter, I expect to get ["Name"] as the result.

In [8]:
def filter_by_category(df, category):
    result = pd.DataFrame()
    site_id_list = []
    name_list = []
    country_list = []
    category_list = []
    year_inscribed_list = []
    visitor_per_year_list = []
    area_km2_list = []
    threatened_list = []
    for i, element in enumerate(list(df["Category"])):
        if element == category:
            site_id_list.append(list(df["SiteID"])[i])
            name_list.append(list(df["Name"])[i])
            country_list.append(list(df["Country"])[i])
            category_list.append(list(df["Category"])[i])
            year_inscribed_list.append(list(df["YearInscribed"])[i])
            visitor_per_year_list.append(list(df["VisitorsPerYear"])[i])
            area_km2_list.append(list(df["Area_km2"])[i])
            threatened_list.append(list(df["Threatened"])[i])
    result["SiteID"] = site_id_list
    result["Name"] = name_list
    result["Country"] = country_list
    result["Category"] = category_list
    result["YearInscribed"] = year_inscribed_list
    result["VisitorsPerYear"] = visitor_per_year_list
    result["Area_km2"] = area_km2_list
    result["Threatened"] = threatened_list
    return result
    # result[column_name] = []
    

In [11]:
filter_by_category(unesco, "Hello")

Unnamed: 0,SiteID,Name,Country,Category,YearInscribed,VisitorsPerYear,Area_km2,Threatened


In [15]:
def count_sites_per_country(df, country_column, country):
    country_list = list(df[country_column]) # Creating a list of the elements that are in a specific dataframe column
    return country_list.count(country) # returning the number of times a country appears in that list using the count method


count_sites_per_country(unesco, "Country", "Greece")
    
    

1

In [23]:
def get_top_visited_sites(df, n):
    df = df.sort_values(by='VisitorsPerYear') # sorting the dataframe according to the column values (ascending order)
    name_list = list(df["Name"]) # we make a list of all the names of the sites
    result_list = list() # we make a list that will contain all the names of the top n sites
    for i in range(n): # we iterate as many times as n
        element = name_list.pop() # we remove the last element from the name_list and append it 
        result_list.append(element) # to the result list
    return result_list # we return the result

In [29]:
def get_top_visited_sites2(df, n):
    df = df.sort_values(by='VisitorsPerYear', ascending=False) # sorting the dataframe according to the column values (descending order)
    name_list = list(df["Name"]) # we make a list of all the names of the sites
    return name_list[:n] # we apply list slicing to only return the top n elements.

In [31]:
get_top_visited_sites2(unesco, 10)

['Pyramids of Giza',
 'Great Wall of China',
 'Taj Mahal',
 'Colosseum',
 'Eiffel Tower',
 'Grand Canyon',
 'Yellowstone National Park',
 'Angkor Wat',
 'Great Barrier Reef',
 'Acropolis of Athens']

In [25]:
get_top_visited_sites(unesco, 10)

['Pyramids of Giza',
 'Great Wall of China',
 'Taj Mahal',
 'Colosseum',
 'Eiffel Tower',
 'Grand Canyon',
 'Yellowstone National Park',
 'Angkor Wat',
 'Acropolis of Athens',
 'Great Barrier Reef']

In [18]:
unesco.sort_values(by='VisitorsPerYear')

Unnamed: 0,SiteID,Name,Country,Category,YearInscribed,VisitorsPerYear,Area_km2,Threatened
8,9,Galápagos Islands,Ecuador,Natural,1978,0.3,8010.0,Yes
13,14,Petra,Jordan,Cultural,1985,1.0,264.0,No
6,7,Stonehenge,UK,Cultural,1986,1.3,0.03,No
1,2,Machu Picchu,Peru,Cultural,1983,1.5,32.0,No
5,6,Serengeti National Park,Tanzania,Natural,1981,1.5,14763.0,No
12,13,Great Barrier Reef,Australia,Natural,1981,2.0,344400.0,Yes
11,12,Acropolis of Athens,Greece,Cultural,1987,2.0,3.04,No
9,10,Angkor Wat,Cambodia,Cultural,1992,2.6,162.6,No
10,11,Yellowstone National Park,USA,Natural,1978,4.0,8983.0,No
3,4,Grand Canyon,USA,Natural,1979,6.0,4927.0,No


In [34]:
'''Develop a function categorize_sites_by_visitors(df) that adds a new column 
"VisitorCategory" to the DataFrame based on the number of visitors per year:

"High": More than 5 million visitors
"Medium": Between 1 and 5 million visitors
"Low": Less than 1 million visitors
The function should return the modified DataFrame.'''

def categorize_sites_by_visitors(df):
    visitor_category_list = []
    for number in list(df["VisitorsPerYear"]): # iterating over the numbers in this column
        if float(number) > 5.0: # making sure that they are numbers, we then compare it with the 2 values for the assignments
            visitor_category_list.append("High") # and append the category according to the comparison
        elif float(number) >= 1.0:
            visitor_category_list.append("Medium")
        else:
            visitor_category_list.append("Low")
    df["VisitorCategory"] = visitor_category_list # we add the list of the categories in the right order as a new column of the dataframe
    return df
    

In [35]:
categorize_sites_by_visitors(unesco)

Unnamed: 0,SiteID,Name,Country,Category,YearInscribed,VisitorsPerYear,Area_km2,Threatened,VisitorCategory
0,1,Great Wall of China,China,Cultural,1987,10.0,21196.0,No,High
1,2,Machu Picchu,Peru,Cultural,1983,1.5,32.0,No,Medium
2,3,Pyramids of Giza,Egypt,Cultural,1979,14.0,16.0,No,High
3,4,Grand Canyon,USA,Natural,1979,6.0,4927.0,No,High
4,5,Colosseum,Italy,Cultural,1980,7.6,0.02,No,High
5,6,Serengeti National Park,Tanzania,Natural,1981,1.5,14763.0,No,Medium
6,7,Stonehenge,UK,Cultural,1986,1.3,0.03,No,Medium
7,8,Taj Mahal,India,Cultural,1983,8.0,0.17,No,High
8,9,Galápagos Islands,Ecuador,Natural,1978,0.3,8010.0,Yes,Low
9,10,Angkor Wat,Cambodia,Cultural,1992,2.6,162.6,No,Medium
