# Python for Spatial Analysis
## Lab Assignment 2 - GG3209 - Python Part
### GitHub Repository: https://github.com/FreyaJJ/MyFirstRepo.git

---
Student ID: 230006679 - University of St Andrews - School of Geography and Sustainable Development

### Introduction

This Jupyter notebook is a collection of the four parts to assessment 2 which uses ArcGIS Online and Google Colab to code the following.

### Content:

#### 1. Python Basics:

* Exercise 1: Create a script that calculates the average of a list of numbers
* Exercise 2: Whats the purpose
* Exercise 3: Indentation
* Exercise 4: Strings
* Exercise 5: Nested *if* conditions
* Exercise 6: Functions
* Exercise 7: Create a script that generates a multiplicaation table
* Exercise 8: Loops
* Exercise 9: Read files

#### 2. NumPy and Pandas:

NumPy
* Exercise 1: Creating arrays
* Exercise 2: Create and reshape an array
* Exercise 3: Create a linearly spaced array
* Exercise 4: Create and index an array
* Exercise 5: Calculate the sum of all numbers
* Exercise 6: Calculate the sum of each row
* Exercise 7: Boolean mask: extract values greater than the mean

Pandas
* Exercise 1: Importing Pandas
* Exercise 2: Import a CSV dataset
* Exercise 3: Rows and columns in the dataframe
* Exercise 4: Mean of category in the whole dataset
* Exercise 5: Max of category in the whole dataset
* Exercise 6: Calculate countries produce more emissions than 1000 Kg/CO2
* Exercise 7: Calculate country consumes the least amount of beef
* Exercise 8: Calculate total emissions of meat products in dataset
* Exercise 9: Calculate total emissions of all other products in dataset

Final Exercise
1. Import pandas and read CSV dataset
2. Calculate new column
3. Remove column
4. Subset a city starting with the letter F
5. Subset the five biggest cities from the country where the F city is 
  
#### 3. GeoPandas and Rasterrio

GeoPandas
* Exercise 1: Load the Shapefile into a GeoDataFrame
* Exercise 2: Subset relevant columns for analysis
* Exercise 3: Inspect the coordinate reference system (CRS)
* Exercise 4: Check the nmber of feature in the layer
* Exercise 5: Ensure uniqueness of the LSOA Area Code (LSOA11CD)
* Exercise 6: Visualize the Layer Using .plot()
* Exercise 7: Explore the Layer with .explorer()
* Exercise 8: Subset Areas with Population Greater Than 1500
* Exercise 9: Visualize the Subset with Population-Based Symbology
* Exercise 10: Count the Number of Areas in the Subset
* Exercise 11: Calculate the Total Population of the Subset

Rasterrio
* Exercise 1: Read the Raster File Using Rasterio
* Exercise 2: Inspect the Coordinate Reference System (CRS) of the Raster
* Exercise 3: Check the Rasters Extent (Bounds) in Projected Coordinates
* Exercise 4: Determine the Number of Bands in the Raster Dataset
* Exercise 5: Visualize the Raster Image
* Exercise 6: Generate Histograms from the Raster Data
* Exercise 7: Create a False-Color Plot Using EarthPy (with Troubleshooting Tips)

#### 4. Spatial Clustering (K-Means - DBSCAN)

Part 1: Get and read the large dataset
1. Download the Road Accident (United Kingdom (UK)) dataset
2. Upload the dataset in your Google Drive
3. Mount the Drive to Notebook


Part 2: Exploratory Data Analysis and K-means Clustering

(A) - Data Exploration and Pre-Processing
* Exercise 1: Load the Dataset Using Pandas
* Exercise 2: Preview the Data Structure and Attributes
* Exercise 3: Select Relevant Numerical and Categorical Columns
* Exercise 4: Filter the Dataset to Include Only 2010 Records
* Exercise 5: Visualize Accidents by Day of the Week
* Exercise 6: Analyze Severity vs. Road Conditions with a Second Plot
* Exercise 7: Map All Accidents Using the Lonboard Library
* Exercise 8: Apply Spatial Filtering and Map the Glasgow-Edinburgh Region

(B) - K-means Clustering Implementation
* Exercise 1: Apply K-Means Clustering with Multiple k Values
* Exercise 2: Mapping Cluster Results Using the Lonboard Library
* Exercise 3: Interpretation: Effect of k on Cluster Structure
* Exercise 4: K-Means with Additional Attributes (Severity, Number of Vehicles, etc.)
* Exercise 5: Mapping Multivariate Clusters with Lonboard
* Exercise 6: Reflection: Coordinate-Only vs. Attribute-Enhanced Clustering


Part 3: Spatial Analysis and DBSCAN Clustering

(A) - Spatial Correlation
* Exercise 1: Create a New GeoDataFrame for DBSCAN Analysis
* Exercise 2: Filter the GeoDataFrame to Include Only Birmingham Accidents (Using BBox)
* Exercise 3: Map Birmingham Accident Data Using Lonboard
* Exercise 4: Inspect Attribute Data Types (Numerical vs. Categorical)
* Exercise 5: Compute Correlation Matrix for Numerical Features
* Exercise 6: Visualize the Correlation Matrix with a Heatmap
* Exercise 7: Install Required Library: pysal
* Exercise 8: Import Additional Required Libraries
* Exercise 9: Reproject the Dataset to the Appropriate UK Coordinate Reference System (EPSG)
* Exercise 10: Interpretation: Insights from the Correlation Analysis

(B) - DBSCAN Clustering Implementation
* Exercise 1: Apply DBSCAN Clustering with Varying eps and min_samples Values
* Exercise 2: Map DBSCAN Cluster Results Using Plotly
* Exercise 3: Interpretation: Effect of Parameter Changes on DBSCAN Clustering
* Exercise 4: Reflection: Comparing DBSCAN to K-Means Clustering
* Exercise 5: Discussion: Real-World Urban Planning Implications of Identified Clusters

---

## Part 1: Python Basics

### Exercise 1: Create a script that calculates the average of a list of numbers
#### Steps:
* In a new cell I created a list of numbers: [1,2,3,4,5,6,7,8,9,10]
* The I wrote a function that takes a list of numbers as input, calculates the average, and returns the result.
* Finally I called the function with my list of numbers and printed the result.

In [4]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
average = sum(numbers) / len(numbers)
print(average)

5.5



The result was an average of 5.5, as all the numbers were added together (55) and divided by the lenth of the numbers list (10).

### Exercise 2: What's the purpose
#### Steps:
* We were asked what the pupose of the first two expressions and explain what they do, with the following code below:
  
  ```
  import pandas as pd
  import geopandas as gpd
  dat = pd.read_csv('data/world_cities.csv')  ## Import CSV file
  geom = gpd.points_from_xy(dat['lon'], dat['lat'])
  geom = gpd.GeoSeries(geom)
  dat = gpd.GeoDataFrame(dat, geometry = gpd.GeoSeries(geom), crs = 4326)
  dat.to_file('output/world_cities.shp')      ## Export Shapefile
  ```

The first two expressions imports the pandas (abbreviated to pd) and geopandas (abbreviated to qpd) libraries into Python. This then allows us convert a CSV file to a Shapefile through reading the CSV file, creating geometry from longitiude and latitude, converting the geometry to a GeoSeries, before creatng a GeoDataFrame and finally exporting to a shapefile. 

### Exercise 3: Indentation
#### Steps:
* We were given the following code cell and asked to address the issue when running it:

```
name = 'Dave'
    dogs = 0
print('My name is', name, 'and I own', dogs, 'dogs.')
```

In [5]:
name = 'Dave'
dogs = 0
print('My name is', name, 'and I own', dogs, 'dogs.')

My name is Dave and I own 0 dogs.


### Exercise 4: Strings
#### Steps:
* Created two string variables
* Joined the two strings together
* Printed the strings and integer together

In [6]:
a = "Thank "
b = "you"
c = a + b
print(c)

Thank you


### Exercise 5: Nested *if* conditions
#### Steps:
* Created the variables where genre stores the movie genre as a string and duration stores the length of the movie in hours.
* Created outer 'if' statements for movie genres
* Created nested 'if' and 'else' statements are excuted if the answer is True or False depending on the duration of movie.
* Created an extending if-else statement for more genres if previous statements are false
* Created another extending if-else statement if all of the above is false

In [7]:
genre = "Action" 
duration = 2.5

if genre == "Action":

     if duration > 3:
         print("Buy lots of popcorn! This will be a long action packed movie!")

     else:
         print("Short action movie! Great for a quick escape during a busy day!")

elif genre == "Romcom":

     if duration > 2:
         print("A long romcom! Get ready for love triangles, meet-cutes and complex relationships!")

     else:
         print("Short and sweet! Perfect for a heartwarming midday pick-me-up!")
else:
     print("Hmm! I'm not sure about this genre! Maybe check the reviews before watching!")

Short action movie! Great for a quick escape during a busy day!


### Exercise 6: Functions
#### Steps:
* Defined the function
* Made an indentation and converted miles to kilometers (1m = 1.60934km) before returning the results
* Called the function and printed the result (using an example of 8 miles being coverted to kilometers)

In [8]:
def miles_to_kms(miles):
    return miles * 1.60934

print(miles_to_kms(8))

12.87472


### Exercise 7: Create a script that generates a multiplication table
#### Steps:
* Wrote a fuction that takes an integer as input and generates a multiplication table for that number, from 1 to 10 (using a loop)
* Called the function with a few different numbers and printed the results.

In [9]:
def multiplication_table(number):
    for i in range(1, 11):
        print(f"{number} x {i} = {number * i}")

numbers = [2, 4, 6]
for num in numbers:
    print(f"\nMultiplication table for {num}:")
    multiplication_table(num)
        


Multiplication table for 2:
2 x 1 = 2
2 x 2 = 4
2 x 3 = 6
2 x 4 = 8
2 x 5 = 10
2 x 6 = 12
2 x 7 = 14
2 x 8 = 16
2 x 9 = 18
2 x 10 = 20

Multiplication table for 4:
4 x 1 = 4
4 x 2 = 8
4 x 3 = 12
4 x 4 = 16
4 x 5 = 20
4 x 6 = 24
4 x 7 = 28
4 x 8 = 32
4 x 9 = 36
4 x 10 = 40

Multiplication table for 6:
6 x 1 = 6
6 x 2 = 12
6 x 3 = 18
6 x 4 = 24
6 x 5 = 30
6 x 6 = 36
6 x 7 = 42
6 x 8 = 48
6 x 9 = 54
6 x 10 = 60


### Exercise 8: Loops
#### Steps:
* Defined a list named lunch, as shown below
* Created an empty dictionary for the list
* Used a *for* loop to go over the items, and a conditional to add a new item with count 1 (if it is not yet in the dictionary), or to increment an existing item count.
* Printed results

In [11]:
lunch = ['Salad', 'Salad', 'Egg', 'Beef', 'Potato', 'Tea', 'chicken', 'Potato', 'Potato', 'Coffee']

lunch_count = {}

for item in lunch:
    if item in lunch_count:
        lunch_count[item] += 1
    else:
        lunch_count[item] = 1

print(lunch_count)

{'Salad': 2, 'Egg': 1, 'Beef': 1, 'Potato': 3, 'Tea': 1, 'chicken': 1, 'Coffee': 1}


### Exercise 9: Read files
#### Steps:
* Created a script that reads and prints the latest_earthquake_world.csv file, included in the data folder.

In [2]:
dataset = 'Latest_earthquake_world.csv'

import csv

with open('Latest_earthquake_world.csv', "r") as csvfile:
    reader = csv.reader(csvfile)
    
import pandas as pd

earthquake_df = pd.read_csv('Latest_earthquake_world.csv', sep=",", header=0, encoding="ISO-8859-1")
earthquake_df.head(30)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2023-02-23T08:01:02.881Z,38.128,73.2184,10.0,4.7,mb,75.0,55.0,1.745,0.66,...,2023-02-23T08:35:43.040Z,"65 km W of Murghob, Tajikistan",earthquake,4.0,1.907,0.058,91.0,reviewed,us,us
1,2023-02-23T06:55:34.020Z,18.7946,-63.9205,10.0,3.67,md,19.0,228.0,1.1264,0.5,...,2023-02-23T08:02:57.172Z,Leeward Islands,earthquake,5.27,4.42,0.13,14.0,reviewed,pr,pr
2,2023-02-23T06:50:49.137Z,38.4879,72.8122,10.0,4.5,mb,21.0,171.0,1.264,0.56,...,2023-02-23T07:28:35.040Z,"106 km WNW of Murghob, Tajikistan",earthquake,8.0,2.0,0.14,15.0,reviewed,us,us
3,2023-02-23T06:18:13.280Z,-8.5483,-77.6254,66.54,4.7,mb,44.0,154.0,3.503,0.47,...,2023-02-23T07:15:40.040Z,"22 km SW of Quiches, Peru",earthquake,8.1,8.7,0.064,74.0,reviewed,us,us
4,2023-02-23T03:36:09.429Z,-18.287,-177.8261,524.983,4.9,mb,40.0,63.0,3.626,0.67,...,2023-02-23T03:50:54.040Z,,earthquake,13.43,10.963,0.108,27.0,reviewed,us,us
5,2023-02-23T03:18:57.663Z,-30.3916,-71.6856,34.097,4.4,ml,33.0,187.0,0.285,0.94,...,2023-02-23T03:30:52.786Z,"52 km WNW of Ovalle, Chile",earthquake,5.75,5.597,,,reviewed,us,guc
6,2023-02-23T02:18:00.671Z,38.3362,73.1067,10.0,4.7,mb,52.0,60.0,1.529,0.94,...,2023-02-23T02:37:56.040Z,Tajikistan-Xinjiang border region,earthquake,6.56,1.897,0.059,87.0,reviewed,us,us
7,2023-02-23T02:07:46.940Z,38.1822,73.2346,10.0,4.8,mb,21.0,180.0,1.712,0.81,...,2023-02-23T02:22:17.040Z,"64 km W of Murghob, Tajikistan",earthquake,6.8,1.992,0.226,6.0,reviewed,us,us
8,2023-02-23T02:03:42.690Z,18.0793,-68.0888,78.0,3.54,md,14.0,234.0,0.1577,0.2,...,2023-02-23T02:42:39.960Z,"64 km ESE of Boca de Yuma, Dominican Republic",earthquake,1.66,1.34,0.06,6.0,reviewed,pr,pr
9,2023-02-23T01:35:57.658Z,38.1824,73.2794,10.0,4.9,mb,27.0,166.0,1.735,0.7,...,2023-02-23T01:49:27.040Z,"60 km W of Murghob, Tajikistan",earthquake,7.86,1.985,0.177,10.0,reviewed,us,us


## Part 2: NumPy and Pandas
## NumPy

### Exercise 1: Creating arrays
#### Steps:
* Import numpy under the alias np
* Create an array of 10 ones
* Create an array of integers 1 to 20
* Create a 5 x 5 matrix of ones with a dtype int.

In [3]:
import numpy as np

arr1 = np.tile(1, 10)
print(arr1)

arr2 = np.arange(1, 21, 1, dtype="int")
print(arr2)

arr3 = np.ones((5, 5), dtype="int")
print(arr3)

[1 1 1 1 1 1 1 1 1 1]
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[[1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]]


### Exercise 2: Create and reshape an array
#### Steps:
* Create an 3D matrix of 3 x 3 x 3 full of random numbers drawn from a standard normal distribution using `np.random.randn()`
* Reshape the above array into shape (27,)

In [4]:
arr1 = np.random.rand(3, 3, 3)

arr1a= arr1.reshape(27,)
print(arr1a)

[0.38732295 0.86964684 0.47217119 0.26663197 0.06566132 0.4687501
 0.93009997 0.96642025 0.90483439 0.48938599 0.40672499 0.14459002
 0.84715336 0.32018714 0.10008607 0.24050824 0.68377639 0.56703338
 0.32072269 0.42664816 0.23461484 0.9285551  0.8160937  0.87834548
 0.2574642  0.72186535 0.43362229]


### Exercise 3: Create a linearly spaced array
#### Steps:
* Create an array of 20 linearly spaced numbers between 1 and 10.

In [5]:
arr1 = np.linspace(1, 10, 20)
print(arr1)

[ 1.          1.47368421  1.94736842  2.42105263  2.89473684  3.36842105
  3.84210526  4.31578947  4.78947368  5.26315789  5.73684211  6.21052632
  6.68421053  7.15789474  7.63157895  8.10526316  8.57894737  9.05263158
  9.52631579 10.        ]


### Exercise 4: Create and index an array
#### Steps:
* Run the following code to create an array of shape 4 x 4
* Use indexing to produce the outputs shown below:
```python
20
```
```python
array([[ 9, 10],
       [14, 15],
       [19, 20],
       [24, 25]])
```
```python
array([ 6,  7,  8,  9, 10])
```

In [6]:
import numpy as np
a = np.arange(1, 26).reshape(5, -1)
print(a[3, 4])


20


In [7]:
a = np.arange(1, 26).reshape(5, -1)
print(a[1:, 3:5])


[[ 9 10]
 [14 15]
 [19 20]
 [24 25]]


In [8]:
a = np.arange(1, 26).reshape(5, -1)
print(a[1, 0:5])

[ 6  7  8  9 10]


### Exercise 5: Calculate the sum of all numbers
#### Steps:
* Calculate the sum of all the numbers in `a`

In [9]:
a = np.arange(1, 26).reshape(5, -1)
print(np.sum(a))

325


### Exercise 6: Calculate the sum of each row
#### Steps:
* Calculate the sum of each row in `a`

In [10]:
a = np.arange(1, 26).reshape(5, -1)
print(np.sum(a, axis=1))

[ 15  40  65  90 115]


### Exercise 7: Boolean mask: extract values greater than the mean
#### Steps:
* Extract all values of `a` greater than the mean of `a` using a boolean mask

In [11]:
a = np.arange(1, 26).reshape(5, -1)
mean = np.mean(a)
print(mean)

boolmask = a > mean
print(a[boolmask])

13.0
[14 15 16 17 18 19 20 21 22 23 24 25]


## Pandas

### Exercise 1: Importing Pandas
#### Steps:
* Import pandas with the alias pd

In [12]:
import pandas as pd

### Exercise 2: Import a CSV dataset
#### Steps:
* Import the dataset as a dataframe named `df` from this url: <https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv>

In [16]:
dataset = 'food_consumption (1).csv'

food_consumption_df = pd.read_csv('food_consumption (1).csv')

### Exercise 3: Rows and columns in the dataframe
#### Steps:
* Print the dataframe 

In [17]:
print(food_consumption_df)

         country             food_category  consumption  co2_emmission
0      Argentina                      Pork        10.51          37.20
1      Argentina                   Poultry        38.66          41.53
2      Argentina                      Beef        55.48        1712.00
3      Argentina               Lamb & Goat         1.56          54.63
4      Argentina                      Fish         4.36           6.96
...          ...                       ...          ...            ...
1425  Bangladesh        Milk - inc. cheese        21.91          31.21
1426  Bangladesh  Wheat and Wheat Products        17.47           3.33
1427  Bangladesh                      Rice       171.73         219.76
1428  Bangladesh                  Soybeans         0.61           0.27
1429  Bangladesh   Nuts inc. Peanut Butter         0.72           1.27

[1430 rows x 4 columns]


### Exercise 4: Mean of category in the whole dataset
#### Steps:
* Within the dataframe, select the Co2 Emission column and calcultate the mean of that column
* Print the calculated mean value to the output

In [18]:
co2_mn = food_consumption_df["co2_emmission"].mean()
print(co2_mn)

74.383993006993


### Exercise 5: Max of category in the whole dataset
#### Steps:
* Within the dataframe, select the Co2 Emission column and calcultate the highest value in that column
* Print the maximum Co2 emission value
* Selects only the rows where the co2_emmission value equals the previously found maximum (co2_max)
* Print the result to the output

In [19]:
co2_max = food_consumption_df["co2_emmission"].max()
print(co2_max)

result = food_consumption_df[food_consumption_df["co2_emmission"] == co2_max]
print(result)

1712.0
     country food_category  consumption  co2_emmission
2  Argentina          Beef        55.48         1712.0


The maximum `co2_emmission` (1712.0) was Beef from Argentia 

### Exercise 6: Calculate countries produce more emissions than 1000 Kg/CO2
#### Steps:
* Filter the dataframe to keep only rows with co2_emmission values greater than 1000 Kg/CO2
* Print filered results

In [20]:
co2_py = food_consumption_df[food_consumption_df["co2_emmission"] > 1000]
print(co2_py)

       country food_category  consumption  co2_emmission
2    Argentina          Beef        55.48        1712.00
13   Australia          Beef        33.86        1044.85
57         USA          Beef        36.24        1118.29
90      Brazil          Beef        39.25        1211.17
123    Bermuda          Beef        33.15        1022.94



The results show that 5 countries produce more than 1000 Kg/CO2/person/year for Beef

### Exercise 7: Calculate country consumes the least amount of beef
#### Steps:
* Filter the dataset to include only beef data
* Finds the minimum consumption value
* Print the country associated with that minimum consumption

In [21]:
beef_df = food_consumption_df[food_consumption_df["food_category"] == "Beef"][["country", "consumption"]]

beef_min = beef_df["consumption"].min()

country_beef_min = beef_df[beef_df["consumption"] == beef_min]

print(country_beef_min)

      country  consumption
1410  Liberia         0.78


### Exercise 8: Calculate total emission of meat products in dataset
#### Steps:
* Define meat categories
* Filter the dataframe to include only those categories
* Sum CO2 emissions for each category
* Sum across all categories to get total emissions
* Print the total

In [22]:
meat_categories = ["Pork", "Poultry", "Fish", "Lamb & Goat", "Beef"]

meat_df = food_consumption_df[food_consumption_df["food_category"].isin(meat_categories)]

meat_sum = meat_df.groupby("food_category")["co2_emmission"].sum()

total_meat_emmiss = meat_sum.sum()

print(total_meat_emmiss)

74441.13


### Exercise 9: Calculate total emission of all other products in dataset
#### Steps:
* Filter out meat categories to get only non-meat data
* Sum CO2 emissions for each non-meat category
* Sum across all non-meat categories to get the total
* Print the total

In [23]:
non_meat_df = food_consumption_df[~food_consumption_df["food_category"].isin(meat_categories)]

non_meat_sum = non_meat_df.groupby("food_category")["co2_emmission"].sum()

total_non_meat_emmiss = non_meat_sum.sum()

print(total_non_meat_emmiss)

31927.98


## Final Exercise

### 1. Import pandas and read CSV dataset
#### Steps:
* Import pandas with the alias pd
* Import and read the the world_cities.csv dataset as a dataframe named `df` 

In [24]:
import pandas as pd

dataset = 'world_cities.csv'

world_cities_df = pd.read_csv('world_cities.csv')

print(world_cities_df)

                     city       country     pop    lat    lon  capital
0      'Abasan al-Jadidah     Palestine    5629  31.31  34.34        0
1      'Abasan al-Kabirah     Palestine   18999  31.32  34.35        0
2            'Abdul Hakim      Pakistan   47788  30.55  72.11        0
3      'Abdullah-as-Salam        Kuwait   21817  29.36  47.98        0
4                   'Abud     Palestine    2456  32.03  35.07        0
...                   ...           ...     ...    ...    ...      ...
43640           az-Zubayr          Iraq  124611  30.39  47.71        0
43641            az-Zulfi  Saudi Arabia   54070  26.30  44.80        0
43642       az-Zuwaytinah         Libya   21984  30.95  20.12        0
43643        s-Gravenhage   Netherlands  479525  52.07   4.30        0
43644     s-Hertogenbosch   Netherlands  135529  51.68   5.30        0

[43645 rows x 6 columns]


### 2. Calcuate new column
#### Steps:
* Convert the pop column (population) to millions, divide it by 1,000,000
* Save it in a new colum called "pop_M"
* Print the first few rows to confirm that the new "pop_M" has appeared


In [25]:
world_cities_df['pop_M'] = world_cities_df['pop'] / 1_000_000
print(world_cities_df.head())

                 city    country    pop    lat    lon  capital     pop_M
0  'Abasan al-Jadidah  Palestine   5629  31.31  34.34        0  0.005629
1  'Abasan al-Kabirah  Palestine  18999  31.32  34.35        0  0.018999
2        'Abdul Hakim   Pakistan  47788  30.55  72.11        0  0.047788
3  'Abdullah-as-Salam     Kuwait  21817  29.36  47.98        0  0.021817
4               'Abud  Palestine   2456  32.03  35.07        0  0.002456


### 3. Remove column
#### Steps:
* Remove orginal column named "pop" using "drop(columns= )"
* Assign and print it back to the dataframe to make sure the change is applied

In [26]:
world_cities_df = world_cities_df.drop(columns=['pop'])
print(world_cities_df.head())

                 city    country    lat    lon  capital     pop_M
0  'Abasan al-Jadidah  Palestine  31.31  34.34        0  0.005629
1  'Abasan al-Kabirah  Palestine  31.32  34.35        0  0.018999
2        'Abdul Hakim   Pakistan  30.55  72.11        0  0.047788
3  'Abdullah-as-Salam     Kuwait  29.36  47.98        0  0.021817
4               'Abud  Palestine  32.03  35.07        0  0.002456


### 4. Subset a city starting with the letter F
#### Steps:
* Filter rows where city starts with "F", by accessing the "F" city column as strigs
* Print the results

In [27]:
F_cities = world_cities_df[world_cities_df['city'].str.startswith('F')]
print(F_cities)

             city           country    lat     lon  capital     pop_M
11083       Fa'id             Egypt  30.32   32.31        0  0.019609
11084        Faaa  French Polynesia -17.54 -149.59        0  0.029740
11085      Faaaha  French Polynesia -16.60 -151.46        0  0.000451
11086     Faaborg           Denmark  55.10   10.25        0  0.007235
11087      Faaite  French Polynesia -16.75 -145.23        0  0.000366
...           ...               ...    ...     ...      ...       ...
11989      Fuzuli        Azerbaijan  39.60   48.15        0  0.026932
11990  Fyodorovka        Kazakhstan  51.21   51.98        0  0.005355
11991  Fyodorovka        Kazakhstan  53.64   62.69        0  0.008148
11992    Fyresdal            Norway  59.18    8.10        0  0.000359
11993        Fyti            Cyprus  34.93   32.55        0  0.000102

[911 rows x 6 columns]


### 5. Subset the five biggest cities from the country where F city is
#### Steps:
* I have chosen the city of Faaa in French Polynesia
* Filter cities for French Polynesia
* Sort the "pop_M" in descending order
* Select only the top five cities
* Print results

In [28]:
fpoly_cities = F_cities[F_cities['country'] == 'French Polynesia']

fpoly_cities_sorted = fpoly_cities.sort_values(by='pop_M', ascending=False)

top_five_cities = fpoly_cities_sorted.head(5)

print(top_five_cities)

         city           country    lat     lon  capital     pop_M
11084    Faaa  French Polynesia -17.54 -149.59        0  0.029740
11089  Faanui  French Polynesia -16.47 -151.74        0  0.002186
11090  Faaone  French Polynesia -17.66 -149.29        0  0.001798
11217    Fare  French Polynesia -16.69 -151.01        0  0.001575
11499   Fitii  French Polynesia -16.72 -151.02        0  0.001167


The results show that the five biggest cities from French Polynesia where the city of Faaa is.