<a href="https://colab.research.google.com/github/CosiMichele/2503-carp-biat/blob/megh/BIAT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bring It All Together

All thanks go to Sarah Stueve and Megh Krishnaswamy for their work on the notebook.

The goal of this notebook is to exercise the acquired Python, git and Command Line knowledge obtained at the University of Arizona's Software Carpentries Spring workshop.

We will be using Google Colab as it allows for the use of the Command Line as well as Python.

The flow of this exercise is the is the following:

1. Setup
   - Open this Notebook in Google Colab
   - Obtain the required `csv`
   - Load the required packages
2. Clean the imported `csv` and extract years
3. Plot and visualize data
   - Create a list
   - Filtering
   - Plotting
4. Visualize text data
5. Saving things in Git


---

## 1. Download, Import and Clean Data

In the next cell, use a bash one-liner (`wget <raw github url>`) to download the dataset from the github repository into Google Colab's file system, and use the correct functions to import the raw csv file and index by country.

To find the raw Github url for a file, click on the 'raw' button on the top right corner in the Github link to a file.

![](https://raw.githubusercontent.com/CosiMichele/2503-carp-biat/main/github_raw_button.png)

and get CSV file from a web url:

In [None]:
!wget https://raw.githubusercontent.com/CosiMichele/2503-carp-biat/refs/heads/main/gapminder_all.csv

---

## 2. Import required packages and data

In [1]:
# Use this cell to load packages (pandas and matplotlib)

import <insert package> as pd
import <insert package> as plt

In [4]:
# Import the data and set index

raw_data = pd.<insert code>

### 2.1 Use the right function to drop the `continent` column.

In [5]:
data = raw_data.<insert code>

### 2.2 Filter the columns

When filtering columns, you can use `data.drop(columns=data.filter(like="<column to drop>").columns)`. Here, `like` is used to search for a string in the column name, and `filter` removes all matches.

In the next 2 cells, first filter the `lifeExp_` column and the the `pop_` column.

In [6]:
data = data.<insert code>

In [7]:
data = data.<insert code>

### 2.3 Print index

In order to continue, you will need to use the correct format for each country's name (as they appear in data frame). In the next cell, use a `for` loop to print the index (that in this case, is the countries).

In [None]:
for i in data.index:
    <insert code>

### 2.4 Extract years

Extract year from last 4 characters of each column name.

The current column names are structured as `gdpPercap_(year)`, so we want to keep the `(year)` part only for clarity when plotting GDP vs. years.

To do this we use `replace()`, which removes from the string the characters stated in the argument. This method works on strings, so we use `replace()` from Pandas Series.str vectorized string functions.

In [9]:
years = data.<insert code>

# Convert year values to integers, saving results back to dataframe

data.columns = years.<insert code>

---

## 3 Visualizing Numerical Data: Listing and Plotting

In the next cell, select a country to plot.

In [None]:

data.<insert code>

### 3.1 Create a list

Compare 5 countries of your choice. Create a list of countries that you are interested in.

In [11]:
sel_countries = [<insert countries>]

### 3.2 Use index.isin() to filter

`.isin(sel_countries)` checks whether each value in the index is present in the list `sel_countries`. Make sure to save to a different dataframe.

In [12]:
data_countries = data[<insert code>]

Plot the data. Use "GDP per capita" as x-axis label, "years" as y-axis label, and save the image with dpi=300.

In [None]:
data_countries.<insert code>

---
## 4 Visualizing Text Data: Plotting and Representing Frequency Distributions

For data distributed continuously, plots are straighforward to generate. But what about **strings**?

Let's say you want to find out how many countries there are in each continent. For this, we need to access the `continent` column in each row, and count how many rows each continent bin contains. This allows us numerical data from strings, that can then be plotted.


### 4.1 Using the right data

In [None]:
print(raw_data.head(), "\n")

In [None]:
print(data.head())

Can we use the filtered dataframe we created in Step 2? Why/Why not?

### 4.2 What are the continent names in this dataframe?

Multiple countries can belong to the same continent. And so, the values in the column `continent` is repeated.

If we want to find out which continents are listed in the dataframe, we only need to list each possible value in the column `continent` once, i.e., unique values in the column.

We use the function `<dataframe>['col_name'].unique()` for this:

In [None]:
# Print unique values in the column
continents = raw_data[<insert col name>].unique()
print(continents)


How many continents are present in this dataframe? Are any continents missing?

### 4.2 Filtering rows by continent:

Create a list of continents you want to plot:

In [24]:
# Filter the DataFrame to keep only specified continents
allowed_continents = [<insert continents>]

Filter data using `.isin()`.

In [25]:
filtered_data = raw_data[<insert code>]


### 4.3 Counting Countries in each continent

Next, we want to use the column `continent`, and count how many rows a given continent name is observed in. This gives us a frequency distribition!

For this, we use the `.value_counts())` function:

In [None]:
# Print frequency table for each continent in the filtered dataframe
filtered_continent_freq_distr = filtered_data[<insert column>].value_counts()
<insert code>


In [None]:
# Print frequency table for each continent in the raw dataframe
raw_continent_freq_distr = raw_data[<insert columns>].value_counts()
<insert code>

This allows us to also check if our  filter worked correctly.

### 4.4 Visualising word frequencies with Bar Plots

We will now use the frequency distributions saved in the variables `filtered_continent_freq_distr` and `raw_continent_freq_distr` to count represent how many countries each continent has.

Plot the data. Use "Continent" as x-axis label, "Frequency" as y-axis label, and save the image with dpi=300. Add a title if you want!

In [None]:
# Generate bar plot for continent frequency in filtered dataframe:
filtered_continent_freq_distr.plot(kind='bar')
plt.xlabel('Continent')
plt.ylabel('Frequency')
plt.title('Continent Frequency Bar Plot')
plt.savefig("continents_comparison.png", dpi=300)

In [None]:
# use the code above to generate a bar plot for all continents:
# Generate bar plot for continent frequency in filtered dataframe:
raw_continent_freq_distr.plot(kind='bar')
plt.xlabel('Continent')
plt.ylabel('Frequency')
plt.title('Continent Frequency Bar Plot')

### 4.5 Visualising word frequencies with Wordclouds


Wordclouds are a fun tool to visualise how frequently a set of strings appear in a dataset. More frequent strings will be larger in size.

For this we will import the `wordcloud` python library, and the frequency distributions we printed in 4.3:

In [73]:
from wordcloud import WordCloud

In [74]:
filtered_wordcloud = WordCloud(width=500, height=500,
                      background_color='white').generate_from_frequencies(
                       filtered_continent_freq_distr)

In [None]:
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(filtered_wordcloud)
plt.axis('off')
plt.savefig("continents_comparison_wordcloud.png", dpi=300)

### Which continent has the most countries?

---

## 5. Save your change to Github

To save your changes to Github, you do not require to use the command line. Instead, use the button on top of the page to save to a repository of your choice.

![](https://raw.githubusercontent.com/CosiMichele/2503-carp-biat/refs/heads/main/save-in-git.png)

![](https://raw.githubusercontent.com/CosiMichele/2503-carp-biat/refs/heads/main/save-in-git2.png)

---

## 6. In your computer, pull or clone the repository using git commands and open and potentially saved image.

...aaaaand you're done! 

Thank you for taking part to this workshop! It's been a pleasure having you <3