In [54]:
import pandas as pd
import numpy as np

# Questions

<b>Notes:</b> 
<ul>
<li> Each of these questions, there are several ways to solve them. Which you choose doesn't matter, just make it work. Each is doable with simple, basic loops and conditions - they are also doable with other techniques that will result in shorter and more concise code, but you'd need to search and implement them. </li>
<li> Work incrementally - can you grab the inputs, can you print them, can you change one thing, can you return a dummy value, can you loop through the data without doing anything, can you print in that loop to make sure you are hitting all items, etc... each little bit is relatively easy on its own. </li>
<li> Build the logic in English (pseudo code) first, then translate it to Python. </li>
<li> Make sure you create some kind of test to make sure that you are producing the correct output, and so that as you build you can see what the output is, perhaps with some more details. These should be pretty simple to test. </li>
</li>

In [55]:
import sys
# Grab Data
FILE_1 = "Internation_students_Canada.csv"
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    print("Code is running in Google Colab")
    #!wget -nc https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/Internation_students_Canada.csv
else:
    print("Code is not running in Google Colab")


Code is not running in Google Colab


## Standard Deviation

Write a function that takes in a data structure such as a list, and calculates the standard deviation of the data. If there are <b><i>any</i></b> non-numeric values in the list, the function should return -1. 

<b>Note:</b> you do not need to manually calculate the standard deviation, you can use an existing library.

In [56]:
def calculate_standard_deviation(data):
    # Check if the data contains non-numeric values
    if any(not isinstance(x, (int, float)) for x in data):
        return -1
    else:
        std_dev = np.std(data)
        return std_dev


In [57]:
test_list_1 = [1, 2, 3, 4, 5]
test_list_2 = [1, 2, 3, 4, 5, 'a']
print(calculate_standard_deviation(test_list_1))
print(calculate_standard_deviation(test_list_2))

1.4142135623730951
-1


## Written Digits

Write a function that takes in a number as an input, and returns that number as <b>a string of the written digits (0-9)</b> in a list. For example, if the input is 123, the output should be ['one', 'two', 'three'].

In [58]:
def number_to_string_digits(num):
    digit_map = {
        '0': 'zero',
        '1': 'one',
        '2': 'two',
        '3': 'three',
        '4': 'four',
        '5': 'five',
        '6': 'six',
        '7': 'seven',
        '8': 'eight',
        '9': 'nine'
    }
    return [digit_map[digit] for digit in str(num)]

# Example usage
number = 123
result = number_to_string_digits(number)
print(result)  # Output: ['one', 'two', 'three']


['one', 'two', 'three']


## Stats Per Country

This dataset contains a listing of the number of international students in Canada from each country, for each year 2015-2023. Create a function that takes in a dataframe in the format of the one loaded here, and returns a dataframe with the highest amount, lowest amount, and the year of the highest amount for each country. For example, one row of output would be:

``` python
Country	    High    Low	    Highest Year
Afghanistan 170     80	    2022
```

<b>Notes:</b> 
<ul>
<li> This is doable both with a loop, and directly with calculations, loop-free. Looping is much easier, conceptually - think about looping through each row, grab the data you need, manipulate it as needed, then add it to an output.</li>
<li> Some commands/functions that might be useful (not necissarily - depending on what you do you may/may not care about these):
    <ul>
    <li> pd.concat() </li>
    <li> dataframe.columns </li>
    </ul>
<li> A sample for loop that loops through each row in a dataframe:
    <ul>
    <li> for index, row in df.iterrows(): </li>
    </ul>
<li> Again, depending on what you do, you may get warnings about adding/combining data into a dataframe. Ignore these, it just means that the way a function works in pandas will change soon. </li>
</ul>

In [59]:
df_students = pd.read_csv("Internation_students_Canada.csv")
df_students.head()

Unnamed: 0,Country of Citizenship,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Afghanistan,95,115,95,80,95,90,80,170,140
1,Albania,115,165,185,245,375,250,305,345,545
2,Algeria,1060,845,1020,1490,2690,2170,3165,5360,7180
3,Andorra,0,0,0,0,0,0,10,5,0
4,Angola,65,80,40,25,120,30,50,75,65


In [60]:
def calculate_range_and_highest_column(df):
    df_results = pd.DataFrame(columns=['Country', 'High', 'Low', "Highest Year"])
    for index, row in df.iterrows():
        country = row['Country of Citizenship']
        columns_to_check = df.drop(columns=['Country of Citizenship']).columns
        highest = -1
        lowest = -1
        high_col = ""
        for column in columns_to_check:
            if highest == -1 or row[column] > highest:
                highest = row[column]
                high_col = column
            if lowest == -1 or row[column] < lowest:
                lowest = row[column]
        #df_results = df_results.append({'Country': country, 'High': highest, 'Low': lowest, "Highest Year": high_col}, ignore_index=True)
        df_results = pd.concat([df_results, pd.DataFrame({'Country': [country], 'High': [highest], 'Low': [lowest], "Highest Year": [high_col]})])
    return df_results.reset_index(drop=True)


In [61]:
calculate_range_and_highest_column(df_students)

Unnamed: 0,Country,High,Low,Highest Year
0,Afghanistan,170,80,2022
1,Albania,545,115,2023
2,Algeria,7180,845,2023
3,Andorra,10,0,2021
4,Angola,120,25,2019
...,...,...,...,...
212,"Virgin Islands, British",0,0,2015
213,Western Sahara,0,0,2015
214,Yemen,275,155,2018
215,Zambia,195,105,2023
