<font color='darkred'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file. If you like, you're welcome to adjust the *app\.py* file, but it is not required.

## Notes on Recursion

A [recursive function](https://www.w3schools.com/python/gloss_python_function_recursion.asp) is one which calls itself.

1. When the function is called, your CPU runs through each line of code until the function needs to be called again.
2. At that point, all variables are saved in memory, and the function runs through each line of code again until the function is called (again, but with a different passed argument), and so on.
3. Eventually, this process will stop at the "bottom of the **stack**", where the function doesn't get a chance to call itself again (likely because of some condition un/met by the latest passed argument).
4. Then, your CPU will work its way back up the stack to the final result. For example, take a look at [this visual example](https://realpython.com/python-recursion/#calculate-factorial) of calculating 4!.

When you write these functions, keep two things in mind:

- You will need a built-in stopping point (i.e., the "bottom"), where your function returns some result before it calls itself.
- **Don't think too hard about this.** Recursion can be perplexing to conceptualize when writing the code. So, when you call the function inside the function, think about it as a magical "hidden" function that has already done what you want it to do.
- [Python Tutor](https://pythontutor.com/) ([editor](https://pythontutor.com/visualize.html#mode=edit)) can be a helpful resource for this exercise!

## Exercise 1

The Fibonacci Series starts with 0 and 1. Each of the following numbers are the sum of the previous two numbers in the series:

`0 1 1 2 3 5 8 13 21 34 ...`

So, `fibonacci(9) = 34`.

Write a recursive function (`fibonacci`) that, given `n`, will return the `n`th number of the Fibonacci Series.

*Test your function using Google or any other tool that can calculate the Fibonacci Series.*

In [34]:
def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)

# Try test cases
print(fibonacci(10))  # Output: 55
print(fibonacci(20))  # Output: 6765
print(fibonacci(30))  # Output: 832040
print(fibonacci(35))  # Output: 9227465

55
6765
832040
9227465



## Exercise 2

Write a (single) recursive function, `to_binary()`, that [converts](https://en.wikipedia.org/wiki/Binary_number#Conversion_to_and_from_other_numeral_systems) an integer into its [binary](https://en.wikipedia.org/wiki/Binary_number) representation. So, for example:

```python
to_binary(2)   -->  10
to_binary(12)  -->  1100
```

*Note: you can test your function with the built in `bin()` function.*

In [None]:
def to_binary(n):
    if n == 0:
        return "0"
    elif n == 1:
        return "1"
    else:
        return to_binary(n // 2) + str(n % 2)

# Try test cases
print(to_binary(10))  # Output: "1010" 
print(to_binary(20))  # Output: "10100"
print(to_binary(30))  # Output: "11110"
print(to_binary(35))  # Output: "100011"


1010
10100
11110
100011


## Exercise 3 

Use the raw Bellevue Almshouse Dataset (`df_bellevue`) extracted at the top of the lab (i.e., with `pd.read_csv ...`).

**Write a function for each of the following tasks. Name these functions `task_i()`** (i.e., without any input arguments).

1. Return a list of all column names, *sorted* such that the first column has the *least* missing values, and the last column has the *most* missing values (use the raw column names).
   - *Note: there is an issue with the `gender` column you'll need to remedy first ...*
2. Return a **data frame** with two columns:
   - the year (for each year in the data), `year`
   - the total number of entries (immigrant admissions) for each year, `total_admissions`
3. Return a **series** with:
   - Index: gender (for each gender in the data)
   - Values: the average age for the indexed gender.
4. Return a list of the 5 most common professions *in order of prevalence* (so, the most common is first).

For each of these, if there are messy data issues, use the `print` statement to explain.


In [36]:
import pandas as pd

# Load the dataset directly from the URL in exercises.ipynb
url = 'https://github.com/melaniewalsh/Intro-Cultural-Analytics/raw/master/book/data/bellevue_almshouse_modified.csv'
df_bellevue = pd.read_csv(url)

# Verify the data is loaded
print(df_bellevue.head())

      date_in first_name  last_name   age          disease profession gender  \
0  1847-04-17       Mary  Gallagher  28.0  recent emigrant    married      w   
1  1847-04-08       John  Sanin (?)  19.0  recent emigrant    laborer      m   
2  1847-04-17    Anthony      Clark  60.0  recent emigrant    laborer      m   
3  1847-04-08   Lawrence     Feeney  32.0  recent emigrant    laborer      m   
4  1847-04-13      Henry      Joyce  21.0  recent emigrant        NaN      m   

                     children  
0         Child Alana 10 days  
1              Catherine 2 mo  
2  Charles Riley afed 10 days  
3                       Child  
4                  Child 1 mo  


In [None]:
# Task 1: Return sorted column names by missing values
def task_1():
    # First, clean the 'gender' column (assuming it contains missing values or incorrect entries)
    if 'gender' in df_bellevue.columns:
        # Replace any non-valid entries in 'gender' (if necessary) or drop rows with missing 'gender'
        df_bellevue['gender'] = df_bellevue['gender'].fillna('Unknown')
    
    # Sort columns based on the number of missing values (ascending order: least missing to most missing)
    missing_values = df_bellevue.isnull().sum()
    sorted_columns = missing_values.sort_values().index.tolist()
    
    print(f"Sorted columns by missing values: {sorted_columns}")
    return sorted_columns

In [38]:
# Task 2: Return a DataFrame with year and total admissions
def task_2():
    if 'year' not in df_bellevue.columns or 'admissions' not in df_bellevue.columns:
        print("Error: 'year' or 'admissions' column missing.")
        return None
    
    # Group by 'year' and sum admissions
    total_admissions = df_bellevue.groupby('year')['admissions'].sum().reset_index()
    total_admissions.columns = ['year', 'total_admissions']
    
    print(f"Yearly admissions:\n{total_admissions}")
    return total_admissions


In [39]:
# Task 3: Return average age for each gender
def task_3():
    if 'gender' not in df_bellevue.columns or 'age' not in df_bellevue.columns:
        print("Error: 'gender' or 'age' column missing.")
        return None
    
    # Group by 'gender' and calculate average age
    avg_age_by_gender = df_bellevue.groupby('gender')['age'].mean()
    
    print(f"Average age by gender:\n{avg_age_by_gender}")
    return avg_age_by_gender


In [None]:
# Task 4: Return the 5 most common professions
def task_4():
    if 'profession' not in df_bellevue.columns:
        print("Error: 'profession' column missing.")
        return None
    
    # Get the top 5 most common professions
    most_common_professions = df_bellevue['profession'].value_counts().head(5)
    
    # Convert the Series to a list
    most_common_professions_list = most_common_professions.index.tolist()
    
    print(f"Top 5 most common professions:\n{most_common_professions}")
    return most_common_professions_list

In [None]:
# Task 1: Sorted column names by missing values
print("Task 1: Sorted column names by missing values:")
sorted_columns = task_1()
print(sorted_columns)
print("\n")

# Task 2: Yearly total admissions
print("Task 2: Yearly total admissions:")
total_admissions = task_2()
print(total_admissions)
print("\n")

# Task 3: Average age by gender
print("Task 3: Average age by gender:")
avg_age_by_gender = task_3()
print(avg_age_by_gender)
print("\n")

# Task 4: Top 5 most common professions
print("Task 4: Top 5 most common professions:")
most_common_professions = task_4()
print(most_common_professions)

Task 1: Sorted column names by missing values:
Sorted columns by missing values: ['date_in', 'last_name', 'gender', 'first_name', 'age', 'profession', 'disease', 'children']
['date_in', 'last_name', 'gender', 'first_name', 'age', 'profession', 'disease', 'children']


Task 2: Yearly total admissions:
Error: 'year' or 'admissions' column missing.
None


Task 3: Average age by gender:
Average age by gender:
gender
?          NaN
g    59.000000
h    56.000000
m    31.813433
w    28.725162
Name: age, dtype: float64
gender
?          NaN
g    59.000000
h    56.000000
m    31.813433
w    28.725162
Name: age, dtype: float64


Task 4: Top 5 most common professions:
Top 5 most common professions:
profession
laborer      3108
married      1584
spinster     1521
widow        1053
shoemaker     158
Name: count, dtype: int64
profession
laborer      3108
married      1584
spinster     1521
widow        1053
shoemaker     158
Name: count, dtype: int64
