# Lab 1 – Python, NumPy, and Pandas



**Importing code from `lab.py`**:

* Below, we import the `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [2]:
%load_ext autoreload
%autoreload 2

In [5]:
from lab import *

In [4]:
from pathlib import Path
import io
import pandas as pd
import numpy as np

Let's get started! 🎉

## Part 1: Python Basics 🐍

### Question 0 – Consecutive Integers

Complete the implementation of the function `consecutive_ints`, which takes in a possibly empty list of integers (`ints`) and returns `True` if there exist two adjacent list elements that are consecutive integers and `False` otherwise.

For example, since `9` is next to `8`, `consecutive_ints([5, 3, 6, 4, 9, 8])` should evaluate to `True`, since `9` and `8` are consecutive integers. On the other hand, `consecutive_ints([1, 3, 5, 7, 9])` should evaluate to `False`.



In [None]:
# The cells below are here for you to write scratch work in. 
# You should write the code for your answer in `lab.py`, not here.


In [11]:
consecutive_ints([1, 2.5, 3, 4]) #This should evaluate to True

True

In [8]:
consecutive_ints([]) #This should evaluate to False

False

### Question 1 – Median vs. Mean

Complete the implementation of the function `median_vs_mean`, which takes in a non-empty list of numbers (`nums`) and returns `True` if median of the list is less than or equal to the mean of the list and `False` otherwise.

Recall, if a list has even length, the median is the mean of the middle two elements.

***Note:*** In this question, you may only use built-in functions and methods in Python. You should not use `numpy` or `pandas` at all, nor should you import any additional packages.

In [22]:
median_vs_mean([1])

True

In [25]:
median_vs_mean([-10, -5, 0, 5, 10])

True

### Question 2 – Same Difference

Complete the implementation of the function `same_diff_ints`, which takes in a list of integers (`ints`) and returns `True` if there exist two list elements $i$ positions apart, whose absolute difference as integers is also $i$. If there are no two elements satisfying this condition, `same_diff_ints` should return `False`.

For example, because `3` (position 1) `5` (position 3) are 2 positions apart, and $|3-5| = 2$:
```py
>>> same_diff_ints([5, 3, 1, 5, 9, 8])
True
```
Whereas:
```py
>>> same_diff_ints([1, 3, 5, 7, 9])
False
```

***Important:*** While implementing `same_diff_ints`, we will assume that `ints` tends to satisfy the condition, and that the pair(s) saitifying the condition tend to be close together. As such, you must implement `same_diff_ints` such that it **runs quicker in cases where the pairs are close together than in cases where the pairs are further apart**. While you will still likely need a nested `for`-loop, this will inform how you configure your loop variables. (Optimizing your code for an assumed distribution of incoming data is very common in data science).

***Hints:*** 
- This is similar to Question 0.
- Make sure to define some extreme test cases, like when `ints` is an empty list. Also, use the `%%time` magic command to time your function, to make sure it satisfies the optimization requirement above.

In [28]:
same_diff_ints([5, 3, 1, 5, 9 ,8])

True

In [31]:
same_diff_ints([1, 3, 5, 7, 9])

False

Make sure your function runs in under 5 seconds.

In [20]:
%%time
same_diff_ints([5, 3, 1, 5, 9, 8])

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 6.91 µs


True

## Part 2: Strings and Files 🧵

The following questions will familiarize you with the basics of working with strings and reading data from files. Remember that by default, data from files are stored as strings in Python.

### Question 3 – $n$ Prefixes

Complete the implementation of the function `n_prefixes`, which takes a string `s` and a positive integer `n`. It returns a string containing the first `n` consecutive prefixes of `s` in reverse order.

For example, let's suppose `s` is the string `'Billy!'` and `n` is `4`. The consecutive prefixes of `'Billy!'` are:
- `'B'`
- `'Bi'`
- `'Bil'`
- `'Bill'`
- `'Billy'`
- `'Billy!'`

The first 4 of these are `'B'`, `'Bi'`, `'Bil'`, and `'Bill'`. If we combine these 4 in reverse order, we get `'BillBilBiB'`, which is what `n_prefixes('Billy!', 4)` should return. As another example, `n_prefixes('Marina', 3)` should return `'MarMaM'`. **You may assume that `n` is no larger than the length of `s`.**

***Hint:*** Recall that [strings may be sliced](https://docs.python.org/3/tutorial/introduction.html#strings), like lists.

In [33]:
n_prefixes('abcdef', 6)

'abcdefabcdeabcdabcaba'

In [8]:
n_prefixes('Marina!', 3)

'MarMaM'

### Question 4 – Exploded Numbers 💣

Complete the implementation of the function `exploded_numbers`, which takes in a list of integers (`ints`) and a non-negative integer (`n`) and **returns a list of strings** containing numbers from the list expanded by `n` numbers in both directions, separated by spaces. Each integer should be [zero padded](https://www.tutorialspoint.com/python/string_zfill.htm) so that all integers outputted have the same length.

For example, consider `exploded_numbers([3, 8, 15], 2)`.
- If we explode 3 by 2 numbers in both directions, we get 1, 2, 3, 4, 5.
- If we explode 8 by 2 numbers in both directions, we get 6, 7, 8, 9, 10.
- If we explode 15 by 2 numbers in both directions, we get 13, 14, 15, 16, 17.

The longest length of any of the exploded numbers above is 2, so all of the outputted integers should have length 2.

- The string corresponding to 3 in the input is `'01 02 03 04 05'`.
- The string corresponding to 8 in the input is `'06 07 08 09 10'`.
- The string corresponding to 15 in the input is `'13 14 15 16 17'`.

So, `exploded_numbers([3, 8, 15], 2)` should return `['01 02 03 04 05', '06 07 08 09 10', '13 14 15 16 17']`. 

As another example, `exploded_numbers([9, 99], 3)` should return `['006 007 008 009 010 011 012', '096 097 098 099 100 101 102']`.

***Note***: You can assume that negative numbers will never be encountered. That is, when testing your code, we will never explode a number so much that it becomes negative.

In [38]:
exploded_numbers([10, 50, 94], 5)

['05 06 07 08 09 10 11 12 13 14 15',
 '45 46 47 48 49 50 51 52 53 54 55',
 '89 90 91 92 93 94 95 96 97 98 99']

In [27]:
exploded_numbers([9, 99], 3)

['006 007 008 009 010 011 012', '096 097 098 099 100 101 102']

## Part 3: `numpy` exercises 🥧



### Question 5 – Array Methods

Complete the implementations of the functions `add_root` and `where_square`. Specifications are given below. Your solutions should **not** contain any loops or list comprehensions.

#### `add_root`

`add_root` should take in a `numpy` array, `A`, and return a new `numpy` array that contains the element-wise sum of the elements in `A` with the _square roots of the positions of the elements in `A`_. 

For instance, if `A` contains the values 5, 9, and 4, the output array should contain the values 5 (5 + $\sqrt{0}$), 10 (9 + $\sqrt{1}$), and 5.4142... (4 + $\sqrt{2}$).

<br>

#### `where_square`

`where_square` should take in a `numpy` array, `A`, and return a new `numpy` array of Booleans whose `i`th element is `True` if and only if the `i`th element of `A` is a perfect square. 

For instance, `where_square(np.array([2, 9, 16, 15]))` should return `array([False, True, True, False])`.

In [42]:
add_root(np.array([]))

array([], dtype=float64)

In [49]:
A_2 = np.array([0.0, 9.0, 15.5, 16.0])
where_square(A_2)

array([ True,  True, False,  True])

In [44]:
# Don't change this cell 
A_1 = np.array([2, 4, 6, 7])
out_1 = add_root(A_1)

A_2 = np.array([1, 4, 9, 16, 25])
out_2 = where_square(A_2)

### Question 6 – Stock Prices 📈

Complete the implementations of the functions `growth_rates` and `with_leftover`. Specifications are given below. Your solutions should **not** contain any loops or list comprehensions.

#### `growth_rates`

`growth_rates` should take in a `numpy` array, `A`, of [stock prices](https://en.wikipedia.org/wiki/Stock) for a single stock on successive days in USD. It should return an array of growth rates. That is, the `i`th number of the returned array should contain the rate of growth in stock price between the $i^{th}$ day to the $(i+1)^{th}$ day. The growth rate between two values is defined as $\frac{\text{final} - \text{initial}}{\text{initial}}$. You should return growth rates as **proportions, rounded to two decimal places**.

<br>

#### `with_leftover`

Again, suppose `A` is a `numpy` array of stock prices. Consider the following scheme: 

- Suppose that you start each day with \$20 to purchase stocks. 
- Each day, you purchase as many shares as possible of the stock. (The price changes each day, according to `A`.)
- Any money left-over after a given day is saved for possibly buying stock on a future day.

The function `with_leftover` should take in `A` and return the day (as an `int`) on which you can buy at least one full share using just "left-over" money. If this never happens, return `-1`. Note that the first stock purchase occurs on Day 0, and that you cannot purchase fractions of a share of a stock.

For example, if the stock price is \$3 every day, then the answer is `1` (corresponding to Day 1):
- Day 0: Buy 6 stocks with \\$20, and \\$2 is added to the leftover. Your total leftover is currently \\$2. This is not enough to buy one extra share, so you continue.
- Day 1: Buy 6 stocks with \\$20, and another \\$2 is added to the leftover. Your total leftover is now \\$4, so you can now buy one extra share. Hence, the answer is Day 1, and `with_leftover` should return `1`.

***Hint:*** `np.cumsum` may be helpful.

In [52]:
fp = Path('data') / 'stocks.csv'
stocks = np.array([float(x) for x in open(fp)])
print(stocks)

growth_rate = (stocks[1:] - stocks[:-1]) 
print(growth_rate)
growth_rate = np.around((stocks[1:] - stocks[:-1]) * 100  / stocks[:-1], 2)
print(growth_rate)


[ 9.89  9.87  9.97  9.83  9.86  9.9   9.86 10.05 10.14 10.38 10.51 10.58
 10.6  10.62 10.7  10.69 10.61 10.59 10.62 10.48 10.54 10.54 10.52 10.68
 10.71 10.78 10.74 10.79 10.94 10.76 10.82 10.87 10.72 10.86 10.88 10.85
 10.79 10.9  11.19 11.12 11.1  11.23 11.3  11.33 11.35 11.32 11.42 11.52
 11.51 11.53 11.73 11.63 11.56 11.71 11.61 11.74 11.95 11.89 11.75 11.74
 11.8  11.81 11.79 11.8  11.96 11.95 12.04 12.01 12.12 12.22 12.31 12.29
 12.25 12.37 12.38 12.4  12.61 12.39 12.38 12.47 12.5  12.63 12.77 12.73
 12.48 12.33 12.26 12.11 11.99 12.01 12.11 12.18 12.27 12.25 12.25 12.2
 12.11 12.26 12.41 12.45]
[-0.02  0.1  -0.14  0.03  0.04 -0.04  0.19  0.09  0.24  0.13  0.07  0.02
  0.02  0.08 -0.01 -0.08 -0.02  0.03 -0.14  0.06  0.   -0.02  0.16  0.03
  0.07 -0.04  0.05  0.15 -0.18  0.06  0.05 -0.15  0.14  0.02 -0.03 -0.06
  0.11  0.29 -0.07 -0.02  0.13  0.07  0.03  0.02 -0.03  0.1   0.1  -0.01
  0.02  0.2  -0.1  -0.07  0.15 -0.1   0.13  0.21 -0.06 -0.14 -0.01  0.06
  0.01 -0.02  0.01  0.16 -

[0.  1.5 2.6 0. ]
[0.  1.5 4.1 4.1]
[False False False  True]
[3]
3
1


In [54]:
# Don't change this cell -- it is needed for the tests to work
fp = Path('data') / 'stocks.csv'
stocks = np.array([float(x) for x in open(fp)])
out_3_stocks = growth_rates(stocks)

A_4 = np.array([9, 9, 9, 9])
out_4 = with_leftover(A_4)

[ 0.   -0.    0.01 -0.01 -0.    0.   -0.    0.02  0.03  0.05  0.06  0.07
  0.07  0.07  0.08  0.08  0.07  0.07  0.07  0.06  0.07  0.07  0.06  0.08
  0.08  0.09  0.09  0.09  0.11  0.09  0.09  0.1   0.08  0.1   0.1   0.1
  0.09  0.1   0.13  0.12  0.12  0.14  0.14  0.15  0.15  0.14  0.15  0.16
  0.16  0.17  0.19  0.18  0.17  0.18  0.17  0.19  0.21  0.2   0.19  0.19
  0.19  0.19  0.19  0.19  0.21  0.21  0.22  0.21  0.23  0.24  0.24  0.24
  0.24  0.25  0.25  0.25  0.28  0.25  0.25  0.26  0.26  0.28  0.29  0.29
  0.26  0.25  0.24  0.22  0.21  0.21  0.22  0.23  0.24  0.24  0.24  0.23
  0.22  0.24  0.25  0.26]
-1


## Part 4: Introduction to `pandas` 🐼

This part will help build familiarity with DataFrames in `pandas`. 

As always for `pandas` questions:
1. Avoid writing loops through the rows of the DataFrame to do the problem, and
2. Test the output/correctness of your code with the help of the dataset given, but be sure your code will also run on data that is similar to but different from the dataset given. (One way to do this is to sample rows from the provided DataFrame using the `.sample` method).

The file `data/salary.csv` contains salary information for the 2021-22 National Basketball Association (NBA) season 🏀. Specifically, it contains the name, team, and salary of all players who have played at least 15 games last season. We will load this file and store it as a DataFrame named `salary`.

In [54]:
# Do not edit this cell -- it is needed for the tests
salary_fp = Path('data') / 'salary.csv'
salary = pd.read_csv(salary_fp)
salary.head()

Unnamed: 0,Player,Position,Team,Salary
0,John Collins,PF,Atlanta Hawks,23000000
1,Danilo Gallinari,PF,Atlanta Hawks,20475000
2,Bogdan Bogdanović,SG,Atlanta Hawks,18000000
3,Clint Capela,C,Atlanta Hawks,17103448
4,Delon Wright,SG,Atlanta Hawks,8526316


### Question 7 – `pandas` Basics

Your job is to complete the implementation of the function `salary_stats`, which takes in a DataFrame like `salary` and returns a **Series** containing the following statistics:
- `'num_players'`: The number of players.
- `'num_teams'`: The number of teams.
- `'total_salary'`: The total salary amount for all players.
- `'highest_salary'`: The name of the player with the highest salary. **Assume there are no ties.**
- `'avg_los'`: The average salary of the `'Los Angeles Lakers'`, rounded to two decimal places.
- `'fifth_lowest'`: The name and team of the player who has the fifth lowest salary, separated by a comma and a space (e.g. `'Billy Triton, Cleveland Cavaliers'`). **Assume there are no ties.**
- `'duplicates'`: A Boolean that is `True` if there are any duplicate last names, and `False` otherwise. Note that some players may have a suffix on their name, such as "Jr." or "III" -- you should ignore these. That is, "Billy Triton Jr." and "Tyler Triton" should be considered to have the same last name.
- `'total_highest'`: The total salary of the team that has the highest paid player.

The index of each element in the outputted Series is specified above.

***Notes***: 
- Your function should work on a dataset of the same format that contains information from other years. This means that `salary_stats` should not "hard-code" any numbers or strings, but should compute them all programatically. In all cases, you may assume that none of the answers involving ranking involves a tie.
- The public tests don't test to see if your function actually returns the right numbers. You should manually inspect your result to make sure that all values seem appropriate.

In [58]:
def salary_stats(salary):
    num_players = salary["Player"].shape[0]
    num_teams = salary["Team"].nunique() #chatgpt is used to learn about .nunique() function
    total_salary = salary["Salary"].sum()

    highest_salary_series = salary.nlargest(1, "Salary")
    highest_salary = highest_salary_series["Player"].iloc[0]

    avg_los = salary.loc[salary["Team"] == "Los Angeles Lakers", "Salary"].mean().round(2)

    fifth_low = salary.nsmallest(5, "Salary").iloc[-1]
    fifth_lowest = fifth_low["Player"] + ', ' + fifth_low["Team"]

    players = salary["Player"]
    player_lastname = players.str.split().apply(lambda x: x[1]) #chatgpt is used to learn about .split and .apply(lambdax: x[1]) function
    duplicates = player_lastname.duplicated().any() #chatgpt is used to learn about .duplicated function

    highest_salary_player = salary .nlargest(1, "Salary") 
    highest_salary_team = highest_salary_player["Team"].iloc[0]
    highest_salary
    total_highest = salary.loc[salary["Team"] == highest_salary_team, "Salary"].sum()

    series = pd.Series([num_players, num_teams, total_salary, highest_salary, avg_los, fifth_lowest, duplicates, total_highest], index = ["num_players", "num_teams", "total_salary", "highest_salary", "avg_los", "fifth_lowest", "duplicates", "total_highest"])
    print (series)
    return series

In [59]:
salary_fp = Path('data') / 'salary.csv'
salary = pd.read_csv(salary_fp)
stats = salary_stats(salary)

num_players                                   381
num_teams                                      30
total_salary                           3433118794
highest_salary                      Stephen Curry
avg_los                               13266896.82
fifth_lowest      Miye Oni, Oklahoma City Thunder
duplicates                                   True
total_highest                           130428103
dtype: object


In [60]:
# Do not edit this cell 
salary_fp = Path('data') / 'salary.csv'
salary = pd.read_csv(salary_fp)
stats = salary_stats(salary)

salary_sample_fp = Path('data') / 'salary_sample.csv'
salary_sample = pd.read_csv(salary_sample_fp)
sample_stats = salary_stats(salary_sample)

num_players                                   381
num_teams                                      30
total_salary                           3433118794
highest_salary                      Stephen Curry
avg_los                               13266896.82
fifth_lowest      Miye Oni, Oklahoma City Thunder
duplicates                                   True
total_highest                           130428103
dtype: object
num_players                                        50
num_teams                                          26
total_salary                                428424568
highest_salary                           Kevin Durant
avg_los                                     1789256.0
fifth_lowest      Keita Bates-Diop, San Antonio Spurs
duplicates                                       True
total_highest                                46202282
dtype: object


## Congratulations! You're done Lab 1! 🏁

As a reminder, all of the work you want to submit needs to be in `lab.py`.
