# Homework 2: Arrays and DataFrames

## Due Tuesday, October 11th at 11:59PM

Welcome to Homework 2! This week's homework will cover arrays and DataFrames in Python. You can find additional help on these topics in [BPD 7-11](https://notes.dsc10.com/02-data_sets/arrays.html) in the `babypandas` notes.

### Instructions

This assignment is due Tuesday, October 11th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

**Please do not use for-loops for any questions in this homework.** If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

In [1]:
# Don't change this cell; just run it. Please don't import any additional packages.
import numpy as np
import babypandas as bpd

import otter
grader = otter.Notebook()

## 1. Arrays 🗃️

**Question 1.1.** Make an array called `quirky_numbers` containing the following numbers (in the given order):

1. The cube root of 31
2. 9 radians, in degrees
3. $2^7 - 4^5$
4. The mathematical constant of $e$ over 4: ($\frac{e}{4}$)
5. The factorial of 5

*Hint:* Check out the functions constants available in the `numpy` module, which has been imported as `np`. If you're unsure of what function to use, a quick Google search should do the trick.  Do **not** import `math`. 

*Note:* In this problem, as with all others, we'll only check that your answer is correct. There may be several valid ways to produce the correct answer.

In [2]:
quirky_numbers = np.array([(31**(1/3)), 9 * (180/np.pi), (2**7) - (4**5), np.e/4, 5*4*3*2*1])
quirky_numbers

array([ 3.14138065e+00,  5.15662016e+02, -8.96000000e+02,  6.79570457e-01,
        1.20000000e+02])

In [3]:
grader.check("q1_1")

**Question 1.2.** Make an array called `likes` containing the following three strings:
- `'I like planting'`
- `'my pets'`
- `'and my friends!'`

<!--
BEGIN QUESTION
name: q1_2
-->

In [4]:
likes = np.array(['I like planting', 'my pets', 'and my friends!'])
likes

array(['I like planting', 'my pets', 'and my friends!'], dtype='<U15')

In [5]:
grader.check("q1_2")

<center><img src=./data/cat_plant.jpeg width=400><a href="https://www.reddit.com/r/pottedcats/comments/xpokrc/blue_eyes/">source</a></center>


In [Lecture 4](https://dsc10.com/resources/lectures/lec04/lec04.html#String-methods), we looked at several string methods, like `lower` and `replace`. Strings have another method that we haven't looked at yet, called `join`. `join` takes one argument, an array of strings, and it returns a single string. Specifically, `some_string.join(some_array)` evaluates to a new string consisting of all of elements in `some_array`, with `some_string` inserted in between each element.

For example, `'-'.join(np.array(['call', '858', '534', '2230']))` evaluates to `'call-858-534-2230'`.

**Question 1.3.** Use the array `likes` and the method `join` to make two strings:

1. `'I like planting, my pets, and my friends!'` (call this one `by_comma`)
2. `'I like planting my pets and my family!'` (call this one `by_space`)

In [6]:
by_comma = ', '.join(likes)
by_space = ' '.join(likes)

# Don't change the lines below.
print(by_comma)
print(by_space)

I like planting, my pets, and my friends!
I like planting my pets and my friends!


In [7]:
grader.check("q1_3")

Now let's get some practice accessing individual elements of arrays.  In Python (and in many programming languages), elements are accessed by *integer position*, with the position of the first element being zero. That's probably not the way you learned to count, so it's easy to get mixed up here. Be careful!

**Question 1.4.** The cell below creates an array of strings.

In [8]:
some_strings = np.array(['🌻', '🌺', '🌼', '🌸', 'flower', 'plant', 'cat', '🐈', 'dog', '🐶'])
some_strings

array(['🌻', '🌺', '🌼', '🌸', 'flower', 'plant', 'cat', '🐈', 'dog', '🐶'],
      dtype='<U6')

What is the integer position of `'🐈'` in the array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

In [9]:
cat_emoji_position = 7
cat_emoji_position

7

In [10]:
grader.check("q1_4")

**Question 1.5.** Suppose you have an array with 500 elements. What is the integer position of the fifth-last element in this array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

_Note:_ Your answer should be a **positive** integer!

In [11]:
fifth_last_position = 499 - 4
print(np.arange(0, 500, 1)[-5])
fifth_last_position

495


495

In [12]:
grader.check("q1_5")

**Question 1.6.** Suppose you have an array with 229 elements. At what integer position is the middle element of this array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

_Note:_ Again, your answer should be a **positive** integer!

In [13]:
mid_position = int(np.floor(229 / 2))
print(np.arange(0, 229, 1)[int(229/2)])
mid_position

114


114

In [14]:
grader.check("q1_6")

By the way, it's also possible to use negative integer positions to access elements in an array, which can be easier than using positive integer positions sometimes.  If a position is negative, you count from the end of the array rather than from the beginning. Position -1 corresponds to the last element, -2 corresponds to the second-last element, and so on. For instance, to find the third-last element of some_strings, we could use:

In [15]:
some_strings[-3]

'🐈'

## 2. DSC 10 Enrollments 📈

It's time to get to know your classmates! The third column of the table below shows how many students in this quarter's offering of DSC 10 come from each of the seven colleges at UCSD. Looks like Sixth College is the most popular, with 93 students. The last column shows how many of these students are DSC majors. Of the 93 students from Sixth College, 54 are DSC majors. 

For comparison's sake, we also have the corresponding data from the Fall 2021 offering of DSC 10. You can see that the class has grown quite a bit since last fall!

Throughout this problem, we'll assume that all students in DSC 10 come from one of the seven colleges in the table.

|College|Fall 21 Students|Fall 21 DSC Major Students|Fall 22 Students|Fall 22 DSC Major Students|
|---|---|---|---|---|
|Seventh|28|19|37|21|
|Sixth|62|45|93|54|
|Roosevelt|22|9|67|39|
|Warren|31|19|61|40|
|Marshall|49|26|62|32|
|Muir|27|14|38|27|
|Revelle|30|12|52|31|

In this question, we'll be working with the data from this table as *arrays*. Here are those arrays:

In [16]:
students_21 = np.array([28, 62, 22, 31, 49, 27, 30])
students_21

array([28, 62, 22, 31, 49, 27, 30])

In [17]:
majors_21 = np.array([19, 45, 9, 19, 26, 14, 12])
majors_21

array([19, 45,  9, 19, 26, 14, 12])

In [18]:
students_22 = np.array([37, 93, 67, 61, 62, 38, 52])
students_22

array([37, 93, 67, 61, 62, 38, 52])

In [19]:
majors_22 = np.array([21, 54, 39, 40, 32, 27, 31])
majors_22

array([21, 54, 39, 40, 32, 27, 31])

Remember, the `numpy` package (`np` for short) provides many handy functions for working with arrays. These are specifically designed to work with arrays and are faster than using Python's built-in functions. 

Some frequently used array functions are `np.min()`, `np.max()`, `np.sum()`, `np.abs()`, and `np.round()`. There are many more, which you can browse by typing `np.` into a code cell and hitting the *tab* key, or by looking at the [documentation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

**Question 2.1.** Assign `enrolled_21` and `enrolled_22` to the number of students that were enrolled in DSC 10 in Fall 2021 and Fall 2022, respectively.

In [20]:
enrolled_21 = majors_21.sum()
enrolled_22 = majors_22.sum()

# Don't change the lines below.
print('Students in Fall 2021:', enrolled_21)
print('Students in Fall 2022:', enrolled_22)

Students in Fall 2021: 144
Students in Fall 2022: 244


In [21]:
grader.check("q2_1")

**Question 2.2.** How many non-DSC major students from the Fall 2021 offering of DSC 10 came from each of the seven colleges? Your answer should be an array called `non_majors_21`, with the colleges in the same order as they appear in the table above. For instance, the first element of `non_majors_21` should be the number of non-DSC majors in the Fall 2021 offering of DSC 10 who were in Seventh College.

Similarly, how many non-DSC major students from the Fall 2022 offering of DSC 10 came from each of the seven colleges? Your answer should be an array called `non_majors_22`, with the colleges in the same order as they appear in the table above. 


In [22]:
non_majors_21 = students_21 - majors_21
non_majors_21

array([ 9, 17, 13, 12, 23, 13, 18])

In [23]:
grader.check("q2_2_1")

In [24]:
non_majors_22 = students_22 - majors_22
non_majors_22

array([16, 39, 28, 21, 30, 11, 21])

In [25]:
grader.check("q2_2_2")

**Question 2.3.** What percentage of Fall 2021 DSC 10 students from each college were DSC majors? Your answer should be an array called `percent_majors_21`, with the colleges in the same order as they appear in the table above, and with percentages rounded to two decimal places.

Similarly, what percentage of Fall 2022 DSC 10 students from each college were DSC majors? Your answer should be an array called `percent_majors_22`, with the colleges in the same order as they appear in the table above, and with percentages rounded to two decimal places.

In [26]:
percent_majors_21 = np.round(majors_21 / students_21 * 100, 2)
percent_majors_21

array([67.86, 72.58, 40.91, 61.29, 53.06, 51.85, 40.  ])

In [27]:
grader.check("q2_3_1")

In [28]:
percent_majors_22 = np.round(majors_22 / students_22 * 100, 2)
percent_majors_22

array([56.76, 58.06, 58.21, 65.57, 51.61, 71.05, 59.62])

In [29]:
grader.check("q2_3_2")

**Question 2.4.** For each college, what is the absolute difference in the percentage of students enrolled in DSC 10 that are DSC majors from Fall 2021 to Fall 2022? Use `percent_majors_21` and `percent_majors_22` to create an array called `abs_differences`, with the colleges in the same order as they appear in the table above. Make sure the values in your answer are rounded to two decimal places.

_Note:_ You _don't_ need to round again.

In [30]:
abs_differences = abs(percent_majors_22 - percent_majors_21)
abs_differences

array([11.1 , 14.52, 17.3 ,  4.28,  1.45, 19.2 , 19.62])

In [31]:
grader.check("q2_4")

For your convenience, we repeat the table from the start of the question below.

|College|Fall 21 Students|Fall 21 DSC Major Students|Fall 22 Students|Fall 22 DSC Major Students|
|---|---|---|---|---|
|Seventh|28|19|37|21|
|Sixth|62|45|93|54|
|Roosevelt|22|9|67|39|
|Warren|31|19|61|40|
|Marshall|49|26|62|32|
|Muir|27|14|38|27|
|Revelle|30|12|52|31|



**Question 2.5.** You might say that the most consistent college is the one with the smallest absolute difference in the percentage of students enrolled in DSC 10 that are DSC majors across the two years. Find the smallest value in the `abs_differences` array and save it as `smallest_abs_diff`. Referring back to the table, try to figure out which college that is. Assign `most_consistent_college` to the name of that college (as a string), exactly as it's displayed in the table.

_Note:_ You can type the name of the college manually.

In [32]:
smallest_abs_diff = abs_differences.min()
most_consistent_college = 'Marshall'

In [33]:
grader.check("q2_5")

## 3. Analyzing NBA Salaries 🏀

The National Basketball Association (NBA) is the premier men's basketball league in North America. The 2022-23 NBA regular season starts on October 18th, just a week after this homework is due!

<img src="data/nba.jpeg" width=60%>

The file `nba_salaries.csv` in the `data/` directory contains salary information for players who played in the NBA at some point between 1985-2018. See below for a description of all the data we have available. 

| Column      | Description |
| ----------- | ----------- |
| `'name'`      | Player name       |
| `'season'`   | NBA season        |
| `'salary'` | Salary (not adjusted for inflation) |
| `'team'` | Player's current team | 
| `'position'` | Position(s) the player was in throughout his career | 
| `'draft_pick'` | Overall draft rank (if drafted)|
| `'draft_year'` | Year drafted into the NBA (if drafted) |


The file `nba_17.csv` in the `data/` directory contains a subset of the data, for only the 2017-18 season.

Note that these player salaries are:
1. Not adjusted for inflation.
2. Not adjusted for the salary cap. The salary cap is the limit that an NBA team can pay its players in total. This limit gradually increased from around \\$35M in 2000 to around \$156M in 2022, so players playing this season are likely making much more than those in our dataset!

**Question 3.1.** Read the file containing all salaries from 1985-2018 into a DataFrame called `salaries`. Read the file containing the 2017-18 salaries into a DataFrame called `nba_17`.

In [34]:
salaries = bpd.read_csv('data/nba_salaries.csv')
nba_17 = bpd.read_csv('data/nba_17.csv')

In [35]:
grader.check("q3_1")

_Note:_ In Questions 3.2 to 3.6, you will use `nba_17`. From that point on, you will use `salaries`.

**Question 3.2.** Create a new DataFrame, `nba_17_id`, by setting the index of `nba_17` to `'player_id'`. Don't change `nba_17`.

In [36]:
nba_17_id = nba_17.set_index('player_id')
nba_17_id

Unnamed: 0_level_0,name,season,salary,team,position,draft_pick,draft_year
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
abrinal01,Alex Abrines,2017-18,5725000,Oklahoma City Thunder,Shooting Guard,32nd overall,2013.0
acyqu01,Quincy Acy,2017-18,1709538,Brooklyn Nets,Power Forward and Small Forward,37th overall,2012.0
adamsst01,Steven Adams,2017-18,22471910,Oklahoma City Thunder,Center,12th overall,2013.0
adebaba01,Bam Adebayo,2017-18,2490360,Miami Heat,Center,14th overall,2017.0
afflaar01,Arron Afflalo,2017-18,1500000,Sacramento Kings,Shooting Guard and Small Forward,27th overall,2007.0
...,...,...,...,...,...,...,...
zelleco01,Cody Zeller,2017-18,12584270,Charlotte Hornets,Center,4th overall,2013.0
zellety01,Tyler Zeller,2017-18,1709538,Milwaukee Bucks,Center,17th overall,2012.0
zipsepa01,Paul Zipser,2017-18,1312611,Chicago Bulls,Small Forward,48th overall,2016.0
zizican01,Ante Zizic,2017-18,1645200,Cleveland Cavaliers,Center,23rd overall,2016.0


In [37]:
grader.check("q3_2")

You should think about why we've chosen to set the index to `'player_id'` rather than `'name'`.

**Question 3.3.** In the 2017-18 season, Stephen Curry of the Golden State Warriors set a record for the most three-pointers made in an NBA Finals game. Using DataFrame operations, assign `curry_17` to his salary during the 2017-18 season. The `'player_id'` for Stephen Curry is `'curryst01'`.

In [38]:
curry_17 = nba_17_id.loc['curryst01'].get('salary')
curry_17

34682550

In [39]:
grader.check("q3_3")

**Question 3.4.** Assign `sixth_highest_salary` to the sixth highest salary during the 2017-18 season. Assign `sixth_player_name` to the name of this player.

Don't type in the salary or player name by hand; get Python to extract this information for you.

In [40]:
sixth_highest_salary = nba_17_id.sort_values('salary', ascending=False).get('salary').iloc[5]
sixth_player_name = nba_17_id.sort_values('salary', ascending=False).get('name').iloc[5]

#print(nba_17_id.sort_values('salary', ascending=False).iloc[0:7])

# Don't change the lines below.
print('Player:', sixth_player_name)
print('Salary:', sixth_highest_salary)

Player: Kyle Lowry
Salary: 28703704


In [41]:
grader.check("q3_4")

**Question 3.5.** Suppose we want to analyze the number of years of NBA experience for each player from the 2017-18 NBA season. We will define a player's **longevity** as follows: 

$$\text{longevity} = 2018 - \text{draft_year}$$

Starting with the `nba_17_id` DataFrame, create a new DataFrame called `longevity` that has an additional column called `'years_played'` containing the longevity of each NBA player as a `float`, sorted so that the players with the most years played are listed first.

_Note:_ For players who were not drafted, their `draft_year` entry will be `NaN`, so their longevity will be recorded as `NaN`, which is okay! `NaN` stands for "not a number."

In [42]:
longevity = bpd.DataFrame.assign(nba_17_id, years_played = 2018 - nba_17_id.get('draft_year'))
#help(bpd.DataFrame)
longevity

Unnamed: 0_level_0,name,season,salary,team,position,draft_pick,draft_year,years_played
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
abrinal01,Alex Abrines,2017-18,5725000,Oklahoma City Thunder,Shooting Guard,32nd overall,2013.0,5.0
acyqu01,Quincy Acy,2017-18,1709538,Brooklyn Nets,Power Forward and Small Forward,37th overall,2012.0,6.0
adamsst01,Steven Adams,2017-18,22471910,Oklahoma City Thunder,Center,12th overall,2013.0,5.0
adebaba01,Bam Adebayo,2017-18,2490360,Miami Heat,Center,14th overall,2017.0,1.0
afflaar01,Arron Afflalo,2017-18,1500000,Sacramento Kings,Shooting Guard and Small Forward,27th overall,2007.0,11.0
...,...,...,...,...,...,...,...,...
zelleco01,Cody Zeller,2017-18,12584270,Charlotte Hornets,Center,4th overall,2013.0,5.0
zellety01,Tyler Zeller,2017-18,1709538,Milwaukee Bucks,Center,17th overall,2012.0,6.0
zipsepa01,Paul Zipser,2017-18,1312611,Chicago Bulls,Small Forward,48th overall,2016.0,2.0
zizican01,Ante Zizic,2017-18,1645200,Cleveland Cavaliers,Center,23rd overall,2016.0,2.0


In [43]:
grader.check("q3_5")

**Question 3.6.** The 2017-18 NBA Finals featured the Golden State Warriors and Cleveland Cavaliers in their fourth straight Finals matchup. What proportion of players on the 2017-18 Warriors and Cavaliers rosters earned at least \$15M in salary? Assign this value to `prop_15m`.

*Hint:* First make a combined roster with all players from both teams, and then find the required proportion.

In [44]:
team_rosters = nba_17_id[nba_17_id.get('team').str.contains('Golden State Warriors') | nba_17_id.get('team').str.contains('Cleveland Cavaliers')]
prop_15m = team_rosters[team_rosters.get('salary') >= 15000000].shape[0] / team_rosters.shape[0]
prop_15m

0.25

In [45]:
grader.check("q3_6")

For the remainder of this section, we will use the full `salaries` DataFrame.

**Question 3.7.** Among all the salaries for players on the Atlanta Hawks from 1985-2018, what is the median salary? Assign this value to `hawks_median`.

In [46]:
hawks = salaries[salaries.get('team') == 'Atlanta Hawks']
#print(hawks.shape[0])
hawks_median = (hawks.get('salary').iloc[int(hawks.shape[0]/2)-1] + hawks.get('salary').iloc[int(hawks.shape[0]/2)]) / 2
hawks_median

1999380.0

In [47]:
grader.check("q3_7")

**Question 3.8.** Assign `highest_salary_tm` to the team in `salaries` that has the highest mean player salary. 

*Hint*: Our solution for this question used only one line of code (thanks, `groupby`)!

In [48]:
highest_salary_tm = salaries.groupby('team').mean().sort_values('salary', ascending=False).get('salary').index[0]
highest_salary_tm

'Brooklyn Nets'

In [49]:
grader.check("q3_8")

**Question 3.9.** In the NBA, there are five positions: center, power forward, small forward, point guard, and shooting guard. However, more recently, positions have become less important and players can fall into multiple roles. For example, LeBron James usually plays as a small forward, but played as a point guard in the 2020-21 season, and has played as other positions before (look [here](https://www.statmuse.com/nba/ask/what-position-is-lebron-playing-this-season) for a history).

Of all of the players from 1985-2018, we want to examine only the power forwards. Note that we are only considering _seasons_ in which the player was playing power forward. For example, if a player played power forward in 1999 and shooting guard in 2000, we would only consider their 1999 season. Assign `pw_forwards` to a DataFrame that includes all the power forwards from 1985-2018, sorted so the highest salaries appear first.

*Hints:*
- You may want to use `.str.contains`; the *Boolean Indexing* section of the [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view) shows you how to use it. Why are we using `.str.contains` and not simply checking for equality?

- The positions are strings, so they may have inconsistencies in how they're capitalized. If we want to include `'power forward'`, `'Power Forward'`, `'PoWeR FoRwaRD'`, and any other variations in capitalization, what operation should we call on the positions **first**? (You may end up using `.str` twice!)

- If you do this correctly, you'll see the same player appear twice in the first three rows! 🐐

In [50]:
# For a multi-step problem like this one, it's helpful to define intermediate variables. 
# Feel free to do that here, or for any problem!
unsorted_pw_forwards = salaries[salaries.get('position').str.lower().str.contains('power forward')]
pw_forwards = unsorted_pw_forwards.sort_values('salary', ascending=False)
pw_forwards

Unnamed: 0,player_id,name,season,salary,team,position,draft_pick,draft_year
6315,jamesle01,LeBron James,2017-18,33285709,Cleveland Cavaliers,Shooting Guard and Small Forward and Power For...,1st overall,2003.0
8731,millspa01,Paul Millsap,2017-18,31269231,Denver Nuggets,Power Forward,47th overall,2006.0
6314,jamesle01,LeBron James,2016-17,30963450,Cleveland Cavaliers,Shooting Guard and Small Forward and Power For...,1st overall,2003.0
5487,haywago01,Gordon Hayward,2017-18,29727900,Boston Celtics,Power Forward and Small Forward,9th overall,2010.0
5029,griffbl01,Blake Griffin,2017-18,29727900,Detroit Pistons,Power Forward,1st overall,2009.0
...,...,...,...,...,...,...,...,...
12775,uthofja01,Jarrod Uthoff,2017-18,7416,Indiana Pacers,Power Forward,,
13711,willilo01,Lorenzo Williams,1992-93,7000,Charlotte Hornets,Power Forward and Center,,
14064,wrighho02,Howard Wright,1992-93,5000,Orlando Magic,Power Forward,,
6377,jeffeam01,Amile Jefferson,2017-18,4608,Minnesota Timberwolves,Power Forward,,


In [51]:
grader.check("q3_9")

## 4. Are You Scared Yet? Analyzing Horror Movies 🎃😱

<center><img src="./data/hocus_pocus.jpg" width = 400/></center>

Spooky season is upon us! In honor of All Hallows' Eve, we've provided a file in the `data/` directory called `horror_movies.csv` that contains 464 movies, each with 10 columns (see the table below) that we'll use to generate some insights about the state of horror movies in recent years. 

| Column      | Description |
| ----------- | ----------- |
| `'Title'`      | Title of the movie       |
| `'Country'`   | Country the movie was originally released in        |
| `'Maturity Rating'` | A rating given to the movie by the Motion Picture Association |
| `'Review Rating'` | The IMDB rating of the film | 
| `'Language'` | The language the movie is in | 
| `'Filming Locations'` | The location in which the movie was filmed |
| `'Budget'` | The total amount spent on the movie |
| `'Release Month'` | The month the movie was released |
| `'Release Day'` | The day the movie was released |
| `'Run Time'` | The length of the film in minutes |

**Question 4.1.** Read the file containing all of the horror movies into a DataFrame called `horror`.

In [52]:
horror = bpd.read_csv('data/horror_movies.csv')
horror

Unnamed: 0,Title,Country,Maturity Rating,Review Rating,Language,Filming Locations,Budget,Release Month,Release Day,Run Time
0,Rise of the Animals (2011),USA,NOT RATED,3.6,English,"Rochester, New York, USA",7000,May,1,70
1,Zombie Resurrection (2014),UK,NOT RATED,2.7,English,"Hampshire, England, UK",100000,March,23,86
2,Before Dawn (2013),Japan,NOT RATED,4.7,English,"Yorkshire, England, UK",25000,June,8,82
3,Apparition (2015),USA,NOT RATED,4.0,English,"Philadelphia, Pennsylvania, USA",3000000,May,5,100
4,Her Cry: La Llorona Investigation (2013),USA,NOT RATED,5.4,English,"Houston, Texas, USA",60000,April,19,89
...,...,...,...,...,...,...,...,...,...,...
459,Insidious: Chapter 3 (2015),USA,PG-13,6.1,English,"929 South Broadway, Downtown, Los Angeles, Cal...",10000000,June,5,97
460,The Purge (2013),USA,R,5.7,English,"Chatsworth, Los Angeles, California, USA",3000000,June,7,85
461,13 Sins (2014),Poland,R,6.3,English,"New Orleans, Louisiana, USA",4000000,April,11,93
462,Victor Frankenstein (2015),USA,PG-13,6.0,English,"London, England, UK",40000000,November,25,110


In [53]:
grader.check("q4_1")

**Question 4.2.** Examine the columns available in `horror` and consider which would be the best choice of index for this DataFrame. Change the `horror` DataFrame so that it's indexed by the values in this column instead of the default index.

In [54]:
horror = horror.set_index('Title')
horror

Unnamed: 0_level_0,Country,Maturity Rating,Review Rating,Language,Filming Locations,Budget,Release Month,Release Day,Run Time
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Rise of the Animals (2011),USA,NOT RATED,3.6,English,"Rochester, New York, USA",7000,May,1,70
Zombie Resurrection (2014),UK,NOT RATED,2.7,English,"Hampshire, England, UK",100000,March,23,86
Before Dawn (2013),Japan,NOT RATED,4.7,English,"Yorkshire, England, UK",25000,June,8,82
Apparition (2015),USA,NOT RATED,4.0,English,"Philadelphia, Pennsylvania, USA",3000000,May,5,100
Her Cry: La Llorona Investigation (2013),USA,NOT RATED,5.4,English,"Houston, Texas, USA",60000,April,19,89
...,...,...,...,...,...,...,...,...,...
Insidious: Chapter 3 (2015),USA,PG-13,6.1,English,"929 South Broadway, Downtown, Los Angeles, Cal...",10000000,June,5,97
The Purge (2013),USA,R,5.7,English,"Chatsworth, Los Angeles, California, USA",3000000,June,7,85
13 Sins (2014),Poland,R,6.3,English,"New Orleans, Louisiana, USA",4000000,April,11,93
Victor Frankenstein (2015),USA,PG-13,6.0,English,"London, England, UK",40000000,November,25,110


In [55]:
grader.check("q4_2")

_Note:_ If you were to run the cell where you set the index of `horror` again, you'd see an error message. Stop and think about _why_ you'd run into an error. Once you've thought about it, click the thinking emoji below to see the reason for the error.

<br>

<details>
    <summary>Why would there be an error? 🤔</summary>
    There would be an error since you'd be trying to set the index of <code>horror</code> to a column that no longer exists in <code>horror</code> – the column wouldn't exist because it was converted to the index the first time you ran the cell (and the index is not a column)!
</details>

If you acually ran the cell twice and got an error message, don't worry. To get rid of it, re-run the cell in 4.1 where you defined the `horror` DataFrame, then run the cell in 4.2 just once, and you'll be good to go.

When you submit your work for autograding, the entire notebook will be run from start to finish. Each cell will run only once, so it's no problem if your code errors on the second run. In this case, it means you're doing something right!

**Question 4.3.** *The Purge*, released in 2013, has had a massive impact on pop culture. What is the `'Review Rating'` of `'The Purge (2013)'`? Assign the rating to the variable `purge_rating`.

In [56]:
purge_rating = horror.get('Review Rating').loc['The Purge (2013)']
purge_rating

5.7

In [57]:
grader.check("q4_3")

**Question 4.4.** Even spooky movies have production costs. Assign `most_expensive` to the name of the movie with the largest budget (including the year in parentheses), and set the total budget of that movie to `petrifying_pricetag`.

In [58]:
most_expensive = horror.sort_values('Budget', ascending=False).index[0]
#print(horror.sort_values('Budget', ascending=False).get('Budget'))
petrifying_pricetag = horror.get('Budget').loc[most_expensive]
most_expensive, petrifying_pricetag

('Train to Busan (2016)', 10000000000)

In [59]:
grader.check("q4_4")

**Question 4.5.** Wow, that's a lot of money to spend on a horror movie, but how does that compare to the other movies included in the dataset? Compute the difference between the budget of the most expensive movie and the average movie budget, and assign the result to the variable `above_average`.

In [60]:
#print(horror.get('Budget').mean())
above_average = petrifying_pricetag - horror.get('Budget').mean()
above_average

9969931881.67026

In [61]:
grader.check("q4_5")

**Question 4.6.** What proportion of movies in our dataset were released in October? Set the proportion equal to `october_prop`.

In [62]:
october_prop = horror[horror.get('Release Month').str.lower().str.contains('october')].shape[0] / horror.shape[0]
october_prop

0.14870689655172414

In [63]:
grader.check("q4_6")

**Question 4.7.** How many movies in our dataset were released on October 31st? Set the number equal to the variable `halloween_count`.

In [64]:
october_movies = horror[horror.get('Release Month').str.lower().str.contains('october')]
halloween_count = october_movies[october_movies.get('Release Day') == 31].shape[0]
halloween_count

5

In [65]:
grader.check("q4_7")

**Question 4.8.** Which movie titles contain the word `'zombie'`? Create an *array* called `zombie_movies` containing the titles of all such movies. Then assign `num_zombie_movies` to the number of such movies.

*Hints:*
- To convert a Series into an array, call the function `np.array` on the Series.
- The movie names are all strings, so they may have inconsistencies in how they're capitalized. If we want to account for variations in capitalization, what operation should we call on the names **first**? (You may end up using `.str` twice!)

In [66]:
zombie_horror = horror.reset_index()
zombie_horror = zombie_horror[zombie_horror.get('Title').str.lower().str.contains('zombie')].get('Title')
zombie_movies = np.array(zombie_horror)
num_zombie_movies = len(zombie_movies)
num_zombie_movies

18

In [67]:
grader.check("q4_8")

**Question 4.9.** What is the highest `'Review Rating'` of a movie shot in a language other than English? Assign this rating to the variable `foreign_rating`.

_Note:_ Some movies are shot in multiple languages, one of which might be English – for instance, the movie *Devoured* was filmed in `'English|Spanish'`. For the purposes of this question, consider such movies to be shot in a language other than English.

In [68]:
sorted_foreign = horror[horror.get('Language').str.lower().str.contains('english') == False].sort_values('Review Rating', ascending = False)
foreign_rating = sorted_foreign.get('Review Rating').iloc[0]
foreign_rating

7.5

In [69]:
grader.check("q4_9")

**Question 4.10.** Wow, this was a long question, with 10 parts 😓. You know what's longer? The run time of the longest horror movie in the dataset 😱! 

Assign `longest_film_name` to the name of the longest movie in the dataset (including the year in parentheses). Assign `longest_film_length` to the run time of this movie in minutes. 

Don't type in the name or number of minutes by hand; get Python to extract this information for you.

In [70]:
longest_film = horror.reset_index().sort_values('Run Time', ascending = False).iloc[0]
longest_film_name = longest_film.get('Title')
longest_film_length = longest_film.get('Run Time')

# Don't change the line below.
print('Longest film:', longest_film_name, '\nLength:', longest_film_length, 'minutes')

Longest film: A Cure for Wellness (2016) 
Length: 146 minutes


In [71]:
grader.check("q4_10")

## Finish Line 🏁

Congratulations! You are done with Homework 2.

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [72]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()

q1_1 results: All test cases passed!

q1_2 results: All test cases passed!

q1_3 results: All test cases passed!

q1_4 results: All test cases passed!

q1_5 results: All test cases passed!

q1_6 results: All test cases passed!

q2_1 results: All test cases passed!

q2_2_1 results: All test cases passed!

q2_2_2 results: All test cases passed!

q2_3_1 results: All test cases passed!

q2_3_2 results: All test cases passed!

q2_4 results: All test cases passed!

q2_5 results: All test cases passed!

q3_1 results: All test cases passed!

q3_2 results: All test cases passed!

q3_3 results: All test cases passed!

q3_4 results: All test cases passed!

q3_5 results: All test cases passed!

q3_6 results: All test cases passed!

q3_7 results: All test cases passed!

q3_8 results: All test cases passed!

q3_9 results: All test cases passed!

q4_1 results: All test cases passed!

q4_10 results: All test cases passed!

q4_2 results: All test cases passed!

q4_3 results: All test cases passed!

q4_