# Student and Problem Set Info
---


## Title: MGSC 310: Problem Set 1

Author:

Ben Labaschin, King of the Notebooks, Destroyer of Worlds.


# Libraries

In [None]:
from os import environ
from google.colab import drive

# Setup

In [None]:
# Ensure you run this cell or otherwise connect to your Google Drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# Question 1: Pandas for Data Manipulation

a) Load the IMDB movies dataset from `MGSC_310_shared_files_and_resources/Data` using the Pandas `.read_csv()` function.



In [None]:
from pandas import read_csv

movies = read_csv("/content/drive/MyDrive/Work/Chapman/MGSC_310/MGSC_310_shared_files_and_resources/Data/IMDB_movies.csv")

b) Use the Pandas `.drop()` function to remove the variable `plot_keywords`. Store the smaller dataset as the variable `movies_sub`. Use the `.shape` attribute to ensure that movies_sub has one fewer variable than movies.

More on `.drop()` [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html).\
More on `.shape` [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html).

In [None]:
movies_sub = movies.drop('plot_keywords', axis=1)
assert movies.shape[1] > movies_sub.shape[1]
movies_sub.shape
# they can just print out movies_sub: (3889, 24)

(3889, 24)

c) Use Pandas' `.query()` or `.loc[]` function (using the original `movies` dataset) to see how many movies in the dataset had a budget in excess of $250M USD.

Major Hint: The correct answer is not a `DataFrame` object, it will be an `int` object. You can use `.shape`, `count`, `len`...

More on `.query()` [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html). \
More on `.loc[]` [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). Though you should have read about `.loc[]` in the [book](https://python.datasciencebook.ca/wrangling.html#using-loc-to-filter-rows-and-select-columns).

In [None]:
len(movies.query("""budget > 250_000_000"""))

19

or

In [None]:
movies.loc[movies["budget"] > 250_000_000].shape[0]

19

d) Use Pandas' `.assign()` function to create two new variables, `budgetM` and `grossM`, that report budget and gross in millions of USD. (e.g., if budget equals $3,500,000, `budgetM` would equal 3.5., ditto for `gross`)

Store the `DataFrame` containing these variables as `movies_millions`.

In [None]:
movies_millions = movies.assign(budgetM=lambda x: x['budget']/1_000_000, grossM=lambda x: x['gross']/1_000_000)
movies_millions[['budgetM', 'grossM']]

Unnamed: 0,budgetM,grossM
0,237.0000,760.505847
1,300.0000,309.404152
2,245.0000,200.074175
3,250.0000,448.130642
4,263.7000,73.058679
...,...,...
3884,0.0070,0.424760
3885,0.0070,0.070071
3886,0.0070,2.040920
3887,0.0090,0.004584


# Question 2: Basic Plotting with Altair


a) Use the `altair` package to create a [scatter plot](https://altair-viz.github.io/gallery/scatter_tooltips.html) of `imdb_score` on the x-axis and movie gross on the y-axis (`grossM`). Use the `movies_millions` dataset from your work above.

In [None]:
import altair as alt

(alt.Chart(movies_millions)
    .mark_circle(size=20) # mark_point() is also fine
    .encode(
            x='imdb_score',
            y='grossM')
    .interactive()
)

Output hidden; open in https://colab.research.google.com to view.

b). There are so many points it is hard to see how much underlying data each point represents.

Add an `opacity` [argument](https://www.w3schools.com/python/gloss_python_function_arguments.asp#:~:text=Information%20can%20be%20passed%20into,separate%20them%20with%20a%20comma.) to the `mark_circle` (or `mark_point`) function  within `altair` to reduce the transparency of each data point. Make assign `opacity` a value of `.2`.

In [None]:
(alt.Chart(movies_millions)
    .mark_circle(size=20, opacity=.2)
    .encode(
        x='imdb_score',
        y='grossM')
    .interactive()
)

Output hidden; open in https://colab.research.google.com to view.

c) Create a scatter plot of `imdb_score` against `grossM` and use `altair`'s [transform_loess'](https://altair-viz.github.io/user_guide/transform/loess.html) to make a [smoothing line].

Is there a relationship between movie gross (in millions) and IMDb score? Write a short 1-2 sentence response in a text cell. Nothing fancy, just justify your claim.

In [None]:
chart = (alt.Chart(movies_millions)
    .mark_circle(size=20, opacity=.2)
    .encode(
        x='imdb_score',
        y='grossM')
    )

(chart + chart.transform_loess('imdb_score', 'grossM')
    .mark_line()

)

Output hidden; open in https://colab.research.google.com to view.

Response Options:

If yes, then you should argue that as imdb_score increases so too does grossM on average (even if small).

If no, then you should argue that the relationship is too small to matter. As imdb_score rises, grossM doesn't rise very much. There is more density at lower levels as demonstrated by opacity.

Fewer points to either answer if responses are not in a text cell.

d) Use `altair` to create a scatter plot. The scatter plot should use budget (`budgetM`) for the x axis and `imdb_score` for the y axis.

The catch is that the data in the chart should only  that only include/represent movies by the director `Steven Spielberg`.

Plot these movies as points and assign the color of the points to the `movie_title` column.

For info how to change the color of your marks in altair, look at this [documentation](https://altair-viz.github.io/user_guide/customization.html#color-schemes)

In [None]:
spielberg_data = movies_millions.loc[movies_millions['director_name'] == 'Steven Spielberg']

(alt.Chart(spielberg_data)
    .mark_point(size=20)
    .encode(
        x="budgetM",
        y='imdb_score',
        color='movie_title',
        tooltip=['movie_title'] # unncessary, just doing it
        )
    .interactive()
    )