## Movie Data Exploration

In this file, you will explore some simple operations of Pandas, a python package made to handle tables of data that it refers to as "data frames." We'll also be using MatPlotLib, a popular tool for data visualization.

The data you are using is one of your instructor's actual Letterboxd data. The included `diary.csv` file is a table representing diary entries in a film log recording all of the movies Pat Healy has watched since 2020.

Before you start doing things with data, we need to make sure we have some python packages of interest installed. Run the following block to import the packages we're using (if you get errors, you need to install these packages with pip):

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

---
#### Read in the data

Pandas gives us an easy way to read in data if it's already formatted in a nice way (like in a spreadsheet). Running the following block will read the `diary.csv` spreadsheet into a dataframe:

In [2]:
data = pd.read_csv("diary.csv")

There are a few dataframe functions we could call to get a basic idea of what we're looking at. For example, the `head` and `describe` methods.

In [4]:
# Gives us the first five rows of the dataframe
data.head()

Unnamed: 0,Date,Name,Year,Letterboxd URI,Rating,Rewatch,Tags,Watched Date
0,2020-12-01,Little Women,2019.0,https://boxd.it/1uiJTT,5.0,,,2020-01-01
1,2020-12-06,Uncut Gems,2019.0,https://boxd.it/1uQMXd,4.5,,,2020-01-05
2,2020-12-06,John Mulaney & The Sack Lunch Bunch,2019.0,https://boxd.it/1uQN7x,4.5,,,2020-01-05
3,2020-12-06,Seth Meyers: Lobby Baby,2019.0,https://boxd.it/1uQNPf,,,,2020-01-06
4,2020-12-06,John Mulaney & The Sack Lunch Bunch,2019.0,https://boxd.it/1uQMl9,4.5,,,2020-01-09


In [6]:
# gives us basic descriptive statistics of quantitative fields
data.describe()

Unnamed: 0,Year,Rating
count,1913.0,1787.0
mean,2002.11448,3.656407
std,20.424034,0.921023
min,1900.0,0.5
25%,1988.0,3.0
50%,2007.0,4.0
75%,2020.0,4.5
max,2024.0,5.0


If you're interested in seeing more functions available to dataframes, check out [the official documentation](https://pandas.pydata.org/docs/reference/frame.html).

Sometimes we just want to look at a single column instead of the whole dataframe. When we do this, we are looking at what Pandas refers to as a "Series" (which has [its own interesting documentation](https://pandas.pydata.org/docs/reference/series.html)).

Here we are extracting out a series representing just the names of the films logged.

In [10]:
film_names = data["Name"]

# Series also have a describe function! In cases of non-numeric data like this, it will give you counts, 
#    counts of unique elements, the mode,
#    and the element name and datatype that define the series.
film_names.describe()

count                  1915
unique                 1648
top       Stop Making Sense
freq                     32
Name: Name, dtype: object

--- 
## Exercises

The following exercises will give you an opportunity to explore some of the basic data analysis functions we have available to us in pandas and matplotlib.

#### Exercise 1. Counting

When a log is created on Letterboxd without a rating, it appears blank in the csv file, which is then read in by Pandas as `NaN`. By whatever means you like, count how many logs in `diary.csv` included no rating. 

There are a lot of ways to do this! Here are some methods (with links to documentation) that you might find useful: 
- The `count` method returns the number of non-NA elements in the series ([documentation link](https://pandas.pydata.org/docs/reference/api/pandas.Series.count.html#pandas.Series.count))
- The `value_counts` methods returns a series representing the frequency of every unique value in a series ([documentation link](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html#pandas.Series.value_counts))
- If you want to get the total number of rows of a dataframe, you can just call the `len` method on it, like you would with any other data structure (like a list in the previous file).

In [None]:
# Your work here!

#### Exercise 2. Favorites
From the diary, find the names of all of the films that were awarded a rating of 5 stars.

There are a lot of ways to do this, too! Probably the easiest (but, behind the scenes, INCREDIBLY complicated) way of doing this is with a `query` call. Take a look at the [official documentation on query](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.query.html).

In [13]:
# your work here!

#### Exercise 3. Average Rating per Year (by log date)

From the diary, find the average rating awarded per year when the film was LOGGED (not when the film was released) for 2020, 2021, 2022, 2023, and 2024.

Some things to note:
- To get the year from the Date objects in the `Watched Date` column, you'll need to either convert the column to DateTime objects or just treat them like strings and look at their first four characters.
- If you plan on treating the column like strings, you should [check out the string startswith method for series](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.startswith.html). You can use that method inside a query!
- If you'd like the convert the column to a datetime, you can do so with the `to_datetime` method ([explored in this tutorial](https://www.geeksforgeeks.org/convert-the-column-type-from-string-to-datetime-format-in-pandas-dataframe/)). Then, you can query on the year with `.dt.year` ([explored in-depth in this other tutorial](https://www.geeksforgeeks.org/how-to-filter-dataframe-rows-based-on-the-date-in-pandas/)).

In [21]:
#your code here!

#### Exercise 4. Plotting Frequency per Year (by film release year)

In this exercise, we want to calculate the frequency of film logs per film release year (the `Year` column in the spreadsheet) AND plot them on a line graph. Though in the previous exercise we had a set number of just a few years, here we should really make sure our code works for an arbitrary number of years to be included in the data.

Some notes:
- The `value_counts` method will be very helpful! It can give you a series that counts frequencies per year.
- Your data should be presented in sorted order, sorted by year. You can sort a pandas series with either the `sort_value()` or `sort_index()` methods. Experiment with them to see which is appropriate.
- You can make a simple line graph out of any pandas series with the `.plot()` method ([documentation here](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html)). For this exercise, you can call it done when you reach this simple line graph but if you'd like to explore ploting in a more stylish way (maybe adding a title, axis labels, changing around the color theme), [check out this documentation](https://pandas.pydata.org/pandas-docs/version/0.13.1/visualization.html).


In [28]:
# your work here!