# Exercise: Pandas (and Jupyter) Basics
In this lesson, we are go introduce some of the basics of [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html), a Python library for working with table data, like CSVs.

## Dataset: *The Pudding*'s Film Dialogue Data
The dataset that we're working with in this lesson is taken from Hannah Andersen and Matt Daniels's *Pudding* essay, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/). The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how "white men dominate movie roles."

## Importing Pandas
To use the Pandas library, we need to **import** it first. Do it the way you usually import a Python package.

In [3]:
import pandas as pd

# By default, Pandas will display 60 rows and 20 columns. I often change Pandas' default display settings to show more rows or columns.
# You don't have to worry about this for now, just run it.
pd.options.display.max_rows = 100

## Reading CSV Files as Data Frames
Pandas takes in [data frames](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#DataFrame) as inputs, which are objects that act like spreadsheets but utilize multiple functions discussed in the next few lectures. 

Import the *Pudding* data frame using a method called **.read_csv()**. Try it out here, with the name of the file (included in the exercise) in the parentheses.

In [4]:
pudding_df = pd.read_csv("Pudding-Film-Dialogue.csv")

## Data Frame Overviews
To look at a random *n* number of rows in a data frame, you can use a method called **.sample()**. Use it here to get a sample of some of the rows here.

In [5]:
pudding_df.sample(10)

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
16999,Interstellar,2014,Case,man,301,0.031808,37.0,200.0,6092
10617,Star Wars: Episode I - The Phantom Menace,1999,Anakin Skywalke,man,1697,0.158037,10.0,813.0,3437
14043,The Mirror Has Two Faces,1996,Gregory Larkin,man,3852,0.295807,47.0,81.0,4630
13640,Labor Day,2013,Officer Treadwe,man,218,0.022916,36.0,14.0,4495
6454,Fantastic Mr. Fox,2009,Franklin Bean,man,845,0.063505,69.0,24.0,2266
7179,Hall Pass,2011,Aunt Meg,woman,315,0.014221,,49.0,2465
22594,Liar Liar,1997,Fletcher Reede,man,4105,0.529063,35.0,343.0,8971
8811,The Master,2012,Susan Gregory,woman,233,0.01587,,17.0,2866
9250,Next Friday,2000,Smokey (A,man,132,0.016031,28.0,92.0,2996
13104,Hardcore,1979,Mary,woman,128,0.015405,37.0,,4333


## Some Basic Pandas Methods
- **.sum()** calculuates the sum of mulitple values.
- **.mean()** calculates the average of all the values.
- **.median()** finds the median of all the values.
- **.max()** finds the maximum of all the values.
- **.min()** finds the miimum of all the values.
- **.mode()** finds the most common value in the dataset.
- **.count()** finds the total number of non-blank values in the dataset.
- **.value_counts** shows the frequency of unique values in the dataset.

Try these methods out with the dataset! Please answer the following questions using Pandas methods:

In [6]:
# How old (on average) are the characters in the dataset?
avg_age = pudding_df["age"].mean()
avg_age

42.2750520205892

In [10]:
# How old is the oldest character in the dataset?
max_age = pudding_df["age"].max()
print(f"The oldest person in this dataset is {max_age} years old.")

The oldest person in this dataset is 2009.0 years old.


In [11]:
# How young is the youngest character in the dataset?
min_age = pudding_df["age"].min()
print(f"The youngest person in this dataset is {min_age} years old.")

The youngest person in this dataset is 3.0 years old.


In [12]:
# Calculate the frequency between both genders in the dataset. 
# Hint: use the method at the bottom of the explanatory list!
gender_frequency = pudding_df["gender"].value_counts()
gender_frequency

man      16131
woman     6911
?            5
Name: gender, dtype: int64

## Examining Subsets
Write a conditional statement that will flter the data frame to only show rows that have characters from a movie of your choice. Explore the dataset to find one of your favorite movies!

This creates a **subset** of your data frame, based on one specific parameter.

In [14]:
title_filter = pudding_df["title"] == "The Matrix"
pudding_df[title_filter]

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
2095,The Matrix,1999,Agent Smith,man,976,0.170838,39.0,292.0,1141
2096,The Matrix,1999,Morpheus,man,2123,0.371609,38.0,292.0,1141
2097,The Matrix,1999,Neo,man,995,0.174164,35.0,292.0,1141
2098,The Matrix,1999,Oracle,woman,208,0.036408,66.0,292.0,1141
2099,The Matrix,1999,Tank,man,535,0.093646,32.0,292.0,1141
2100,The Matrix,1999,Trinity,woman,876,0.153335,32.0,292.0,1141


##❓ What potential issues do you notice when you look closer at this data?

What do you think about The Pudding's approach to assigning gender in this dataset? What alternatives could we potentially use, if any?

Your answer here (in *italics*):
*I think one of the biggest problems comes when some characters are based on the ACTOR's gender rather than the CHARACTER's. Look at Roz in "Monsters, Inc.," as discussed in the textbook (voiced by a man, but the character is a woman), or the Oracle in "The Matrix" (played by a woman, but the character is nonbinary) - these are complexities that the dataset doesn't tell you, and I think that specifying the CHARACTER's gender is better than the actor's.*