# SI 618 Day 02 - Pandas I
Dr. Chris Teplovs, University of Michigan School of Information

Copyright &copy; 2024.  This notebook may not be shared outside of the course without permission.

Notebook version 2024.09.05.7.CT

## Overview

This notebook will provide an overview of some pandas data structures and introduce you to pandas, a Python library for data analysis.  We will use pandas extensively in this course.


## Outline
1. Welcome (5 minutes)
2. Q&A (15 minutes)
3. Coding basics (python stuff) (30 minutes)
4. Conceptual overview & lecture (50 minutes)
5. Break (10 minutes)
6. Hands-on coding (25 minutes)
7. Application of concepts (30 minutes)
8. Debrief & next steps (10 minutes)


## Welcome (5 minutes)

## Q&A (15 minutes)

Your questions, comments, and observations about the pre-class readings.

My questions:
1. Why are we using Python?  Why not R?  Or SAS?  Or SPSS?  Or Stata?  Or Excel?
2. Why are we using Jupyter notebooks?  Why not just write scripts?
3. What is pandas?  Why are we using it?  Why not just use Python?

## Coding basics (python stuff) (30 minutes)

Before we start, let's look at some Visual Studio Code extensions that will make our lives easier.

1. Python
2. Jupyter
3. GitHub Copilot
4. Markdown All in One
5. Code Spell Checker
6. GitLens
7. indent-rainbow
8. Pylance
9.  Rainbow CSV
10. Trailing Spaces

### Data structures in python

You've already been introduced to the basic data structures in python: lists, tuples, and dictionaries.  We'll review them briefly here.

Python lists are ordered collections of items.  The items can be of any type, and can be mixed types.  Lists are mutable, meaning that you can change the contents of a list after you create it.  You can add items to a list, remove items from a list, and change the value of items in a list.  You can also sort a list.

```
my_list = [1, 2, 3, 4, 5]
my_list.append(6) # add an item to the end of the list
my_list[0] = 0 # change the value of the first item in the list
my_list[1] = 'two' # lists can contain mixed types
```

A tuple is similar to a list, but is immutable.  Once you create a tuple, you cannot change its contents.  You can, however, create a new tuple from an existing tuple.

```
my_tuple = (1, 2, 3, 4, 5)
my_tuple[0] = 0 # this will cause an error
my_tuple = (0, 2, 3, 4, 5) # this is ok
```

A dictionary is a collection of key-value pairs.  The keys must be unique, but the values can be duplicated.  Dictionaries are mutable.

```
my_dict = {'a': 1, 'b': 2, 'c': 3}
my_dict['d'] = 4 # add a new key-value pair
my_dict['a'] = 0 # change the value of an existing key
my_dict['a'] = my_dict['a'] + 1 # increment the value of an existing key
my_dict['e'] = my_dict['e'] + 1 # error: key 'e' does not exist
```

Let's introduce a new data structure: the set.  A set is an unordered collection of unique items.  You can add items to a set, remove items from a set, and check whether an item is in a set.  You cannot change the value of an item in a set.

```
my_set = {1, 2, 3, 4, 5}
my_set.add(6) # add an item to the set
my_set.add(6) # this is ok, but the set will still only contain one 6
my_set.remove(1) # remove an item from the set
my_set[0] = 0 # this will cause an error
```



We can combine data types into more complex ones.  For example, we could have a list of lists, or a dictionary of sets, or a list of tuples of dictionaries of sets.  You get the idea.

```
my_list = [1, 2, 3, 4, 5]
my_list_of_lists = [my_list, my_list, my_list]

my_set = {1, 2, 3, 4, 5}
my_dict_of_sets = {'a': my_set, 'b': my_set, 'c': my_set}

my_tuple = (my_dict_of_sets, my_dict_of_sets, my_dict_of_sets)
my_list_of_tuples = [my_tuple, my_tuple, my_tuple]

```

#### Challenge:

Consider the following code:

In [None]:
names = ['Chris', 'Xin', 'Nithin']
scores = [75, 95, 85]

Write code to print the following output.  Document your code using comments, docstrings, and type hints.

```
Chris: 75
Xin: 95
Nithin: 85
```

In [None]:
# insert your code here

#### Challenge:
Now consider the case where we have multiple scores per person.  Write code to use an appropriate data structure to store the scores for each person and print the following, exactly as it appears below.  Document your code using comments, docstrings, and type hints.

```
Chris: 75, 80, 85
Xin: 95, 90, 85
Nithin: 85, 90, 95
```

In [None]:
# insert your code here

## Conceptual overview & lecture (50 minutes)

Whereas [PEP-8](https://pep8.org/) suggests that `import` statements should be at the top of the file, we're going to put them here so that we can see what we're importing.

In [None]:
import pandas as pd
import numpy as np

Before we dive into `pandas`, let's take a look at some `numpy` basics.

numpy is a Python library for scientific computing.  Most relevant to us, it provides a multidimensional array object, and an assortment of routines for fast operations on arrays.  The array object is called a `ndarray` and is similar to a list, but all of the elements must be of the same type.  The elements of an `ndarray` are accessed using square brackets, just like a list.

https://numpy.org/doc/stable/user/quickstart.html

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html



Let's create a small dataframe to play with.

In [None]:
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [4, 5, 6]})
df

Creating small DataFrames is a good practice when you're testing things out, but it can get a bit cumbersome if we want a slightly larger DataFrame.  So let's generate some random data using `numpy` and then create a DataFrame from that.

In [None]:
# np.random.randn returns a sample (or samples) from the “standard normal” distribution.
np.random.randn(10, 4)

In [None]:
# we can use it to create a DataFrame with a random values
pd.DataFrame(np.random.randn(10, 4))

But those column names don't make much sense.  Let's change them.

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df

If you re-run the above cell, you'll get different values.  That's because we're using `numpy.random` to generate random values.  If we want to be able to reproduce our results, we need to set the seed for the (pseudo-)random number generator.

What does setting the seed do? It sets the starting point for the random number generator.  The random number generator is a mathematical function that takes a number as input and produces a random number as output.  If you set the seed to the same value, you'll get the same sequence of random numbers.  If you set the seed to a different value, you'll get a different sequence of random numbers.

With numpy, you can set the seed using `numpy.random.seed()`.  The argument to `numpy.random.seed()` is the seed value.  The seed value can be any integer.  It doesn't matter what the value is, as long as it's an integer.  Common values are 0, 1, 42, and 618 (for this course).  



In [None]:
np.random.seed(618)
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df

#### Challenge:

Implement a function that takes the number of rows, number of columns, and column names as required parameters, as well as a random seed that defaults to 42, and returns a DataFrame with random data.  Document your function using a docstring, comments, and type hints.

In [None]:
# insert your code here

#### Challenge: 

Write a set of assertions that test your function.  You should test that the function returns a DataFrame, that the DataFrame has the correct number of rows and columns, and that the column names are correct.

In [None]:
# insert your code here

### Universal functions (ufuncs) from numpy

Let's say we had a simple list of numbers and we wanted to add 1 to each number.  We could do that using a for loop:
```
def add_one(numbers):
    for i in range(len(numbers)):
        numbers[i] += 1
    return numbers

add_one([1, 2, 3, 4, 5])
```

But that's cumbersome, and annoying.  It would be nice if we could just add 1 to the list and have it add 1 to each element.  We can do that with numpy.

```
import numpy as np
numbers = np.array([1, 2, 3, 4, 5])
numbers + 1
```

Numpy provides us not only with universal functions for arithmetic, but also for trigonometry, logarithms, and other mathematical functions.  For example, we can take the square root of each element in an array using `np.sqrt()`.

```
np.sqrt(numbers)
```

That's enough numpy for now.  Let's move on to pandas.  pandas is a Python library for data analysis.  It provides a DataFrame object that is similar to a spreadsheet or a database table.  It also provides a Series object that is similar to a single column in a spreadsheet or a database table.  Let's take a look at pandas Series objects and then move on to pandas DataFrame objects.

### pandas.Series

pandas Series are similar to numpy arrays, but they have an index.  The index is a label for each element in the Series.  The index can be a number, a string, or a date.  The index is used to access elements in the Series.  The index is also used to align Series when performing operations on multiple Series.

So, for example, we can create a Series from a list of numbers.  Unless we specify otherwise, the index will be a sequence of integers starting at 0 (called a RangeIndex):

```
import pandas as pd
numbers = pd.Series([1, 2, 3, 4, 5])
numbers
```


In [None]:
import pandas as pd
numbers = pd.Series([1, 2, 3, 4, 5])
numbers

### pandas.DataFrame

pandas DataFrames consist of one or more Series.  Each Series is a column in the DataFrame.  The Series are aligned using the index.  So, for example, we can create a DataFrame from a dictionary of Series.  The keys of the dictionary are the column names.  The values of the dictionary are the Series.

```
df = pd.DataFrame({'numbers': numbers, 'squares': numbers ** 2})
df
```


We can extract a single column from a DataFrame using square brackets.  The column will be returned as a Series.

```
df['numbers']
```

We can also extract a single column from a DataFrame using dot notation.  The column will be returned as a Series.

```
df.numbers
```
Be careful with dot notation.  It only works if the column name is a valid Python identifier.  That means it can't start with a number, it can't contain spaces, and it can't contain any of the following characters: `~!@#$%^&*()-+={}[]|\:;"'<,>.?/`.  If the column name is not a valid Python identifier, or if it conflicts with any DataFrame method or attribute names, you must use square brackets.



We can perform arithmetic operations on Series, just like we can with numpy arrays.  We can also perform arithmetic operations on DataFrames.  The operations are performed on each element in the Series or DataFrame.

```
df + 1
```


Let's say we wanted to add a new column to our DataFrame.  We can do that using square brackets and the assignment operator.  The new column will be added to the end of the DataFrame.

```
df['cubes'] = numbers ** 3
df
```

We can also use more complex expressions to create new columns.  For example, we can create a new column that is the sum of two existing columns.

```
df['sum'] = df.numbers + df.squares
df
```

We can also use more complex functions on DataFrames.  For example, we can use the `map()` method to apply a function to each element in a DataFrame.

```
def is_even(x):
    return x % 2 == 0
df['numbers'].map(is_even)
```

For those who are comfortable with lambda functions, we can use a lambda function instead of a named function.

```
df['numbers'].map(lambda x: x % 2 == 0)
```

Sometimes we want to apply a function to each element in a DataFrame, but we want to use more than one column as input to the function.  We can do that using the `apply()` method.  The `apply()` method takes a function as an argument.  The function must take a single argument, which will be a Series containing the values from a single row in the DataFrame.  The function must return a single value, which will be the value for the new column.

```
df.apply(max, axis=1) # how does this work?
```

Note that, while less common than row-wise manipulation, we can use also the `apply()` method to apply a function to each column in a DataFrame.  We just need to specify `axis=0` (or `axis='index'`).

```
def range(x):
    return x.max() - x.min()
df.apply(range, axis=0)
```

### Counting value frequencies

Let's say we wanted to count the number of times each value appears in a Series.  We can do that using the `value_counts()` method.

```
df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5, 5, 1, 2, 3, 4, 5, 6], 'squares': [1, 4, 9, 16, 25, 25, 1, 4, 9, 16, 25, 36], 'names': ['one', 'two', 'three', 'four', 'five', 'five', 'one', 'two', 'three', 'four', 'five', 'six']})
df.numbers.value_counts()
```

Note that the `value_counts()` method returns a Series, not a DataFrame.  The index of the Series is the unique values in the original Series.  The values of the Series are the counts of each unique value.  The Series is sorted in descending order by count.

In [None]:
df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5, 5, 1, 2, 3, 4, 5, 6], 'squares': [1, 4, 9, 16, 25, 25, 1, 4, 9, 16, 25, 36], 'names': ['one', 'two', 'three', 'four', 'five', 'five', 'one', 'two', 'three', 'four', 'five', 'six']})
df.numbers.value_counts()

### Describing DataFrames

Let's say we wanted to get some basic descriptive statistics for each column in a DataFrame.  We can do that using the `describe()` method.

```
df.describe()
```

We can also get information about the data types of each column in a DataFrame using the `info()` method.

```
df.info()
```

In [None]:
df.describe()


In [None]:
df.info()

The `dtypes` attribute of a DataFrame contains the data types of each column in the DataFrame.

```
df.dtypes
```

You can extract all columns of a particular data type using the `select_dtypes()` method, either by specifying the data type you want to include or by specifying the data type you want to exclude.

```
df.select_dtypes(include='number')
```

```
df.select_dtypes(include='object')
```

```
df.select_dtypes(exclude='number')
```

### Dropping rows and columns

Let's say we wanted to drop the `numbers` column from our DataFrame.  We can do that using the `drop()` method.

```
df.drop('numbers', axis=1)
```

Note that the `drop()` method returns a new DataFrame.  It does not modify the original DataFrame.  If we want to modify the original DataFrame, we need to use the `inplace=True` argument.

```
df.drop('numbers', axis=1, inplace=True)
```

We can also drop rows using the `drop()` method.  To drop rows, we need to specify the index of the rows we want to drop.

```
df.drop(['one', 'two'], axis=0)
```

### Missing values

* a missing value is just that -- it's missing
* represented as nan, NaN, NAN, np.nan, np.NaN (you get the idea)
* many tools that we'll be learning can't handle missing values
* you need to decide what to do with it
* can leave it as is, replace it with a scalar value, replace it with the output of a function (like mean), or drop the row
* think of what that would mean if you were going to calculate the mean of 1,2,3,NaN,4,5,6

Let's set up a DataFrame with some missing values.

In [None]:
df_missing = pd.DataFrame(
    {'a': [1, 2, 3, np.nan],
    'b': [2, 3, np.nan, 5],
    'c': [1, 2, 3, 4]})

In [None]:
df_missing

In [None]:
df_missing.info()

The `missingno` library provides some nice tools for visualizing missing values.

In [None]:
import missingno as msno # again, violates PEP-8
msno.matrix(df_missing)

We can use the `isna()` method to check for missing values.

In [None]:
df_missing.isna()

and we can use the sum method to count the number of missing values in each column:

In [None]:
df_missing.isna().sum()

In [None]:
df_missing.dropna(how='any') # drop rows with any missing values

In [None]:
df_missing.dropna(axis=1, how='any') # drop columns with any missing values

In [None]:
# add a new column consisting of missing values
df_missing['d'] = np.nan
df_missing

In [None]:
df_missing.dropna(axis=1, how='all') # drop columns where all values are NaN

In [None]:
# drop rows that have missing values for column 'b'
df_missing.dropna(subset=['b'])

### Filling in missing values
Sometimes, you don't want to drop missing values, but rather fill them in with some other value.  For example, you might want to fill in missing values with a scalar value, or perhaps with the mean or median of the column.  You can do that using the `fillna()` method.

```
df.fillna(0) # fill in missing values with 0
```

```
df.fillna(df.mean()) # fill in missing values with the mean of the column
```

```
df.fillna(df.median()) # fill in missing values with the median of the column
```

In [None]:
df_missing.fillna(0) # fill missing values with 0

A slightly different approach, and one that's often used in time series analysis, is to fill in missing values with the last known value.  You can do that using the `ffill` method.

```
df_missing.fillna(method='ffill')
```
and its counterpart, `bfill`

```
df_missing.fillna(method='bfill')
```

### Setting and resetting indexes

Let's say we wanted to use the `names` column as the index for our DataFrame.  We can do that using the `set_index()` method.

```
df.set_index('names')
```

Note that the `set_index()` method returns a new DataFrame.  It does not modify the original DataFrame.  If we want to modify the original DataFrame, we need to use the `inplace=True` argument.

```
df.set_index('names', inplace=True)
```

We can also reset the index (to move the index back into the DataFrame as a column) using the `reset_index()` method.

```
df.reset_index()
```

We'll use indexes in the next section when we talk about selecting rows and columns.

### Filtering rows

First, let's set the index to be the `names` column.

```
df.set_index('names', inplace=True)
```

The .loc and .iloc indexers are used to select rows and columns from a DataFrame.  The .loc indexer selects rows and columns based on the labels of the rows and columns.  The .iloc indexer selects rows and columns based on the position of the rows and columns.

Let's say we wanted to select the row with the label 'one'.  We can do that using the .loc indexer.

```
df.loc['one']
```

The .iloc indexer works the same way, but uses the position of the row instead of the label of the row.

```
df.iloc[0]
```

#### Boolean filtering (IMPORTANT)

Let's say we wanted to select only the rows where the value in the `numbers` column is greater than 3.  We can do that using a boolean expression.

```
df[df.numbers > 3]
```

We can also use the `isin()` method to select rows where the value in the `names` column is in a list of values.

```
df[df.names.isin(['one', 'two', 'three'])]
```

We can also use the `between()` method to select rows where the value in the `numbers` column is between two values.

```
df[df.numbers.between(2, 4)]
```

We can also use the `str.contains()` method to select rows where the value in the `names` column contains a particular string.

```
df[df.names.str.contains('o')]
```

Finally, pandas provides the `query()` method, which allows us to use SQL-like syntax to select rows.

```
df.query('numbers > 3')
```


### Sorting rows

Let's say we wanted to sort the rows in our DataFrame by the values in the `numbers` column.  We can do that using the `sort_values()` method.

```
df.sort_values('numbers')
```

By default, the `sort_values()` method sorts in ascending order.  We can sort in descending order by setting the `ascending` argument to `False`.

```
df.sort_values('numbers', ascending=False)
```

We can also sort by the index using the `sort_index()` method.

```
df.sort_index()
```


### Grouping rows

Let's say we wanted to group the rows in our DataFrame by the values in the `numbers` column.  We can do that using the `groupby()` method.

```
df.groupby('numbers')
```

By itself, the `groupby()` method doesn't do much.  It just creates a `DataFrameGroupBy` object.  We need to apply an aggregation function to the `DataFrameGroupBy` object to get anything useful.  For example, we can get the mean of each group using the `mean()` method.

```
df.groupby('numbers').mean()
```

We can also apply multiple aggregation functions to the `DataFrameGroupBy` object using the `agg()` method.

```
df.groupby('numbers').agg(['mean', 'median'])
```


### Basic Plots

pandas provides some basic plotting functionality.  We'll use the `plot()` method to create plots.  The `plot()` method takes a `kind` argument that specifies the type of plot to create.  The default is a line plot.

For example, generate a histogram of the Age column, filtering out any values that are less than 15 or greater than 100.

```
df[(df.Age < 100) & (df.Age > 15)].Age.plot.hist()
```

Another example: generate a bar chart of the output from `value_counts()` of the `benefits` column.

```
df.benefits.value_counts().plot.bar()
```



## Break (10 minutes)

## Hands-on Coding (25 minutes)

From https://www.kaggle.com/osmi/mental-health-in-tech-survey


### Data Description

This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.

### Metadata
|Field|Description|
|:----|:----|
|**Timestamp**|
|**Age**| 
|**Gender**
|**Country**
|**state**| If you live in the United States, which state or territory do you live in?
|**self_employed**| Are you self-employed?
|**family_history**| Do you have a family history of mental illness?
|**treatment**| Have you sought treatment for a mental health condition?
|**work_interfere**| If you have a mental health condition, do you feel that it interferes with your work?
|**no_employees**| How many employees does your company or organization have?
|**remote_work**| Do you work remotely (outside of an office) at least 50% of the time?
|**tech_company**| Is your employer primarily a tech company/organization?
|**benefits**| Does your employer provide mental health benefits?
|**care_options**| Do you know the options for mental health care your employer provides?
|**wellness_program**| Has your employer ever discussed mental health as part of an employee wellness program?
|**seek_help**| Does your employer provide resources to learn more about mental health issues and how to seek help?
|**anonymity**| Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
|**leave**| How easy is it for you to take medical leave for a mental health condition?
|**mental_health_consequence**| Do you think that discussing a mental health issue with your employer would have negative consequences?
|**phys_health_consequence**| Do you think that discussing a physical health issue with your employer would have negative consequences?
|**coworkers**| Would you be willing to discuss a mental health issue with your coworkers?
|**supervisor**| Would you be willing to discuss a mental health issue with your direct supervisor(s)?
|**mental_health_interview**| Would you bring up a mental health issue with a potential employer in an interview?
|**phys_health_interview**| Would you bring up a physical health issue with a potential employer in an interview?
|**mental_vs_physical**| Do you feel that your employer takes mental health as seriously as physical health?
|**obs_consequence**| Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
|**comments**| Any additional notes or comments



Then read the CSV file into a DataFrame:

NOTE: MacOS users may need to run the Install Certificates.command script located in the Python 3 folder in order to avoid SSL errors when using the requests library.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/umsi-data-science/data/main/survey.csv")

It's common to look at the resulting DataFrame using .head()

In [None]:
df.head()

If you want to look at a random sample, you can use .sample()

In [None]:
df.sample(5)

Finally, you can get some basic information about the size and shape of the DataFrame:

In [None]:
print("The number of rows of the dataset is: ", len(df))
print("The number of columns of the dataset is: ", len(df.columns))
print("The shape of the dataset is: ", df.shape)

You can list the columns:


In [None]:
df.columns

And you can extract one or more columns.  The following pair of 
commands do exactly the same thing:

In [None]:
print(df['Country'])

In [None]:
country_state = df[['Country', 'state']]
country_state.head()

In [None]:
df.iloc[0]

In [None]:
df.loc[0]

In [None]:
df_gender = df.set_index('Gender')

In [None]:
df_gender.loc['Female']

Let's see what the possible values of the 'Country' column are:

In [None]:
df['Country'].value_counts()

Let's do some boolean filtering:

Let's say we wanted to select only the rows where the value in the `Country` column is "United States".

```
df[df.Country == 'United States']
```



Let's do some basic data cleaning.  First, let's examine the `Age` column.

```
df.Age
```
Do you see anything strange?  Let's look at the minimum and maximum values.

```
df.Age.min()
```

```
df.Age.max()
```

Let's say we wanted to replace all of the values in the `Age` column that are greater than 100 with the median age.  We can do that using the `where()` method.

```
df.Age.where(df.Age > 100, df.Age.median())
```


#### Challenge: 
Use the `where()` method to replace all of the values in the `Age` column that are less than 16 with the median age.


In [None]:
# insert your code here

Example: Find people who reported a family history of mental health conditions.
Look at the data description above, and you'll see that we can use the `family_history` column to find people who reported a family history of mental health conditions.

```
df.family_history
```

In [None]:
# get a sense of the distribution of the values in the family_history column
df.family_history.value_counts()

In [None]:
# do the filtering
df_family_history = df[df.family_history == 'Yes']
df_family_history.head()

You can use a simple expression like `df.family_history == 'Yes'` to find people who reported a family history of mental health conditions, but you can also use more complex expressions.  For example, to find people who reported a family history of mental health conditions and are between the ages of 20 and 30, you could use the following expression:

```
df[(df.family_history == 'Yes') & (df.Age.between(20, 30))]
```


In [None]:
df[(df.family_history == 'Yes') & (df.Age.between(20, 30))]

Be careful to always look at the possible values of a column before you do any filtering.  For example, let's say that we wanted to extract the rows corresponding to 
individuals who are willing to discuss a mental health issue with their coworkers.  We might try the following:

```
df[df.coworkers == 'Yes']
```

However, if we look at the possible values of the `coworkers` column, we'll see that there are actually three possible values: `Yes`, `No`, and `Some of them`.  So we need to modify our expression to take that into consideration.


In [None]:
df.coworkers.value_counts()

#### Challenge: 
Report how many people are willing to discuss a mental health issue with their coworkers?

In [None]:
# insert your code here

#### Challenge:

What is the mean age of respondents?

In [None]:
# insert your code here


Does that look right?  What's going on? Can you fix it?

#### Challenge:
Use a better method to report the "expected" value of the `Age` column. (Hint: consider other measures of central tendency.)

In [None]:
# insert your code here

#### Challenge:
What are the unique values of the `Gender` column?

In [None]:
# insert your code here

#### Challenge:
Describe (using a markdown block that you insert below) how, if at all, you would clean the `Gender` column, implement your solution (using a code block that you insert below), and comment on any bias that might be introduced by your implementation (again, by inserting a markdown block below your code).

#### Challenge:
Find the unique categories of no_employees. What is the frequency of each category?

In [None]:
# insert your code here

### Aggregating data

Example: Find the number of respondents from each state.
```
df.state.value_counts()
```

Alternatively, we could use the `groupby()` method to group the rows by state and then use the `size()` method to get the size of each group.

```
df.groupby('state').size()
```


## Application of Concepts (30 minutes)

Exploration of Movie Titles and Movie Cast

We're going to use two datasets from the Internet Movie Database (IMDb) to explore movie titles and movie cast.  The first dataset contains information about movies, including the title, year of release, and genre(s).  The second dataset contains information about the cast of each movie, including the name of the actor/actress, the character they played, and the billing position (i.e., the order in which they were listed in the credits).

We're going to load them via URLs, but you can also download them and load them from your local machine.  Note that if you get a weird "certificate error" or "SSL error", you'll need to follow the instructions from last class to fix it (i.e. click on "Install Certificates.command" in the Finder); alternatively you can use `%pip install certifi` in a new code block below.


In [None]:
titles = pd.read_csv('https://github.com/umsi-data-science/data/raw/main/titles.csv', index_col=None)
cast = pd.read_csv('https://github.com/umsi-data-science/data/raw/main/cast.zip', index_col=None)

#### Challenge:
Show the first 3 lines of both the titles and the cast DataFrames.

In [None]:
# insert your code here

#### Challenge:
How many movies are listed in the titles DataFrame?

In [None]:
# insert your code here

#### Challenge:
What are the earliest two films listed in the titles DataFrame?

In [None]:
# insert your code here

#### Challenge:
How many movies have the title "Hamlet"?

In [None]:
# insert your code here

#### Challenge:
List all of the "Treasure Island" movies from earliest to most recent.

In [None]:
# insert your code here

#### Challenge:
What are the 10 most common movie names of all time?

In [None]:
# insert your code here

#### Challenge:
Who are the 10 people most often credited as "Herself" in film history?

In [None]:
# insert your code here

#### Challenge:
What are the 10 most frequent roles that start with the word "Science"?


In [None]:
# insert your code here