<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Building a Scene Recognition Model form Video Frames</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/">https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Frames of a Video

Visual images are an important part of all media and Data Scientists are often using images as data sources.  In this MicroProject, you will create a simple model to detect the amount of time spent in two different "scenes" we used when creating office-hour style videos for Data Science DISCOVERY.

*This MicroProject was inspired by a podcast that we recently recorded with the team from the Center for Innovation in Teaching and Learning who helped produce our video.  To learn the background and hear from Karle and Wade about the journey of creating DISCOVERY, go over and listen to our episode on the "Teach Talk Listen Learn Podcast" where talk with TTLL host Bob Dignan and our CITL video producer Eric Schumacher: https://citl.illinois.edu/citl-101/teaching-learning/teach-talk-listen-learn*


### Loading a Video Frame

We have provided you with one frame every second from our video [*"Outliers Impact on Correlation (m6-02b)"*](https://www.youtube.com/watch?v=bd6hQ2UcIJc) that is used as part of our [DISCOVERY lecture covering Correlation](https://discovery.cs.illinois.edu/learn/Towards-Machine-Learning/Correlation/).

The `skimage` library is commonly used to load image data into Python.  Specifically:

- `skimage.io.imread(filename)` will read a filename and return the pixel color for every pixel in the image.
- To use the `imread` function, you will need to either do one of the following:

    1. Either import all of `skimage` by using the import line `import skimage`.  After importing all of `skimage`, you will call the function using it's fully qualified name: `skimage.io.imread(filename)`.
    
    **OR**
    
    2. Import only the `imread` function by using the import line `from skimage.io import imread`.  After importing only `imread`, you will call the function directly: `imread(filename)`

#### Read Pixel Data for `frames/frame_0001.jpg`

We have provided a `frames` directory with all of the frames.  In the following cell, store the pixel color data from the file named `frames/frame_0001.jpg` image in the variable `pixels` by using the `imread` function:


In [138]:
from skimage.io import imread

pixels = imread("frames/frame_0001.jpg")

### 🔬 Checkpoint Tests 🔬

In [139]:
### TEST CASE for Reading Video Frames
tada = "\N{PARTY POPPER}"

assert("pixels" in vars())
assert(pixels.shape == (360, 640, 3))
assert(pixels[0][0][0] == 91)

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 1: Storing Average Pixel Color

The **shape** of your data is the `rows` by `columns` by `color values` as 3-dimensional list.  Here's a formatted view of your `pixels` data:

```
[
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #1
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #2
  ...                                                    # ...
]
```

The current shape of `pixels` is 360 rows by 640 columns by 3 colors.  Each of the three colors represent the three color channels on a screen: red, green, and blue.

Using `pixel.mean()`, we find the average color grouping **ALL** the color channels (combining blues and reds and greens together).  Try it out:


In [140]:
pixels.mean()

np.float64(72.18011863425926)

In [141]:
pixels = pixels.reshape(-1, 3)
pixels


array([[ 91,  83,  80],
       [ 91,  83,  80],
       [ 91,  83,  80],
       ...,
       [162, 131, 110],
       [162, 131, 110],
       [162, 131, 110]], dtype=uint8)

To find the average of each color channel, the `pixels.resize(-1, 3).mean(axis=0)` function will find the mean of everything **except** the color channels.  Check out the new mean:

In [142]:
pixels.mean(axis=0)

array([88.65917535, 67.45620226, 60.4249783 ])

### Puzzle 1.1: Finding the Average Color of One Image

Store `pixel`'s average red value in `r`, average green value in `g`, and average blue value in `b`:

In [143]:
mean = pixels.mean(axis=0)
r = mean[0]
g = mean[1]
b = mean[2]
r

np.float64(88.65917534722222)

In [144]:
### TEST CASE for Puzzle 1.1: Finding the Average Color of One Image
tada = "\N{PARTY POPPER}"

import math
assert("r" in vars())
assert("g" in vars())
assert("b" in vars())
assert(math.isclose(r, 88.65917534722222))
assert(math.isclose(g, 67.45620225694445))
assert(math.isclose(b, 60.42497829861111))

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


### Puzzle 1.2: Finding the Average Color of All Images

The following code loops through every file in the `frames` directory -- this will include `frame_0001.jpg` (like you analyzed already) and also `frame_0002.jpg`, `frame_0003.jpg`, and all 300+ frames!

Create a DataFrame where each row is one frame with the following four columns:
- `frame`, the filename of the frame
- `r`, the average red color of the frame
- `g`, the average green color of the frame
- `b`, the average blue color of the frame

The structure of the code should be nearly identical to writing a simulation.  Instead of creating random variables for your real world data, your real world data will be the filename, and the average color values.

- See: https://discovery.cs.illinois.edu/learn/Simulation-and-Distributions/Simple-Simulations-in-Python/

In [145]:
import glob
import os
import pandas as pd

data = []
for frame in glob.glob(os.path.join("frames", "*.jpg")): 
  # `frame`` contains the filename of the frame (ex: "frames/frame_0001.jpg").  Use it for `imread` to read the frame image data.
  frame_pixels = imread(frame)
  frame_pixels = frame_pixels.reshape(-1, 3)
  r = pixels.mean(axis=0)[0]
  g = frame_pixels.mean(axis=0)[1]
  b = frame_pixels.mean(axis=0)[2]
  d = {"frame": frame, "r": r, "g": g, "b": b}
  data.append(d)
df = pd.DataFrame(data)
df

Unnamed: 0,frame,r,g,b
0,frames/frame_0028.jpg,88.659175,68.291619,60.394323
1,frames/frame_0014.jpg,88.659175,53.664076,45.780673
2,frames/frame_0216.jpg,88.659175,66.814054,61.303611
3,frames/frame_0202.jpg,88.659175,237.795842,235.532795
4,frames/frame_0148.jpg,88.659175,229.394049,227.089674
...,...,...,...,...
325,frames/frame_0227.jpg,88.659175,243.987266,242.135369
326,frames/frame_0233.jpg,88.659175,244.020039,242.136675
327,frames/frame_0019.jpg,88.659175,67.002426,59.679549
328,frames/frame_0025.jpg,88.659175,66.656454,59.014805


### 🔬 Checkpoint Tests 🔬

In [146]:
### TEST CASE for Puzzle 1.2: Finding the Average Color of All Images
tada = "\N{PARTY POPPER}"

import math
assert("df" in vars())
assert(len(df) == 330)
assert("r" in df)
assert("g" in df)
assert("b" in df)
assert("frame" in df)
assert( abs( df[ df.frame.str.endswith("_0001.jpg") ]["r"].sum() - 88 ) < 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 2: Create a Simple Classifier

In the DISCOVERY lecture videos, there are two primary "scenes" in the video:

1. **"Office Hours Studio Scene"**, where Karle and Wade are talking to each other and the audience,

2. **"Notebook Scene"**, where the notebook is displayed

View the `frames` folder on your computer and find **at least three more frames** that are in the "office hours studio scene" and **at least three more frames** that are in the "notebook scene".  Add the frames you found to the list below:

In [147]:
# List of at least four office hour frames by the filename's frame number:
office_hour_frames = [1, 29, 118, 299]

# List of at least four notebook frames by the filename's frame number:
notebook_frames = [30, 92, 189, 258]

### Observing the Average Colors of Your Frames

The following code uses your sample frames to display the average color values for your selected frames:

In [148]:
print("== Office Hour Frames ==")
print( df[ df["frame"].isin( [f"frames\\frame_{frame:04d}.jpg" for frame in office_hour_frames] ) ] )
print(df[(df["frame"] == "frames/frame_0001.jpg") | (df["frame"] == "frames/frame_0029.jpg") | (df["frame"] == "frames/frame_0118.jpg") | (df["frame"] == "frames/frame_0299.jpg")])
print("== Notebook Frames ==")
print( df[ df["frame"].isin( [f"frames\\frame_{frame:04d}.jpg" for frame in notebook_frames] ) ] )
print(df[(df["frame"] == "frames/frame_0030.jpg") | (df["frame"] == "frames/frame_0092.jpg") | (df["frame"] == "frames/frame_0189.jpg") | (df["frame"] == "frames/frame_0258.jpg")])

== Office Hour Frames ==
Empty DataFrame
Columns: [frame, r, g, b]
Index: []
                     frame          r          g          b
12   frames/frame_0001.jpg  88.659175  67.456202  60.424978
14   frames/frame_0029.jpg  88.659175  67.881858  59.730273
125  frames/frame_0299.jpg  88.659175  70.066285  62.480816
213  frames/frame_0118.jpg  88.659175  67.930573  60.564405
== Notebook Frames ==
Empty DataFrame
Columns: [frame, r, g, b]
Index: []
                     frame          r           g           b
25   frames/frame_0189.jpg  88.659175  232.358021  230.135013
134  frames/frame_0258.jpg  88.659175  243.130226  241.545473
225  frames/frame_0092.jpg  88.659175  230.046940  230.497027
310  frames/frame_0030.jpg  88.659175  236.513451  236.777122


### Create Your Classifier Function

A **classifier function** is a function that takes data and gives a classification for that data.  Create a new function, `classifyFrame` that receives an `r`, `g`, and `b` value.

Using information from your frames above, have the function return the string `"office hour"` or `"notebook"` based on the values of `r`, `g`, and `b`.

**IMPORTANT**: Make sure your classifier can handle **ANY** input -- even frames you have not seen before!  For example, you might decide that you will call a frame an `"office hour"` frame if the sum of `r`, `g` and `b` is greater than 100 and otherwise it's a `"notebook"` scene.

In [149]:
def classifyFrame(r, g, b):
  # Return either "office hour" or "notebook" based on the values of `r`, `g`, and `b`.
  sum_frame = r + g + b
  if sum_frame < 288:
    return "office hour" 
  else:
    return "notebook"

### 🔬 Checkpoint Tests 🔬

In [150]:
### TEST CASE for Part 2: Create a Simple Classifier
tada = "\N{PARTY POPPER}"

r = classifyFrame(0, 0, 0)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(255, 255, 255)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(0, 255, 255)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(255, 255, 0)
assert(r == "notebook" or r == "office hour")

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 3: Using Your Classifier!

Now that we have a classifier, we should run it on every frame!

The following cell runs your `classifyFrame` classifier on every frame and adds a new column `scene` and displayed 20 random rows:

In [151]:
df["scene"] = df.apply(lambda row: classifyFrame(row.r, row.g, row.b), axis=1)
df.sample(20)

Unnamed: 0,frame,r,g,b,scene
324,frames/frame_0186.jpg,88.659175,232.451589,230.251189,notebook
256,frames/frame_0021.jpg,88.659175,66.617891,59.294557,office hour
196,frames/frame_0054.jpg,88.659175,232.579991,232.816007,notebook
122,frames/frame_0312.jpg,88.659175,70.261745,62.89487,office hour
67,frames/frame_0275.jpg,88.659175,242.536293,241.133168,notebook
39,frames/frame_0167.jpg,88.659175,70.131476,62.044696,office hour
56,frames/frame_0170.jpg,88.659175,70.037899,62.148294,office hour
114,frames/frame_0307.jpg,88.659175,70.20556,62.805404,office hour
113,frames/frame_0313.jpg,88.659175,70.192526,62.526467,office hour
227,frames/frame_0090.jpg,88.659175,229.869128,230.489566,notebook


In [152]:
df[ (df.frame.str.endswith("0018.jpg"))]

Unnamed: 0,frame,r,g,b,scene
312,frames/frame_0018.jpg,88.659175,65.819722,58.584336,office hour


### 🔬 Checkpoint Tests 🔬

In [153]:
### TEST CASE for Part 3: Using Your Classifier
tada = "\N{PARTY POPPER}"

assert("scene" in df)

assert(len(df[ df.scene == "notebook" ]) > 100)
assert(len(df[ df.scene == "office hour" ]) > 75)
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) == len(df))

assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


## Observing Results

In the next 5 cells, we display a frame and you'll run code to check what your classifier classified the frame as being!  Make sure to run the code for each frame:

### Frame #0001: Office Hours

In [154]:
df[ df.frame.str.endswith("0001.jpg") ]

Unnamed: 0,frame,r,g,b,scene
12,frames/frame_0001.jpg,88.659175,67.456202,60.424978,office hour


![Frame 0001](frames/frame_0001.jpg)

### Frame #0081: Notebook

In [155]:
df[ df.frame.str.endswith("0081.jpg") ]

Unnamed: 0,frame,r,g,b,scene
150,frames/frame_0081.jpg,88.659175,229.915091,230.48303,notebook


![Frame 0001](frames/frame_0081.jpg)

### Frame #0191: Notebook

In [156]:
df[ df.frame.str.endswith("0191.jpg") ]

Unnamed: 0,frame,r,g,b,scene
304,frames/frame_0191.jpg,88.659175,232.354644,230.103359,notebook


![Frame 0001](frames/frame_0191.jpg)

### Frame #0306: Office Hours

In [157]:
df[ df.frame.str.endswith("0306.jpg") ]

Unnamed: 0,frame,r,g,b,scene
121,frames/frame_0306.jpg,88.659175,70.149223,62.83901,office hour


![Frame 0001](frames/frame_0306.jpg)

### Frame #0320: Data Science Duo Logo???

What did you classify the DUO logo as?  It's nether one, but we don't have that option!

In [158]:
df[ df.frame.str.endswith("0320.jpg") ]

Unnamed: 0,frame,r,g,b,scene
159,frames/frame_0320.jpg,88.659175,71.838433,54.457305,office hour


![Frame 0001](frames/frame_0320.jpg)

### Frame #328: Video Credits

What did you classify the DUO logo as?  It's another tricky one!


In [159]:
df[ df.frame.str.endswith("0328.jpg") ]

Unnamed: 0,frame,r,g,b,scene
78,frames/frame_0328.jpg,88.659175,7.481519,7.487826,office hour


![Frame 0328](frames/frame_0328.jpg)

<hr style="color: #DD3403;">

## Part 4: Update Your Classifier to Account with an "Other" Category

Create a second classifier -- `classifyFrame2` -- that returns either `"notebook"`, `"office hour"` or `"other"`.  Your classifier should correctly handle the "Data Science Duo" (ex: #0320) frames and the "Credit" frames (ex: #0328).

In [160]:
df[ (df.frame.str.endswith("0316.jpg")) | (df.frame.str.endswith("0320.jpg")) | (df.frame.str.endswith("0321.jpg"))]

Unnamed: 0,frame,r,g,b,scene
91,frames/frame_0316.jpg,88.659175,67.024722,48.608728,office hour
159,frames/frame_0320.jpg,88.659175,71.838433,54.457305,office hour
164,frames/frame_0321.jpg,88.659175,80.898411,63.118472,office hour


In [161]:
df_sorted = df.sort_values("frame").reset_index()
df_sorted = df_sorted.drop(columns="index")
df_sorted["sum"] = df_sorted["r"] + df_sorted["g"] + df_sorted["b"]
df_sorted[df_sorted["scene"] == "office hour"].nlargest(15, "g")

Unnamed: 0,frame,r,g,b,scene,sum
320,frames/frame_0321.jpg,88.659175,80.898411,63.118472,office hour,232.676059
318,frames/frame_0319.jpg,88.659175,72.007214,54.58467,office hour,215.251059
319,frames/frame_0320.jpg,88.659175,71.838433,54.457305,office hour,214.954913
174,frames/frame_0175.jpg,88.659175,71.015408,63.90668,office hour,223.581263
303,frames/frame_0304.jpg,88.659175,70.626554,63.144909,office hour,222.430638
309,frames/frame_0310.jpg,88.659175,70.553273,63.082947,office hour,222.295395
293,frames/frame_0294.jpg,88.659175,70.40457,62.928372,office hour,221.992118
302,frames/frame_0303.jpg,88.659175,70.382357,62.886363,office hour,221.927895
294,frames/frame_0295.jpg,88.659175,70.339766,63.31974,office hour,222.318681
308,frames/frame_0309.jpg,88.659175,70.325065,62.954852,office hour,221.939093


In [162]:
df[(df.g > 65) & (df.b < 55)]

Unnamed: 0,frame,r,g,b,scene
91,frames/frame_0316.jpg,88.659175,67.024722,48.608728,office hour
100,frames/frame_0317.jpg,88.659175,66.995486,47.76674,office hour
159,frames/frame_0320.jpg,88.659175,71.838433,54.457305,office hour
237,frames/frame_0319.jpg,88.659175,72.007214,54.58467,office hour
242,frames/frame_0318.jpg,88.659175,67.101398,48.556484,office hour


In [163]:
df_sorted.nsmallest(15, "sum")

Unnamed: 0,frame,r,g,b,scene,sum
324,frames/frame_0325.jpg,88.659175,0.030256,0.038095,office hour,88.727526
329,frames/frame_0330.jpg,88.659175,4.658776,4.665082,office hour,97.983034
326,frames/frame_0327.jpg,88.659175,7.472743,7.478576,office hour,103.610495
325,frames/frame_0326.jpg,88.659175,7.473355,7.479188,office hour,103.611719
328,frames/frame_0329.jpg,88.659175,7.481289,7.487595,office hour,103.62806
327,frames/frame_0328.jpg,88.659175,7.481519,7.487826,office hour,103.62852
321,frames/frame_0322.jpg,88.659175,41.149089,18.975074,office hour,148.783338
322,frames/frame_0323.jpg,88.659175,42.190686,19.63658,office hour,150.486441
323,frames/frame_0324.jpg,88.659175,42.935651,20.674631,office hour,152.269457
7,frames/frame_0008.jpg,88.659175,52.125681,37.592253,office hour,178.377109


In [164]:
def classifyFrame2(r, g, b):
  # Return either "office hour", "notebook", or "other" based on the values of `r`, `g`, and `b`.
  return_str = ""
  sum_frame = r + g + b
  return_str = "office hour"
  if (sum_frame < 155):
    return_str = "other" 
  if (b < 55) & (g > 65):
    return_str = "other"
  if (g > 80):
    return_str = "other"
  if (sum_frame > 300):
    return_str = "notebook"
  return return_str

## Apply your `classifyFrame2` function

Using `classifyFrame2`, this code replaces the value in the column `scene` with your `classifyFrame2` classification function.  The output of this cell shows the last frames of the video, which we expect to be `"other"`:

In [165]:
df["scene"] = df.apply(lambda row: classifyFrame2(row.r, row.g, row.b), axis=1)
df.sample(20)

Unnamed: 0,frame,r,g,b,scene
49,frames/frame_0039.jpg,88.659175,233.20829,233.665586,notebook
130,frames/frame_0072.jpg,88.659175,230.171858,230.673008,notebook
259,frames/frame_0182.jpg,88.659175,232.372378,230.151311,notebook
180,frames/frame_0280.jpg,88.659175,242.176402,240.907209,notebook
276,frames/frame_0208.jpg,88.659175,237.83477,235.548598,notebook
191,frames/frame_0322.jpg,88.659175,41.149089,18.975074,other
81,frames/frame_0248.jpg,88.659175,243.410126,241.684648,notebook
192,frames/frame_0256.jpg,88.659175,243.393129,241.721944,notebook
322,frames/frame_0145.jpg,88.659175,229.600955,227.319214,notebook
159,frames/frame_0320.jpg,88.659175,71.838433,54.457305,other


### 🔬 Checkpoint Tests 🔬

In [166]:
### TEST CASE for Part 4: Update Your Classifier to Account with an Other Category
tada = "\N{PARTY POPPER}"

assert("scene" in df)

assert(len(df[ df.scene == "notebook" ]) > 100)
assert(len(df[ df.scene == "office hour" ]) > 75)
assert(len(df[ df.scene == "other" ]) >= 15)
assert(len(df[ df.scene == "other" ]) <= 18)   # Okay to classify the intro screens as well, but not any others.
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) + len(df[ df.scene == "other" ]) == len(df))

assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0317.jpg")) & (df.scene == "other") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0325.jpg")) & (df.scene == "other") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0328.jpg")) & (df.scene == "other") ] ) == 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉