# Bob Ross Exploration

An exploration into the exciting world of Bob Ross' paintings and correlations between objects he chose to paint!

***

Start by downloading a CSV of all of Bob's episodes:
https://github.com/fivethirtyeight/data/blob/master/bob-ross/elements-by-episode.csv

For each episode, objects are tagged as present (1) or absent (0).

Save the CSV into the same folder as this Notebook.

Then, import pandas and get all the episode data into a DataFrame:

In [5]:
import pandas as pd

reviews = pd.read_csv("elements-by-episode.csv", index_col = 0)
reviews

Unnamed: 0_level_0,TITLE,APPLE_FRAME,AURORA_BOREALIS,BARN,BEACH,BOAT,BRIDGE,BUILDING,BUSHES,CABIN,...,TOMB_FRAME,TREE,TREES,TRIPLE_FRAME,WATERFALL,WAVES,WINDMILL,WINDOW_FRAME,WINTER,WOOD_FRAMED
EPISODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S01E01,"""A WALK IN THE WOODS""",0,0,0,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,0,0
S01E02,"""MT. MCKINLEY""",0,0,0,0,0,0,0,0,1,...,0,1,1,0,0,0,0,0,1,0
S01E03,"""EBONY SUNSET""",0,0,0,0,0,0,0,0,1,...,0,1,1,0,0,0,0,0,1,0
S01E04,"""WINTER MIST""",0,0,0,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,0,0
S01E05,"""QUIET STREAM""",0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
S31E09,"""EVERGREEN VALLEY""",0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
S31E10,"""BALMY BEACH""",0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
S31E11,"""LAKE AT THE RIDGE""",0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
S31E12,"""IN THE MIDST OF WINTER""",0,0,1,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,1,0


It's always helpful to use ```.info()``` on your DataFrame to check whether any columns are missing data before you start working with it. So do that now:

In [6]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 403 entries, S01E01 to S31E13
Data columns (total 68 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   TITLE               403 non-null    object
 1   APPLE_FRAME         403 non-null    int64 
 2   AURORA_BOREALIS     403 non-null    int64 
 3   BARN                403 non-null    int64 
 4   BEACH               403 non-null    int64 
 5   BOAT                403 non-null    int64 
 6   BRIDGE              403 non-null    int64 
 7   BUILDING            403 non-null    int64 
 8   BUSHES              403 non-null    int64 
 9   CABIN               403 non-null    int64 
 10  CACTUS              403 non-null    int64 
 11  CIRCLE_FRAME        403 non-null    int64 
 12  CIRRUS              403 non-null    int64 
 13  CLIFF               403 non-null    int64 
 14  CLOUDS              403 non-null    int64 
 15  CONIFER             403 non-null    int64 
 16  CUMULUS             403

## Correlation 

Now we can go ahead and get a correlation matrix by simply calling ```.corr()``` on the DataFrame.

In order to see all the columns and rows, uncomment the two lines in the next cell.

In [13]:
correlation_matrix = reviews.corr(numeric_only=True)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
print(correlation_matrix)

                    APPLE_FRAME  AURORA_BOREALIS      BARN     BEACH  \
APPLE_FRAME            1.000000        -0.003522 -0.010467 -0.013365   
AURORA_BOREALIS       -0.003522         1.000000 -0.014821 -0.018925   
BARN                  -0.010467        -0.014821  1.000000 -0.056237   
BEACH                 -0.013365        -0.018925 -0.056237  1.000000   
BOAT                  -0.003522        -0.004988 -0.014821  0.122310   
BRIDGE                 0.375133        -0.009390 -0.027902 -0.035628   
BUILDING               1.000000        -0.003522 -0.010467 -0.013365   
BUSHES                -0.032478        -0.045988 -0.109660 -0.152792   
CABIN                 -0.022669         0.155379 -0.095385 -0.121798   
CACTUS                -0.004994        -0.007071 -0.021012 -0.026831   
CIRCLE_FRAME          -0.003522        -0.004988 -0.014821 -0.018925   
CIRRUS                -0.013629        -0.019298  0.039756  0.004843   
CLIFF                 -0.007098        -0.010051 -0.029866  0.31

The correlation matrix is itself a DataFrame, so go back and save it as its own object. Name it ```bob_ross_corr```.

## Start the Investigation

Now that you have a DataFrame and a correlation matrix, try to use code to perform the following:

### Sunny Days

Output (as a DataFrame) the episode and title of every episode in which Bob painted the sun.

*Hint: use the SUN column where value == 1*

In [22]:
sunny_episodes_df = reviews.loc[reviews['SUN'] == 1, ['TITLE']]
print(sunny_episodes_df)


                              TITLE
EPISODE                            
S01E03               "EBONY SUNSET"
S02E02                 "WINTER SUN"
S02E03                  "EBONY SEA"
S02E09     "BLACK & WHITE SEASCAPE"
S05E06              "OCEAN SUNRISE"
S07E03       "EVERGREENS AT SUNSET"
S08E03            "WARM WINTER DAY"
S09E01          "WINTER EVERGREENS"
S10E08              "GOLDEN SUNSET"
S10E12               "WINTER FROST"
S11E10      "SUNSET OVER THE WAVES"
S12E01               "GOLDEN KNOLL"
S12E12        "MOUNTAIN IN AN OVAL"
S14E08             "ON A CLEAR DAY"
S15E01         "SPLENDOR OF WINTER"
S17E04                "STORMY SEAS"
S18E13            "RIPPLING WATERS"
S19E03   "FINAL EMBERS OF SUNLIGHT"
S19E09                   "EBB TIDE"
S20E02             "NEW DAY'S DAWN"
S20E12             "HIDDEN DELIGHT"
S21E02              "TRANQUIL DAWN"
S21E08                 "BY THE SEA"
S22E11            "PASTEL SEASCAPE"
S23E04        "REFLECTIONS OF GOLD"
S23E12               "CRIMSO

### Cones Please

What percentage of paintings included a conifer? Use code to calculate this. See if you can do it in one line of code.

It's okay to Google for ideas, but cite your source with a comment and full link to where you found it.

In [23]:
per_paint = (reviews['CONIFER'].sum() / len(reviews) )* 100
per_paint

np.float64(52.605459057071954)

### Water

I want to know about episodes in which Bob might have painted water. Assume that any of the following objects would include water:

'BOAT', 'BEACH', 'OCEAN', 'LAKE', 'WATERFALL', 'WAVES', 'RIVER', 'DOCK', 'BEACH'

Create a new column in the original DataFrame called "WATER" and set it to 1 if any of the above columns have 1, otherwise 0.

Hints: use a few code cells to do this in steps
- Turn my list of water columns into a list called water_cols
- Output the DataFrame but just the subset of waters columns. You'll use this view to verify your work.
- Create a new column called water using this notation: ```df['WATER'] = ``` where df is the name of your DataFrame
- Now the tricky part. You want to set that new column to a boolean value based on whether the number 1 is in any of the water columns. You'll need to use ```.isin()``` and ```.any(axis='columns')```
- You can change the boolean values to int's using .astype(int) at the end of your expression

In [26]:
water_col = ['BOAT', 'BEACH', 'OCEAN', 'LAKE', 'WATERFALL', 'WAVES', 'RIVER', 'DOCK', 'BEACH']
reviews['WATER'] = reviews[water_col].isin([1]).any(axis = 'columns').astype(int)
reviews['WATER']

EPISODE
S01E01    1
S01E02    0
S01E03    0
S01E04    1
S01E05    1
S01E06    1
S01E07    1
S01E08    1
S01E09    1
S01E10    1
S01E11    1
S01E12    1
S01E13    0
S02E01    1
S02E02    1
S02E03    1
S02E04    1
S02E05    1
S02E06    1
S02E07    1
S02E08    1
S02E09    1
S02E10    1
S02E11    1
S02E12    1
S02E13    1
S03E01    1
S03E02    1
S03E03    1
S03E04    0
S03E05    1
S03E06    0
S03E07    1
S03E08    1
S03E09    1
S03E10    1
S03E11    1
S03E12    1
S03E13    1
S04E01    1
S04E02    1
S04E03    0
S04E04    1
S04E05    1
S04E06    1
S04E07    0
S04E08    1
S04E09    1
S04E10    1
S04E11    0
S04E12    1
S04E13    1
S05E01    1
S05E02    1
S05E03    0
S05E04    1
S05E05    1
S05E06    1
S05E07    1
S05E08    1
S05E09    1
S05E10    0
S05E11    1
S05E12    0
S05E13    1
S06E01    1
S06E02    1
S06E03    1
S06E04    1
S06E05    1
S06E06    0
S06E07    1
S06E08    1
S06E09    0
S06E10    0
S06E11    1
S06E12    1
S06E13    1
S07E01    0
S07E02    1
S07E03    1
S07E04    1
S07E05  

### Super Bonus üå∂Ô∏è

Can you find the highest and lowest correlation for any column? 

So, pick a column, like ROCKS. Other than ROCKS (which would have a correlation of 1.00 with itself) what are the most and least correlated objects?

Can you find that for every object?

In [27]:
max_corr = {}
min_corr = {}

for column in correlation_matrix.columns:
    correlations = correlation_matrix[column].drop(column)
    max_corr[column] = correlations.idxmax(), correlations.max()
    min_corr[column] = correlations.idxmin(), correlations.min()

print(max_corr)
print(min_corr)

#https://www.geeksforgeeks.org/python-pandas-dataframe-corr/

{'APPLE_FRAME': ('BUILDING', np.float64(1.0)), 'AURORA_BOREALIS': ('NIGHT', np.float64(0.4215892248333433)), 'BARN': ('STRUCTURE', np.float64(0.3453955165129267)), 'BEACH': ('OCEAN', np.float64(0.8555979618608421)), 'BOAT': ('DOCK', np.float64(0.7062267475148986)), 'BRIDGE': ('APPLE_FRAME', np.float64(0.37513323859009895)), 'BUILDING': ('APPLE_FRAME', np.float64(1.0)), 'BUSHES': ('TREES', np.float64(0.2295202063362597)), 'CABIN': ('STRUCTURE', np.float64(0.7338142166700508)), 'CACTUS': ('PATH', np.float64(0.2691207217334807)), 'CIRCLE_FRAME': ('FRAMED', np.float64(0.18148423239501615)), 'CIRRUS': ('CLOUDS', np.float64(0.286034840296712)), 'CLIFF': ('LIGHTHOUSE', np.float64(0.35046167134877554)), 'CLOUDS': ('CUMULUS', np.float64(0.5460948883216591)), 'CONIFER': ('MOUNTAIN', np.float64(0.4553426202459414)), 'CUMULUS': ('CLOUDS', np.float64(0.5460948883216591)), 'DECIDUOUS': ('TREE', np.float64(0.3873714764805649)), 'DIANE_ANDRE': ('GUEST', np.float64(0.2075573517707431)), 'DOCK': ('BOAT'

  max_corr[column] = correlations.idxmax(), correlations.max()
  min_corr[column] = correlations.idxmin(), correlations.min()


### Super Super Bonus üå∂Ô∏èüå∂Ô∏è

And the icing on the cake- get the least and most correlated item for every item in the correlation matrix.

*Hint: you will want to turn your code above into a function that takes an item (like "SNOW") and outputs the answer. Then, to iterate over the items, use ```iteritems()``` like this:*

```for item in bob_ross_corr.iteritems():```

The ```.iteritem()``` function returns a tuple, and you'll need to take the first element of the tuple and pass it to your function.