# Tutorial & Active Learning Notebook with the dataset
## "Film Noir: They Shot Dark Pictures, Didn't They?" (https://www.kaggle.com/datasets/kabhishm/film-noir-they-shot-dark-pictures-didnt-they)




The dataset provides information about more than 1,000 Film Noir films from 1940 to 2003, including their title, genre, year, runtime, rating, description, votes, director, and stars. 

    Note: The variable 'Rating' is based on contemporary ratings of the Film-Noir movies. This is to mean that the Ratings are views held by a contemporary audience, and most times not held by the audience stemming from the time these movies were produced.

Have you ever heard of the term 'Film Noir'? Did you know this genre has long been regarded as one of cinema's most intriguing styles that defined post-war American cinematography? Did you know that this genre developed iconic cinematic motifs and tropes that inspire filmmakers in this day and age? 

The above mentioned reasons is why this Notebook (containing both the tutorial and the active learning exercises)
was created. For the tutorial, expect to be guided through the process of how to import the dataset and clean it. For the active learning exercises, you can expect the following structure: 1) the question, and 2) an introduction to the exercise, which is meant to motivate you, as the reader to answer the particular question.
The answers can be checked with the answer sheet file found at https://github.com/catalina255/film_noir_exploratory_analysis/blob/706454f178eba82cee913cbe7a8a45916f1549fb/Solutions%20-%20Tutorial%20&%20Active%20Learning%20Notebook%20.ipynb. 

# Tutorial
As was mentioned before, you will be guided through the process of how to do everything which you need in order to complete the active learning exercises later on. 

### Step 1: Download the data 
You can download the data from the following GitHub repository:
https://github.com/catalina255/film_noir_exploratory_analysis

Then, download the file named 'IMDB_noir_1000.csv'. 

#### Overview of required packages
This is an overview of the libraries which are required to work with the daya you just downloaded. 
- The Pandas library
- The Seaborn library
- The NumPy library
- The Matplotlib library

Do not be alarmed for now, it will be stated what to import when it is necessary. 

### Step 2: loading the data

In [28]:
# In order to use the Pandas library, this first needs to be imported
import pandas as pd

To load the data, you specify the filepath as an argument to the pd.read_csv() function. 
- Note: the filepath to the 'IMDB_noir_1000.csv' file is located in relation to where you store this Notebook.

You assign this to a variable with a descriptive name, like we did with 'film_noir_df' and show the DataFrame by calling the variable name.

In [49]:
film_noir_df = pd.read_csv('data/IMDB_noir_1000.csv')
film_noir_df

Unnamed: 0,Title,Genre,Year,Runtime,Rating,desc,Votes,Director,Stars
0,Angels Over Broadway,"Adventure, Comedy, Crime",1940,79 min,6.5,A cuckolded embezzler on the verge of suicide ...,1228,Ben Hecht,|
1,City for Conquest,"Drama, Music, Sport",1940,104 min,7.2,"Danny is a content truck driver, but his girl ...",2430,Anatole Litvak,|
2,The Letter,"Crime, Drama, Film-Noir",1940,95 min,7.6,The wife of a rubber plantation administrator ...,13469,William Wyler,Bette Davis
3,Rebecca,"Drama, Film-Noir, Mystery",1940,130 min,8.1,A self-conscious woman juggles adjusting to he...,138097,Alfred Hitchcock,Laurence Olivier
4,Stranger on the Third Floor,"Crime, Drama, Film-Noir",1940,64 min,6.8,An aspiring reporter is the key witness at the...,4268,Boris Ingster,Peter Lorre
...,...,...,...,...,...,...,...,...,...
1353,Who'll Stop the Rain,"Action, Crime, Drama",1978,126 min,6.7,A Vietnam veteran gets conned into helping an ...,3003,,
1354,The Wings of the Dove,"Drama, Romance",1997,102 min,7.1,An impoverished woman who has been forced to c...,12580,,
1355,Witness,"Drama, Romance, Thriller",1985,112 min,7.4,When a young Amish boy is sole witness to a mu...,97031,,
1356,The Zodiac,"Crime, Drama, Horror",2005,158 min,5.3,An elusive serial killer known as the Zodiac t...,7333,,


# Step 3: Cleaning the data

Take a minute to look at the DataFrame. The DataFrame consists of 1358 rows by 9 columns. Ask yourself 'What stands out?'. The second thing that might stand out to you is that that there are a lot of values that correspond to NaN, while the second thing might be is that for the variable 'Stars', there is at least one Star named '|'. 

It is possible to drop the values that correspond to NaN, by simply running the following code. 

In [50]:
film_noir_df = film_noir_df.dropna()
film_noir_df

Unnamed: 0,Title,Genre,Year,Runtime,Rating,desc,Votes,Director,Stars
0,Angels Over Broadway,"Adventure, Comedy, Crime",1940,79 min,6.5,A cuckolded embezzler on the verge of suicide ...,1228,Ben Hecht,|
1,City for Conquest,"Drama, Music, Sport",1940,104 min,7.2,"Danny is a content truck driver, but his girl ...",2430,Anatole Litvak,|
2,The Letter,"Crime, Drama, Film-Noir",1940,95 min,7.6,The wife of a rubber plantation administrator ...,13469,William Wyler,Bette Davis
3,Rebecca,"Drama, Film-Noir, Mystery",1940,130 min,8.1,A self-conscious woman juggles adjusting to he...,138097,Alfred Hitchcock,Laurence Olivier
4,Stranger on the Third Floor,"Crime, Drama, Film-Noir",1940,64 min,6.8,An aspiring reporter is the key witness at the...,4268,Boris Ingster,Peter Lorre
...,...,...,...,...,...,...,...,...,...
1317,Point Blank,"Crime, Drama, Thriller",1967,92 min,7.3,"After being double-crossed and left for dead, ...",2981,Karel Reisz,Nick Nolte
1318,Point of No Return,"Action, Crime, Drama",1993,109 min,6.1,A government fakes the death of a criminal to ...,12545,Iain Softley,Helena Bonham Carter
1319,The Postman Always Rings Twice,"Crime, Drama, Romance",1981,122 min,6.6,The sensuous wife of a lunch wagon proprietor ...,96420,Peter Weir,Harrison Ford
1320,Pretty Poison,"Comedy, Crime, Romance",1968,89 min,7.0,When a mentally disturbed young man tells a pr...,7327,Alexander Bulkley,Justin Chambers


Take another minute to look at the DataFrame. The DataFrame consists of 1246 rows by 9 columns. You can clearly see that we lost data due to dropping the NaN values.
But that is not the end. You still need to find a way to take care of the Stars that are named '|'. This can be achieved by running the following bit of code:

In [51]:
film_noir_df = film_noir_df[(film_noir_df.Stars != '|')]
film_noir_df

Unnamed: 0,Title,Genre,Year,Runtime,Rating,desc,Votes,Director,Stars
2,The Letter,"Crime, Drama, Film-Noir",1940,95 min,7.6,The wife of a rubber plantation administrator ...,13469,William Wyler,Bette Davis
3,Rebecca,"Drama, Film-Noir, Mystery",1940,130 min,8.1,A self-conscious woman juggles adjusting to he...,138097,Alfred Hitchcock,Laurence Olivier
4,Stranger on the Third Floor,"Crime, Drama, Film-Noir",1940,64 min,6.8,An aspiring reporter is the key witness at the...,4268,Boris Ingster,Peter Lorre
5,They Drive by Night,"Crime, Drama, Film-Noir",1940,95 min,7.2,When one of two truck-driving brothers loses a...,8201,Raoul Walsh,George Raft
6,Among the Living,"Drama, Film-Noir, Mystery",1941,67 min,6.4,A mentally-unstable man who has been kept in i...,716,Stuart Heisler,Albert Dekker
...,...,...,...,...,...,...,...,...,...
1317,Point Blank,"Crime, Drama, Thriller",1967,92 min,7.3,"After being double-crossed and left for dead, ...",2981,Karel Reisz,Nick Nolte
1318,Point of No Return,"Action, Crime, Drama",1993,109 min,6.1,A government fakes the death of a criminal to ...,12545,Iain Softley,Helena Bonham Carter
1319,The Postman Always Rings Twice,"Crime, Drama, Romance",1981,122 min,6.6,The sensuous wife of a lunch wagon proprietor ...,96420,Peter Weir,Harrison Ford
1320,Pretty Poison,"Comedy, Crime, Romance",1968,89 min,7.0,When a mentally disturbed young man tells a pr...,7327,Alexander Bulkley,Justin Chambers


Now, take yet another minute to look at the DataFrame. The DataFrame consists of 1199 rows by 9 columns. Our solution to take care of the Stars named '|', is not the only solution. Another more time-consuming solution is to check all the movies from which the Stars are named '|', and manually insert the names of the Stars in the csv file. 

# Step 4: Displaying the dataset, and providing basic statistics about the dataset

In [8]:
# The following code gives a random sample of the dataset. 
film_noir_df.sample(10)

Unnamed: 0,Title,Genre,Year,Runtime,Rating,desc,Votes,Director,Stars
347,Take One False Step,"Crime, Drama, Mystery",1949,94 min,6.4,"During a conference-trip to L.A., an academic ...",3649,Fritz Lang,Louis Hayward
783,Bluebeard,"Crime, Horror, Thriller",1944,72 min,5.9,"In Paris, an artist hires portrait models, and...",5026,Robert Rossen,John Garfield
134,Nobody Lives Forever,"Crime, Drama, Film-Noir",1946,100 min,7.0,Ex-GI Nick Blake gets involved in a scheme to ...,597,Jules Dassin,Lucille Ball
78,Detour,"Crime, Drama, Film-Noir",1945,68 min,7.3,Chance events trap hitchhiking nightclub piani...,1191,Edwin L. Marin,George Raft
1273,Kill Me Again,"Action, Crime, Drama",1989,94 min,6.3,A young detective becomes involved with a beau...,41235,Roger Donaldson,Kevin Costner
592,The Big Knife,"Crime, Drama, Film-Noir",1955,111 min,6.8,Hollywood actor Charles Castle is pressured by...,1120,Allan Dwan,John Payne
855,Detour,"Crime, Drama, Film-Noir",1945,68 min,7.3,Chance events trap hitchhiking nightclub piani...,3576,George Cukor,Ronald Colman
23,Crossroads,"Crime, Drama, Film-Noir",1942,83 min,6.7,An amnesiac French diplomat is blackmailed for...,332,Robert Siodmak,Nancy Kelly
271,Pitfall,"Crime, Film-Noir, Thriller",1948,86 min,7.1,Married insurance adjuster John Forbes falls f...,693,Ted Tetzlaff,George Raft
186,The Gangster,"Crime, Drama, Film-Noir",1947,84 min,6.5,Shubunka (Barry Sulivan) is a cynical gangster...,5586,Curtis Bernhardt,Joan Crawford


In [9]:
# The folloing code provides some information about the data. 
film_noir_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1199 entries, 2 to 1321
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Title     1199 non-null   object 
 1   Genre     1199 non-null   object 
 2   Year      1199 non-null   object 
 3   Runtime   1199 non-null   object 
 4   Rating    1199 non-null   float64
 5   desc      1199 non-null   object 
 6   Votes     1199 non-null   object 
 7   Director  1199 non-null   object 
 8   Stars     1199 non-null   object 
dtypes: float64(1), object(8)
memory usage: 93.7+ KB


In [10]:
# The following code provides descriptive statistics for all the column in the data.
film_noir_df.describe(include = 'all')

Unnamed: 0,Title,Genre,Year,Runtime,Rating,desc,Votes,Director,Stars
count,1199,1199,1199.0,1199,1199.0,1199,1199.0,1199,1199
unique,846,107,68.0,84,,860,793.0,394,446
top,Fear,"Crime, Drama, Film-Noir",1947.0,82 min,,"Jim Fletcher, waking up from a coma, finds he ...",296.0,Anthony Mann,Humphrey Bogart
freq,3,549,123.0,43,,2,4.0,27,27
mean,,,,,6.728023,,,,
std,,,,,0.63476,,,,
min,,,,,3.6,,,,
25%,,,,,6.3,,,,
50%,,,,,6.7,,,,
75%,,,,,7.2,,,,


# Start of Active Learning Exercises

## Exercise 1

In this exercise and the following exercise, you will explore what sort of subgenres were most popular throughout the Film Noir age. Take note of how there are many combinations of genres, ranging from 1 to 3. 

We are trying to demonstrate that the genre of crime is quite common within Film-Noir, with this exercise. Additionally, with this exercise, we aim to observe the tendencies of Film-Noir viewers, in terms of which combinations of genre receives the most interest. In turn, this could prove helpful in determining people's future recommendations. 

In the first part of this exercise, answer the following question: Q1a) How many distinct entries are there of genre combinations?

In [7]:
# Insert your code here for Q1a

The second part of this exercise continues where 1a left off. Q1b: Of those distinct entries, which are the 10 most popular genre combinations? 

In [8]:
# Insert your code here for Q1b

## Exercise 2

As mentioned before, you will explore what sort of subgenres were popular throughout the Film Noir age. However, unlike in Exercise 1, you will now attempt to answer this question with the help of creating two additional variables. The first variable is the 'Score', which is created by multiplying 'Rating' with 'Votes'. The second variable is the 'Popularity Index', which is created by dividing the score by the number of distinct entries of genre. 

Essentially, we want to rate how popular IMDB rated Film-Noir movies through the application of the Popularity Index.  Exploring this can reveal interesting patterns surrounding which combinations of genres receive the highest ratings. 


Q2: Which subgenres were popular throughout the Film Noir time period using the variable 'Popularity Index'? Provide a column chart to help support your answer. 

In [11]:
# Insert your code here for Q2

## Exercise 3

In the following exercise, you will try to determine whether movies produced by several directors with a dominating presence were enjoyed by viewers. 

Q3a: Who are the directors with a dominating presence?
Q3b: Are the movies directed by those directors enjoyed by viewers?


The motivation behind both of these questions falls along the same lines, which is to see whether movies directed by the same select few people affects the enjoyment of those movies. 

In [None]:
# Insert your code here for Q3a

In [None]:
# Insert your code here for Q3b

## Exercise 4

In this exercise, you will look into what movie stars had the most succes during their Film Noir career. 

Q4: Which movie stars had the most success during their Film Noir career?

    Note: The variable 'Rating' is based on contemporary ratings of the Film-Noir movies. This is to mean that the Ratings   are views held by a contemporary audience, and most times not held by the audience stemming from the time these movies   were produced. 

The motivation for this exercise is using these results for future research, perhaps these movie stars were able to further their career due to their Film Noir career, or perhaps their career came to an end due to their Film Noir career. 

In [25]:
# Insert your code here for Q4

## Exercise 5
In the next exercise, you are going to work with the short plot descriptions found in 'desc' column. The purpose is to observe the word frequency in the descriptions to see how well the words reflect the Film Noir genre. That being said, we are going to use two methods in order to see whether the results are the same. 

Q5a: In the first method you are asked to create a wordcloud visualisation for the most frequent used words in the 'desc' column. We suggest to  clean the text of redundant suffixes, so first take a brief look at the corpus. Also, make sure to install the wordcloud and pillow packages to generate the wordcloud. 

Q5b: In the second method you are asked to manually extract the frequencies. We suggest to  clean the text of redundant suffixes, so first take a brief look at the corpus. 

In [27]:
# Insert your code here for Q5a

In [28]:
# Insert your code here for Q5b

## Exercise 6: Free style

If you haven't noticed by now, the dataset also contains movies from the Neo-Noir period, so an interesting analysis could consist of looking at different statistical changes from the initial release of this kind of movies and the revival of the genre later on in time. 

In this exercise, we encourage you to come up with your own question or questions which compares the Noir movies with the Neo-Noir movies, based on a particular variable. 

In [52]:
# Insert your code here for Q6

## Exercise 7: Free style

In this exercise, we encourage you to come up with your own question, or even questions, without guidance. Feel free to be as creative as possible. 

In [None]:
# Insert your code here for Q7