# Investigating Netflix Movies

<center><img src="redpopcorn.jpg"></center>

## Problem Statement

**Netflix**! What started in 1997 as a DVD rental service has since exploded into one of the largest entertainment and media companies.

Given the large number of movies and series available on the platform, it is a perfect opportunity to flex your exploratory data analysis skills and dive into the entertainment industry.

You work for a production company that specializes in nostalgic styles. You want to do some research on movies released in the 1990's. You'll delve into Netflix data and perform exploratory data analysis to better understand this awesome movie decade!

You have been supplied with the dataset `netflix_data.csv`, along with the following table detailing the column names and descriptions. Feel free to experiment further after submitting!

### The data
#### **netflix_data.csv**
| Column | Description |
|--------|-------------|
| `show_id` | The ID of the show |
| `type` | Type of show |
| `title` | Title of the show |
| `director` | Director of the show |
| `cast` | Cast of the show |
| `country` | Country of origin |
| `date_added` | Date added to Netflix |
| `release_year` | Year of Netflix release |
| `duration` | Duration of the show in minutes |
| `description` | Description of the show |
| `genre` | Show genre |

## Objectives
Perform exploratory data analysis on the netflix_data.csv data to understand more about movies from the 1990s decade.
- What was the most frequent movie duration in the 1990s? Save an approximate answer as an integer called duration.
- A movie is considered short if it is less than 90 minutes. Count the number of short action movies released in the 1990s and save this integer as short_movie_count.

---
## Solution

First, let's import the necessary packages:
- `numpy` for conducting quick operations on arrays
- `pandas` for reading the csv file and creating dataframes
- `matplotlib` for data visualization

In [20]:
# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Read in the Netflix CSV as a DataFrame
netflix_df = pd.read_csv("netflix_data.csv")

Let's familiarize ourselves with the `netflix_data.csv` file by exploring it's rows, columns and shape.

In [21]:
# Getting to know the data
print(netflix_df.shape)
print(netflix_df)
print(netflix_df.loc[:5, ["date_added", "release_year", "duration"]])

(4812, 11)
     show_id     type       title           director  \
0         s2    Movie        7:19  Jorge Michel Grau   
1         s3    Movie       23:59       Gilbert Chan   
2         s4    Movie           9        Shane Acker   
3         s5    Movie          21     Robert Luketic   
4         s6  TV Show          46        Serdar Akar   
...      ...      ...         ...                ...   
4807   s7779    Movie  Zombieland    Ruben Fleischer   
4808   s7781    Movie         Zoo       Shlok Sharma   
4809   s7782    Movie        Zoom       Peter Hewitt   
4810   s7783    Movie        Zozo        Josef Fares   
4811   s7784    Movie      Zubaan        Mozez Singh   

                                                   cast        country  \
0     Demián Bichir, Héctor Bonilla, Oscar Serrano, ...         Mexico   
1     Tedd Chan, Stella Chung, Henley Hii, Lawrence ...      Singapore   
2     Elijah Wood, John C. Reilly, Jennifer Connelly...  United States   
3     Jim Sturgess, 

### 1. What was the most frequent movie duration in the 1990s? Save an approximate answer as an integer called duration.
---
We can accomplish this task in two steps:

- Filtering - filter the movies/shows that occured in the 1990s
- Counting - counting all the movie durations and storing the most frequent one


#### a) Filter
To solve this, first, we need to filter the `netflix_df` dataframe to get the 'nostalgic movies' - movies/shows released in the '90s. Since **Pandas** is built on top of **Numpy**, we are able to use numpy's `logical_and()` function to filter the rows where 1900 <= year < 2000.

Let's print out our nostalgic movies just to confirm this

In [22]:
# Usung Numpy's logical_and() method to filter the dataframe
nostalgic_movies = netflix_df[np.logical_and(netflix_df["release_year"] >= 1990, netflix_df["release_year"] < 2000)]
print(nostalgic_movies.loc[:, ["date_added", "release_year", "duration"]])

            date_added  release_year  duration
6     November 1, 2019          1997       119
118      April 1, 2018          1993       101
145   December 1, 2019          1998        82
167   December 1, 2020          1996       108
194       June 1, 2017          1993       154
...                ...           ...       ...
4672  October 19, 2020          1999       106
4689   January 1, 2021          1993       118
4718   January 1, 2020          1999       106
4746   January 1, 2020          1994       191
4756      July 1, 2017          1994       148

[184 rows x 3 columns]


#### b) Count
Now that we've filtered our dataframe to the movies and shows that came out in the '90s, we need to count the movie/show durations for the `nostalgic_movies` to get the most frequent movie duration in the 1990s. How do I do this?
- First, I loop through the 'duration' column of the `nostalgic_movies` as a pandas series.
- I store all the elements of this series, which come as integers, into the list `minutes_list`
- I then loop through each element of this newly created list (`minutes_list`) and count each element using the `.count()` method (Ps. I delete each element after counting it to make my code more efficient)
- While counting, I record the element with the highest count in the `duration` variable and the number of times it occurs in the `highest_count` variable
- Once the loop is done, well, looping, the final value of the `duration` variable is the most frequent movie duration

In [23]:
duration = 0
highest_count = 0

minutes_list = []
for minutes in nostalgic_movies.loc[:, "duration"]:
    minutes_list.append(minutes)

for minute in minutes_list:
    minute_count = minutes_list.count(minute)
    
    if minute_count > highest_count:
        highest_count = minute_count
        duration = minute
        
        
    while minute in minutes_list:
        minutes_list.remove(minute)
        
    # print({str(minute): minute_count})
        
print(duration)


94


### 2. A movie is considered short if it is less than 90 minutes. Count the number of short action movies released in the 1990s and save this integer as short_movie_count.
---
Likewise, we solve this problem using two steps:
- Filtering - filtering the rows with the column 'type' as 'Movie' and the column 'genre' as 'Action'
- Counting - counting the filtered rows that have a duration shorter than 90 minutes

#### i) Filter
Just like the first filter, we use numpy's `logical_and()` function to filter out action movies.

In [24]:
nostalgic_action_movies = nostalgic_movies[np.logical_and(nostalgic_movies["type"] == "Movie", nostalgic_movies["genre"] == "Action")]
print(nostalgic_action_movies)

     show_id   type                                  title  \
352     s508  Movie                        Andaz Apna Apna   
431     s628  Movie  Austin Powers: The Spy Who Shagged Me   
468     s688  Movie                               Bad Boys   
515     s757  Movie                                Barsaat   
675    s1003  Movie                            Blue Streak   
815    s1236  Movie                          Casino Tycoon   
816    s1237  Movie                        Casino Tycoon 2   
1018   s1605  Movie                           Dante's Peak   
1179   s1850  Movie                            Dragonheart   
1288   s2039  Movie              EVANGELION: DEATH (TRUE)²   
1299   s2060  Movie                     Executive Decision   
1504   s2394  Movie                                 Ghayal   
1515   s2408  Movie                      Ghulam-E-Musthafa   
1548   s2466  Movie                              GoldenEye   
1599   s2551  Movie                                 Gumrah   
1661   s

#### ii) Count
Here, I loop through the filtered data and for every movie that is shorter than 90 minutes, I add 1 to the variable `short_movie_count` which initally holds the value `0`.

In [25]:
short_movie_count = 0

for minute in nostalgic_action_movies.loc[:, "duration"]:
    if minute < 90:
        short_movie_count += 1
        
print(short_movie_count)

7
