In this notebook, we will explore Netflix movies and TV shows with [kotlin/dataframe](https://github.com/Kotlin/dataframe). Also, we will use [kandy](https://github.com/Kotlin/kandy) library for data visualization.

## Table of contents
* [**Imports**](attachment:./#Imports)
* [**Reading and first look**](attachment:./#Reading-and-first-look)
* [**TV Shows and Movies**](attachment:./#TV-Shows-and-Movies)
* [**Lifetimes and Release Times**](attachment:./#Lifetimes-and-Release-Times)
* [**Actors**](attachment:./#Actors)
* [**Countries**](attachment:./#Countries)
* [**Duration**](attachment:./#Duration)
* [**Ratings**](attachment:./#Ratings)

## Imports

We use the latest available versions of the libraries, the following line magic is responsible for this:

In [1]:
%useLatestDescriptors

Importing dataframe

In [2]:
%use dataframe

Importing the visualization library

In [3]:
%use kandy

## Reading and first look

To get started, need to read data from csv

In [4]:
val rawDf = DataFrame.read("netflix_titles.csv")

First look could be taken at its content

In [5]:
// taking a look at types and columns
rawDf.schema()

show_id: String
type: String
title: String
director: String?
cast: String?
country: String?
date_added: String?
release_year: Int
rating: String?
duration: String
listed_in: String
description: String

In [6]:
rawDf.size() // rowsCount x columnsCount

7787 x 12

In [7]:
rawDf.head() // return first five rows

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV...",In a future where the elite inhabit a...
s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits M...
s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, h..."
s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movie...","In a postapocalyptic world, rag-doll ..."
s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosw...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become ...


In [8]:
// Getting general statistics and info for each columns
rawDf.describe()

name,type,count,unique,nulls,top,freq,mean,std,min,median,max
show_id,String,7787,7787,0,s1,1,,,s1,s4502,s999
type,String,7787,2,0,Movie,5377,,,Movie,Movie,TV Show
title,String,7787,7787,0,3%,1,,,#Alive,Manglehorn,최강전사 미니특공대 : 영웅의 탄생
director,String?,7787,4050,2389,"Raúl Campos, Jan Suter",18,,,A. L. Vijay,Lance Bangs,Şenol Sönmez
cast,String?,7787,6832,718,David Attenborough,18,,,"'Najite Dede, Jude Chukwuka, Taiwo Ar...","Kay Kay Menon, Shiney Ahuja, Chitrang...","Ṣọpẹ́ Dìrísù, Wunmi Mosaku, Matt Smit..."
country,String?,7787,682,507,United States,2555,,,Argentina,Thailand,Zimbabwe
date_added,String?,7787,1566,10,"January 1, 2020",118,,,"April 15, 2018","July 6, 2020","September 9, 2020"
release_year,Int,7787,73,0,2018,1121,2013932580.0,8757395.0,1925,2017,2021
rating,String?,7787,15,7,TV-MA,2863,,,G,TV-MA,UR
duration,String,7787,216,0,1 Season,1608,,,1 Season,148 min,99 min


Data consists of Netflix TV shows and movies up to 2020. Each row contains information about one specific project and consists of:
* `show_id` - unique show number
* `type` - ***TV Show*** or ***Movie***
* `title` - the name of a TV show or movie
* `director` - director's name
* `cast` - cast list
* `country` - the country where the title was released
* `date_added` - when the title was added on netflix
* `release_year` - the year the title was released
* `rating` - rating of the title
* `listed_in` - in which lists/genres the title is present on netflix
* `description` - title description

Before we get started, let's process the dataframe. It can be seen that `date_added` is of type `String`, let's [convert](https://kotlin.github.io/dataframe/convert.html) it to `LocalDate` for further convenience. Kotlin DataFrame provides built-in type converters for major types. We will use `String` -> `LocalDate` conversion and specify date format [pattern](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)

In [12]:
val df = rawDf.dropNulls { date_added } // remove rows where `date_added` is not specified
    .convert { date_added }.toLocalDate("MMMM dd, yyyy") // convert date_added to LocalDate using date pattern
    .sortBy { date_added } // and let's also sort by date for easy operation later
df

java.lang.IllegalStateException: Can't convert `August 14, 2020` to LocalDate

In [None]:
it// let's look at what type of column it turned out
df.date_added.type()             

## TV Shows and Movies

First, let's see what more shows or films.

In [None]:
rawDf
    .valueCounts(sort = false) { type }
    .plot {
        bars {
            x(type)
            y("count")
            fillColor(type) {
                scale = categorical(range = listOf(Color.hex("#00BCD4"), Color.hex("#009688")))
            }
        }

        layout {
            title = "Count of TV Shows and Movies"
            size = 900 to 550
        }
    }

It can be seen that the number of films on Netflix is about twice the number of TV shows. But has it always been this way? To do this, let's see if there was such a year when the number of TV Shows was more than Movies and let's see the cumulative amount for Movies and TV Shows.

In [None]:
val df_date_count = df
    .convert { date_added }.with { it.year } // converting `date_added` to extract `year`
    .groupBy { date_added } // grouping by `year` stored in `date_added`
    .aggregate {
        count { type == "TV Show" } into "tvshows" // counting TV Shows into column `tvshows`
        count { type == "Movie" } into "movies" // counting Movies into column `movies`
    }
df_date_count

Let's hold on and see how we can simplify this expression using more advanced operations. First of all, we can combine conversion of `date_added` into `year` and grouping using [`map`](https://kotlin.github.io/dataframe/map.html) function within [column selector](https://kotlin.github.io/dataframe/columnselectors.html).

In [None]:
val df_date_count = df
    .groupBy { date_added.map { it.year } } // grouping by year added extracted from `date_added`
    .aggregate {
        count { type == "TV Show" } into "tvshows" // counting TV Shows into column `tvshows`
        count { type == "Movie" } into "movies" // counting Movies into column `movies`
    }
df_date_count

Our [groupBy aggregation](https://kotlin.github.io/dataframe/groupby.html#aggregation) adds new columns for "TV Show" and "Movie". This is exactly what [`pivot`](https://kotlin.github.io/dataframe/pivot.html) does: generates new columns for every unique value in `type`.

In [None]:
df.groupBy { date_added.map { it.year } }
    .pivot { type }

After `type` column is pivoted, we call [`aggregate`](https://kotlin.github.io/dataframe/pivot.html#aggregation) to specify metrics to be calculated for every data group.

In [None]:
df.groupBy { date_added.map { it.year } }
    .pivot { type }.aggregate { count() }

Simple statistics can be aggregated without `aggregate`:

In [None]:
df.groupBy { date_added.map { it.year } }
    .pivot { type }.count()

For `count` statistics there is even shorter API [`pivotCounts`](https://kotlin.github.io/dataframe/pivot.html#pivotcounts).

Here is the final version:

In [None]:
val df_date_count = df
    .groupBy { date_added.map { it.year } }.pivotCounts { type }
df_date_count

Now we will prepare dataframe for rendering. We will call [`flatten`](https://kotlin.github.io/dataframe/flatten.html) to remove column grouping and convert dataframe to `Map`.

In [None]:
df_date_count.plot {
    x(date_added) { axis.name = "year" }

    area {
        y(type.`TV Show`) { axis.name = "count" }
        fillColor = Color.hex("#BF360C")
        borderLine.color = Color.hex("#BF360C")
        alpha = .5
    }

    area {
        y(type.Movie)
        fillColor = Color.hex("#01579B")
        borderLine.color = Color.hex("#01579B")
        alpha = .5
    }

    layout {
        title = "Number of titles by year"
        size = 800 to 500
        theme {
            panel {
                background {
                    fillColor = Color.hex("#ECEFF1")
                    borderLineColor = Color.hex("#ECEFF1")
                }
                grid.lineGlobal { blank = true }
            }
        }
    }
}

It can be seen that more films were added every year than shows. Obviously, the cumulative sum of the movies was also always higher than the TV Shows, but let's build such a plot.

In [None]:
val df_cumsum_titles = df_date_count
    .sortBy { date_added } // sorting by date_added
    .cumSum { type.allCols() } // count cumulative sum for columns `TV Show` and `Movie` that are nested under column `type`
df_cumsum_titles

In [None]:
df_cumsum_titles.plot {
    x(date_added) { axis.name = "year" }

    area {
        y(type.`TV Show`) { axis.name = "cumulative count" }
        fillColor = Color.hex("#BF360C")
        borderLine.color = Color.hex("#BF360C")
        alpha = .5
    }

    area {
        y(type.Movie)
        fillColor = Color.hex("#01579B")
        borderLine.color = Color.hex("#01579B")
        alpha = .5
    }

    layout {
        title = "Cumulative count of titles by year"
        size = 800 to 500

        theme {
            panel {
                background {
                    fillColor = Color.hex("#ECEFF1")
                    borderLineColor = Color.hex("#ECEFF1")
                }
                grid.lineGlobal { blank = true }
            }
        }
    }
}

## Lifetimes and Release Times

Let's take a look at the distribution by the lifetime of titles on the platform. To do this, find the most recently uploaded title and calculate the difference between the date it was added and the maximum date found.

In [None]:
import kotlinx.datetime.*

In [None]:
val maxDate = df.date_added.max()
val df_days = df.add {
    "days_on_platform" from { date_added.daysUntil(maxDate) } // adding column for number of days on the platform
    "months_on_platform" from { date_added.monthsUntil(maxDate) } // adding column for number of months on the platform
    "years_on_platform" from { date_added.yearsUntil(maxDate) } // adding column for number of years on the platform
}

In [None]:
val p1 = df_days.select { type and days_on_platform }.plot {
    histogram(days_on_platform, binsOption = BinsOption.byNumber(30)) {
        y(Stat.density)
        fillColor = Color.hex("#ef0b0b")
        borderLine.color = Color.hex("#ECEFF1")
    }

    statBin(days_on_platform, binsOption = BinsOption.byNumber(30)) {
        area {
            x(Stat.x)
            y(Stat.density)
            alpha = .5
            fillColor = Color.hex("#0befef")
        }
    }

    layout {
        xAxisLabel = "days"
        title = "Age distribution (in days) on Netflix"
    }
}

val p2 = df_days.select { type and days_on_platform }.plot {
    boxplot(x = type, y = days_on_platform) {
        boxes {
            fillColor(Stat.x) {
                scale = categorical(range = listOf(Color.hex("#792020"), Color.hex("#207979")))
            }
        }
    }
    layout {
        yAxisLabel = "days"
        title = "Boxplot for age (in days) by type"
    }
}

plotBunch {
    add(p1, 0, 0, 500, 450)
    add(p2, 500, 0, 500, 450)
}

The age distribution of titles on the platform is similar to movies and TV shows. But you can see in the second graph that there are very old titles among the movies compared to the shows. Let's take a closer look at this moment. To do this, let's build a graph of the duration in years of being on the platform of films and shows.

In [None]:
df_days.valueCounts(sort = false) { type and years_on_platform }.plot {
    bars {
        x(years_on_platform) { axis.name = "years" }
        y("count")
        fillColor(type) {
            scale = categorical(range = listOf(Color.hex("#bc3076"), Color.hex("#30bc76")))
        }
        position = Position.dodge()
    }
    layout {
        title = "Years of Movies and TV Shows on Netflix"
        size = 900 to 500
    }
}

As you can see, movies are usually older than TV shows.
After that, you might ask yourself: how quickly were titles added to Netflix after their release? Well, finding the answer to it will be quite simple.

In [None]:
val df_years = df
    // adding a new column of the difference between the year of release and the year of addition
    .add("years_off_platform") {
        date_added.year - release_year
    }
    // dropping negative values and equal to zero
    .filter { "years_off_platform"<Int>() > 0 }
df_years

We dropped negative values because it happens that titles are added to the platform while it is still in production. Also dropped the zero values as they are of no interest. 

In [None]:
df_years.valueCounts(false) { years_off_platform }.plot {
    x(years_off_platform) { axis.name = "years" }
    points {
        y("count")
        size = 7.5
        color(years_off_platform) {
            scale = continuous(range = Color.hex("#97a6d9")..Color.hex("#00256e"))
        }
    }
    layout {
        title = "How long does it take for a title to be added to Netflix?"
        size = 1000 to 500
    }
}

Well, let's build the informal top charts for the oldest and newest movies and TV shows.

* ***Top 5 movies with the most days on Netflix***

In [None]:
// Top 5 oldest movies
df_days
    .filter { type == "Movie" } // filtering by type
    .sortByDesc { days_on_platform } // sorting by number of days on Netflix
    .select { cols(type, title, country, date_added, release_year, duration) } // selecting required columns
    .head() // taking first five rows

* ***Top 5 movies recently added on Netflix***

In [None]:
// Top 5 newest movies
df_days
    .filter { type == "Movie" }
    .sortBy { days_on_platform }
    .select { cols(type, title, country, date_added, release_year, duration) }
    .head()

* ***Top 5 TV Shows with most days on Netflix***

In [None]:
// Top 5 oldest shows
df_days
    .filter { type == "TV Show" }
    .sortByDesc { days_on_platform }
    .select { cols(type, title, country, date_added, release_year, duration) }
    .head()

* ***Top 5 TV Shows recently added on Netflix***

In [None]:
// Top 5 newest shows
df_days
    .filter { type == "TV Show" }
    .sortBy { days_on_platform }
    .select { cols(type, title, country, date_added, release_year, duration) }
    .head()

You might be interested in what months are titles added most often?

In [None]:
val df_split_date = df
    // splitting dates into four columns
    .split { date_added }.by { listOf(it, it.dayOfWeek, it.month, it.year) }
    .into("date", "day", "month", "year")
    .sortBy("month") // sorting by month
df_split_date

In [None]:
df_split_date
    .valueCounts(false) { year and month }
    .plot {
        tiles {
            x(year)
            y(month)
            width = .9
            height = .9
            fillColor("count") {
                scale = continuous(range = Color.hex("#FFF3E0")..Color.hex("#E65100"))
            }
        }

        layout {
            title = "Content additions by month and year"
            size = 900 to 700
            theme {
                panel {
                    background {
                        blank = true
                    }
                    grid.lineGlobal { blank = true }
                }
            }
        }
    }

## Actors

In this section, let's take a look at the actors and directors who make the content. First, let's determine the average number of actors in titles.

In [None]:
// splitting cast and couting number of actors
val cast_df = df
    .split { cast }.by(',').inplace()
    .add("size_cast") { "cast"<List<String>>().size }
    .convert { date_added } // Since we need the time in milliseconds since epoch for the plots, let's convert date_added to an Instant
            .with { it.atStartOfDayIn(TimeZone.UTC) }  
cast_df

In [None]:
cast_df.plot { 
    histogram(size_cast, binsOption = BinsOption.byNumber(50)) {
        fillColor(Stat.count) { 
            scale = continuous(range = Color.hex("#E0F7FA")..Color.hex("#006064"))
            legend {
                type = LegendType.None
            }
        }
    }
    layout {
        xAxisLabel = "actors"
        title = "Number of people on cast"
        size = 950 to 650
    }
}

It can be seen that usually 8-9 people are included in the cast.

But what about who exactly is involved in creating the content? Let's take a look at these actors and how many times they took part in movies and shows.

In [None]:
// counting the participation of each actor
val actors_df = cast_df.cast.explode().valueCounts()
actors_df

In [None]:
actors_df.take(30).plot { 
    barsH {
        y(cast) { scale = categorical() }
        x(count)
        fillColor(cast) { 
            scale = categoricalColorHue()
            legend {
                type = LegendType.None
            }
        }
    }
    layout.title = "Top 30 actors"
    layout.size = 950 to 900
}

Anupam Kher is definitely in the lead with 42 titles. Now we will split the castes for participation in movies or shows.

In [None]:
val actors = cast_df.pivot { type }.aggregate {
    cast.explode().valueCounts()
}
actors

In [None]:
val p1 = actors.`TV Show`.take(30).plot {
    barsH {
        x(count)
        y(cast)
        fillColor(cast) {
            scale = continuous(Color.hex("#263238")..Color.hex("#ECEFF1"))
            legend { type = LegendType.None }
        }
    }
    layout.title = "Top 30 actors in Shows"
}

val p2 = actors.Movie.take(30).plot {
    barsH {
        x(count)
        y(cast)
        fillColor(cast) {
            scale = continuousColorGradientN(listOf(Color.hex("#006064"), Color.hex("#E0F7FA")))
            legend { type = LegendType.None }
        }
    }
    layout.title = "Top 30 actors in Movies"
}

plotBunch {
    add(p1, 0, 0, 500, 700)
    add(p2, 500, 0, 500, 700)
}

How about directors? Let's see the top 10 directors with more appearance on Netflix catalog.

In [None]:
val directors_df = df.valueCounts { director }

In [None]:
directors_df.take(10).plot {
    barsH {
        x(count)
        y(director) { axis.name = "Name" }
        fillColor(director) { 
            scale = categoricalColorHue()
            legend { type = LegendType.None }
        }
    }
    layout.title = "Top 10 directors"
    layout.size = 850 to 500
}

These people work very productively.

## Countries

This section focuses on analyzing how content is distributed across various countries. To do so, we will need to import libraries that work with geospatial data and maps, and then perform the necessary manipulations to render the maps.

In [None]:
%use lets-plot
%use lets-plot-gt(gt=30.1)

In [None]:
USE {
    repository("https://repo.osgeo.org/repository/release")
    dependencies {
        implementation("org.geotools:gt-shapefile:30.1")
        implementation("org.geotools:gt-cql:30.1")
    }
}

In [None]:
import org.geotools.data.shapefile.ShapefileDataStoreFactory
import org.geotools.data.simple.SimpleFeatureCollection
import java.net.URL

In [None]:
val factory = ShapefileDataStoreFactory()

In [None]:
val worldFeatures: SimpleFeatureCollection = with("naturalearth_lowres") {
    val url = "https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/shp/${this}/${this}.shp"
    factory.createDataStore(URL(url)).featureSource.features
}

// Convert Feature Collection to SpatialDataset.
// Use 10 decimals to encode floating point numbers (this is the default).
val world = worldFeatures.toSpatialDataset(10)
val voidTheme = theme(
    axisTitle = "blank",
    axisLine = "blank",
    axisTicks = "blank",
    axisText = "blank",
)
val worldLimits = coordMap(ylim = -55 to 85)

Let's add another dataframe with country labels.

In [None]:
val countries = DataFrame.readCsv("country_codes.csv")
countries.head()

In [None]:
// counting number of titles by county and joining them with country codes dataframe
val df_country = df.valueCounts { country }.join(countries)
df_country

In [None]:
ggplot() +
        geomMap(
            data = df_country.toMap(),
            map = world,
            mapJoin = "iso" to "iso_a3",
            color = "white",
        ) { fill = "count" } +
        scaleFillGradient(
            low = "#FFF3E0",
            high = "#E65100",
            name = "Number of Titles",
        ) +
        ggsize(width = 1000, height = 800) +
        voidTheme +
        worldLimits

The map clearly shows where the content is mainly produced and gets to Netflix. Let's take a closer look at the top of such countries.

In [None]:
df_country[0..9].sortByDesc { count }.plot {
    bars {
        x(country)
        y(count)
        fillColor = Color.hex("#00796B")
    }
    layout.title = "Top 10 Countries"
    layout.size = 900 to 450
}

## Duration

How long does the content usually last to keep the viewer?

In [None]:
val df_dur = df
    .split { duration }.by(" ").inward("duration_num", "duration_scale") // splitting duration by time and scale inward
    .convert { "duration"["duration_num"] }.toInt() // converting by column path
    .update { "duration"["duration_scale"] }.with { if (it == "Seasons") "Season" else it }
df_dur.head()

In [None]:
val durations = df_dur.pivot { type }.values { duration }
durations

In [None]:
val p1 = durations.Movie.plot {
    histogram(duration_num, binsOption = BinsOption.byNumber(100)) {
        y(Stat.density)
        fillColor = Color.hex("#00BCD4")
    }

    statBin(duration_num, binsOption = BinsOption.byNumber(25)) {
        line {
            x(Stat.x) { axis.name = "minutes" }
            y(Stat.density) { axis.name = "density" }
            alpha = 1.0
            width = 1.0
            color = Color.hex("#d41900")
        }
    }

    layout.title = "Distribution of movies duration in minutes"
}

val p2 = durations.`TV Show`.plot {
    statBin(duration_num, binsOption = BinsOption.byNumber(15)) {
        bars {
            x(Stat.x)
            y(Stat.count)
            fillColor = Color.hex("#00BCD4")
        }
    }
}

plotBunch {
    add(p1, 0, 0, 1000, 500)
    add(p2, 0, 500, 1000, 500)
}

And according to tradition, the top longest movies and TV shows.

* ***Top 5 movies with highest duration***

In [None]:
df_dur.xs("Movie") { type }
    .sortByDesc { duration.duration_num }.head()
    .select { title and country and date_added and release_year and duration.all() }

* ***Top 5 TV shows with most seasons***

In [None]:
df_dur.xs("TV Show") { type }
    .sortByDesc { duration.duration_num }.head()
    .select { title and country and date_added and release_year and duration.all() }

And in the top content producing countries, how long are movies and TV shows?

In [None]:
val list_top_countries = df_country.country.take(10).toSet()

val df_cntr = df_dur
    .filter { country in list_top_countries }
    .pivot { type }.aggregate { 
        groupBy { country }.mean { duration.duration_num }
    }
df_cntr

In [None]:
val p1 = df_cntr.Movie.sortBy { duration_num }.plot {
    bars {
        x(country) { axis.name = "Name" }
        y(duration_num) { axis.name = "Minute" }
        fillColor(duration_num) {
            scale = continuous(Color.hex("#ECEFF1")..Color.hex("#263238"))
            legend.type = LegendType.None
        }
    }
    layout.title = "Top 10 cast on Movies by country"
}

val p2 = df_cntr.`TV Show`.sortBy { duration_num }.plot {
    bars {
        x(country) { axis.name = "Name" }
        y(duration_num) { axis.name = "Season" }
        fillColor(duration_num) {
            scale = continuous(Color.hex("#E0F7FA")..Color.hex("#006064"))
            legend.type = LegendType.None
        }
    }
    layout.title = "Top 10 cast on TV Shows by country"
}

plotBunch {
    add(p1, 0, 0, 900, 550)
    add(p2, 0, 550, 900, 550)
}

## Ratings

Finally, let's take a look at the rating column.
Here we will find out what is the most commonly assigned rating for films and shows.

In [None]:
val dfInstants = df.convert { date_added }.with { it.atStartOfDayIn(TimeZone.UTC) }

In [None]:
dfInstants.valueCounts(false) { rating }.sortBy("count").plot {
    bars {
        x(rating)
        y("count")
        fillColor(rating) { 
            scale = categoricalColorHue()
            legend.type = LegendType.None
        }
    }
    layout.title = "Rating of Titles"
    layout.size = 950 to 500
}

In [None]:
dfInstants.valueCounts(sort = false) { rating and type }.plot {
    bars {
        x(rating)
        y("count")
        fillColor(type) { scale = categorical(listOf(Color.hex("#607D8B"), Color.hex("#00BCD4"))) }
        position = Position.dodge()
    }
    layout.title = "Rating of Titles"
    layout.size = 950 to 500
}