# 2. Data Exploration

## 2.0. Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

This section imports the essential Python libraries required for data analysis, visualization, and statistical exploration. Each library serves a specific purpose:

- **NumPy (np)**: Provides support for multi-dimensional arrays and a wide range of mathematical operations, facilitating efficient numerical computations.
- **Pandas (pd)**: A powerful data manipulation library used for loading, cleaning, and analyzing structured data through its **DataFrame** and **Series** objects.
- **Matplotlib (plt)**: A versatile library for creating static, interactive, and animated visualizations, primarily used for plotting data in 2D.
- **Seaborn (sns)**: Built on top of Matplotlib, this library simplifies the creation of visually appealing and informative statistical graphics, offering a high-level interface for data visualization.

These libraries together enable efficient data processing and insightful visual representations to support data-driven decision-making.

## 2.1. Read data

In [2]:
data_path = './data/tvshows_raw.csv'
df = pd.read_csv(data_path)

In this section, we use the **Pandas** library to read data from a CSV file containing information about automobiles. 

- First, the file path is stored in the variable `data_path`.
- Then, the `pd.read_csv()` function is called to load the data from the file and convert it into a **DataFrame** object, which allows for easier manipulation, analysis, and visualization of the data.
- The `df` variable holds the loaded dataset, which can be used for further processing in the subsequent steps of the analysis.

This approach simplifies data handling and prepares the dataset for deeper exploration and analysis.

## 2.2. How many rows and how many columns?

In [3]:
row, col = df.shape
print("Number of Row: ", row)
print("Number of Column: ", col)

Number of Row:  1138
Number of Column:  14


The dataset contains **1138 rows** and **14 columns**. This indicates that there are **1138 entries** or **records** in the dataset, each corresponding to a different **TV show**. The **14 columns** represent various attributes of these TV shows, which might include information such as **title**, **release year**, **rating**, **runtime**, **cast**, **genres**, and **production companies**.

With this structure, the dataset provides a **rich set of features** for each TV show, offering sufficient data for **comprehensive analysis** and **insight generation**. The relatively large number of rows suggests that the dataset is **adequate** for identifying patterns, trends, and relationships that can provide valuable insights into the performance of TV shows across different platforms and genres.

## 2.3. What is the meaning of each row?

In [4]:
df.head()

Unnamed: 0,Title,Years,Certification,Runtime,Rating,Number of Votes,Emmys,Creators,Actors,Genres,Coutries of origins,Languages,Production companies,Link
0,Queen Cleopatra,2023,TV-14,45m,1.2,86K,0,,"Jada Pinkett Smith, Adele James, Craig Russell...","Documentary, Drama, History",United Kingdom,English,Nutopia,https://www.imdb.com/title/tt27528139/?ref_=sr...
1,Velma,2023–2024,TV-MA,25m,1.6,80K,0,Charlie Grandy,"Mindy Kaling, Glenn Howerton, Sam Richardson, ...","Animation, Adventure, Comedy, Crime, Horror, M...","United States, South Korea",English,"Charlie Grandy Productions, Kaling Internation...",https://www.imdb.com/title/tt14153790/?ref_=sr...
2,Keeping Up with the Kardashians,2007–2021,TV-14,44m,2.9,32K,0,"Ryan Seacrest, Eliot Goldberg","Khloé Kardashian, Kim Kardashian, Kourtney Kar...","Family, Reality-TV",United States,"English, Spanish","Bunim-Murray Productions (BMP), Ryan Seacrest ...",https://www.imdb.com/title/tt1086761/?ref_=sr_...
3,Batwoman,2019–2022,TV-14,45m,3.6,47K,0,Caroline Dries,"Camrus Johnson, Rachel Skarsten, Meagan Tandy,...","Action, Adventure, Crime, Drama, Sci-Fi",United States,English,"Berlanti Productions, DC Entertainment, Warner...",https://www.imdb.com/title/tt8712204/?ref_=sr_...
4,The Acolyte,2024,TV-14,35m,4.1,125K,0,Leslye Headland,"Lee Jung-jae, Amandla Stenberg, Manny Jacinto,...","Action, Adventure, Drama, Fantasy, Mystery, Sc...",United States,English,"Lucasfilm, Disney+, The Walt Disney Company",https://www.imdb.com/title/tt12262202/?ref_=sr...


Each **row** in the dataset represents the **information of a specific TV show**. Each row contains a set of **attributes** related to that particular show, which describe its **performance**, **genre**, **cast**, **rating**, **runtime**, **production companies**, and more. These rows collectively provide a **detailed profile** for every TV show, allowing for **comparisons** across different shows based on various features such as **rating**, **viewership**, **genre**, and **production company**.

Essentially, every row captures the **key data** for an individual TV show, and this **structured representation** helps in understanding **patterns**, **trends**, and **relationships** within the dataset. By examining the rows, we can analyze how different TV shows vary based on their **attributes** and make **informed decisions** or **predictions** about the factors influencing their performance or success.

## 2.4. Are there duplicated rows?

In [5]:
df[df.duplicated(keep=False)]

Unnamed: 0,Title,Years,Certification,Runtime,Rating,Number of Votes,Emmys,Creators,Actors,Genres,Coutries of origins,Languages,Production companies,Link


Upon checking the dataset for **duplicated rows** using the command `df[df.duplicated(keep=False)]`, it is observed that there are **no duplicated rows** present. This means that each **entry** in the dataset is **unique**, which is a good sign as it ensures that there is no **redundancy** in the data.

The **absence of duplicate rows** is important because it prevents potential **biases** or **errors** that could arise from analyzing repeated data. It suggests that the data has been **properly collected** and **cleaned**, making it suitable for further **analysis** and **modeling** without the need for **data cleansing** regarding duplicates.

## 2.5. What is the meaning of each column?

In [6]:
list(df.columns.values)

['Title',
 'Years',
 'Certification',
 'Runtime',
 'Rating',
 'Number of Votes',
 'Emmys',
 'Creators',
 'Actors',
 'Genres',
 'Coutries of origins',
 'Languages',
 'Production companies',
 'Link']

Here is a detailed explanation of each column in the dataset:

| **Column**             | **Description**                                                                                   | **Meaning**                                                                                                             |
|------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| **Title**              | Title of the TV show or miniseries.                                                                | This column contains the name or title of the TV show or miniseries, which identifies the specific show in the dataset. |
| **Years**              | Year(s) of broadcast of the TV show.                                                               | Represents the year or range of years in which the TV show aired or was first released. This provides insight into the show's age or broadcast timeline. |
| **Certification**      | The certification or rating that the show has received.                                             | Indicates the age rating or certification level of the TV show, which is assigned based on content appropriateness for various age groups. |
| **Runtime**            | Duration of the TV show in terms of time.                                                           | Refers to the total length of the show or the individual episodes, usually measured in minutes, indicating how long the show lasts. |
| **Rating**             | The score or rating given to the TV show, usually on a scale of 1 to 10.                          | Represents the average score given by viewers or critics on a 10-point scale. It helps measure the show's popularity and perceived quality. |
| **Number of Votes**    | The number of votes or reviews the TV show has received.                                           | Indicates the total number of votes or reviews submitted by viewers or critics for the TV show, giving an idea of its audience size and engagement. |
| **Emmys**              | The number of Emmy awards the TV show has won.                                                     | This column shows how many Emmy awards the show has won, which signifies its recognition in the television industry for excellence in various categories. |
| **Creators**           | The individuals or teams responsible for creating the TV show.                                      | Lists the creators, such as writers, directors, or producers, who are credited with the development and creation of the TV show. |
| **Actors**             | The main actors or cast members of the TV show.                                                    | Contains the names of the key actors or cast members who appeared in the TV show, providing insight into the talent behind the production. |
| **Genres**             | The type or category of the TV show.                                                                | Represents the genre(s) of the TV show (e.g., drama, comedy, sci-fi, etc.), which helps categorize the show based on its content and themes. |
| **Countries of Origins**| The country or countries where the TV show was produced.                                            | Shows the country of origin, indicating where the TV show was filmed or produced, which can provide context for the show's cultural background and style. |
| **Languages**          | The primary languages in which the TV show is available or performed.                              | Lists the languages spoken or used in the TV show, which can indicate the primary audience and market for the show. |
| **Production Companies**| The company or companies responsible for producing the TV show.                                     | Contains the names of the production companies that were involved in the creation and distribution of the TV show. |
| **Link**               | A URL link to the TV show's IMDb page.                                                             | Provides a hyperlink to the IMDb page for the TV show, where users can find additional details, reviews, and related content. |

This structured format helps provide a comprehensive understanding of each TV show in the dataset, offering valuable insights into its background, audience reception, and industry recognition.

## 2.6. What is the current data type of each column? Are there columns having inappropriate data types?

### 2.6.1. Overview of Data Types:

To provide a **general overview** of the dataset without detailed analysis or statistical insights, the **data types** of each **column** can be displayed along with a quick **summary** (e.g., **unique values** for categorical columns and **summary statistics** for numerical columns).

In [7]:
df.dtypes

Title                    object
Years                    object
Certification            object
Runtime                  object
Rating                  float64
Number of Votes          object
Emmys                     int64
Creators                 object
Actors                   object
Genres                   object
Coutries of origins      object
Languages                object
Production companies     object
Link                     object
dtype: object

The **data types** of each **column** in the dataset provide useful insights into the **structure** of the data. Here’s a breakdown and interpretation of each column's **data type**:

| **Column Name**          | **Data Type**   | **Description**                                                                                     |
|--------------------------|-----------------|-----------------------------------------------------------------------------------------------------|
| **Title**                 | object          | The **Title** column is of type `object`, which is appropriate as it contains the names of TV shows or programs, stored as strings. |
| **Years**                 | object          | The **Years** column is of type `object`, which might indicate a range or specific year. It may be converted to a **datetime** type if necessary. |
| **Certification**         | object          | The **Certification** column is of type `object`, suitable for storing categorical data such as age ratings or certifications. |
| **Runtime**               | object          | The **Runtime** column is of type `object`. It could be converted to a numeric type (e.g., int or float) if the data represents time in minutes or hours. |
| **Rating**                | float64         | The **Rating** column is of type `float64`, which is appropriate for numerical values with decimal points, as ratings are typically represented with decimals. |
| **Number of Votes**       | object          | The **Number of Votes** column is of type `object`, but it should be converted to `int64` as it represents a count of votes, which is an integer. |
| **Emmys**                 | int64           | The **Emmys** column is of type `int64`, which is correct since the number of Emmys won is represented as an integer. |
| **Creators**              | object          | The **Creators** column is of type `object`, suitable as it contains the names of the creators of the show. |
| **Actors**                | object          | The **Actors** column is of type `object`, appropriate for storing the names of actors in the show. |
| **Genres**                | object          | The **Genres** column is of type `object`, which is correct for storing categorical data such as genres (e.g., comedy, drama). |
| **Countries of Origins**  | object          | The **Countries of Origins** column is of type `object`, which is suitable for storing categorical data regarding the countries of origin for the show. |
| **Languages**             | object          | The **Languages** column is of type `object`, appropriate as it stores the languages in which the show is available. |
| **Production Companies**  | object          | The **Production Companies** column is of type `object`, which is correct for storing the names of production companies. |
| **Link**                  | object          | The **Link** column is of type `object`, appropriate for storing URL links to the show's website. |

#### **_Summary_**:
- **_Correct Data Types_**: Most **columns** have the appropriate **data types**. **_Numerical columns_** like **`Rating`** and **`Emmys`** are correctly represented as **`float64`** and **`int64`** respectively. **_Categorical columns_** like **`Title`**, **`Creators`**, and **`Actors`** are stored as **`object`**, which is suitable for textual or categorical data.
- **_Potential Data Type Issues_**: 
  - **`Number of Votes`** is currently stored as **`object`**, but it should be converted to **`int64`** since it represents a count of votes.
  - **`Runtime`** is stored as **`object`**, but it would be more appropriate to convert it to a **numeric type** (e.g., `int64` or `float64`) if it represents a time duration in minutes or hours.
  - **`Years`** is stored as **`object`**, but it may need to be converted to a **datetime** type if it represents specific years or ranges.

#### **_Potential Improvements_**:
- **_Missing Data Check_**: It’s important to check if there are any missing values in the dataset, especially for numerical columns like **`Number of Votes`** and **`Runtime`**, which could sometimes have missing or incorrect values.
- **_Consistency in `Runtime` Column_**: The **`Runtime`** column is currently of type **`object`**, but it should be checked for consistency and potentially converted to a numerical type (e.g., **`int64`** or **`float64`**) if it represents time duration in minutes.
- **_Consistency in `Years` Column_**: The **`Years`** column is stored as **`object`** but may be better represented as a **`datetime`** type to ensure accurate processing of year-related data.

#### **_Overall_**:
- Most of the data types are appropriate for the nature of the data in each column. Numerical columns such as **`Rating`**, **`Emmys`**, and others are correctly represented as **`float64`** or **`int64`**, while categorical columns like **`Title`**, **`Certification`**, and **`Genres`** are appropriately stored as **`object`**.
- There are a few potential improvements regarding consistency and ensuring that all columns are properly formatted for analysis, such as converting certain columns to categorical or numerical types.

### 2.6.2. Blank Shell of Each Column:

In [8]:
df.isna().sum()

Title                     0
Years                     0
Certification            14
Runtime                  57
Rating                    0
Number of Votes           0
Emmys                     0
Creators                198
Actors                    1
Genres                    0
Coutries of origins       0
Languages                 0
Production companies     17
Link                      0
dtype: int64

To identify the missing values in each column, the following output shows the count of missing values for each column:

| **Column**              | **Missing Values** |
|-------------------------|--------------------|
| **Title**               | 0                  |
| **Years**               | 0                  |
| **Certification**       | 14                 |
| **Runtime**             | 57                 |
| **Rating**              | 0                  |
| **Number of Votes**     | 0                  |
| **Emmys**               | 0                  |
| **Creators**            | 198                |
| **Actors**              | 1                  |
| **Genres**              | 0                  |
| **Countries of origins**| 0                  |
| **Languages**           | 0                  |
| **Production companies**| 17                 |
| **Link**                | 0                  |

#### **Observations**:
- Columns like **`Runtime`** and **`Creators`** have a significant number of missing values (57 and 198, respectively). 
- **`Creators`** column can be ignored as it is not essential for the analysis.
- The **`Runtime`** column, being an important feature, will need to be handled with care. Since it is a numerical column, missing values can be filled with **`0`** or another suitable strategy based on the context. In the following steps, we will explore how to deal with this column appropriately.

#### **Next Steps**:
- **Missing Data**: For numerical columns such as **`Runtime`**, **`Number of Votes`**, and **`Production companies`**, **`NaN`** values will be replaced with **0** or other suitable filler values.
- **Non-Numerical Columns**: For non-numerical columns like **`Certification`**, **`Creators`**, and **`Actors`**, missing values can be handled with placeholders or by ignoring those records if needed.

The **Runtime** and **Creators** columns have quite a few missing values (NAN). However, the **Creators** column is not a critical factor that we need to focus on, so it can be ignored. Meanwhile, the **Runtime** column plays a more important role, and we will spend time handling this column in more detail in the subsequent steps.