# 2. Data Exploration

## 2.0. Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

This section imports the essential Python libraries required for data analysis, visualization, and statistical exploration. Each library serves a specific purpose:

- **NumPy (np)**: Provides support for multi-dimensional arrays and a wide range of mathematical operations, facilitating efficient numerical computations.
- **Pandas (pd)**: A powerful data manipulation library used for loading, cleaning, and analyzing structured data through its **DataFrame** and **Series** objects.
- **Matplotlib (plt)**: A versatile library for creating static, interactive, and animated visualizations, primarily used for plotting data in 2D.
- **Seaborn (sns)**: Built on top of Matplotlib, this library simplifies the creation of visually appealing and informative statistical graphics, offering a high-level interface for data visualization.

These libraries together enable efficient data processing and insightful visual representations to support data-driven decision-making.

## 2.1. Read data

In [2]:
data_path = './data/tvshows_raw.csv'
df = pd.read_csv(data_path)

In this section, we use the **Pandas** library to read data from a CSV file containing information about automobiles. 

- First, the file path is stored in the variable `data_path`.
- Then, the `pd.read_csv()` function is called to load the data from the file and convert it into a **DataFrame** object, which allows for easier manipulation, analysis, and visualization of the data.
- The `df` variable holds the loaded dataset, which can be used for further processing in the subsequent steps of the analysis.

This approach simplifies data handling and prepares the dataset for deeper exploration and analysis.

## 2.2. How many rows and how many columns?

In [3]:
row, col = df.shape
print("Number of Row: ", row)
print("Number of Column: ", col)

Number of Row:  1138
Number of Column:  14


The dataset contains **1138 rows** and **14 columns**. This indicates that there are **1138 entries** or **records** in the dataset, each corresponding to a different **TV show**. The **14 columns** represent various attributes of these TV shows, which might include information such as **title**, **release year**, **rating**, **runtime**, **cast**, **genres**, and **production companies**.

With this structure, the dataset provides a **rich set of features** for each TV show, offering sufficient data for **comprehensive analysis** and **insight generation**. The relatively large number of rows suggests that the dataset is **adequate** for identifying patterns, trends, and relationships that can provide valuable insights into the performance of TV shows across different platforms and genres.

## 2.3. What is the meaning of each row?

In [4]:
df.head()

Unnamed: 0,Title,Years,Certification,Runtime,Rating,Number of Votes,Emmys,Creators,Actors,Genres,Coutries of origins,Languages,Production companies,Link
0,Queen Cleopatra,2023,TV-14,45m,1.2,86K,0,,"Jada Pinkett Smith, Adele James, Craig Russell...","Documentary, Drama, History",United Kingdom,English,Nutopia,https://www.imdb.com/title/tt27528139/?ref_=sr...
1,Velma,2023–2024,TV-MA,25m,1.6,80K,0,Charlie Grandy,"Mindy Kaling, Glenn Howerton, Sam Richardson, ...","Animation, Adventure, Comedy, Crime, Horror, M...","United States, South Korea",English,"Charlie Grandy Productions, Kaling Internation...",https://www.imdb.com/title/tt14153790/?ref_=sr...
2,Keeping Up with the Kardashians,2007–2021,TV-14,44m,2.9,32K,0,"Ryan Seacrest, Eliot Goldberg","Khloé Kardashian, Kim Kardashian, Kourtney Kar...","Family, Reality-TV",United States,"English, Spanish","Bunim-Murray Productions (BMP), Ryan Seacrest ...",https://www.imdb.com/title/tt1086761/?ref_=sr_...
3,Batwoman,2019–2022,TV-14,45m,3.6,47K,0,Caroline Dries,"Camrus Johnson, Rachel Skarsten, Meagan Tandy,...","Action, Adventure, Crime, Drama, Sci-Fi",United States,English,"Berlanti Productions, DC Entertainment, Warner...",https://www.imdb.com/title/tt8712204/?ref_=sr_...
4,The Acolyte,2024,TV-14,35m,4.1,125K,0,Leslye Headland,"Lee Jung-jae, Amandla Stenberg, Manny Jacinto,...","Action, Adventure, Drama, Fantasy, Mystery, Sc...",United States,English,"Lucasfilm, Disney+, The Walt Disney Company",https://www.imdb.com/title/tt12262202/?ref_=sr...


Each **row** in the dataset represents the **information of a specific TV show**. Each row contains a set of **attributes** related to that particular show, which describe its **performance**, **genre**, **cast**, **rating**, **runtime**, **production companies**, and more. These rows collectively provide a **detailed profile** for every TV show, allowing for **comparisons** across different shows based on various features such as **rating**, **viewership**, **genre**, and **production company**.

Essentially, every row captures the **key data** for an individual TV show, and this **structured representation** helps in understanding **patterns**, **trends**, and **relationships** within the dataset. By examining the rows, we can analyze how different TV shows vary based on their **attributes** and make **informed decisions** or **predictions** about the factors influencing their performance or success.

## 2.4. Are there duplicated rows?

In [5]:
df[df.duplicated(keep=False)]

Unnamed: 0,Title,Years,Certification,Runtime,Rating,Number of Votes,Emmys,Creators,Actors,Genres,Coutries of origins,Languages,Production companies,Link


Upon checking the dataset for **duplicated rows** using the command `df[df.duplicated(keep=False)]`, it is observed that there are **no duplicated rows** present. This means that each **entry** in the dataset is **unique**, which is a good sign as it ensures that there is no **redundancy** in the data.

The **absence of duplicate rows** is important because it prevents potential **biases** or **errors** that could arise from analyzing repeated data. It suggests that the data has been **properly collected** and **cleaned**, making it suitable for further **analysis** and **modeling** without the need for **data cleansing** regarding duplicates.