# Data Processing and Cleaning

## 1. **Data Loading and Initial Inspection**

**Description:**

Before cleaning and processing, we load the dataset and inspect its structure, including column names, data types, and summary statistics.

In [36]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("primary_data.csv")

# Display the first 5 rows
print(df.head())

# Display dataset information (columns, data types, non-null counts)
print(df.info())

# Display summary statistics for numerical columns
print(df.describe())

      Video ID                                              Title  Duration     Views    Likes  Comments Language             Topic      Publication Time Region
0  KbiwL74KyJQ                          Shrek 5 Cast Announcement     PT28S   7871600   156303         0  English     Entertainment  2025-02-27T16:01:18Z     US
1  8B1EtVPBSMw                  A Minecraft Movie | Final Trailer   PT2M31S  11843815   239419     22975  English  Film & Animation  2025-02-27T20:00:04Z     US
2  W7FTkUFU7nw                       Pokémon Presents | 2.27.2025  PT19M16S   3599910   156222     18513  English            Gaming  2025-02-27T14:00:06Z     US
3  r5VRqWkFpEQ  LISA - FUTW (Vixi Solo Version) (Official Musi...   PT3M48S   8218102  1016119     66437  English             Music  2025-02-28T05:01:26Z     US
4  vONxgCQWZCA  YoungBoy Never Broke Again - 5 Night [Official...   PT3M16S    540591    60888      5892  English             Music  2025-02-28T06:32:53Z     US
<class 'pandas.core.frame.DataFram

# Load the dataset

In [37]:
df = pd.read_csv("primary_data.csv")

# Display the first 5 rows

In [38]:
df.head()

Unnamed: 0,Video ID,Title,Duration,Views,Likes,Comments,Language,Topic,Publication Time,Region
0,KbiwL74KyJQ,Shrek 5 Cast Announcement,PT28S,7871600,156303,0,English,Entertainment,2025-02-27T16:01:18Z,US
1,8B1EtVPBSMw,A Minecraft Movie | Final Trailer,PT2M31S,11843815,239419,22975,English,Film & Animation,2025-02-27T20:00:04Z,US
2,W7FTkUFU7nw,Pokémon Presents | 2.27.2025,PT19M16S,3599910,156222,18513,English,Gaming,2025-02-27T14:00:06Z,US
3,r5VRqWkFpEQ,LISA - FUTW (Vixi Solo Version) (Official Musi...,PT3M48S,8218102,1016119,66437,English,Music,2025-02-28T05:01:26Z,US
4,vONxgCQWZCA,YoungBoy Never Broke Again - 5 Night [Official...,PT3M16S,540591,60888,5892,English,Music,2025-02-28T06:32:53Z,US


# Display dataset information (columns, data types, non-null counts)

In [39]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2513 entries, 0 to 2512
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Video ID          2513 non-null   object
 1   Title             2513 non-null   object
 2   Duration          2513 non-null   object
 3   Views             2513 non-null   int64 
 4   Likes             2513 non-null   int64 
 5   Comments          2513 non-null   int64 
 6   Language          2513 non-null   object
 7   Topic             2513 non-null   object
 8   Publication Time  2513 non-null   object
 9   Region            2513 non-null   object
dtypes: int64(3), object(7)
memory usage: 196.5+ KB
None


# Display summary statistics for numerical columns

In [40]:
print(df.describe())

              Views         Likes       Comments
count  2.513000e+03  2.513000e+03    2513.000000
mean   2.578657e+06  9.285591e+04    5555.045762
std    8.352468e+06  3.028678e+05   15382.891750
min    0.000000e+00  0.000000e+00       0.000000
25%    3.243060e+05  9.028000e+03     533.000000
50%    7.442750e+05  2.301400e+04    1477.000000
75%    1.819733e+06  6.751600e+04    3986.000000
max    1.465875e+08  4.580506e+06  173803.000000


## 2. **Handling Missing Values**

### Description:
We check for missing values in the dataset and handle them appropriately. For this dataset, we drop rows with missing critical fields like `Views`, `Likes`, or `Comments`

# Check for missing values

In [41]:
print(df.isnull().sum())

Video ID            0
Title               0
Duration            0
Views               0
Likes               0
Comments            0
Language            0
Topic               0
Publication Time    0
Region              0
dtype: int64


**No missing values found**, so no rows were dropped.

## 3. **Removing Duplicates**

### Description:
We check for duplicate rows based on the `VideoID` column and remove them to ensure data integrity.

# Check for duplicate Video IDs

In [48]:
print(f"Duplicate rows: {df.duplicated(subset='Video ID').sum()}")


Duplicate rows: 656


**No duplicates found**, so no rows were removed

## 4. **Data Type Conversion**
### Description:
We convert the Duration column from ISO 8601 format to seconds for easier analysis. We also convert the Publication Time column to a datetime object.

In [55]:
# Function to convert ISO 8601 duration to seconds
def iso_to_seconds(Duration ):
    time_str = duration.replace("PT", "").replace("H", "H ").replace("M", "M ").replace("S", "S").split()
    seconds = 0
    for part in time_str:
        if 'H' in part:
            seconds += int(part.replace('H', '')) * 3600  # Convert hours to seconds
        elif 'M' in part:
            seconds += int(part.replace('M', '')) * 60    # Convert minutes to seconds
        elif 'S' in part:
            seconds += int(part.replace('S', ''))         # Add seconds
    return seconds

# Apply the function to the Duration column
df["Duration (seconds)"] = df["Duration"].apply(iso_to_seconds)

# Drop the original Duration column
df = df.drop("Duration", axis=1)

KeyError: 'Duration'