# Netflix User Segmentation and Behavior EDA

#### Importing the Libararies

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#### Loading the data

In [5]:
df = pd.read_csv(r'data\netflix_users.csv')

#### Displaying the first 10 rows of the dataset to get a quick glance of the data

In [6]:
df.head(10)

Unnamed: 0,User_ID,Name,Age,Country,Subscription_Type,Watch_Time_Hours,Favorite_Genre,Last_Login
0,1,James Martinez,18,France,Premium,80.26,Drama,2024-05-12
1,2,John Miller,23,USA,Premium,321.75,Sci-Fi,2025-02-05
2,3,Emma Davis,60,UK,Basic,35.89,Comedy,2025-01-24
3,4,Emma Miller,44,USA,Premium,261.56,Documentary,2024-03-25
4,5,Jane Smith,68,USA,Standard,909.3,Drama,2025-01-14
5,6,David Johnson,21,USA,Standard,615.93,Romance,2025-02-03
6,7,John Hernandez,57,Canada,Standard,755.47,Romance,2025-01-05
7,8,Katie Hernandez,68,USA,Standard,145.23,Sci-Fi,2024-10-30
8,9,James Williams,39,UK,Basic,950.14,Action,2024-04-16
9,10,Alex Davis,55,Mexico,Standard,696.66,Horror,2024-07-03


#### Displaying the number of rows and columns in the dataset

In [11]:
rows, columns = df.shape
print(f'This dataset contains {rows} rows and {columns} columns.')

This dataset contains 25000 rows and 8 columns.


#### Displaying the Column Names

In [8]:
df.columns

Index(['User_ID', 'Name', 'Age', 'Country', 'Subscription_Type',
       'Watch_Time_Hours', 'Favorite_Genre', 'Last_Login'],
      dtype='str')

#### Based on the Netflix User Database, the meaning of each column is as follows:

- User_ID – Unique identifier for each Netflix user.
- Name – Name of the user (may be anonymized in some datasets).
- Age – Age of the user in years.
- Country – Country of residence of the user.
- Subscription_Type – Type of subscription the user has (e.g., Basic, Standard, Premium).
- Watch_Time_Hours – Total hours the user has spent watching content on Netflix.
- Favorite_Genre – User’s preferred content genre (e.g., Drama, Action, Comedy).
- Last_Login – Timestamp or date of the user’s most recent login activity.

In [12]:
df.describe()

Unnamed: 0,User_ID,Age,Watch_Time_Hours
count,25000.0,25000.0,25000.0
mean,12500.5,46.48288,500.468858
std,7217.022701,19.594861,286.381815
min,1.0,13.0,0.12
25%,6250.75,29.0,256.5675
50%,12500.5,46.0,501.505
75%,18750.25,63.0,745.7325
max,25000.0,80.0,999.99


#### Checking the data type of the Columns

In [9]:
df.dtypes

User_ID                int64
Name                     str
Age                    int64
Country                  str
Subscription_Type        str
Watch_Time_Hours     float64
Favorite_Genre           str
Last_Login               str
dtype: object

#### Observations on Data Types

- Clearly one of the column - Last_Login is stored as str instead of datetime format
- Columns like - Subscription_Type, Country and Favorite_Genre can be stored as categorical variables for better aggregation & segmentation

#### Checking all unique values of each column

In [13]:
for col in df.columns:
    unique_vals = df[col].unique()
    print(f'Name of the Column : {col}')
    print(f'Number of unique values present : {len(unique_vals)}')
    print(f'Unique Values : {unique_vals}')
    print('-'*50)

Name of the Column : User_ID
Number of unique values present : 25000
Unique Values : [    1     2     3 ... 24998 24999 25000]
--------------------------------------------------
Name of the Column : Name
Number of unique values present : 100
Unique Values : <StringArray>
[   'James Martinez',       'John Miller',        'Emma Davis',
       'Emma Miller',        'Jane Smith',     'David Johnson',
    'John Hernandez',   'Katie Hernandez',    'James Williams',
        'Alex Davis',       'Jane Miller',     'Jane Martinez',
     'Alex Martinez',        'Alex Smith',     'Michael Jones',
      'Chris Miller',       'Chris Davis',     'Emma Williams',
      'Alex Johnson',        'John Jones',  'Michael Williams',
       'Sarah Davis',        'John Smith',        'Alex Brown',
     'Chris Johnson',       'James Jones',   'Michael Johnson',
       'Sarah Smith',       'Chris Jones',     'John Williams',
        'Jane Brown',       'David Davis',    'Emma Hernandez',
    'Sarah Williams',   

#### Summary from the observations

- Columns with correct types: User_ID, Name, Age, Country, Subscription_Type, Favorite_Genre, Watch_Time_Hours, Last_Login are all fine.
- Columns that may need minor adjustments: Last_Login needs to be converted to datetype from str 


#### Checking for missing values per column (including empty strings)

In [14]:
# Count of NaN / None values per column
nan_count = df.isnull().sum()

# Count of empty strings per column
empty_count = (df == '').sum()

# Total missing values (NaN + empty strings)
total_missing = nan_count + empty_count

# Percentage of missing values
missing_percent = (total_missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Values': total_missing,
    'Percentage (%)': missing_percent
}).sort_values(by='Percentage (%)', ascending=False)

print(missing_df)


                   Missing Values  Percentage (%)
User_ID                         0             0.0
Name                            0             0.0
Age                             0             0.0
Country                         0             0.0
Subscription_Type               0             0.0
Watch_Time_Hours                0             0.0
Favorite_Genre                  0             0.0
Last_Login                      0             0.0


#### Checking for Duplicates in the dataset

In [16]:
# Check for duplicate rows
print(df.duplicated().sum())

0


#### Data Cleaning

In [17]:
# Convert Last_Login from string to datetime
df['Last_Login'] = pd.to_datetime(df['Last_Login'], errors='coerce')

# Convert categorical columns to 'category' type
categorical_cols = ['Country', 'Subscription_Type', 'Favorite_Genre']
for col in categorical_cols:
    df[col] = df[col].astype('category')

# Check data types after conversion
print(df.dtypes)

User_ID                       int64
Name                            str
Age                           int64
Country                    category
Subscription_Type          category
Watch_Time_Hours            float64
Favorite_Genre             category
Last_Login           datetime64[us]
dtype: object


## Time for Analysis

#### Univaraite Analysis

#### Bivaraite Analysis

#### Multivaraite Analysis