########### Pandas - EDA(Exploratory Data Analysis) #########

Metadata

In [0]:
## Metadata is simply “data about data.” 
    - It doesn’t hold the content itself, but it describes and gives context to that content so it’s easier to understand, organize, and use.



Install Dependencies

In [0]:
!pip install openpyxl -q

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Pandas and Numpy Libraries

In [0]:

## The pandas library is one of the most essential tools in Python for data analysis and manipulation.
    - Read and write data, Clean and transform, Analyze and visualize, manipulate data.

## NumPy (short for Numerical Python) is a foundational library for scientific computing in Python.
    - Efficient handling of large arrays and matrices, 
    - Fast mathematical operations, 
    - Support for linear algebra, statistics, and random number generation

In [0]:
## The line import pandas as pd is used to bring the pandas library into your Python environment and give it the nickname pd.
    # pd is just a short alias — instead of typing pandas.read_csv(), you can type pd.read_csv().

import pandas as pd

## The line import numpy as np is used to bring the NumPy library into your Python environment and give it the alias np. This makes it easier and faster to use NumPy’s powerful tools for numerical computing.
import numpy as np

Data Load/Ingestion

In [0]:
## data_path is a variable that stores the path to the CSV file that contains the data you want to analyze.
    # CSV file location
data_path = "/Workspace/Users/ramslazio@gmail.com/Viewership Analysis.csv"



In [0]:
## Excel file location

# data_path = "/Workspace/Users/ramslazio@gmail.com/Viewership Analysis.xlsx"

In [0]:
## The pd.read_csv() function is used to read data from a CSV file into a pandas DataFrame. The function takes the path to the CSV file as an argument and returns a DataFrame object that contains the data from the CSV file.

# df = pd.read_csv(data_path)

In [0]:
## survey_analysis = df is a line of code that assigns the DataFrame object df to a new variable called survey_analysis. This allows you to work with the data in the DataFrame using the survey_analysis variable instead of df.

survey_analysis = pd.read_csv(data_path)

In [0]:
# Display data

display(survey_analysis)

DateID,CustomerID,TotalTimeWatched,Platform,PlayEventType,VideoTitle
20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300.0,Leanback,LiveTV,F1 '20: Emilia Romagna GP
20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300.0,Leanback,LiveTV,F1 '20: Emilia Romagna GP
20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360.0,Leanback,Other,Chasing The Sun
20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360.0,Leanback,Other,Chasing The Sun
20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120.0,Leanback,LiveTV,Sonic The Hedgehog
20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120.0,Leanback,LiveTV,Frozen II
20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120.0,iOS,LiveTV,PL 20/21: Man Utd v Arsenal
20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120.0,iOS,LiveTV,PL 20/21: Man Utd v Arsenal
20201101,6PMV67PLJ2S47S68J0Y30XFK003C,60.0,iOS,LiveTV,F1 '20: Emilia Romagna GP
20201101,6PMV67PLJ2S47S68J0Y30XFK003C,600.0,iOS,LiveTV,F1 '20: Emilia Romagna GP


In [0]:
# survey_analysis.head() is a method that displays the first few rows of the DataFrame.
survey_analysis.head()

Unnamed: 0,DateID,CustomerID,TotalTimeWatched,Platform,PlayEventType,VideoTitle
0,20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300.0,Leanback,LiveTV,F1 '20: Emilia Romagna GP
1,20201101,EW1DENH0EC1J3M9WAOZF9LSV004O,300.0,Leanback,LiveTV,F1 '20: Emilia Romagna GP
2,20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360.0,Leanback,Other,Chasing The Sun
3,20201101,6TS2LLY0L3G66FVY86Q0JEZE000K,360.0,Leanback,Other,Chasing The Sun
4,20201101,6PMV67PLJ2S47S68J0Y30XFK003C,120.0,Leanback,LiveTV,Sonic The Hedgehog


In [0]:
# survey_analysis.shape is a method that displays the number of rows and columns.
survey_analysis.shape


(118534, 6)

In [0]:
# survey_analysis.dtypes is a method that displays the data types of each column.
# Displays data types of columns
survey_analysis.dtypes

DateID                int64
CustomerID           object
TotalTimeWatched    float64
Platform             object
PlayEventType        object
VideoTitle           object
dtype: object

In [0]:
# survey_analysis.describe() is a method that displays a summary of the DataFrame, including the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for each column.
survey_analysis.describe()

Unnamed: 0,DateID,TotalTimeWatched
count,118534.0,118018.0
mean,20121470.0,2046.980054
std,1330494.0,3739.670014
min,2.0,1.0
25%,20210120.0,240.0
50%,20210220.0,1020.0
75%,20210320.0,2400.0
max,20210420.0,88500.0


In [0]:
# survey_analysis.info() is a method that displays information about the DataFrame, including the number of rows, data types of each column, and memory usage
survey_analysis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118534 entries, 0 to 118533
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   DateID            118534 non-null  int64  
 1   CustomerID        118018 non-null  object 
 2   TotalTimeWatched  118018 non-null  float64
 3   Platform          118018 non-null  object 
 4   PlayEventType     118018 non-null  object 
 5   VideoTitle        118018 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 5.4+ MB


In [0]:
# survey_analysis.columns is a line of code that displays the column names of the DataFrame.
survey_analysis.columns

Index(['DateID', 'CustomerID', 'TotalTimeWatched', 'Platform', 'PlayEventType',
       'VideoTitle'],
      dtype='object')

In [0]:
# Check unique values in the 'playeventtype' column using pandas
survey_analysis['PlayEventType'].unique()

array(['LiveTV', 'Other', 'Catch Up', 'Downloads', nan], dtype=object)

In [0]:
# Check unique values in the 'platform' column using pandas
survey_analysis['Platform'].unique()

array(['Leanback', 'iOS', 'Web', 'Android', nan], dtype=object)

In [0]:
# Check unique values in the 'videotitle' column using pandas
survey_analysis['VideoTitle'].unique()

array(["F1 '20: Emilia Romagna GP", 'Chasing The Sun',
       'Sonic The Hedgehog', ..., 'Baby', 'Iron Man 3',
       'The Campaign Against The Climate'], dtype=object)

In [0]:
# Check unique values in the 'totaltimewatched' column using pandas
survey_analysis['TotalTimeWatched'].unique()

array([  300.,   360.,   120., ...,  4159.,  5194., 47660.])

In [0]:
survey_analysis['TotalTimeWatched'].min()

1.0

In [0]:
survey_analysis['TotalTimeWatched'].max()

88500.0

In [0]:

# Check for duplicate rows in the DataFrame using pandas
survey_analysis.duplicated().sum()


10372

In [0]:
# Check for missing values in the DataFrame using pandas

survey_analysis.isnull().sum()

DateID                0
CustomerID          516
TotalTimeWatched    516
Platform            516
PlayEventType       516
VideoTitle          516
dtype: int64

In [0]:
# Dropping duplicate rows
survey_analysis.drop_duplicates(inplace=True)
# Dropped 1 duplicate rows
survey_analysis.duplicated().sum()

0

In [0]:
help(pd.Series.loc)

Help on property:

    Access a group of rows and columns by label(s) or a boolean array.

    ``.loc[]`` is primarily label based, but may also be used with a
    boolean array.

    Allowed inputs are:

    - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
      interpreted as a *label* of the index, and **never** as an
      integer position along the index).
    - A list or array of labels, e.g. ``['a', 'b', 'c']``.
    - A slice object with labels, e.g. ``'a':'f'``.

          start and the stop are included

    - A boolean array of the same length as the axis being sliced,
      e.g. ``[True, False, True]``.
    - An alignable boolean Series. The index of the key will be aligned before
      masking.
    - An alignable Index. The Index of the returned selection will be the input.
    - A ``callable`` function with one argument (the calling Series or
      DataFrame) and that returns valid output for indexing (one of the above)

    See more at :ref:`Selection by Label