# <center> Cleaning Youtube Comments.csv </center>

This notebook aims to inspect the youtube_comments.csv file and clean the dataset, if necessary.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/youtube_comments.csv', encoding='ISO-8859-1')

In [3]:
df.head()

Unnamed: 0,videoId,commentId,comment,authorName,authorId,likeCount,date,totalReplyCount
0,lG4VkPoG3ko,Ugycx5y6oJzjxAt7AUB4AaABAg,I am a grad student in a mathematics-heavy fie...,Jamie,UCR9GVvuF4uEIw3EEIYZ4l_g,0.0,2021-02-17 03:46:35+00:00,0.0
1,lG4VkPoG3ko,UgwcSC6h7lidOETckkB4AaABAg,Iâm just here because I looked up my last na...,Christopher Bayes,UCVftL6EUiUndgwUVumjBfsA,0.0,2021-02-17 02:50:46+00:00,0.0
2,lG4VkPoG3ko,UgzclhYYZ1xzrNn2y754AaABAg,"Im searching for the Origin of a quote, and i ...",Moritz Roos,UCfj8n_LBuzPMPNdGmGeBjXw,0.0,2021-02-16 23:54:31+00:00,0.0
3,lG4VkPoG3ko,UgzvF8bgbgJxSRC7pAd4AaABAg,Why there is no exact formula for finding the ...,BIDISH DAS,UCLvYFzMHxPeDZNevr0eKvRQ,0.0,2021-02-15 21:14:54+00:00,0.0
4,lG4VkPoG3ko,UgyKJG7KQVbfAjA4ip14AaABAg,a nice and neat fishing web! (and a metaphor xD),Yu Gu,UC0MUvfY8QG02G5mNBNZyQVg,0.0,2021-02-15 12:24:37+00:00,0.0


In [4]:
df.describe()

Unnamed: 0,likeCount,totalReplyCount
count,166971.0,166949.0
mean,13.404903,0.46967
std,200.606229,3.349839
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,26723.0,328.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166995 entries, 0 to 166994
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   videoId          166995 non-null  object 
 1   commentId        166993 non-null  object 
 2   comment          166992 non-null  object 
 3   authorName       166962 non-null  object 
 4   authorId         166971 non-null  object 
 5   likeCount        166971 non-null  float64
 6   date             166949 non-null  object 
 7   totalReplyCount  166949 non-null  float64
dtypes: float64(2), object(6)
memory usage: 10.2+ MB


In [6]:
df.isna().sum()

videoId             0
commentId           2
comment             3
authorName         33
authorId           24
likeCount          24
date               46
totalReplyCount    46
dtype: int64

There are some missing values in our dataset. Before we look into these, I wanted to check if there were values in the videoId field that are invalid. Apparently there are some values that look like comments and shouldn't belong in this field.

In [7]:
df['videoId'].unique()

array(['lG4VkPoG3ko', 'b3NxrZOu_CE', 'X8jsijhllIA', 'mH0oCDa74tE',
       ' WHO DEVELOPED GROUP THEORY BEING IN A PRISON AND DIED AT AGE OF JUST 20.',
       'wTJI_WuZSwE',
       ' and Amy Fowler? I thought they already tied the knot.',
       'QvuQH4_05LI', 'pq9LcwC7CoY', 'D__UaR5MQao', 'elQVZLLiod4',
       '4PDoT7jtxmw', 'cEvgcoyZvB4', 'IAEASE5GjdI', 'ZxYOEwM6Wbk',
       '5PcpBw5Hbwo', 'yBw67Fb31Cs', 'MHXO86wKeDY', 'ppWPuXsnf1Q',
       'privileged if you are learning Math for the first time with this cutiepie',
       'ZA4JkHKZM50', 'gxAaO2rsdIs', '8idr1WZ1A7Q', 'Kas0tIxDvrg',
       ' It takes about 13.0 doublings to get from 1 to 7993. It takes only about 12.6 doublings to get from 7993 to 50000000. 7993 is the current worldwide count of coronavirus deaths according the live update on Roylab Stats channel.',
       'U_85TaXbeIo', 'HZGCoVF3YvM', 'Agbh95KyWxY', 'EK32jo7i5LQ',
       'M64HUIJFTZM', '1Pivot.', 'v0YEaeIClKY', '#NAME?', 'r6sGWTCMz2k',
       ' has a similar explanati

Because of these strange looking values, I opened the csv file on Excel and manually removed rows containing these values. Checking the videoId field again, the values are what we want.

In [11]:
df = pd.read_csv('data/youtube_comments.csv', encoding='ISO-8859-1')

In [12]:
df['videoId'].unique()

array(['lG4VkPoG3ko', 'b3NxrZOu_CE', 'X8jsijhllIA', 'mH0oCDa74tE',
       'wTJI_WuZSwE', 'QvuQH4_05LI', 'pq9LcwC7CoY', 'D__UaR5MQao',
       'elQVZLLiod4', '4PDoT7jtxmw', 'cEvgcoyZvB4', 'IAEASE5GjdI',
       'ZxYOEwM6Wbk', '5PcpBw5Hbwo', 'yBw67Fb31Cs', 'MHXO86wKeDY',
       'ppWPuXsnf1Q', 'ZA4JkHKZM50', 'gxAaO2rsdIs', '8idr1WZ1A7Q',
       'Kas0tIxDvrg', 'U_85TaXbeIo', 'HZGCoVF3YvM', 'Agbh95KyWxY',
       'EK32jo7i5LQ', 'M64HUIJFTZM', 'v0YEaeIClKY', '-qgreAUpPwM',
       'r6sGWTCMz2k', 'ToIXSwZ1pJU', 'ly4S0oi3Yz8', 'p_di4Zn4wz4',
       'jBsC34PxzoM', 'brU5yLm9DZM', 'jsYwFizhncE', 'HEfHFsfGXjs',
       'GNcFjFmqEc8', 'yuVqxCSsE7c', '_UoTTq651dE', 'zjMuIxRvygQ',
       'd4EgbgTm0Bg', 'Qe6o9j4IjTo', 'pQa_tWZmlGs', 'VcgJro0sTiM',
       'rB83DpBJQsE', 'CfW845LNObM', '8GPy_UMV-08', 'b7FxPsqfkOY',
       'bcPTiiiYDs8', 'd-o3eB9sfls', 'MBnnXbOM5S4', 'spUNpyF58BY',
       'VvCytJvd4H0', 'liL66CApESk', 'OkmNXy7er84', 'tIeHLnjs5U8',
       'Ilg3gGewQ5U', 'IHZwWFHWa-w', 'aircAruvnKk', 'MzRCDLre1

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166971 entries, 0 to 166970
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   videoId          166971 non-null  object 
 1   commentId        166971 non-null  object 
 2   comment          166970 non-null  object 
 3   authorName       166940 non-null  object 
 4   authorId         166949 non-null  object 
 5   likeCount        166949 non-null  float64
 6   date             166949 non-null  object 
 7   totalReplyCount  166949 non-null  float64
dtypes: float64(2), object(6)
memory usage: 10.2+ MB


In [13]:
df.isna().sum()

videoId             0
commentId           0
comment             1
authorName         31
authorId           22
likeCount          22
date               22
totalReplyCount    22
dtype: int64

We are still missing some values in our dataset. Taking a closer look at the rows that are missing authorName, it seems that these rows are also missing the other fields.

We will use df.dropna() to drop all rows containing missing values.

In [17]:
df[df['authorName'].isna() == True].head()

Unnamed: 0,videoId,commentId,comment,authorName,authorId,likeCount,date,totalReplyCount
4524,mH0oCDa74tE,UgyjI0Kyz8RtVxleFdJ4AaABAg,YOU SHOULD HAVE MENTIONED Ãvariste Galois,,,,,
6452,wTJI_WuZSwE,UgxFYCDlW952RPcytV54AaABAg,Now what kind of wedding do you go to and leav...,,,,,
16503,ppWPuXsnf1Q,UgyLDR00KL_t_e7SV7F4AaABAg,Oh my god You are so,,,,,
25856,Kas0tIxDvrg,UgyjAiZ_E_U7mRRrpCJ4AaABAg,7993 is logarithmically closer to fifty millio...,,,,,
35648,M64HUIJFTZM,UgwB5utB6kn4Xvp4JbF4AaABAg,Welcome to 3Blue1Brown,,,,,


In [22]:
df = df.dropna()

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 166939 entries, 0 to 166970
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   videoId          166939 non-null  object 
 1   commentId        166939 non-null  object 
 2   comment          166939 non-null  object 
 3   authorName       166939 non-null  object 
 4   authorId         166939 non-null  object 
 5   likeCount        166939 non-null  float64
 6   date             166939 non-null  object 
 7   totalReplyCount  166939 non-null  float64
dtypes: float64(2), object(6)
memory usage: 11.5+ MB


In [24]:
df.isna().sum()

videoId            0
commentId          0
comment            0
authorName         0
authorId           0
likeCount          0
date               0
totalReplyCount    0
dtype: int64

All the missing values are no longer in our dataset, and we can save this into our existing youtube_comments.csv file.

In [25]:
df.to_csv('data/youtube_comments.csv',index=False,encoding='utf-8')