# Data cleaning

By the end of the tutorials of this week, you should acquire:

**Knowledge on:**
* Inspection of dataframes
* Identification and handling of missing values
* Usage of functions to clean columns
* Merging dataframes

**Skills:**
* .describe(), .dtypes
* .isna().sum(), .fillna(), . dropna()
* .apply() and selection of functions based on existing list
* .merge()

In [43]:
import pandas as pd

## Loading data

Instead of using Twitter data for this tutorial, I decided to use data from YouTube. I collected it using the [YouTube Data Tools](https://tools.digitalmethods.net/netvizz/youtube/), also created by the Digital Methods Initiative from the UvA.

The data I have is from a video search using the keyword "climate change".

It always helps if we note down what we want to do with the data *before* we start. We can always refine these objectives later.

### What do I want to achieve with this analysis?

Ideally I have a few research questions here. While we won't be able to do the visualisations (which we will learn on week 4) or statistical testing (week 5 and 6), I am noting down some research questions that may be interesting.

* RQ1. To what extent does the sentiment expressed in the title of the video influence user engagement (views, likes and dislikes)?
* RQ2. To what extent does the sentiment expressed in the title of the video vary depending on the category in which the video is published?
* RQ3. To what extent does the sentiment expressed in the title of the video vary depending on whether Greta Thunberg is mentioned?

**Important:** I only want to do this for videos published in 2019 and 2020. 

These are examples - which are probably not very sophisticated yet as a business challenge - but they imply that we need a few things:
1. We need to have a sentiment analysis performed in the titles of the videos so we have the **sentiment** variable(s)
2. I need to make sure I have the user engagement variables (**likes**, **dislikes**, and **views**)
3. I need to have a variable for the **category** of the video
4. I need to know **when the video was published**, and remove old videos
5. I need to know if Greta is mentioned in the title

With this noted down, I can start loading, inspecting and cleaning the data.

In [44]:
videos = pd.read_csv('videolist_search500_2020_01_25-12_34_16.tab')

ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6


Because this is a tab-delimited file (i.e., the separators are tabs, not commas), I need to specify this in the read_csv command.

In [None]:
videos = pd.read_csv('videolist_search500_2020_01_25-12_34_16.tab', sep='\t')

In [None]:
len(videos)

In [None]:
videos.head()

In [None]:
videos.columns

In [None]:
videos.describe()

In [None]:
videos.isna().sum()

*Some preliminary findings:*
1. It seems I have the engagement variables I need, but likeCount and dislikeCount seem to have missing values
2. I need to run sentiment analysis on videoTitle, but I don't seem to have a variable for language (so I cannot be sure if I just have titles in English)
3. The videoCategoryLabel column seems to be a starting point for the category variable.
4. The publishedAt column can probably help me filter videos from 2019 and 2020.

All of this still needs to be confirmed though...

# Data cleaning

Now let's start preparing the data. The steps always depend on the dataset, but at a minimum we need to make sure that we:
1. Handle the missing values for relevant variables
2. Check if the variables are stored in the correct format/type
3. Create the variables we need (that may be not in the data yet)

### Missing values

In [None]:
videos.isna().sum()

From the variables I am interested in, seems like likeCount and dislikeCount are the ones that do have an issue. Let me check what's happening with them.

One possibility would be that 0's are not included (i.e., if the video does not have a like, it will not appear). Let's see if that's the case...

In [4]:
videos[['likeCount', 'dislikeCount']].describe()

NameError: name 'videos' is not defined

The minimum value is 0, so probably something else is going on. Let's see if this is related to channels (e.g., some channels not allowing users to like videos, perhaps?).

In [5]:
videos[videos['likeCount'].isna()]['channelTitle'].value_counts()
#group by channel title and count the missing values

NameError: name 'videos' is not defined

OK, so here's the list of channels that have missing likes. Now let's see if they appear in a list of channels that have likes.

First, let's make a list of all channels that do have likes and call it channels_with_likes.

In [6]:
channels_with_likes = videos[videos['likeCount'].isna()==False]['channelTitle'].unique().tolist()

NameError: name 'videos' is not defined

In [7]:
channels_with_likes

NameError: name 'channels_with_likes' is not defined

Using "in" operator we can check if an element is present in a list.

In [8]:
'Intergovernmental Panel on Climate Change (IPCC)' in channels_with_likes

NameError: name 'channels_with_likes' is not defined

In [9]:
'The Lancet' in channels_with_likes

NameError: name 'channels_with_likes' is not defined

This seems to be the case. But we cannot be very sure, so let's see how one of these video pages look like.

In [10]:
videos[videos['likeCount'].isna()]['videoId']

NameError: name 'videos' is not defined

Let me add the YouTube URL to some of these id's and see what's going on there:

* https://www.youtube.com/watch?v=9Nw5zhsSgHQ
* https://www.youtube.com/watch?v=viZnvRnEvEc
* https://www.youtube.com/watch?v=8_69vy7ZBxE
* https://www.youtube.com/watch?v=NYstLwqtPlI


It seems that for most of them - but not all of them - the comments are turned off. But it does not necessarily mean that the missing values are always a specific number (e.g., a zero).

Ultimately, I have generally two options with missing values:
* I can substitute them by another value (e.g., 0)
* I can drop them from the dataset

If I wanted to substitute them by another value, I would use the following command:

In [11]:
videos['dislikeCount_no_na'] = videos['dislikeCount'].fillna(0)

NameError: name 'videos' is not defined

In [12]:
videos.isna().sum()

NameError: name 'videos' is not defined

But it doesn't seem to be appropriate here, as the number of likes for these videos is **not being informed** - sometimes because the channel does not allow comments, sometimes for other reasons. So it is a limitation (that we need to acknowledge in our reporting), but most likely dropping these values is better. 

To do so, I can run the following command:

In [13]:
videos = videos.dropna(subset=['likeCount', 'dislikeCount'])

NameError: name 'videos' is not defined

In [14]:
videos.isna().sum()

NameError: name 'videos' is not defined

In [15]:
len(videos)

NameError: name 'videos' is not defined

### Checking the data types

It is also important to check if we have the data stored in the right format. Let's inspect it:

In [16]:
videos.dtypes

NameError: name 'videos' is not defined

From my key variables so far (likeCount, dislikeCount, videoTitle, videoCategoryLabel), all looks OK. The numeric variables are in numeric form (int or float), and the text variables are in object form.

But the date variable (publishedAt) is stored as an object... and it should be a date.

In [17]:
videos['publishedAt'].head()

NameError: name 'videos' is not defined

Yes, it looks like a date, but it is stored as an object. This is a problem, because I cannot filter the dataset by date.

In [18]:
videos['publishedAt'] = videos['publishedAt'].apply(pd.to_datetime)

NameError: name 'videos' is not defined

In [19]:
videos['publishedAt'].head()

NameError: name 'videos' is not defined

In [20]:
videos.dtypes

NameError: name 'videos' is not defined

Now we can for example check the videos published in 2019 and 2020, for example:

In [21]:
videos[videos['publishedAt'] >'2018-12-31']

NameError: name 'videos' is not defined

Or I can get the oldest and the latest date for the videos:

In [22]:
videos['publishedAt'].min()

NameError: name 'videos' is not defined

In [23]:
videos['publishedAt'].max()

NameError: name 'videos' is not defined

In [24]:
len(videos)

NameError: name 'videos' is not defined

In [25]:
len(videos[videos['publishedAt'] >'2018-12-31'])

NameError: name 'videos' is not defined

In [26]:
videos = videos[videos['publishedAt'] >'2018-12-31']

NameError: name 'videos' is not defined

### Important!

Above we have used the ```.apply``` method to run a function in that column. Curious about other things you can do with it? Check out the notebook "UsefulFunctions" in the "UsefulScripts" folder.

OK, just to recap, this is the status of the variables that we need:
* Engagement: likeCount and dislikeCount are in the right type (int or float) and we fixed the missing values
* Sentiment: the videoTitle column is in the right type (object), but we don't have sentiment yet
* Category: we have the videoCategoryLabel, but we're not sure if it is really that informative yet
* PublishedAt: we corrected the data type, and managed to slice the dataframe correctly. Yes!
* Greta: we still need to check if she is mentioned in the title...


## Requesting sentiment analysis

As I don't have the language of the videos, I am going to use a Python module to automatically detect the language. It's available in the ```UsefulScripts``` folder, in the ```AdvancedModules``` notebook. If you want to use it, make sure to read the details in that notebook, as you may need to install a few things!

In [27]:
from langdetect import detect

ModuleNotFoundError: No module named 'langdetect'

In [28]:
def apply_langdetect(text):
    text = str(text)
    try:
        lang = detect(text)
    except:
        lang = 'error'
        
    return lang

In [29]:
videos['lang_title'] = videos['videoTitle'].apply(apply_langdetect)

NameError: name 'videos' is not defined

In [30]:
videos['lang_title'].value_counts()

NameError: name 'videos' is not defined

In [31]:
videos[videos['lang_title']!='en'][['lang_title', 'videoTitle']]

NameError: name 'videos' is not defined

The language detection module doesn't seem to be working all the time. But it did identify a video with a Russian title. So we can safely remove that video, and make a judgement call of what to do with the rest.

In [32]:
videos = videos[videos['lang_title']!= 'ru']

NameError: name 'videos' is not defined

In [33]:
len(videos)

NameError: name 'videos' is not defined

Now I can export it for sentiment analysis. When doing so, I will also need to change the column name - from videoTitle to text -, otherwise the sentiment analysis script won't know what to do.

In [34]:
sent_export = videos[['videoId', 'videoTitle']].rename(columns={'videoTitle': 'text'})

NameError: name 'videos' is not defined

In [35]:
sent_export.head()

NameError: name 'sent_export' is not defined

In [36]:
sent_export.to_pickle('TheoAraujo_YouTubeClimateChange_EN.pkl')

NameError: name 'sent_export' is not defined

Now I would usually upload this file to SurfDrive (see links in the homepage of the General Repository) and have to wait one or two workdays until the analysis is complete. 

When it is complete, I will find the file in the ```SentimentAnalysisResults``` folder also on SurfDrive. I have to download it and add to the same folder I am working on now.

In [37]:
sentiment = pd.read_pickle('TheoAraujo_YouTubeClimateChange_EN_completed.pkl')

In [38]:
sentiment.head()

Unnamed: 0,videoId,text,negative,positive,neutral
0,sGHq_EwXDn8,Australia’s Policies Going in Wrong Direction ...,-1,1,0
1,PRtn1W2RAVU,Nigel Farage compares President Trump and Prin...,-1,1,0
2,2CQvBGSiDvw,Climate change in the 2020s: What impacts to e...,-1,1,0
3,Cbwv1jg4gZU,Solution To Climate Change Is To Make It Profi...,-1,1,0
4,cWsCX_yxXqw,🌤 Climate Change from the Economic Point of View,-1,1,0


In [39]:
sentiment.dtypes

videoId     object
text        object
negative    object
positive    object
neutral     object
dtype: object

In [40]:
sentiment.isna().sum()

videoId     0
text        0
negative    0
positive    0
neutral     0
dtype: int64

### Important tip!

It seems I need to clean this variable (sentiment seems to be stored as an object). I will not show this in the video as this is part of the weekly challenge :-) 

*But I still need to merge the dataframes...*

## Merging the dataframes

Merging dataframes is an operation to bring the data of one dataframe into the other (or rather create a new dataframe). This is covered extensively in a video in the ```FAQs``` folder, and we will review it in class in a bit more detail (in DA3 and DA4).

Basically, we need to use the command ```.merge```. A few important tips:
* Make sure that you have one unique identifier (column) that is available in both dataframes
* Make sure that the unique identifier column has the same name in both dataframes
* Make sure that the unique identifier column is of the same data type in both dataframes

In our case, the unique identifier is videoId. Let's check the items above step by step:

In [41]:
videos.columns

NameError: name 'videos' is not defined

In [42]:
sentiment.columns

Index(['videoId', 'text', 'negative', 'positive', 'neutral'], dtype='object')

In [50]:
videos.dtypes

position                            int64
channelId                          object
channelTitle                       object
videoId                            object
publishedAt           datetime64[ns, UTC]
publishedAtSQL                     object
videoTitle                         object
videoDescription                   object
videoCategoryId                     int64
videoCategoryLabel                 object
duration                           object
durationSec                         int64
dimension                          object
definition                         object
caption                              bool
thumbnail_maxres                   object
licensedContent                   float64
viewCount                           int64
likeCount                         float64
dislikeCount                      float64
favoriteCount                       int64
commentCount                      float64
lang_title                         object
dtype: object

In [51]:
sentiment.dtypes

videoId     object
text        object
negative    object
positive    object
neutral     object
dtype: object

OK, the column is available in both, and seems to be of the same data type. So I can merge.

In [52]:
len(videos)

245

In [53]:
len(sentiment)

244

In [54]:
videos.merge(sentiment, on='videoId')

Unnamed: 0,position,channelId,channelTitle,videoId,publishedAt,publishedAtSQL,videoTitle,videoDescription,videoCategoryId,videoCategoryLabel,...,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,lang_title,text,negative,positive,neutral
0,1,UCIALMKvObZNtJ6AmdCLP7Lg,Bloomberg Markets and Finance,sGHq_EwXDn8,2020-01-24 04:15:28+00:00,2020-01-24 04:15:28,Australia’s Policies Going in Wrong Direction ...,"Jan.23 -- Michael Mann, distinguished professo...",25,News & Politics,...,2017,31.0,24.0,0,57.0,en,Australia’s Policies Going in Wrong Direction ...,-1,1,0
1,2,UCb1Ti1WKPauPpXkYKVHNpsw,LBC,PRtn1W2RAVU,2020-01-23 10:32:38+00:00,2020-01-23 10:32:38,Nigel Farage compares President Trump and Prin...,This is Nigel Farage's reaction to President T...,25,News & Politics,...,65633,1637.0,100.0,0,1093.0,en,Nigel Farage compares President Trump and Prin...,-1,1,0
2,3,UC-SJ6nODDmufqBzPBwCvYvQ,CBS This Morning,2CQvBGSiDvw,2019-12-23 13:38:55+00:00,2019-12-23 13:38:55,Climate change in the 2020s: What impacts to e...,"In our series The 2020's, we're exploring the ...",25,News & Politics,...,34455,646.0,97.0,0,618.0,en,Climate change in the 2020s: What impacts to e...,-1,1,0
3,4,UCcyq283he07B7_KUX07mmtA,Business Insider,Cbwv1jg4gZU,2020-01-22 22:28:34+00:00,2020-01-22 22:28:34,Solution To Climate Change Is To Make It Profi...,Environmental problems rose to the top of the ...,25,News & Politics,...,24345,871.0,54.0,0,166.0,en,Solution To Climate Change Is To Make It Profi...,-1,1,0
4,5,UCWafKqurzE49MzZ6eHFwXvQ,EconClips,cWsCX_yxXqw,2020-01-22 20:01:12+00:00,2020-01-22 20:01:12,🌤 Climate Change from the Economic Point of View,Climate change from the economic point of view...,27,Education,...,2085,129.0,23.0,0,74.0,en,🌤 Climate Change from the Economic Point of View,-1,1,0
5,6,UCIALMKvObZNtJ6AmdCLP7Lg,Bloomberg Markets and Finance,pbiSuB3mzmo,2020-01-22 10:24:55+00:00,2020-01-22 10:24:55,What Davos Attendees Are Saying on Climate Change,Jan.22 -- Climate change and other environment...,25,News & Politics,...,1053,22.0,12.0,0,8.0,en,What Davos Attendees Are Saying on Climate Change,-1,1,0
6,7,UCO0akufu9MOzyz3nvGIXAAw,Sky News Australia,6d9ENk3NfBM,2020-01-23 11:39:09+00:00,2020-01-23 11:39:09,Al Gore's 'climate change hypocrisy' is 'nuts',Sky News host Chris Kenny says Al Gore and Pri...,25,News & Politics,...,50699,2614.0,97.0,0,1979.0,en,Al Gore's 'climate change hypocrisy' is 'nuts',-2,1,-1
7,8,UCo7a6riBFJ3tkeHjvkXPn1g,CNBC International,hzrFtZc9EkQ,2020-01-22 22:00:15+00:00,2020-01-22 22:00:15,Trump vs Greta: How climate change took over D...,U.S. President Donald Trump and climate activi...,25,News & Politics,...,48684,954.0,637.0,0,925.0,en,Trump vs Greta: How climate change took over D...,-1,1,0
8,9,UCFQgi22Ht00CpaOQLtvZx2A,ITV News,Je2l7Gw7uns,2020-01-21 23:07:40+00:00,2020-01-21 23:07:40,Trump v Thunberg as two deliver contrasting cl...,Climate activist Greta Thunberg has told the D...,25,News & Politics,...,9743,106.0,107.0,0,,en,Trump v Thunberg as two deliver contrasting cl...,-1,1,0
9,10,UCqOoboPm3uhY_YXhvhmL-WA,Discovery,8Rvl6z80baI,2020-01-10 19:36:00+00:00,2020-01-10 19:36:00,NASA's Research on Climate Change | Above and ...,ABOVE AND BEYOND examines the role NASA plays ...,24,Entertainment,...,32157,1122.0,139.0,0,659.0,en,NASA's Research on Climate Change | Above and ...,-1,1,0


In [55]:
len(videos.merge(sentiment, on='videoId'))

244

In [56]:
videos_sent = videos.merge(sentiment, on='videoId')

We are almost there. Let's recap where we are:
* Engagement (OK): likeCount and dislikeCount are in the right type (int or float) and we fixed the missing values
* Sentiment (OK-ish): we have the sentiment analysis results for the title, but they are in the wrong dtype. We won't fix it now - as you need to do it for the weekly challenge ;)
* Category: we have the videoCategoryLabel, but we're not sure if it is really that informative yet
* PublishedAt (OK): we corrected the data type, and managed to slice the dataframe correctly. Yes!
* Greta: we still need to check if she is mentioned in the title...

So we just need to work on the Category and mentions to Greta now.


## Video Categories

In [57]:
videos_sent['videoCategoryLabel'].value_counts()

News & Politics          148
Education                 26
Science & Technology      21
Entertainment             14
Nonprofits & Activism     12
Comedy                    10
People & Blogs             9
Autos & Vehicles           1
Film & Animation           1
Travel & Events            1
Music                      1
Name: videoCategoryLabel, dtype: int64

We have waaaay too many categories here to make informative comparisons, and some of them are very small (1 video), while others have a lot of videos. While we that will almost always be the case for digital trace data, we can at least recategorize this a bit...

In [58]:
def recategorize(category):
    if category == 'News & Politics':
        return category
    if category == 'Education':
        return 'Education, Science and Technology'
    if category == 'Science & Technology':
        return 'Education, Science and Technology'
    if category == 'Nonprofits & Activism':
        return category
    else:
        return 'Other'

In [59]:
videos_sent['category'] = videos_sent['videoCategoryLabel'].apply(recategorize)

In [60]:
videos_sent['category'].value_counts()

News & Politics                      148
Other                                 37
Education, Science and Technology     26
Science & Technology                  21
Nonprofits & Activism                 12
Name: category, dtype: int64

This recategorization is not ideal, but at least we have five large(r) categories. One could argue that nonprofits is still too small. But for now we'll keep as is.

## Mentions to Greta Thunberg

I'll also use a function available in the ```UsefulFunctions``` notebook inside the ```UsefulScripts``` folder.

In [61]:
def wordlist_any_present(text, query):
    import re
    text = str(text).lower()
    newquery = []
    for word in query:
        newquery.append(str(word).lower())
    tokens = re.findall(r"[\w']+|[.,!?;$@#]", text)
    
    for word in newquery:
        if word in tokens:
            return 1
    return 0

In [62]:
videos_sent['Greta'] = videos_sent['videoTitle'].apply(wordlist_any_present, args=(['Greta', 'Thunberg'],)) 

In [63]:
videos_sent.groupby(['Greta'])['videoTitle'].value_counts()

Greta  videoTitle                                                                                          
0      'Alarmist rhetoric' on climate change not borne out by figures: Coalition backbencher                   1
       'Climate warriors' use bushfires to push climate change inaction agenda: Credlin                        1
       11,000 scientists sign declaration of climate emergency                                                 1
       A Climate Change Sceptic Denies Global Warming Caused the Australian Fires | Good Morning Britain       1
       A climate change solution that's right under our feet | Asmeret Asefaw Berhe:                           1
       Al Gore Calls Climate Change the Most Serious Challenge Humanity Has Ever Faced                         1
       Al Gore's 'climate change hypocrisy' is 'nuts'                                                          1
       Amazon rainforest fires could devastate the fight against climate change                      

In [64]:
videos_sent['Greta'].value_counts()

0    217
1     27
Name: Greta, dtype: int64

Let's recap where we are:
* Engagement (OK)
* Sentiment (OK-ish): we have the sentiment analysis results for the title, but they are in the wrong dtype. We won't fix it now - as you need to do it for the weekly challenge ;)
* Category (OK)
* PublishedAt (OK)
* Greta (OK)

Great! Now let's just confirm that the dataframe looks OK.


In [65]:
videos_sent.columns

Index(['position', 'channelId', 'channelTitle', 'videoId', 'publishedAt',
       'publishedAtSQL', 'videoTitle', 'videoDescription', 'videoCategoryId',
       'videoCategoryLabel', 'duration', 'durationSec', 'dimension',
       'definition', 'caption', 'thumbnail_maxres', 'licensedContent',
       'viewCount', 'likeCount', 'dislikeCount', 'favoriteCount',
       'commentCount', 'lang_title', 'text', 'negative', 'positive', 'neutral',
       'category', 'Greta'],
      dtype='object')

In [66]:
videos_sent.isna().sum()

position               0
channelId              0
channelTitle           0
videoId                0
publishedAt            0
publishedAtSQL         0
videoTitle             0
videoDescription       1
videoCategoryId        0
videoCategoryLabel     0
duration               0
durationSec            0
dimension              0
definition             0
caption                0
thumbnail_maxres      39
licensedContent       52
viewCount              0
likeCount              0
dislikeCount           0
favoriteCount          0
commentCount          18
lang_title             0
text                   0
negative               0
positive               0
neutral                0
category               0
Greta                  0
dtype: int64

In [67]:
videos_sent.describe()

Unnamed: 0,position,videoCategoryId,durationSec,licensedContent,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,Greta
count,244.0,244.0,244.0,192.0,244.0,244.0,244.0,244.0,226.0,244.0
mean,249.336066,25.139344,670.696721,1.0,208845.1,5175.627049,1481.770492,0.0,1895.00885,0.110656
std,156.023983,2.855406,710.273139,0.0,584693.0,13847.417301,8989.7186,0.0,6754.155581,0.31435
min,1.0,1.0,25.0,1.0,19.0,0.0,0.0,0.0,0.0,0.0
25%,115.5,25.0,249.0,1.0,5421.25,118.0,22.75,0.0,71.0,0.0
50%,255.0,25.0,405.5,1.0,32706.0,678.0,104.0,0.0,470.5,0.0
75%,391.25,25.0,767.75,1.0,148781.2,3430.5,358.5,0.0,1409.0,0.0
max,496.0,29.0,3574.0,1.0,5212827.0,119562.0,113733.0,0.0,88424.0,1.0


In [68]:
videos_sent['Greta'].value_counts()

0    217
1     27
Name: Greta, dtype: int64

In [69]:
videos_sent['Greta'].value_counts(normalize=True)

0    0.889344
1    0.110656
Name: Greta, dtype: float64

In [70]:
videos_sent['category'].value_counts()

News & Politics                      148
Other                                 37
Education, Science and Technology     26
Science & Technology                  21
Nonprofits & Activism                 12
Name: category, dtype: int64

In [71]:
videos_sent['category'].value_counts(normalize=True)

News & Politics                      0.606557
Other                                0.151639
Education, Science and Technology    0.106557
Science & Technology                 0.086066
Nonprofits & Activism                0.049180
Name: category, dtype: float64

## We're done!

## But just because I am curious...

Let's see quickly how engagement varies...

In [1]:
videos_sent.groupby('Greta')[['likeCount', 'dislikeCount', 'viewCount']].describe().transpose()

NameError: name 'videos_sent' is not defined

Pandas is using scientific notation because of other columns being too large. So let's change this.

In [73]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [74]:
videos_sent.groupby('Greta')[['likeCount', 'dislikeCount', 'viewCount']].describe().transpose()

Unnamed: 0,Greta,0,1
likeCount,count,217.0,27.0
likeCount,mean,3023.512,22472.259
likeCount,std,6855.64,32439.759
likeCount,min,0.0,8.0
likeCount,25%,104.0,448.5
likeCount,50%,486.0,6594.0
likeCount,75%,2614.0,39452.0
likeCount,max,60703.0,119562.0
dislikeCount,count,217.0,27.0
dislikeCount,mean,275.682,11175.148


In [75]:
videos_sent.groupby('category')[['likeCount', 'dislikeCount', 'viewCount']].mean().transpose()

category,"Education, Science and Technology",News & Politics,Nonprofits & Activism,Other,Science & Technology
likeCount,6042.115,3873.905,4083.0,10663.405,4232.286
dislikeCount,588.154,1903.095,107.5,1341.243,651.714
viewCount,156157.0,181584.791,118376.5,415990.297,152924.333


In [76]:
videos_sent.groupby('category')[['likeCount', 'dislikeCount', 'viewCount']].std().transpose()

category,"Education, Science and Technology",News & Politics,Nonprofits & Activism,Other,Science & Technology
likeCount,12390.532,12182.607,9107.242,21182.785,11083.051
dislikeCount,1286.85,11394.488,212.598,3159.771,1819.082
viewCount,276236.548,603834.274,253732.835,783798.953,374773.207


To make it more readable, you can use transpose (columns become rows, and vice-versa)

In [77]:
videos_sent.to_pickle('videos_sent.pkl')