# Lab | Data Cleaning

## Introduction

We keep seeing a common phrase that 80% of the work of a data scientist is data cleaning. We have no idea whether this number is accurate but a data scientist indeed spends lots of time and effort in collecting, cleaning and preparing the data for analysis. This is because datasets are usually messy and complex in nature. It is a very important ability for a data scientist to refine and restructure datasets into a usable state in order to proceed to the data analysis stage.

In this exercise, you will both practice the data cleaning techniques we discussed in the lesson and learn new techniques by looking up documentations and references. You will work on your own but remember the teaching staff is at your service whenever you encounter problems.


## Resources

[Data Cleaning with Numpy and Pandas](https://realpython.com/python-data-cleaning-numpy-pandas/#python-data-cleaning-recap-and-resources)

[Data Preparation](https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html)

# Import library 

In [52]:
# Your code here
import pandas as pd
import numpy as np


# Read the users dataset.

Take a look at what is the `users.csv` separator.

In [79]:
# Your code here
path = 'data/users.csv'
tb_users = pd.read_csv(path, sep = '#' )
tb_users

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40498,6726,1,2011-10-09 13:16:20,AlexAtStack,2012-05-18 09:32:44,,,,0,0,0,203972,,
40499,53426,101,2014-08-05 07:54:54,John J. Camilleri,2014-08-05 08:54:37,http://johnjcamilleri.com,"Gothenburg, Sweden","<p>Accidental computational linguist, de facto...",1,2,0,34865,28.0,https://www.gravatar.com/avatar/5738c02070833b...
40500,21468,101,2013-03-02 07:50:03,Peter L.,2013-03-02 07:50:03,http://www.a1qa.com/,"Minsk, Belarus","<p>QA Manager with comprehensive, cold-blooded...",1,0,0,2211454,32.0,http://www.gravatar.com/avatar/cbd80a5b2a5257d...
40501,54132,1,2014-08-15 10:52:25,user54132,2014-08-15 10:52:25,,,,1,0,0,4894117,,


## Check its shape

See the number of rows and columns you're dealing.

In [80]:
# Your code here
tb_users.shape

(40503, 14)

## Use the .head() to see some rows of your dataframe.

In [81]:
# Your code here
tb_users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


## Get the data info. 

Which columns have a great number of missing values? How many space does this dataframe is occupying in your memory?

Expected output:
````
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               40503 non-null  int64  
 1   Reputation       40503 non-null  int64  
 2   CreationDate     40503 non-null  object 
 3   DisplayName      40497 non-null  object 
 4   LastAccessDate   40503 non-null  object 
 5   WebsiteUrl       8158 non-null   object 
 6   Location         11731 non-null  object 
 7   AboutMe          9424 non-null   object 
 8   Views            40503 non-null  int64  
 9   UpVotes          40503 non-null  int64  
 10  DownVotes        40503 non-null  int64  
 11  AccountId        40503 non-null  int64  
 12  Age              8352 non-null   float64
 13  ProfileImageUrl  16540 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB
````

In [82]:
# Your code here
tb_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               40503 non-null  int64  
 1   Reputation       40503 non-null  int64  
 2   CreationDate     40503 non-null  object 
 3   DisplayName      40497 non-null  object 
 4   LastAccessDate   40503 non-null  object 
 5   WebsiteUrl       8158 non-null   object 
 6   Location         11731 non-null  object 
 7   AboutMe          9424 non-null   object 
 8   Views            40503 non-null  int64  
 9   UpVotes          40503 non-null  int64  
 10  DownVotes        40503 non-null  int64  
 11  AccountId        40503 non-null  int64  
 12  Age              8352 non-null   float64
 13  ProfileImageUrl  16540 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB


## Rename Id column to user_id.

Remember to store you results back at the dataframe.

In [83]:
# Your code here
dc01 = {'Id' : 'user_id'}
tb_users = tb_users.rename(dc01, axis = 1)
tb_users

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40498,6726,1,2011-10-09 13:16:20,AlexAtStack,2012-05-18 09:32:44,,,,0,0,0,203972,,
40499,53426,101,2014-08-05 07:54:54,John J. Camilleri,2014-08-05 08:54:37,http://johnjcamilleri.com,"Gothenburg, Sweden","<p>Accidental computational linguist, de facto...",1,2,0,34865,28.0,https://www.gravatar.com/avatar/5738c02070833b...
40500,21468,101,2013-03-02 07:50:03,Peter L.,2013-03-02 07:50:03,http://www.a1qa.com/,"Minsk, Belarus","<p>QA Manager with comprehensive, cold-blooded...",1,0,0,2211454,32.0,http://www.gravatar.com/avatar/cbd80a5b2a5257d...
40501,54132,1,2014-08-15 10:52:25,user54132,2014-08-15 10:52:25,,,,1,0,0,4894117,,


# Import the `posts_file.csv` dataset

In [84]:
# Your code here
path = 'data/posts_file.csv'
tb_posts = pd.read_csv(path)
tb_posts.head()


Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,OwnerUserId,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,67711,2,,2013-08-19 02:31:14,3,,"<p>At least in OLS, flipping the direction ($x...",805.0,2013-08-19 10:17:50,,...,,0,,805.0,2013-08-19 10:17:50,,67709.0,,,
1,92493,1,,2014-04-04 05:35:59,2,18.0,<p>I have used a psychometric survey of 10 ite...,43085.0,2014-04-04 05:35:59,Multiple Regression - Extreme F-statistic and ...,...,0.0,1,,,,,,,,
2,86981,2,,2014-02-18 08:39:04,1,,<p>I think that this is due to familywise erro...,38450.0,2014-02-18 09:04:57,,...,,1,,805.0,2014-02-18 09:04:57,,86889.0,,,
3,38717,2,,2012-10-05 09:49:08,2,,"<p>There's a <a href=""http://www.ncbi.nlm.nih....",4598.0,2014-02-15 11:26:53,,...,,3,,4598.0,2014-02-15 11:26:53,,38541.0,,,
4,113919,2,,2014-09-01 01:41:05,3,,"<p>For that data, the estimated regression equ...",805.0,2014-09-01 20:09:53,,...,,1,,805.0,2014-09-01 20:09:53,,113871.0,,,


## Perform the same as above to understand a bit of your data (head, info, shape)

In [85]:
# Your code here

In [86]:
# shape
tb_posts.shape

(8299, 21)

In [87]:
# head
tb_posts.head()

Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,OwnerUserId,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,67711,2,,2013-08-19 02:31:14,3,,"<p>At least in OLS, flipping the direction ($x...",805.0,2013-08-19 10:17:50,,...,,0,,805.0,2013-08-19 10:17:50,,67709.0,,,
1,92493,1,,2014-04-04 05:35:59,2,18.0,<p>I have used a psychometric survey of 10 ite...,43085.0,2014-04-04 05:35:59,Multiple Regression - Extreme F-statistic and ...,...,0.0,1,,,,,,,,
2,86981,2,,2014-02-18 08:39:04,1,,<p>I think that this is due to familywise erro...,38450.0,2014-02-18 09:04:57,,...,,1,,805.0,2014-02-18 09:04:57,,86889.0,,,
3,38717,2,,2012-10-05 09:49:08,2,,"<p>There's a <a href=""http://www.ncbi.nlm.nih....",4598.0,2014-02-15 11:26:53,,...,,3,,4598.0,2014-02-15 11:26:53,,38541.0,,,
4,113919,2,,2014-09-01 01:41:05,3,,"<p>For that data, the estimated regression equ...",805.0,2014-09-01 20:09:53,,...,,1,,805.0,2014-09-01 20:09:53,,113871.0,,,


In [88]:
# info
tb_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8299 entries, 0 to 8298
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Id                     8299 non-null   int64  
 1   PostTypeId             8299 non-null   int64  
 2   AcceptedAnswerId       1344 non-null   float64
 3   CreaionDate            8299 non-null   object 
 4   Score                  8299 non-null   int64  
 5   ViewCount              3966 non-null   float64
 6   Body                   8284 non-null   object 
 7   OwnerUserId            8197 non-null   float64
 8   LasActivityDate        8299 non-null   object 
 9   Title                  3966 non-null   object 
 10  Tags                   3966 non-null   object 
 11  AnswerCount            3966 non-null   float64
 12  CommentCount           8299 non-null   int64  
 13  FavoriteCount          1217 non-null   float64
 14  LastEditorUserId       4071 non-null   float64
 15  Last

## Rename Id column to post_id and OwnerUserId to user_id.

Again, remember to check that your results are correctly stored inside the dataframe.

In [89]:
# Your code here
dic02 = {'Id' : 'post_id', 'OwnerUserId' : 'user_id'}
tb_posts = tb_posts.rename(dic02, axis = 1)
tb_posts.head()

Unnamed: 0,post_id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,user_id,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,67711,2,,2013-08-19 02:31:14,3,,"<p>At least in OLS, flipping the direction ($x...",805.0,2013-08-19 10:17:50,,...,,0,,805.0,2013-08-19 10:17:50,,67709.0,,,
1,92493,1,,2014-04-04 05:35:59,2,18.0,<p>I have used a psychometric survey of 10 ite...,43085.0,2014-04-04 05:35:59,Multiple Regression - Extreme F-statistic and ...,...,0.0,1,,,,,,,,
2,86981,2,,2014-02-18 08:39:04,1,,<p>I think that this is due to familywise erro...,38450.0,2014-02-18 09:04:57,,...,,1,,805.0,2014-02-18 09:04:57,,86889.0,,,
3,38717,2,,2012-10-05 09:49:08,2,,"<p>There's a <a href=""http://www.ncbi.nlm.nih....",4598.0,2014-02-15 11:26:53,,...,,3,,4598.0,2014-02-15 11:26:53,,38541.0,,,
4,113919,2,,2014-09-01 01:41:05,3,,"<p>For that data, the estimated regression equ...",805.0,2014-09-01 20:09:53,,...,,1,,805.0,2014-09-01 20:09:53,,113871.0,,,


## Define new dataframes for users and posts with the following selected columns:
**users columns**: user_id, Reputation, Views, UpVotes, DownVotes  
**posts columns**: post_id, Score, user_id, ViewCount, CommentCount, Body

In [90]:
# Your code here
df_users = tb_users[['user_id','Reputation', 'Views', 'UpVotes', 'DownVotes']]
print(df_users.head())
df_posts = tb_posts[['post_id', 'Score', 'user_id', 'ViewCount', 'CommentCount', 'Body']]
print(df_posts.head())

   user_id  Reputation  Views  UpVotes  DownVotes
0       -1           1      0     5007       1920
1        2         101     25        3          0
2        3         101     22       19          0
3        4         101     11        0          0
4        5        6792   1145      662          5
   post_id  Score  user_id  ViewCount  CommentCount  \
0    67711      3    805.0        NaN             0   
1    92493      2  43085.0       18.0             1   
2    86981      1  38450.0        NaN             1   
3    38717      2   4598.0        NaN             3   
4   113919      3    805.0        NaN             1   

                                                Body  
0  <p>At least in OLS, flipping the direction ($x...  
1  <p>I have used a psychometric survey of 10 ite...  
2  <p>I think that this is due to familywise erro...  
3  <p>There's a <a href="http://www.ncbi.nlm.nih....  
4  <p>For that data, the estimated regression equ...  


## Merge the new dataframes you have created
- Create dataframe called `posts_from_users` merging users and posts.
- You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes. 
- Think carefully which should be the key(s) for your merging.

In [91]:
# Your code here
posts_from_users = pd.merge(df_users, df_posts, on='user_id', how = 'inner')
posts_from_users.head()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,10131,0,,0,
1,-1,1,0,5007,1920,16366,0,,0,
2,-1,1,0,5007,1920,40689,0,,0,
3,-1,1,0,5007,1920,28333,0,,0,
4,-1,1,0,5007,1920,32803,0,,0,


## Check the number of duplicated rows.

Remember you can sum the results of a mask to get how many numbers the True value appeared in the results. This occurs because `True` is interpreted as `1` in Python whereas `False` is interpreted as `0`.

In [93]:
# Your code here
posts_from_users.duplicated().sum()

351

## Find those duplicate values and try to understand what happened.

Hints:   
- You can use the argument `keep=False` from the `.duplicated()` method to bring the duplication.
- You can sort the values `by=['user_id', 'post_id']` to see them in order.  


In [94]:
# Your code here
mask = posts_from_users.duplicated(keep = False)
posts_from_users[mask].sort_values(by = ['user_id','post_id'])

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
735,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
739,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
734,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
736,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
738,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
...,...,...,...,...,...,...,...,...,...,...
8475,54711,4,18,0,0,114527,0,45.0,5,<p>From Shapiro-Wilk's test I see that the res...
8477,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
8478,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
8486,54911,1,1,0,0,113691,0,36.0,11,<p>I extract data related to a movie by sentim...


## Should you drop it? 
If you think it is reasonable to drop it, then drop it. Think how would you correct it in the first place? That is, what was wrong in the first place?  
*Hint: There's a pandas method to drop duplicates. If you wanted to do it by hand, you could select the indexes of the duplicated values and `.drop()` it.*

In [95]:
# Your code here
posts_from_users = posts_from_users.drop_duplicates()
posts_from_users.head() 

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,10131,0,,0,
1,-1,1,0,5007,1920,16366,0,,0,
2,-1,1,0,5007,1920,40689,0,,0,
3,-1,1,0,5007,1920,28333,0,,0,
4,-1,1,0,5007,1920,32803,0,,0,


## 10. How many missing values do you have in your merged dataframe? On which columns?

In [96]:
# Your code here
#posts_from_users.info()
#posts_from_users.shape
posts_from_users.isna().sum()


user_id            0
Reputation         0
Views              0
UpVotes            0
DownVotes          0
post_id            0
Score              0
ViewCount       4277
CommentCount       0
Body              15
dtype: int64

## Select only the rows in which there at least some missing values.

In [97]:
# Your code here
posts_from_users.loc[posts_from_users['Body'].isna() | posts_from_users['ViewCount'].isna() ]  

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,10131,0,,0,
1,-1,1,0,5007,1920,16366,0,,0,
2,-1,1,0,5007,1920,40689,0,,0,
3,-1,1,0,5007,1920,28333,0,,0,
4,-1,1,0,5007,1920,32803,0,,0,
...,...,...,...,...,...,...,...,...,...,...
8504,55365,321,23,1,1,114888,1,,0,<p>You need to have some indication of the unc...
8507,55435,94,1,2,0,114837,-1,,4,"<p>Yes, it is. Identifiability means that if ..."
8511,55484,38,0,3,0,114869,1,,0,<p>I am battling similar problems at the momen...
8515,55599,31,2,0,0,115233,0,,0,<p>Before computing the variance-covariance ma...


## You will need to make something with missing values.  Will you clean or filling them? 

Pay attention. There can be different reasons for the missings numbers. Look at the `user_id` of some of them, look at the body of the message. Which ones you're sure of what should be and which one can you infer? Don't hurry up, take a look at your data.

In [98]:
posts_from_users = posts_from_users.dropna(subset = ['Body'])
posts_from_users.head()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
5,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
8,-1,1,0,5007,1920,50527,0,,0,<blockquote>\n <p>An F-test is any statistica...
11,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
13,-1,1,0,5007,1920,65171,0,,0,"<p>A <a href=""http://en.wikipedia.org/wiki/Mix..."
16,5,6792,1145,662,5,3675,2,,0,<p>I agree with @ars that you are unlikely to ...


In [101]:
# Your code here
posts_from_users['ViewCount'] = posts_from_users['ViewCount'].fillna(value = posts_from_users['ViewCount'].mean())
posts_from_users.head()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
5,-1,1,0,5007,1920,2175,0,549.57319,0,<p><strong>CrossValidated</strong> is for stat...
8,-1,1,0,5007,1920,50527,0,549.57319,0,<blockquote>\n <p>An F-test is any statistica...
11,-1,1,0,5007,1920,8981,0,549.57319,0,"<p>""Statistics"" can refer variously to the (wi..."
13,-1,1,0,5007,1920,65171,0,549.57319,0,"<p>A <a href=""http://en.wikipedia.org/wiki/Mix..."
16,5,6792,1145,662,5,3675,2,549.57319,0,<p>I agree with @ars that you are unlikely to ...


## Reset the index

In [102]:
# Your code here
posts_from_users.reset_index()

Unnamed: 0,index,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,5,-1,1,0,5007,1920,2175,0,549.57319,0,<p><strong>CrossValidated</strong> is for stat...
1,8,-1,1,0,5007,1920,50527,0,549.57319,0,<blockquote>\n <p>An F-test is any statistica...
2,11,-1,1,0,5007,1920,8981,0,549.57319,0,"<p>""Statistics"" can refer variously to the (wi..."
3,13,-1,1,0,5007,1920,65171,0,549.57319,0,"<p>A <a href=""http://en.wikipedia.org/wiki/Mix..."
4,16,5,6792,1145,662,5,3675,2,549.57319,0,<p>I agree with @ars that you are unlikely to ...
...,...,...,...,...,...,...,...,...,...,...,...
8151,8517,55628,1,0,0,0,115148,0,21.00000,2,<p>I am currently doing research on social med...
8152,8518,55633,4,1,0,0,115162,0,25.00000,1,<p>I am new to using R. I am trying to figur...
8153,8519,55637,26,4,0,0,115170,1,549.57319,0,"<p>When you say class, I hope you mean 'output..."
8154,8520,55734,1,0,0,0,115352,0,16.00000,0,"<p>For example, I was looking at <a href=""http..."


## Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [108]:
# Your code here

posts_from_users['ViewCount'] = posts_from_users['ViewCount'].astype('int64')
posts_from_users.dtypes
posts_from_users.head()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
5,-1,1,0,5007,1920,2175,0,549,0,<p><strong>CrossValidated</strong> is for stat...
8,-1,1,0,5007,1920,50527,0,549,0,<blockquote>\n <p>An F-test is any statistica...
11,-1,1,0,5007,1920,8981,0,549,0,"<p>""Statistics"" can refer variously to the (wi..."
13,-1,1,0,5007,1920,65171,0,549,0,"<p>A <a href=""http://en.wikipedia.org/wiki/Mix..."
16,5,6792,1145,662,5,3675,2,549,0,<p>I agree with @ars that you are unlikely to ...


# Bonus (filtering) 
What is the average number of comments for users who are above the average reputation?  
*Hint: Calculate the average of the user Reputation. Store it in a variable called `avg_reputation` and then use that variable for filtering the dataset and generating the results for each case (for the case in which `Reputation > avg_reputation`*

In [None]:
# Your code here