# Pandas on Weibo COVID-19 data

Reading in the data (currently refering to a file on my computer)

In [4]:
import pandas as pd

data = pd.read_csv("C:/Users/kgk/OneDrive - Aalborg Universitet/CALDISS_projects/digital-literacy_E20/text-networks/data/2019-12.csv")

`data` is loaded into Python as a "DataFrame class" which contains a lot of methods and attributes.

In [41]:
type(data)

pandas.core.frame.DataFrame

`.head()` prints the first five rows. It's good for checking if the data has been imported correctly.

In [5]:
data.head()

Unnamed: 0,_id,crawl_time,created_at,like_num,repost_num,comment_num,content,origin_weibo,location_map_info
0,IiF4ShXQZ,1587604890,2019-12-01 00:00:17,1,0,0,《药品管理法》《疫苗管理法》👏👏,,
1,IiF5S5tKi,1587968450,2019-12-01 00:02:44,1,0,0,[微风]今日 周六 大风 第三天 打疫苗 实测体重2.6公斤,,
2,IiF6Zcfr3,1587968119,2019-12-01 00:05:29,0,0,0,明天新的《中华人民共和国药品管理法》和《疫苗管理法》就实行了。 友情提醒：虽然第九十八条把...,IiEEaDOkW,
3,IiF75FUT4,1587895953,2019-12-01 00:05:45,0,0,0,持续不断否定自己 变得一点信心也没有 对自己否定 对生活否定 对一切都怀疑 今天压力真的大到...,,
4,IiF84inVj,1587964711,2019-12-01 00:08:09,0,0,0,央视新闻频道，正在播出宇芽家暴的专题采访。 这个专题之前是高以翔事件的报道。 转发理由:看...,IiEBvb90s,


A dataframe is the main data structure in pandas. It is used to get a tabular data format to work with in Python (data in rows and columns.

Rows are automatically assigned a "row index" starting with 0.

If data has headers, pandas assumes those as names for the columns.

`.loc[]` is used for subsetting the data.

The syntax is `data.loc[rows, columns]`.

Rows can either be specified by their index or by boolean values (using a condition).

When selecting several columns, these can be specified as a list of columns.

Note that `.loc[]` does not change the data. It only created printouts. The output can however be assigned to new objects/variables to create a subset.

In [42]:
data.loc[0:5, ['like_num', 'created_at', 'content']] # Select first 6 rows and columns like_num, created_at, content

Unnamed: 0,like_num,created_at,content
0,1,2019-12-01 00:00:17,《药品管理法》《疫苗管理法》👏👏
1,1,2019-12-01 00:02:44,[微风]今日 周六 大风 第三天 打疫苗 实测体重2.6公斤
2,0,2019-12-01 00:05:29,明天新的《中华人民共和国药品管理法》和《疫苗管理法》就实行了。 友情提醒：虽然第九十八条把...
3,0,2019-12-01 00:05:45,持续不断否定自己 变得一点信心也没有 对自己否定 对生活否定 对一切都怀疑 今天压力真的大到...
4,0,2019-12-01 00:08:09,央视新闻频道，正在播出宇芽家暴的专题采访。 这个专题之前是高以翔事件的报道。 转发理由:看...
5,1,2019-12-01 00:08:19,终于在2019的结尾完成了一件“人生大事”，很幸运的不用预约就把九价疫苗给打了，不过疫苗好像...


In [43]:
data.loc[data['like_num'] > 5, ['like_num', 'created_at', 'content']] # Selects rows/posts with more than 5 likes

Unnamed: 0,like_num,created_at,content
8,24,2019-12-01 00:09:00,HAPPY HOMECOMING 内江 显示地图
86,20,2019-12-01 01:49:02,十一月 我的十一月是一个周日和四个周四组成的。十月三十一日，我被野猫抓破，去打了第一针狂犬疫...
89,34,2019-12-01 01:51:44,刚不小心被狗咬了一口🐶 请问需要打狂犬疫苗么～ 临沂·临沭县 显示地图
131,7,2019-12-01 05:58:07,回国前例行shopping 海关蜀黍求放过👮🏻♂️👮🏼♂️👮🏽♂️👮🏾♂️👮🏿♂️ 美国...
243,23,2019-12-01 08:15:50,#株洲# 【 “世界艾滋病日”将至 株洲建立了7个国家级艾滋病监测哨点】“艾滋病已成为严重危...
...,...,...,...
60010,26,2019-12-31 23:39:26,惯例#跨年小结# [喵喵] 2019上半年是这辈子最恶心最后悔的一段💩一样的日子 呕呕呕呕呕...
60019,9,2019-12-31 23:43:34,2019年 看了20场电影 听了1场演唱会 去了迪士尼故宫鸟巢 也去了长沙去了海边 把肥牛从...
60048,7,2019-12-31 23:56:03,#新年心愿#我和飞哥都能上岸！早日经济独立！我想打疫苗！爸爸妈妈姥姥姥爷身体健康！弟弟能懂事点！
60052,6,2019-12-31 23:57:56,2⃣️0⃣️1⃣️9⃣️ 全国去了天津 北戴河 武汉 福州 北京去了红砖美术馆 三里屯 荟聚...


### Dates

Pandas can work with dates by converting date columns to datetime objects.

Usually Pandas cannot automatically recognize dates, so the column has to be converted with the command `pd.to_datetime()`. If the date is in a recognizable format, pandas will convert the column with no problem.

In [44]:
data['created_at'] = pd.to_datetime(data['created_at'])

Now that the column is recognized as a date, it can be used as a filter (here selecting posts posted af December 15).

In [45]:
data.loc[data['created_at'] > "2019-12-15", ['created_at','content']]

Unnamed: 0,created_at,content
27880,2019-12-15 00:00:03,emmm……忘记预约hpv疫苗的事了都[摊手] 有个大事上不含糊的人在身边真的非常有安全感。 ❤️
27881,2019-12-15 00:01:10,羽生理惠（将棋选手羽生善治的夫人）今早发twi： 杂技酱也被国内的声音为难。明明是为国家夺得...
27882,2019-12-15 00:01:25,【英国男受静脉曲张困扰20年，飞15小时到长沙：医生40分钟解决】英国的Peter受下肢静脉...
27883,2019-12-15 00:01:40,#宝宝疫苗# 宝宝出生后，疫苗怎么打，一类免费必须打，二类自费需要打吗？打疫苗需要注意些什...
27884,2019-12-15 00:01:59,【真实记录：日本普通程序员的一天】 看完了感觉和中国的程序员，生活状态基本一样啊..... ...
...,...,...
60063,2019-12-31 23:59:31,2020来临之际，@微博宠粉官 向我发起了新年祝福接力，现在我把祝福传递给你 @活累_ @生...
60064,2019-12-31 23:59:34,2019年的最后一刻[钟] 时间一眨眼就溜了溜了。 换了新的工作，认识了许多新的人。 细胞又...
60065,2019-12-31 23:59:44,【2019最后一次乳印乳俄乳髪乳英】具有大口径四面相控阵雷达和垂直发射远程舰空导弹系统的大型...
60066,2019-12-31 23:59:52,胰腺癌为何公认“癌中之王”？目前普遍公认胰腺癌是“癌中之王”，原因主要有三点： 1、恶性度...


**Extracting date information**

Datetime objects contain attributes for the month, year, week, weekday and so on.

This can be used to create new variables/columns containing fx the weekday and the week for the post (respectively).

In [50]:
data['post_weekday'] = data['created_at'].dt.weekday # 0 = Monday, 6 = Sunday
data['post_weeknumber'] = data['created_at'].dt.isocalendar().week

data.loc[data['created_at']> "2019-12-15", ['created_at', 'post_weekday', 'post_weeknumber']]

Unnamed: 0,created_at,post_weekday,post_weeknumber
27880,2019-12-15 00:00:03,6,50
27881,2019-12-15 00:01:10,6,50
27882,2019-12-15 00:01:25,6,50
27883,2019-12-15 00:01:40,6,50
27884,2019-12-15 00:01:59,6,50
...,...,...,...
60063,2019-12-31 23:59:31,1,1
60064,2019-12-31 23:59:34,1,1
60065,2019-12-31 23:59:44,1,1
60066,2019-12-31 23:59:52,1,1


### Groupby

Dataframes supports groupby operations. This works by grouping observations by each individual value in a column (or a combination of columns). After that various summary statistics can be computed.

In [51]:
data_grouped = data.groupby('post_weeknumber') # Creates a groupby object - grouped by weeknumber

In [52]:
data_grouped.count() # Counts number of observations per week per column (counts differ because of missing values)

Unnamed: 0_level_0,_id,crawl_time,created_at,like_num,repost_num,comment_num,content,origin_weibo,location_map_info,post_weekday
post_weeknumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,8395,8395,8395,8395,8395,8395,8395,5963,444,8395
48,1813,1813,1813,1813,1813,1813,1813,1499,57,1813
49,17071,17071,17071,17071,17071,17071,17071,13488,541,17071
50,10714,10714,10714,10714,10714,10714,10714,8141,450,10714
51,10226,10226,10226,10226,10226,10226,10226,7349,531,10226
52,11849,11849,11849,11849,11849,11849,11849,8859,572,11849


Summary statistics can also be computed for individual columns:

In [54]:
data_grouped['like_num'].mean()

post_weeknumber
1     1.140083
48    1.488141
49    0.654209
50    1.955572
51    1.252298
52    0.898388
Name: like_num, dtype: float64

In [1]:
a_list = ['cat', 'dog', 'cow', 'window']

a_list.apply(lower)

AttributeError: 'list' object has no attribute 'apply'