# Data acquisition

In this section, we will be using pandas to read the dataset in csv file format by using it's function read_csv() and we will be checking some properties of our dataframe.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# raw csv file collected from GitHub : @BDBC-KG-NLP/COVID-19-tracker
chinese_province_wise_data = {
    'sichuan'  : 'https://bit.ly/3fOW1BM',
    'anhui'    : 'https://bit.ly/2A7O2Ra',
    'shandong' : 'https://bit.ly/3g1Fo5J',
    'shanxi'   : 'https://bit.ly/3fWyCOJ',
    'jiangsu'  : 'https://bit.ly/2BHdCgk',
    'henan'    : 'https://bit.ly/2BDOhng',
    'zhejiang' : 'https://bit.ly/2NGddNX',
    'hainan'   : 'https://bit.ly/382c7oX',
    'hunan'    : 'https://bit.ly/31cDHyt',
    'guizhou'  : 'https://bit.ly/2Zn6T3l',
    'chongqing': 'https://bit.ly/2YwmvSy'
}

# list of provinces parsing from dict.keys()
provinces = chinese_province_wise_data.keys()

In [3]:
# list to collect dataframes from each dataset
list_of_frames = []

# iterate and append frame in list_of_frames
for each_province in provinces:
  frame = pd.read_csv(chinese_province_wise_data[each_province], index_col=None, header=None)
  list_of_frames.append(frame)

# df : completely merged frame
df = pd.concat(list_of_frames, axis=0, ignore_index=True)

In [4]:
# refactoring index bugs
df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
df.drop(df.index[[0]], inplace=True)

# refactor row index
df.reset_index(drop=True, inplace=True)

# custom header
headers = ['Gender', 'Age', 'Work', 'Diagnosis Time', 'Hubei Contact', 'Case Number', 'Permanent Residence', 'Hubei Contact Description', 'Case Announcement Province', 'Case Announcement City', 'Time', 'Route Location', 'Event', 'Description', 'Transportation', 'Track Number', 'Number Of Tracks']

# set custom header to dataframe
df.columns = headers
df.head(3)

Unnamed: 0,Gender,Age,Work,Diagnosis Time,Hubei Contact,Case Number,Permanent Residence,Hubei Contact Description,Case Announcement Province,Case Announcement City,Time,Route Location,Event,Description,Transportation,Track Number,Number Of Tracks
0,女,58,,,未知,,,,四川省,泸州市,1月26日,"石河中心卫生院(107.328107,30.870779)",探望病人,"1月26日, 10：00，到石河中心卫生院探望病人",,轨迹1,4
1,女,58,,,未知,,,,四川省,泸州市,1月27日,"石河镇宝华村(107.327606,30.895882)",看望病危父亲,"1月27日, 16：00，到石河镇宝华村3组看望病危父亲",,轨迹2,4
2,女,58,,,未知,,,,四川省,泸州市,1月31日,"石河镇宝华村(107.327606,30.895882)",办丧事,"1月31日-2月5日, 在石河镇宝华村办丧事",,轨迹3,4


In [5]:
# check the shape of 
df.shape

(10790, 17)

# Data wrangling

In this section, we will be wrangling our data frame for the purpose of converting data from the initial format to a format that may be better for analysis. Sometimes, data-wrangling is referred to as **'data-preprocessing'**. In the beginning of processing, we will be identifying columns needed to be translated from **Chinese to English**.

Before, translating we need to drop some columns that are mostly NaN.



In [6]:
# find columns NaN percentage
def check_nan() : 
  percent_nan = df.isnull().sum() * 100 / len(df)
  missing_value_in_df = pd.DataFrame(
      {
          'percent_nan': round(percent_nan, 2)
      }
  )
  print(missing_value_in_df)


check_nan()

                            percent_nan
Gender                             0.00
Age                                8.02
Work                              96.60
Diagnosis Time                    72.91
Hubei Contact                      0.10
Case Number                       99.66
Permanent Residence               18.84
Hubei Contact Description         85.39
Case Announcement Province         0.28
Case Announcement City             0.77
Time                               0.07
Route Location                    34.25
Event                             45.76
Description                        0.01
Transportation                    69.94
Track Number                       0.00
Number Of Tracks                   0.00


### Drop columns

By having an insight over NaN percentage on columns, we can see there are several columns that are mostly NaN. These columns are needed to be removed. But the question is how much NaN should we consider? According to some standard methodologies, columns with more than 50% NaN are useless. So, regarding the standard we will be dropping columns with NaN percentage more than 50%.

In [7]:
columns_to_drop = ['Work', 'Diagnosis Time', 'Case Number', 'Hubei Contact Description', 'Transportation']

'''
  axis = 1 : for dropping column
  inplace = True : upgrading the dataframe inline
'''
df.drop(columns=columns_to_drop, axis=1, inplace=True)

# find columns NaN percentage
check_nan()

                            percent_nan
Gender                             0.00
Age                                8.02
Hubei Contact                      0.10
Permanent Residence               18.84
Case Announcement Province         0.28
Case Announcement City             0.77
Time                               0.07
Route Location                    34.25
Event                             45.76
Description                        0.01
Track Number                       0.00
Number Of Tracks                   0.00


### Drop rows initially

We have so much rows in our dataframe. That's cool unless we think of our translator API performance which we're gonna use next. Translating more than 10K rows in a dataframe might take **10-30 minutes**. Sad, but true 🥺. We cannot define all the important features of our dataset before translation but we can initally detect some of the features that are must for our analysis. 

Initially, Age, Permanent Resident, Case Announcement Province, Announcement City and Time are so important feature that must not be NaN. So, we can think of dropping these.

In [8]:
drop_nan_rows = ['Age', 'Permanent Residence', 'Case Announcement Province', 'Case Announcement City', 'Time', 'Hubei Contact', 'Event']
df.dropna(subset=drop_nan_rows, axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

In [9]:
# find columns NaN percentage
check_nan()

                            percent_nan
Gender                             0.00
Age                                0.00
Hubei Contact                      0.00
Permanent Residence                0.00
Case Announcement Province         0.00
Case Announcement City             0.00
Time                               0.00
Route Location                    18.73
Event                              0.00
Description                        0.00
Track Number                       0.00
Number Of Tracks                   0.00


In [10]:
df.shape

(4587, 12)

### Correcting data type

In [11]:
df['Age'].unique()

array(['51', '50', '23', '28', '27', '56', '10', '79', '57', '61', '66',
       '64', '30', '43', '54', '65', '77', '6', '71', '68', '88', '49',
       '85', '21', '55', '36', '45', '18', '48', '39', '47', '24', '76',
       '37', '20', '72', '3', '12', '22', '25', '46', '41', '52', '29',
       '4', '33', '38', '35', '53', '31', '62', '26', '32', '42', '80',
       '17', '13', '67', '70', '34', '60', '9', '74', '年龄', '63', '40',
       '58', '44', '16', '11', '69', '59', '75', '90', '73', '14', '78',
       '15', '84', '89', '1', '5', '19', '8', '7', '91', '2', '86', '82'],
      dtype=object)

In [12]:
# Replacing chinese word with average age = 44.75 => 45
df["Age"].replace('年龄', '45', inplace=True)

In [13]:
df[["Age"]] = df[["Age"]].astype("int")

In [14]:
df['Number Of Tracks'].max()

'轨迹数量'

In [15]:
# Initially replace with 0 to find the max
df["Number Of Tracks"].replace('轨迹数量', '0', inplace=True)

In [16]:
df['Number Of Tracks'].max()

'9'

In [17]:
df['Number Of Tracks'].unique()

array(['4', '2', '5', '15', '3', '18', '10', '6', '8', '1', '11', '14',
       '7', '17', '22', '27', '16', '9', '13', '23', '12', '0'],
      dtype=object)

In [18]:
df["Number Of Tracks"].replace('0', '9', inplace=True)

In [19]:
# finally change the type
df[["Number Of Tracks"]] = df[["Number Of Tracks"]].astype("int")

In [20]:
# check NaN and replace with custom string
# because translation api doesn't take NaN
df['Route Location'].replace(np.nan, '', inplace=True)
check_nan()

                            percent_nan
Gender                              0.0
Age                                 0.0
Hubei Contact                       0.0
Permanent Residence                 0.0
Case Announcement Province          0.0
Case Announcement City              0.0
Time                                0.0
Route Location                      0.0
Event                               0.0
Description                         0.0
Track Number                        0.0
Number Of Tracks                    0.0


In [21]:
# swap first two columns position for iloc translation
c = df.columns
df[[c[0], c[1]]] = df[[c[1], c[0]]]
df.head(3)

Unnamed: 0,Gender,Age,Hubei Contact,Permanent Residence,Case Announcement Province,Case Announcement City,Time,Route Location,Event,Description,Track Number,Number Of Tracks
0,51,女,有,泸县牛滩镇,四川省,泸州市,1月19日,"武汉(114.305215,30.592935)->泸县牛滩镇寿尊村(105.392379,...",回家,"1月19日, 雷某英独自从武汉返回泸县牛滩镇寿尊村10社家中",轨迹1,4
1,51,女,有,泸县牛滩镇,四川省,泸州市,1月23日,,出现咳嗽等症状,"1月23日, 出现咳嗽、头痛等症状，自行服药后缓解",轨迹2,4
2,51,女,有,泸县牛滩镇,四川省,泸州市,1月24日,"泸县牛滩镇寿尊村(105.392379,29.097620)->牛滩镇卫生院(105.340...",检查,"1月24日, 到牛滩镇卫生院检查未见异常（未如实提供武汉返泸史）",轨迹3,4


In [22]:
# reset column header
df = df.rename(columns={'Gender': 'Age', 'Age': 'Gender'})
df.head(3)

Unnamed: 0,Age,Gender,Hubei Contact,Permanent Residence,Case Announcement Province,Case Announcement City,Time,Route Location,Event,Description,Track Number,Number Of Tracks
0,51,女,有,泸县牛滩镇,四川省,泸州市,1月19日,"武汉(114.305215,30.592935)->泸县牛滩镇寿尊村(105.392379,...",回家,"1月19日, 雷某英独自从武汉返回泸县牛滩镇寿尊村10社家中",轨迹1,4
1,51,女,有,泸县牛滩镇,四川省,泸州市,1月23日,,出现咳嗽等症状,"1月23日, 出现咳嗽、头痛等症状，自行服药后缓解",轨迹2,4
2,51,女,有,泸县牛滩镇,四川省,泸州市,1月24日,"泸县牛滩镇寿尊村(105.392379,29.097620)->牛滩镇卫生院(105.340...",检查,"1月24日, 到牛滩镇卫生院检查未见异常（未如实提供武汉返泸史）",轨迹3,4


In [23]:
check_nan()

                            percent_nan
Age                                 0.0
Gender                              0.0
Hubei Contact                       0.0
Permanent Residence                 0.0
Case Announcement Province          0.0
Case Announcement City              0.0
Time                                0.0
Route Location                      0.0
Event                               0.0
Description                         0.0
Track Number                        0.0
Number Of Tracks                    0.0


In [24]:
df.dtypes

Age                            int64
Gender                        object
Hubei Contact                 object
Permanent Residence           object
Case Announcement Province    object
Case Announcement City        object
Time                          object
Route Location                object
Event                         object
Description                   object
Track Number                  object
Number Of Tracks               int64
dtype: object

# Data Translation
In this section, we will be translating our data from Chinese to English. Except Age and Number of track, we need to translate every other columns.

In [None]:
! pip install googletrans

In [26]:
# import the library
import googletrans
from googletrans import Translator

In [27]:
# detect language
translator = Translator()

languages = translator.detect(['石河中心卫生院', 'English', 'বাংলা'])
for lang in languages:
    print(lang.lang, lang.confidence)

zh-CN 1.0
en 1.0
bn 1.0


In [28]:
df.shape

(4587, 12)

In [29]:
# copy dataframe for translation
df_en = df.copy()
df_en.iloc[:, 1:11].dtypes

Gender                        object
Hubei Contact                 object
Permanent Residence           object
Case Announcement Province    object
Case Announcement City        object
Time                          object
Route Location                object
Event                         object
Description                   object
Track Number                  object
dtype: object

In [30]:
# perform translation : skip first and last column
translations = {}
i = 0


for column in df_en.iloc[:, 1:11].columns:
    # unique elements of the column to avoid translating same thing multiple times
    unique_elements = df_en[column].unique()
    print(unique_elements)
    for element in unique_elements:
        # add translation to the dictionary
        translations[element] = translator.translate(element).text
        i = i + 1
        print('t => {}'.format(i))



# debug translation completion
print('Done with translation')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
t => 3183
t => 3184
t => 3185
t => 3186
t => 3187
t => 3188
t => 3189
t => 3190
t => 3191
t => 3192
t => 3193
t => 3194
t => 3195
t => 3196
t => 3197
t => 3198
t => 3199
t => 3200
t => 3201
t => 3202
t => 3203
t => 3204
t => 3205
t => 3206
t => 3207
t => 3208
t => 3209
t => 3210
t => 3211
t => 3212
t => 3213
t => 3214
t => 3215
t => 3216
t => 3217
t => 3218
t => 3219
t => 3220
t => 3221
t => 3222
t => 3223
t => 3224
t => 3225
t => 3226
t => 3227
t => 3228
t => 3229
t => 3230
t => 3231
t => 3232
t => 3233
t => 3234
t => 3235
t => 3236
t => 3237
t => 3238
t => 3239
t => 3240
t => 3241
t => 3242
t => 3243
t => 3244
t => 3245
t => 3246
t => 3247
t => 3248
t => 3249
t => 3250
t => 3251
t => 3252
t => 3253
t => 3254
t => 3255
t => 3256
t => 3257
t => 3258
t => 3259
t => 3260
t => 3261
t => 3262
t => 3263
t => 3264
t => 3265
t => 3266
t => 3267
t => 3268
t => 3269
t => 3270
t => 3271
t => 3272
t => 3273
t => 3274
t => 3275
t => 

In [32]:
translations

{'女': 'Female',
 '男': 'male',
 '性别': 'gender',
 '有': 'Have',
 '未知': '未知',
 '武汉（湖北）接触史': 'Wuhan (Hubei) Contact History',
 '无': 'no',
 ' 泸县牛滩镇': 'Niutan Town, Lu County',
 '成都市新都区': 'Xindu District, Chengdu',
 '顺庆区延安路海润滨江小区': "Hairun Binjiang Community, Yan'an Road, Shunqing District",
 '叙州区南岸街道中山生态园小区': 'Zhongshan Ecological Park Community, Nanan Street, Xuzhou District',
 '攀枝花市东区': 'East District of Panzhihua City',
 '西昌': 'Xichang',
 '锦江区': 'Jinjiang District',
 '道孚县': 'Daofu County',
 '遂宁市蓬溪县鸣凤镇青杠村': 'Qinggan Village, Mingfeng Town, Pengxi County, Suining City',
 '成都市锦江区': 'Jinjiang District, Chengdu',
 '成都市金堂县': 'Jintang County, Chengdu',
 '成都市温江区': 'Wenjiang District, Chengdu',
 '成都市天府新区': 'Tianfu New District, Chengdu',
 '成都市武侯区': 'Wuhou District, Chengdu',
 '成都市金牛区': 'Jinniu District of Chengdu',
 '巴中市温江区': 'Wenjiang District, Bazhong City',
 '上海': 'Shanghai',
 '巴州区永泰花园东街7栋': 'Building 7, Yongtai Garden East Street, Bazhou District',
 '岩峰镇凤阳村': 'Fengyang Village, Yanfeng Town',


In [33]:
# modify all the terms of the data frame by using the previously created dictionary
df_en.replace(translations, inplace = True)

# check translation
df_en.head()

Unnamed: 0,Age,Gender,Hubei Contact,Permanent Residence,Case Announcement Province,Case Announcement City,Time,Route Location,Event,Description,Track Number,Number Of Tracks
0,51,Female,Have,"Niutan Town, Lu County",四川省,Luzhou City,January 19,"武汉(114.305215,30.592935)->泸县牛滩镇寿尊村(105.392379,...",Come back home,"On January 19th, Lei Mouying returned from Wuh...",轨迹1,4
1,51,Female,Have,"Niutan Town, Lu County",四川省,Luzhou City,January 23,,出现咳嗽等症状,"1月23日, 出现咳嗽、头痛等症状，自行服药后缓解",轨迹2,4
2,51,Female,Have,"Niutan Town, Lu County",四川省,Luzhou City,January 24,"泸县牛滩镇寿尊村(105.392379,29.097620)->牛滩镇卫生院(105.340...",检查,"On January 24, went to the Niutan Township Hos...",Track 3,4
3,51,Female,Have,"Niutan Town, Lu County",四川省,Luzhou City,2月19日,"泸州市传染病医院(105.316716,28.862516)",Admission and isolation,"On February 19, the diagnosis was confirmed by...",Track 4,4
4,50,male,Have,"Niutan Town, Lu County",四川省,Luzhou City,January 7,"武汉(114.305215,30.592935)->牛滩镇寿尊村(105.392379,29...",Come back home,"1月7日, 石某辉自驾从武汉返回到牛滩镇寿尊村10社家中",轨迹1,2


In [34]:
from google.colab import files

df.to_csv('covid.csv')
files.download('covid.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [35]:
from google.colab import files

df_en.to_csv('covid-en-0.5.csv')
files.download('covid-en-0.5.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Fixing translation bugs

Most of the cells in our dataframe are translated by Google Translation API. But there are some cells which are not translated properly. What we can do at this point is finding unique elements and translating them manually through logic.

In [36]:
# fixing hubei contact translation

df_en['Hubei Contact'].unique()

array(['Have', '未知', 'Wuhan (Hubei) Contact History', 'no'], dtype=object)

In [37]:
df_en['Hubei Contact'].replace('未知', False, inplace=True)
df_en['Hubei Contact'].replace('Have', True, inplace=True)
df_en['Hubei Contact'].replace('Wuhan (Hubei) Contact History', True, inplace=True)
df_en['Hubei Contact'].replace('no', False, inplace=True)

# finally changing data type to boolean
df_en[["Hubei Contact"]] = df_en[["Hubei Contact"]].astype("bool")

df_en.dtypes

Age                            int64
Gender                        object
Hubei Contact                   bool
Permanent Residence           object
Case Announcement Province    object
Case Announcement City        object
Time                          object
Route Location                object
Event                         object
Description                   object
Track Number                  object
Number Of Tracks               int64
dtype: object

In [40]:
# fixing Case Announcement Province translation
df_en['Case Announcement Province'].unique()

array(['四川省', '病例公布省份', 'Anhui Province', 'Shandong Province',
       'Shanxi Province', 'Jiangsu Province', '河南省', 'Zhejiang Province',
       '海南省', 'Hunan Province', 'Guizhou Province', 'Chongqing'],
      dtype=object)

In [41]:
df_en['Case Announcement Province'].replace({'四川省': 'Sichuan Province', '病例公布省份': 'Others', '河南省': 'Henan Province', '海南省': 'Hainan Province'}, inplace=True)

In [42]:
df_en['Case Announcement Province'].unique()

array(['Sichuan Province', 'Others', 'Anhui Province',
       'Shandong Province', 'Shanxi Province', 'Jiangsu Province',
       'Henan Province', 'Zhejiang Province', 'Hainan Province',
       'Hunan Province', 'Guizhou Province', 'Chongqing'], dtype=object)

In [43]:
# fixing Gender translation
df_en['Gender'].unique()

array(['Female', 'male', 'gender'], dtype=object)

In [45]:
df_en['Gender'].replace({'Female': 'female', 'gender': 'others'}, inplace=True)
df_en['Gender'].unique()

array(['female', 'male', 'others'], dtype=object)

In [46]:
# fixing Case Announcement City
df_en['Case Announcement City'].unique()

array(['Luzhou City', 'Ganzi Tibetan Autonomous Prefecture', 'Chengdu',
       'Yibin', 'Panzhihua', 'Liangshan Yi Autonomous Prefecture',
       'Suining', 'Nanchong', '巴中市', 'Dazhou', 'Guangyuan', 'Neijiang',
       "Guang'an", 'Meishan', 'Mianyang', 'Ziyang', 'Deyang',
       "Ya'an city", 'Xichang', 'Case announcement city', 'Fuyang',
       "Lu'an", 'Maanshan', 'Hefei', 'Guangde City', 'Bengbu', 'Bozhou',
       'Huainan', 'Suzhou', 'Chizhou', 'Wuhu', 'Tongling City', 'Anqing',
       'Huaibei City', '宣城市', '济宁市', 'Dezhou', 'Zibo', 'Rizhao',
       'Qingdao', 'Zaozhuang', 'Weifang', 'Linyi City', 'Binzhou',
       'Jinan City', 'Yantai', 'Weihai', '聊城市', 'Heze', "Tai'an",
       'Shuozhou', 'Jinzhong', 'Taiyuan', 'Xinzhou', 'Datong City', '阳泉市',
       'Jincheng', 'Yuncheng', 'Changzhi', 'Xuzhou', 'Changzhou City',
       '宿迁市', 'Nanjing', 'Yancheng', 'Yixing', 'Haimen', 'Nantong city',
       'Suzhou City', '镇江市', 'Yangzhou', '安阳市', '许昌市', 'Sanmenxia',
       'Xinxiang', 'Luoyang

In [47]:
df_en['Case Announcement City'].replace(
    {
        '巴中市': 'Bazhong',
        'Case announcement city': 'Others',
        '宣城市': 'Xuancheng',
        '济宁市': 'Jining City',
        '聊城市': 'Liaocheng',
        '阳泉市': 'Yangquan City',
        '宿迁市': 'Suqian City',
        '镇江市': 'Zhenjiang',
        '安阳市': 'Anyang City',
        '许昌市': 'Xuchang City',
        '焦作市': 'Jiaozuo City',
        '嘉兴市': 'Jiaxing City',
        '海口市': 'Haikou',
        '昌江黎族自治县': 'Changjiang Li',
        '文昌市': 'Wenchang City',
        '陵水县': 'Lingshui County',
        '岳阳市': 'Yueyang City',
        '万州区': 'Wanzhou District',
        '云阳县': 'Yunyang County',
        '荣昌区': 'Rongchang District',
        '长寿区': 'Changshou District',
        '城口县': 'Chengkou County',
        '渝中区': 'Yuzhong District',
        '沙坪坝区': 'Shapingba District'
    }, 
inplace=True)

In [48]:
df_en['Case Announcement City'].unique()

array(['Luzhou City', 'Ganzi Tibetan Autonomous Prefecture', 'Chengdu',
       'Yibin', 'Panzhihua', 'Liangshan Yi Autonomous Prefecture',
       'Suining', 'Nanchong', 'Bazhong', 'Dazhou', 'Guangyuan',
       'Neijiang', "Guang'an", 'Meishan', 'Mianyang', 'Ziyang', 'Deyang',
       "Ya'an city", 'Xichang', 'Others', 'Fuyang', "Lu'an", 'Maanshan',
       'Hefei', 'Guangde City', 'Bengbu', 'Bozhou', 'Huainan', 'Suzhou',
       'Chizhou', 'Wuhu', 'Tongling City', 'Anqing', 'Huaibei City',
       'Xuancheng', 'Jining City', 'Dezhou', 'Zibo', 'Rizhao', 'Qingdao',
       'Zaozhuang', 'Weifang', 'Linyi City', 'Binzhou', 'Jinan City',
       'Yantai', 'Weihai', 'Liaocheng', 'Heze', "Tai'an", 'Shuozhou',
       'Jinzhong', 'Taiyuan', 'Xinzhou', 'Datong City', 'Yangquan City',
       'Jincheng', 'Yuncheng', 'Changzhi', 'Xuzhou', 'Changzhou City',
       'Suqian City', 'Nanjing', 'Yancheng', 'Yixing', 'Haimen',
       'Nantong city', 'Suzhou City', 'Zhenjiang', 'Yangzhou',
       'Anyang City', 

In [50]:
# fixing track numbers translation
df_en['Track Number'].unique()

array(['轨迹1', '轨迹2', 'Track 3', 'Track 4', 'Track 5', 'Track 6', '轨迹7',
       '轨迹8', '轨迹9', 'Track 10', 'Track 11', 'Track 12', '轨迹13',
       'Track 14', 'Track 15', '轨迹16', 'Track 17', '轨迹18', '轨迹19', '轨迹20',
       'Track 21', 'Track 22', '轨迹23', '轨迹24', 'Track 25', '轨迹26', '轨迹27',
       'Track number', '轨迹35'], dtype=object)

In [54]:
df_en['Track Number'].replace({
    '轨迹1': 'Track 1',
    '轨迹2': 'Track 2',
    '轨迹7': 'Track 7',
    '轨迹8': 'Track 8',
    '轨迹9': 'Track 9',
    '轨迹13': 'Track 13',
    '轨迹16': 'Track 16',
    '轨迹18': 'Track 18',
    '轨迹19': 'Track 19',
    '轨迹20': 'Track 20',
    '轨迹23': 'Track 23',
    '轨迹24': 'Track 24',
    '轨迹26': 'Track 26',
    '轨迹27': 'Track 27',
    '轨迹35': 'Track 35',
    'Track number': 'Track Others',
}, inplace=True)
df_en['Track Number'].unique()

array(['Track 1', 'Track 2', 'Track 3', 'Track 4', 'Track 5', 'Track 6',
       'Track 7', 'Track 8', 'Track 9', 'Track 10', 'Track 11',
       'Track 12', 'Track 13', 'Track 14', 'Track 15', 'Track 16',
       'Track 17', 'Track 18', 'Track 19', 'Track 20', 'Track 21',
       'Track 22', 'Track 23', 'Track 24', 'Track 25', 'Track 26',
       'Track 27', 'Track Others', 'Track 35'], dtype=object)

#### Download partially translated frame as CSV

In [55]:
from google.colab import files

df_en.to_csv('covid-en-0.8.csv')
files.download('covid-en-0.8.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

With this initial data wrangling and translations ends here. Next, we will be translating rest of the data which couldn't be translated due to huge number of amount will be translated through Googel Sheets tools.

😀😀😀

