# Feature Engineering

`Author: YUAN Yanzhe`

- In this notebook, feature engineering is applied to the raw data and output feature.csv for the later modeling.

- The codes are originally run in Google Colab
    - use packages like re, datetime to extract features from 
    - use wwo-hist API to grab weather data 
        - [wwo-hist github](https://github.com/ekapope/WorldWeatherOnline)
        - [wwo-hist tutorial](https://towardsdatascience.com/obtain-historical-weather-forecast-data-in-csv-format-using-python-5a6c090fc828)

## Load Data

In [None]:
import pandas as pd
import numpy as np
import re
from datetime import datetime

In [None]:
# load our original data
train_data = pd.read_csv('/content/drive/My Drive/5001_kaggle/train.csv')
test_data = pd.read_csv('/content/drive/My Drive/5001_kaggle/test.csv')
sub_data = pd.read_csv('/content/drive/My Drive/5001_kaggle/sampleSubmission.csv')
train_data

Unnamed: 0,id,date,speed
0,0,1/1/2017 0:00,43.002930
1,1,1/1/2017 1:00,46.118696
2,2,1/1/2017 2:00,44.294158
3,3,1/1/2017 3:00,41.067468
4,4,1/1/2017 4:00,46.448653
...,...,...,...
14001,14001,31/12/2018 12:00,19.865269
14002,14002,31/12/2018 15:00,17.820375
14003,14003,31/12/2018 16:00,12.501851
14004,14004,31/12/2018 18:00,15.979319


## Feature Engineering

### Obtain Additional Features

- weather conditions can effect traffic, so I obtain weather data from wwo-hist API
- select the following features to build weather_features
    - temperature
    - visibility
    - wind direction
    - wind speed
    - humidity
    - cloudcover
    - wind chill temp

In [None]:
!pip install wwo-hist

Collecting wwo-hist
  Downloading https://files.pythonhosted.org/packages/5a/b4/19a4d6a0d131567cf4b2ffa3758710d867f7d7d3f0c6f94bd63fadf1d02a/wwo_hist-0.0.5-py3-none-any.whl
Installing collected packages: wwo-hist
Successfully installed wwo-hist-0.0.5


In [None]:
from wwo_hist import retrieve_hist_data
import urllib.request

#downlaod extra weather information

frequency = 24
start_date = '01-Jan-2017'
end_date = '31-DEC-2018'
api_key = '8b9dd8eb5f174b05b0a143348200412'
location_list = ['hongkong']

hist_weather_data = retrieve_hist_data(api_key, location_list, start_date, end_date, frequency = 24, export_csv = True, store_df = True)



Retrieving weather data for hongkong


Currently retrieving data for hongkong: from 2017-01-01 to 2017-01-31
Time elapsed (hh:mm:ss.ms) 0:00:01.311524
Currently retrieving data for hongkong: from 2017-02-01 to 2017-02-28
Time elapsed (hh:mm:ss.ms) 0:00:02.252394
Currently retrieving data for hongkong: from 2017-03-01 to 2017-03-31
Time elapsed (hh:mm:ss.ms) 0:00:03.235293
Currently retrieving data for hongkong: from 2017-04-01 to 2017-04-30
Time elapsed (hh:mm:ss.ms) 0:00:04.215726
Currently retrieving data for hongkong: from 2017-05-01 to 2017-05-31
Time elapsed (hh:mm:ss.ms) 0:00:05.200638
Currently retrieving data for hongkong: from 2017-06-01 to 2017-06-30
Time elapsed (hh:mm:ss.ms) 0:00:06.207318
Currently retrieving data for hongkong: from 2017-07-01 to 2017-07-31
Time elapsed (hh:mm:ss.ms) 0:00:07.282019
Currently retrieving data for hongkong: from 2017-08-01 to 2017-08-31
Time elapsed (hh:mm:ss.ms) 0:00:08.293757
Currently retrieving data for hongkong: from 2017-09-01 to 2017

In [None]:
# add weather features
dateparse = lambda dates: datetime.strptime(dates, '%Y-%m-%d')
df_weather = pd.read_csv('/content/hongkong.csv', parse_dates=['date_time'], index_col='date_time', date_parser=dateparse)
weather_features = df_weather[['tempC','visibility','winddirDegree','windspeedKmph','humidity','cloudcover', 'WindChillC']]
weather_features

Unnamed: 0_level_0,tempC,visibility,winddirDegree,windspeedKmph,humidity,cloudcover,WindChillC
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-01-01,21,10,76,16,80,30,19
2017-01-02,22,10,77,10,80,11,21
2017-01-03,22,10,81,14,81,8,21
2017-01-04,22,10,80,15,82,8,21
2017-01-05,22,10,72,9,83,42,21
...,...,...,...,...,...,...,...
2018-12-27,22,10,39,12,66,43,22
2018-12-28,20,10,36,19,61,27,18
2018-12-29,14,10,21,20,62,44,12
2018-12-30,13,10,10,20,59,70,9


### Obtain Features from Raw Data

- There is only date data in the dataset, so I think it is necessary to mine more infromation (i.e. more features) extracted from the **date data**, so I extract the following features:
    - hour, date, week, month, year: features from datetime.stiptime
    - is_holiday: whether the date is a public holiday
    - tried is_traffic_time(9 a.m., 10 a.m., 18 a.m., 19 a.m.), quarter

In [None]:
# this block just for test , ignore!
def get_holiday(s):
  s=str(s)
  list = re.split(' ', s)
  days = str(datetime.strptime(list[0],"%d/%m/%Y"))
  day = re.split(' ', days)[0]
  return day


train_data["holiday"] = train_data["date"].apply(lambda x: get_holiday(x))
train_data["holiday"]

0        2017-01-01
1        2017-01-01
2        2017-01-01
3        2017-01-01
4        2017-01-01
            ...    
14001    2018-12-31
14002    2018-12-31
14003    2018-12-31
14004    2018-12-31
14005    2018-12-31
Name: holiday, Length: 14006, dtype: object

In [None]:
# define some functions to preprocess our data
def get_hour(s):
  s=str(s)
  list = re.split(' |:',s)
  return list[1]

def get_month(s):
  s=str(s)
  list = re.split('/',s)
  return list[1]

def get_year(s):
  s=str(s)
  list = re.split('/| ',s)
  return list[2]

def get_day(s):
  s=str(s)
  list = re.split('/| ',s)
  return list[0]

def get_week(s):
  s=str(s)
  list = re.split(' ', s)
  week = datetime.strptime(list[0],"%d/%m/%Y").weekday()
  return str(week)


holidays = ['2017-01-02','2017-01-28','2017-01-30','2017-01-31','2017-04-04','2017-04-14','2017-04-15','2017-04-17','2017-05-01','2017-05-03','2017-05-30','2017-07-01','2017-10-02','2017-10-05','2017-10-28','2017-12-25','2017-12-26','2018-01-01','2018-02-16','2018-02-17','2018-02-18','2018-03-30','2018-03-31','2018-04-02','5/4/2018','2018-05-01','2018-05-22','2018-06-18','2018-07-02','2018-09-25','2018-10-01','2018-10-17','2018-12-25','2018-12-26']

def get_holiday(s):
  s=str(s)
  list = re.split(' ', s)
  days = str(datetime.strptime(list[0],"%d/%m/%Y"))
  day = re.split(' ', days)[0]
  if day in holidays:
    return 1
  return 0

In [None]:
train_data["hour"] = train_data["date"].apply(lambda x : get_hour(x))
train_data["month"] = train_data["date"].apply(lambda x : get_month(x))
train_data["day"] = train_data["date"].apply(lambda x : get_day(x))
train_data["year"] = train_data["date"].apply(lambda x : get_year(x))
train_data["weekday"] = train_data["date"].apply(lambda x: get_week(x))
train_data["holiday"] = train_data["date"].apply(lambda x: get_holiday(x))
#train_fixed = train_data.drop(["date"], axis=1, inplace=False)
#train_data["hour"].describe()
#train_data["weekday"].describe()
#train_data.head(100)
train_data

Unnamed: 0,id,date,speed,holiday,hour,month,day,year,weekday
0,0,1/1/2017 0:00,43.002930,0,0,1,1,2017,6
1,1,1/1/2017 1:00,46.118696,0,1,1,1,2017,6
2,2,1/1/2017 2:00,44.294158,0,2,1,1,2017,6
3,3,1/1/2017 3:00,41.067468,0,3,1,1,2017,6
4,4,1/1/2017 4:00,46.448653,0,4,1,1,2017,6
...,...,...,...,...,...,...,...,...,...
14001,14001,31/12/2018 12:00,19.865269,0,12,12,31,2018,0
14002,14002,31/12/2018 15:00,17.820375,0,15,12,31,2018,0
14003,14003,31/12/2018 16:00,12.501851,0,16,12,31,2018,0
14004,14004,31/12/2018 18:00,15.979319,0,18,12,31,2018,0


### Concatenate All Features

In [None]:
# process weather data 
tempC = []
visibility = []
winddirDegree = []
windspeedKmph = []
humidity = []
cloudcover = []
WindChillC = []

for i in train_data['date']:
    # transform the type from date to datetime
  string_date = str(i)
  string_date=re.split(" ",string_date)[0]
  date = datetime.strptime(string_date, '%d/%m/%Y')
  tempC.append(weather_features['tempC'][date])
  visibility.append(weather_features['visibility'][date])
  winddirDegree.append(weather_features['winddirDegree'][date])
  windspeedKmph.append(weather_features['windspeedKmph'][date])
  humidity.append(weather_features['humidity'][date])
  cloudcover.append(weather_features['cloudcover'][date])
  WindChillC.append(weather_features['WindChillC'][date])

train_data['tempC'] = tempC
train_data['visibility'] = visibility
train_data['winddirDegree'] = winddirDegree
train_data['windspeedKmph'] = windspeedKmph
train_data['humidity'] = humidity
train_data['cloudcover'] = cloudcover
train_data['WindChillC'] = WindChillC
train_data

Unnamed: 0,id,date,speed,holiday,hour,month,day,year,weekday,tempC,visibility,winddirDegree,windspeedKmph,humidity,cloudcover,WindChillC
0,0,1/1/2017 0:00,43.002930,0,0,1,1,2017,6,21,10,76,16,80,30,19
1,1,1/1/2017 1:00,46.118696,0,1,1,1,2017,6,21,10,76,16,80,30,19
2,2,1/1/2017 2:00,44.294158,0,2,1,1,2017,6,21,10,76,16,80,30,19
3,3,1/1/2017 3:00,41.067468,0,3,1,1,2017,6,21,10,76,16,80,30,19
4,4,1/1/2017 4:00,46.448653,0,4,1,1,2017,6,21,10,76,16,80,30,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14001,14001,31/12/2018 12:00,19.865269,0,12,12,31,2018,0,12,10,138,18,69,79,10
14002,14002,31/12/2018 15:00,17.820375,0,15,12,31,2018,0,12,10,138,18,69,79,10
14003,14003,31/12/2018 16:00,12.501851,0,16,12,31,2018,0,12,10,138,18,69,79,10
14004,14004,31/12/2018 18:00,15.979319,0,18,12,31,2018,0,12,10,138,18,69,79,10


**I tried to use one-hot form for those features like hour, date, etc. to add the dimensionality of features but the effect is worse than not using one-hot**

In [None]:
#pd.get_dummies(train_data[["hour","weekday"]])

Unnamed: 0,hour_0,hour_1,hour_10,hour_11,hour_12,hour_13,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_2,hour_20,hour_21,hour_22,hour_23,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,hour_9,weekday_0,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14001,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
14002,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
14003,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
14004,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [None]:
#pd.get_dummies(train_fixed[["hour","month","day","year"]])
#train_clean = pd.concat([train_data, pd.get_dummies(train_data[["hour","weekday","month","day","year"]])], axis=1)
#train_clean = train_clean.drop(["hour","weekday","month","day","date","year"], axis=1)
#features = train_clean.loc[:, "hour_0":"day_9"]
#train_clean.loc[:, "hour_0":"day_9"] = (features-features.mean())/features.std()
train_clean = train_data.drop(["date"], axis=1)
train_clean

Unnamed: 0,id,speed,holiday,hour,month,day,year,weekday,tempC,visibility,winddirDegree,windspeedKmph,humidity,cloudcover,WindChillC
0,0,43.002930,0,0,1,1,2017,6,21,10,76,16,80,30,19
1,1,46.118696,0,1,1,1,2017,6,21,10,76,16,80,30,19
2,2,44.294158,0,2,1,1,2017,6,21,10,76,16,80,30,19
3,3,41.067468,0,3,1,1,2017,6,21,10,76,16,80,30,19
4,4,46.448653,0,4,1,1,2017,6,21,10,76,16,80,30,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14001,14001,19.865269,0,12,12,31,2018,0,12,10,138,18,69,79,10
14002,14002,17.820375,0,15,12,31,2018,0,12,10,138,18,69,79,10
14003,14003,12.501851,0,16,12,31,2018,0,12,10,138,18,69,79,10
14004,14004,15.979319,0,18,12,31,2018,0,12,10,138,18,69,79,10


In [None]:
#save train data with features
train_clean.to_csv("/content/drive/My Drive/5001_kaggle/train_cleaned_data7.csv", index=False)

### The Same For Test Data

In [None]:
test_data["hour"] = test_data["date"].apply(lambda x : get_hour(x))
test_data["month"] = test_data["date"].apply(lambda x : get_month(x))
test_data["day"] = test_data["date"].apply(lambda x : get_day(x))
test_data["year"] = test_data["date"].apply(lambda x : get_year(x))
test_data["weekday"] = test_data["date"].apply(lambda x: get_week(x))
test_data["holiday"] = test_data["date"].apply(lambda x: get_holiday(x))


#train_fixed = train_data.drop(["date"], axis=1, inplace=False)
test_data

Unnamed: 0,id,date,hour,month,day,year,weekday,holiday
0,0,1/1/2018 2:00,2,1,1,2018,0,1
1,1,1/1/2018 5:00,5,1,1,2018,0,1
2,2,1/1/2018 7:00,7,1,1,2018,0,1
3,3,1/1/2018 8:00,8,1,1,2018,0,1
4,4,1/1/2018 10:00,10,1,1,2018,0,1
...,...,...,...,...,...,...,...,...
3499,3499,31/12/2018 17:00,17,12,31,2018,0,0
3500,3500,31/12/2018 19:00,19,12,31,2018,0,0
3501,3501,31/12/2018 21:00,21,12,31,2018,0,0
3502,3502,31/12/2018 22:00,22,12,31,2018,0,0


In [None]:
tempC = []
visibility = []
winddirDegree = []
windspeedKmph = []
humidity = []
cloudcover = []
WindChillC = []

for i in test_data['date']:
    # transform the type from date to datetime
  string_date = str(i)
  string_date=re.split(" ",string_date)[0]
  date = datetime.strptime(string_date, '%d/%m/%Y')
  tempC.append(weather_features['tempC'][date])
  visibility.append(weather_features['visibility'][date])
  winddirDegree.append(weather_features['winddirDegree'][date])
  windspeedKmph.append(weather_features['windspeedKmph'][date])
  humidity.append(weather_features['humidity'][date])
  cloudcover.append(weather_features['cloudcover'][date])
  WindChillC.append(weather_features['WindChillC'][date])

test_data['tempC'] = tempC
test_data['visibility'] = visibility
test_data['winddirDegree'] = winddirDegree
test_data['windspeedKmph'] = windspeedKmph
test_data['humidity'] = humidity
test_data['cloudcover'] = cloudcover
test_data['WindChillC'] = WindChillC
test_data

Unnamed: 0,id,date,hour,month,day,year,weekday,holiday,speed,tempC,visibility,winddirDegree,windspeedKmph,humidity,cloudcover,WindChillC
0,0,1/1/2018 2:00,2,1,1,2018,0,1,0,19,10,65,12,63,23,18
1,1,1/1/2018 5:00,5,1,1,2018,0,1,0,19,10,65,12,63,23,18
2,2,1/1/2018 7:00,7,1,1,2018,0,1,0,19,10,65,12,63,23,18
3,3,1/1/2018 8:00,8,1,1,2018,0,1,0,19,10,65,12,63,23,18
4,4,1/1/2018 10:00,10,1,1,2018,0,1,0,19,10,65,12,63,23,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3499,3499,31/12/2018 17:00,17,12,31,2018,0,0,0,12,10,138,18,69,79,10
3500,3500,31/12/2018 19:00,19,12,31,2018,0,0,0,12,10,138,18,69,79,10
3501,3501,31/12/2018 21:00,21,12,31,2018,0,0,0,12,10,138,18,69,79,10
3502,3502,31/12/2018 22:00,22,12,31,2018,0,0,0,12,10,138,18,69,79,10


In [None]:
test_data["speed"] = 0
#test_clean = pd.concat([test_data, pd.get_dummies(test_data[["hour","weekday","month","day","year"]])], axis=1)
#test_clean = test_clean.drop(["hour","weekday","month","day","date","year"], axis=1)
test_clean = test_data.drop(["date"], axis=1)
#test_clean["year_2017"] = 0

#features = test_clean.loc[:, "hour_0":"day_9"]
#test_clean.loc[:, "hour_0":"day_9"] = (features-features.mean())/features.std()
test_clean

Unnamed: 0,id,hour,month,day,year,weekday,holiday,speed,tempC,visibility,winddirDegree,windspeedKmph,humidity,cloudcover,WindChillC
0,0,2,1,1,2018,0,1,0,19,10,65,12,63,23,18
1,1,5,1,1,2018,0,1,0,19,10,65,12,63,23,18
2,2,7,1,1,2018,0,1,0,19,10,65,12,63,23,18
3,3,8,1,1,2018,0,1,0,19,10,65,12,63,23,18
4,4,10,1,1,2018,0,1,0,19,10,65,12,63,23,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3499,3499,17,12,31,2018,0,0,0,12,10,138,18,69,79,10
3500,3500,19,12,31,2018,0,0,0,12,10,138,18,69,79,10
3501,3501,21,12,31,2018,0,0,0,12,10,138,18,69,79,10
3502,3502,22,12,31,2018,0,0,0,12,10,138,18,69,79,10


In [None]:
#test_clean[['year_2018','year_2017']] = test_clean[['year_2017','year_2018']]
#test_clean.rename(columns={'year_2018':'year_2017','year_2017':'year_2018'}, inplace=True)
#test_clean

Unnamed: 0,id,holiday,speed,hour_0,hour_1,hour_10,hour_11,hour_12,hour_13,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_2,hour_20,hour_21,hour_22,hour_23,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,hour_9,weekday_0,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,month_1,month_10,month_11,month_12,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,day_1,day_10,day_11,day_12,day_13,day_14,day_15,day_16,day_17,day_18,day_19,day_2,day_20,day_21,day_22,day_23,day_24,day_25,day_26,day_27,day_28,day_29,day_3,day_30,day_31,day_4,day_5,day_6,day_7,day_8,day_9,year_2017,year_2018
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,4,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3499,3499,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
3500,3500,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
3501,3501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
3502,3502,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1


In [None]:
test_clean.to_csv("/content/drive/My Drive/5001_kaggle/test_cleaned_data7.csv", index=False)