# Preprocessing of the Engineered Music Recommendation System Data

**Submitted By:** Debojyoti Choudhury

**Enrollment Number:** 40AIML137-21/2

**Programme:** Diploma in AI and ML

The purpose of this notebook is to take the engineered data and preprocess them to create model-ready data.

## Importing necessary libraries

In [1]:
import numpy as np                                  # for fast processing of tabular data
import pandas as pd                                 # data processing, CSV file I/O
import matplotlib.pyplot as plt                     # for data visualization
import seaborn as sns                               # for data visualization
import missingno                                    # for missing values visualization
import gc                                           # utility to free up the memory
from sklearn.preprocessing import StandardScaler    # for scaling the continuous features (standardizing) 
from sklearn.preprocessing import LabelEncoder      # for ordinal encoding of the categorical features 
from sklearn.preprocessing import MinMaxScaler      # for scaling the continuous features (min-max)

## Read the engineered data

In [2]:
train_data = pd.read_csv("../Data/engineered_data/fe_train_data.csv.gz",compression='gzip')

In [3]:
test_data = pd.read_csv("../Data/engineered_data/fe_test_data.csv.gz", compression='gzip')

In [4]:
validation_data = pd.read_csv("../Data/engineered_data/fe_validation_data.csv.gz", compression='gzip')

In [5]:
train_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,registration_year,...,song_year,member_song_count,artist_song_count,composer_song_count,lyricist_song_count,genre_song_count,lang_song_count,song_member_count,age_song_count,target
0,XyqTyQdDkjyvyVlF9HQEQMC7jpTt6uhfP0OElaBLckE=,IXP1a2o3NL8WU4WK1X0WJAKaSW+LgGRpPDn4Gt1HrV8=,discover,others,online-playlist,2,28,male,9,2015,...,2016.0,836,57040,36465,37907,3814444,4044643,6033,232548,0
1,LRvwVwXrRasztB6Ujf84G1E1M3PtDd6YfzLEsqTyuPo=,Ru7n8Xw2s8LGDsgDhyzWqCWQRWQW9KNPY9qMOFAf5x0=,my library,Local playlist more,local-library,6,0,gender_missing,7,2013,...,2016.0,389,22567,15809,14400,329134,1864789,3488,2948104,1
2,cRMes5eNsNChPwtNTj1+42usZO8yH69BkfC9px884yc=,U8VgelI8G461f4FVJHnavdn4zx7XjmSFcNHZ3Hru3TU=,my library,Local playlist more,local-playlist,6,0,gender_missing,7,2015,...,2014.0,574,22118,13466,14417,3814444,4044643,5729,2948104,1
3,LfhLJImWDCzeEf93gxaGT/IXqrvasxsZfOwVEhFmHw0=,JiY2ckGkra5KGchu3Rh6KPH9FL2J0tTmAB3pGyVi4BQ=,discover,Online playlist more,online-playlist,4,28,male,9,2007,...,2016.0,425,30798,27417,9600,1266910,4044643,5820,232548,0
4,Qyq0USrm2n4w9QqZr/Iuuv24IETeO6lgVEhNrs0HHwI=,uNNIh/p+Ucr+CQBnmtVQFi9OMloGqLgnHe1mXggKle4=,search,others,others,2,18,female,4,2016,...,2016.0,512,303616,259,3178798,400596,1864789,249,124868,0


In [6]:
validation_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,registration_year,...,song_year,member_song_count,artist_song_count,composer_song_count,lyricist_song_count,genre_song_count,lang_song_count,song_member_count,age_song_count,target
0,0JLOaW5BdEelTbeS4ZJkuuDVr0NZPP+SG+zcxG/5qHo=,RMv6xdzVwGLCucAtyRR7h5LWV2HKXjJMAtEL2pBeBG4=,my library,Local playlist more,local-library,6,0,gender_missing,9,2013,...,2013.0,822,26759,189,3178798,1266910,78621,143,2948104,0
1,crR6QHSbA22COjylVND+tTjvk25uQPW7QXOLZo/PBMg=,OAkciD8vFKjj3vNPxHd+QiPxVYPfq+IGHQEMEGz/JxM=,search,others,others,0,27,male,7,2012,...,2004.0,463,10197,558,3178798,3814444,4044643,84,253065,1
2,FgE2jvvDEMVNw4QWqdLjyIpAkgDqtRWLevWP/FP9Gj8=,6cl6GdZmBS8/qMJadkQJNM3c+NSiyXZmNIF3//QIDj8=,my library,Local playlist more,local-library,6,0,gender_missing,4,2017,...,2016.0,54,20731,336,411,59284,245136,247,2948104,1
3,lPJxYddCsyfyph9qB1Ni/MSPrIWG+P4ADUPrCaJTs0M=,ah833V1/nMgO+NNxtUpKn8m5bc9nCPi7AHV77/k1zNA=,my library,Local playlist more,local-playlist,1,26,male,3,2015,...,2001.0,527,13274,1665,3178798,400596,245136,16,257869,1
4,oxa+4yk7Lj3MvxFsP7wbTv4BGSKUxBQB7wHeqa0kWbM=,CkcGnJJYzv1tJVVEI0UlDBVuK1+5BH1Rk8PCuwcrkg8=,my library,Local playlist more,local-library,5,38,female,9,2011,...,2016.0,498,20617,9873,2341,3814444,1864789,2134,75166,1


In [7]:
test_data.head()

Unnamed: 0,id,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,...,registration_code,song_year,member_song_count,artist_song_count,composer_song_count,lyricist_song_count,genre_song_count,lang_song_count,song_member_count,age_song_count
0,0,V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=,WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=,my library,others,local-library,6,0,gender_missing,7,...,UM7,2014.0,17,4009,205,1224744,384869,1311328,196,1046511
1,1,V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=,y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=,my library,Local playlist more,local-library,6,0,gender_missing,7,...,B67,2010.0,17,31571,25629,1542,1248185,1311328,1479,1046511
2,2,/uQAlrAkaczV+nWCd2sPF2ekvXPRipV7q0l+gbLuxjw=,8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=,discover,Local playlist more,others,6,0,gender_missing,4,...,WP0,2010.0,1,149,125,125,56521,84227,2,1046511
3,3,1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=,ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=,radio,Local playlist more,radio,0,30,male,9,...,AAN,2002.0,270,225,40,1224744,1248185,709024,14,68238
4,4,1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=,MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=,radio,others,radio,0,30,male,9,...,O10,2011.0,270,80,1895,1224744,2893,116856,3,68238


## Check for missing values

Following columns have missing values

In [8]:
features_with_missing_values = train_data.columns[train_data.isna().any()].tolist()

features_with_missing_values

['first_artist_name',
 'first_lyricist',
 'first_composer',
 'country_code',
 'registration_code',
 'song_year']

In [9]:
def fill_missing_values(data):
    '''Function to fill missing values'''
    data['first_artist_name'].fillna('no_artist_name', inplace=True)
    data['first_lyricist'].fillna('no_lyricist', inplace=True)
    data['first_composer'].fillna('no_composer', inplace=True)
    data['country_code'].fillna('no_country_code', inplace=True)
    data['registration_code'].fillna('no_registration_code', inplace=True)
    return data

In [10]:
train_data = fill_missing_values(train_data)
test_data = fill_missing_values(test_data)
validation_data = fill_missing_values(validation_data)

In [11]:
rep_song_year = train_data['song_year'].median()

In [12]:
train_data['song_year'].fillna(rep_song_year, inplace=True)
test_data['song_year'].fillna(rep_song_year, inplace=True)
validation_data['song_year'].fillna(rep_song_year, inplace=True)

In [13]:
train_data.columns[train_data.isna().any()].tolist()

[]

In [14]:
test_data.columns[test_data.isna().any()].tolist()

[]

In [15]:
validation_data.columns[validation_data.isna().any()].tolist()

[]

## Preprocess the data

The preprocessing shall be done in following way:

- The continuous features shall be scaled using standard scaler. The standard scaler works like following
  If $x$ is the feature whose mean value is $\bar{x}$ and standard deviation is $\sigma_x$ then the scaled feature will be:
  $$x := \frac{x - \bar{x}}{\sigma_x}$$
  The problem is if the features contains outliers or the data is highly skewed then this type of scaling will not work well.

- If the continuous features are highly skewed then instead of standard scaler we can use min-max scaler
  If $x$ is the feature whose minimum value is $x_{min}$ and maximum value is $x_{max}$ then the min-max scaler will work like following.
  $$x := \frac{x - x_{min}}{x_{max} - x_{min}}$$
  This ensures the scaled feature lies between 0 to 1
 
- For categorical features we will use label encoder.

### Selecting object type and numerical type features

In [16]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5901934 entries, 0 to 5901933
Data columns (total 42 columns):
 #   Column               Dtype  
---  ------               -----  
 0   msno                 object 
 1   song_id              object 
 2   source_system_tab    object 
 3   source_screen_name   object 
 4   source_type          object 
 5   city                 int64  
 6   bd                   int64  
 7   gender               object 
 8   registered_via       int64  
 9   registration_year    int64  
 10  registration_month   int64  
 11  registration_day     int64  
 12  expiration_year      int64  
 13  expiration_month     int64  
 14  expiration_day       int64  
 15  membership_days      int64  
 16  song_length          float64
 17  first_genre_id       float64
 18  second_genre_id      float64
 19  third_genre_id       float64
 20  genre_ids_count      float64
 21  artist_name          object 
 22  language             int64  
 23  is_featured          int64  
 24

In [17]:
object_type_features = train_data.select_dtypes(include=['object']).columns.to_list()

In [18]:
numerical_type_features = train_data.select_dtypes(exclude=['object']).columns.to_list()

### Preprocessing the categorical features

There are 12 columns which are object types (categorical)

In [19]:
object_type_features

['msno',
 'song_id',
 'source_system_tab',
 'source_screen_name',
 'source_type',
 'gender',
 'artist_name',
 'first_artist_name',
 'first_lyricist',
 'first_composer',
 'country_code',
 'registration_code']

In [20]:
train_data[object_type_features].nunique()

msno                   30540
song_id               324549
source_system_tab          5
source_screen_name         5
source_type                7
gender                     3
artist_name            37773
first_artist_name      36342
first_lyricist         22843
first_composer         45434
country_code             105
registration_code       6118
dtype: int64

Here `song_id` is not important feature to consider because there is a huge cardinality.

Hence, we shall consider following categorical features for label encoding.

In [21]:
cat_features = ['msno',
 'source_system_tab',
 'source_screen_name',
 'source_type',
 'gender',
 'artist_name',
 'first_artist_name',
 'first_lyricist',
 'first_composer',
 'country_code',
 'registration_code']

In [22]:
pd.set_option('mode.chained_assignment', None)
for feature in cat_features:
    le = LabelEncoder()
    print(f"Working on feature: {feature}")
    combined = train_data[feature].append(validation_data[feature])
    combined = set(combined.append(test_data[feature]))
    combined = np.array(list(combined))
    le = le.fit(combined)
    train_data[feature] = le.transform(train_data[feature].values.reshape(-1,1))
    validation_data[feature] = le.transform(validation_data[feature].values.reshape(-1,1))
    test_data[feature] = le.transform(test_data[feature].values.reshape(-1,1))

Working on feature: msno


  return f(*args, **kwargs)


Working on feature: source_system_tab
Working on feature: source_screen_name
Working on feature: source_type
Working on feature: gender
Working on feature: artist_name
Working on feature: first_artist_name
Working on feature: first_lyricist
Working on feature: first_composer
Working on feature: country_code
Working on feature: registration_code


### Preprocessing the numerical features

There are 30 columns which are numerical types

In [42]:
numerical_type_features

['city',
 'bd',
 'registered_via',
 'registration_year',
 'registration_month',
 'registration_day',
 'expiration_year',
 'expiration_month',
 'expiration_day',
 'membership_days',
 'song_length',
 'first_genre_id',
 'second_genre_id',
 'third_genre_id',
 'genre_ids_count',
 'language',
 'is_featured',
 'artist_count',
 'lyricist_count',
 'composer_count',
 'song_year',
 'member_song_count',
 'artist_song_count',
 'composer_song_count',
 'lyricist_song_count',
 'genre_song_count',
 'lang_song_count',
 'song_member_count',
 'age_song_count',
 'target']

In [27]:
train_data[numerical_type_features].nunique()

city                       7
bd                        68
registered_via             5
registration_year         14
registration_month        12
registration_day          31
expiration_year           18
expiration_month          12
expiration_day            31
membership_days         4319
song_length            55399
first_genre_id           156
second_genre_id           82
third_genre_id            35
genre_ids_count            8
language                  11
is_featured                2
artist_count              10
lyricist_count            23
composer_count            27
song_year                100
member_song_count       1564
artist_song_count       1619
composer_song_count     1641
lyricist_song_count     1324
genre_song_count         125
lang_song_count           11
song_member_count       1798
age_song_count            68
target                     2
dtype: int64

We shall use following features for scaling

In [59]:
numerical_features = ['bd',
 'registered_via',
 'membership_days',
 'song_length',
 'composer_count',
 'member_song_count',
 'artist_song_count',
 'composer_song_count',
 'lyricist_song_count',
 'genre_song_count',
 'lang_song_count',
 'song_member_count',
 'age_song_count']

In [62]:
for f in numerical_features:
    print(f"Working on feature: {f}")
    s = np.abs(train_data[f].skew())
    if s >= 1.0:
        encoder = MinMaxScaler()
    else:
        encoder = StandardScaler()
    encoder = encoder.fit(train_data[f].values.reshape(-1,1))
    train_data[f] = encoder.transform(train_data[f].values.reshape(-1,1))
    validation_data[f] = encoder.transform(validation_data[f].values.reshape(-1,1))
    test_data[f] = encoder.transform(test_data[f].values.reshape(-1,1))

Working on feature: bd
Working on feature: registered_via
Working on feature: membership_days
Working on feature: song_length
Working on feature: composer_count
Working on feature: member_song_count
Working on feature: artist_song_count
Working on feature: composer_song_count
Working on feature: lyricist_song_count
Working on feature: genre_song_count
Working on feature: lang_song_count
Working on feature: song_member_count
Working on feature: age_song_count


Dealing with year based features

In [70]:
for f in numerical_type_features:
    if 'year' in f:
        print(f"Working on feature: {f}")
        mid_year = train_data[f].median()
        train_data[f] = (train_data[f] - mid_year).astype('int')
        validation_data[f] = (validation_data[f] - mid_year).astype('int')
        test_data[f] = (test_data[f] - mid_year).astype('int')

Working on feature: registration_year
Working on feature: expiration_year
Working on feature: song_year


Dealing with Genre IDs

In [80]:
genres = set(train_data['first_genre_id'].astype('int')).union(set(train_data['second_genre_id'].astype('int'))).union(set(train_data['third_genre_id'].astype('int')))

In [83]:
genres = genres.union(set(validation_data['first_genre_id'].astype('int'))).union(set(validation_data['second_genre_id'].astype('int'))).union(set(validation_data['third_genre_id'].astype('int')))

In [85]:
genres = genres.union(set(test_data['first_genre_id'].astype('int'))).union(set(test_data['second_genre_id'].astype('int'))).union(set(test_data['third_genre_id'].astype('int')))

In [87]:
le = LabelEncoder()

In [88]:
le.fit(np.array(list(genres)))

LabelEncoder()

In [100]:
train_data['first_genre_id'] = le.transform(train_data['first_genre_id'].astype('int').values.reshape(-1,1))
train_data['second_genre_id'] = le.transform(train_data['second_genre_id'].astype('int').values.reshape(-1,1))
train_data['third_genre_id'] = le.transform(train_data['third_genre_id'].astype('int').values.reshape(-1,1))

  return f(*args, **kwargs)


In [101]:
validation_data['first_genre_id'] = le.transform(validation_data['first_genre_id'].astype('int').values.reshape(-1,1))
validation_data['second_genre_id'] = le.transform(validation_data['second_genre_id'].astype('int').values.reshape(-1,1))
validation_data['third_genre_id'] = le.transform(validation_data['third_genre_id'].astype('int').values.reshape(-1,1))

In [102]:
test_data['first_genre_id'] = le.transform(test_data['first_genre_id'].astype('int').values.reshape(-1,1))
test_data['second_genre_id'] = le.transform(test_data['second_genre_id'].astype('int').values.reshape(-1,1))
test_data['third_genre_id'] = le.transform(test_data['third_genre_id'].astype('int').values.reshape(-1,1))

## Saving the preprocessed (model-ready) data

In [95]:
pd.set_option('display.max_columns', 50)

In [103]:
train_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,registration_year,registration_month,registration_day,expiration_year,expiration_month,expiration_day,membership_days,song_length,first_genre_id,second_genre_id,third_genre_id,genre_ids_count,artist_name,language,is_featured,artist_count,first_artist_name,lyricist_count,first_lyricist,composer_count,first_composer,country_code,registration_code,song_year,member_song_count,artist_song_count,composer_song_count,lyricist_song_count,genre_song_count,lang_song_count,song_member_count,age_song_count,target
0,19307,IXP1a2o3NL8WU4WK1X0WJAKaSW+LgGRpPDn4Gt1HrV8=,0,4,3,2,0.691596,2,0.9693,2,4,10,0,9,30,-0.641538,0.259599,41,0,0,1.0,44623,3,0,1,42871,1,12786,0.009259,33528,101,1738,2,0.14352,0.187866,0.02176,-0.858809,0.94303,0.856465,0.431721,-0.773667,0
1,12488,Ru7n8Xw2s8LGDsgDhyzWqCWQRWQW9KNPY9qMOFAf5x0=,1,1,1,6,-1.107765,1,0.090447,0,1,7,0,9,30,0.08754,0.243548,126,0,0,1.0,5760,0,0,1,5377,1,2380,0.009259,6342,33,2063,2,0.06669,0.074324,0.009434,-0.873839,-1.162382,-0.612357,0.249571,1.22446,1
2,21774,U8VgelI8G461f4FVJHnavdn4zx7XjmSFcNHZ3Hru3TU=,1,1,2,6,-1.107765,1,0.090447,2,11,25,0,9,20,-0.853263,0.312308,41,0,0,1.0,21606,3,0,1,20590,1,10369,0.009259,27036,101,4316,0,0.098487,0.072846,0.008035,-0.873829,0.94303,0.856465,0.409963,1.22446,1
3,12590,JiY2ckGkra5KGchu3Rh6KPH9FL2J0tTmAB3pGyVi4BQ=,0,2,3,4,0.691596,2,0.9693,-6,4,3,0,10,11,1.962944,0.264823,40,0,0,1.0,41281,3,0,1,39550,1,23823,0.018519,47315,101,1722,2,0.072877,0.101434,0.016361,-0.876909,-0.595889,0.856465,0.416476,-0.773667,0
4,15446,uNNIh/p+Ucr+CQBnmtVQFi9OMloGqLgnHe1mXggKle4=,4,4,4,2,0.048967,0,-1.227833,3,11,5,0,9,7,-1.171294,0.167287,77,0,0,1.0,36867,0,0,1,35217,0,17278,0.009259,34859,107,5401,2,0.087831,1.0,0.000154,1.149534,-1.119213,-0.612357,0.01775,-0.852899,0


In [104]:
validation_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,registration_year,registration_month,registration_day,expiration_year,expiration_month,expiration_day,membership_days,song_length,first_genre_id,second_genre_id,third_genre_id,genre_ids_count,artist_name,language,is_featured,artist_count,first_artist_name,lyricist_count,first_lyricist,composer_count,first_composer,country_code,registration_code,song_year,member_song_count,artist_song_count,composer_song_count,lyricist_song_count,genre_song_count,lang_song_count,song_member_count,age_song_count,target
0,1218,RMv6xdzVwGLCucAtyRR7h5LWV2HKXjJMAtEL2pBeBG4=,1,1,1,6,-1.107765,1,0.9693,0,4,13,0,9,22,-0.004592,0.294378,40,0,0,1.0,41698,6,0,1,39968,0,17278,0.009259,32661,101,6234,-1,0.141114,0.088131,0.000112,1.149534,-0.595889,-1.815906,0.010163,1.22446,0
1,21990,OAkciD8vFKjj3vNPxHd+QiPxVYPfq+IGHQEMEGz/JxM=,4,4,4,0,0.627333,2,0.090447,-1,1,13,0,10,6,0.411771,0.230341,41,0,0,1.0,43280,3,0,1,41543,0,17278,0.009259,39959,101,1773,-10,0.079409,0.033582,0.000332,1.149534,0.94303,0.856465,0.00594,-0.758571,1
2,9393,6cl6GdZmBS8/qMJadkQJNM3c+NSiyXZmNIF3//QIDj8=,1,1,1,6,-1.107765,1,-1.227833,4,1,22,0,7,9,-1.293545,0.223491,37,0,0,1.0,10504,4,0,1,9902,2,782,0.037037,2356,54,2214,2,0.00911,0.068277,0.0002,-0.882784,-1.325394,-1.703706,0.017607,1.22446,1
3,26565,ah833V1/nMgO+NNxtUpKn8m5bc9nCPi7AHV77/k1zNA=,1,1,2,1,0.56307,2,-1.667259,2,4,15,0,10,1,-0.645082,0.26849,77,0,0,1.0,39782,4,0,1,38055,0,17278,0.009259,45549,54,6064,-13,0.090409,0.043717,0.000993,1.149534,-1.119213,-1.703706,0.001074,-0.755036,1
4,28486,CkcGnJJYzv1tJVVEI0UlDBVuK1+5BH1Rk8PCuwcrkg8=,1,1,1,5,1.334224,0,0.9693,-2,2,17,0,9,8,0.679306,0.234347,41,0,0,1.0,30813,0,0,1,29401,3,8695,0.027778,23368,107,5603,2,0.085425,0.067902,0.005891,-0.88155,0.94303,-0.612357,0.152662,-0.88947,1


In [105]:
test_data.head()

Unnamed: 0,id,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,registration_year,registration_month,registration_day,expiration_year,expiration_month,expiration_day,membership_days,song_length,first_genre_id,second_genre_id,third_genre_id,genre_ids_count,artist_name,language,is_featured,artist_count,first_artist_name,lyricist_count,first_lyricist,composer_count,first_composer,country_code,registration_code,song_year,member_song_count,artist_song_count,composer_song_count,lyricist_song_count,genre_song_count,lang_song_count,song_member_count,age_song_count
0,0,17724,WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=,1,4,1,6,-1.107765,1,0.090447,3,2,19,0,9,18,-0.931221,0.249033,40,0,0,1.0,42813,3,0,1,41079,0,17278,0.009259,33433,101,6234,0,0.00275,0.013201,0.000122,-0.099924,-1.128714,-0.985288,0.013956,-0.174747
1,1,17724,y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=,1,1,1,6,-1.107765,1,0.090447,3,2,19,0,9,18,-0.931221,0.356078,41,0,0,1.0,42633,3,0,1,40900,2,19800,0.009259,49460,101,2221,-4,0.00275,0.10398,0.015294,-0.882061,-0.6072,-0.985288,0.105783,-0.174747
2,2,977,8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=,0,1,4,6,-1.107765,1,-1.227833,3,11,17,-1,11,24,-1.436171,0.350999,140,0,0,1.0,38492,4,0,1,36789,1,16782,0.009259,43693,54,6585,-4,0.0,0.000487,7.4e-05,-0.882967,-1.327063,-1.812129,7.2e-05,-0.174747
3,3,1878,ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=,3,1,5,0,0.820121,2,0.9693,-6,7,25,0,4,30,1.717556,0.3169,41,0,0,1.0,36423,0,0,1,34785,0,17278,0.037037,39854,33,1815,-12,0.046236,0.000738,2.3e-05,-0.099924,-0.6072,-1.39113,0.00093,-0.894568
4,4,1878,MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=,3,4,5,0,0.820121,2,0.9693,-6,7,25,0,4,30,1.717556,0.219544,72,0,0,1.0,38298,1,0,1,36603,0,17278,0.009259,30641,51,5056,-3,0.046236,0.00026,0.00113,-0.099924,-1.359458,-1.790143,0.000143,-0.894568


In [106]:
train_data.shape, validation_data.shape, test_data.shape

((5901934, 42), (1475484, 42), (2556790, 42))

In [107]:
train_data.to_csv("../Data/preprocessed_data/mrd_train_data.csv.gz", sep=',', index=False, compression='gzip')
test_data.to_csv("../Data/preprocessed_data/mrd_test_data.csv.gz", sep=',', index=False, compression='gzip')
validation_data.to_csv("../Data/preprocessed_data/mrd_validation_data.csv.gz", sep=',', index=False, compression='gzip')

## Sample the dataset

As the dataset is quite huge. As the system configuration is not enough to handle this huge data we will sample the dataset and will use that sampled data to train our models.

We will use 20% data from train and validation set.

In [None]:
train_data = train_data.sample(frac=0.2, random_state=1996)
validation_data = validation_data.sample(frac=0.2, random_state=1996)

In [None]:
train_data.shape, validation_data.shape, test_data.shape

In [None]:
train_data.to_csv("../Data/sampled_data/sampled_train_data.csv.gz", sep=',', index=False, compression='gzip')
validation_data.to_csv("../Data/sampled_data/sampled_validation_data.csv.gz", sep=',', index=False, compression='gzip')