<p><span style="text-decoration: underline;"><strong>Problem Description:</strong></span></p>
<p>Real-world data is seldom clean, rather, it tends to have lots of noisy values. These can range from values disturbing the distribution of the data to null, infinite values, and so on. Your task here will be to treat one such kind of data. </p>
<p>The<strong> dataset</strong> can be downloaded from: <strong><a style="color: blue;" href="https://d3n0h9tb65y8q.cloudfront.net/public_assets/assets/000/002/000/original/input_data.csv?1638532904" target="_blank" rel="noopener">input_data.csv</a></strong></p>
<p>After clicking on the data link above, you can download the file by right-clicking on the page and clicking on "Save As", then naming the file as per your wish, with .csv as the extension.</p>
<p>Your task will be to:</p>
<ul>
<li>Replace all null/nan values with 0</li>
<li>Replace all negative infinite values with -1 and positive infinities with 1 (usually these will be of type np.inf)</li>
<li>The non-null and non-inf values, basically integers, range from the range 50-100. Replace each such value with the square root of the same, and round them off up to 2 decimal places.</li>
</ul>
<p>For e.g.:</p>
<p>Suppose input data is [-inf, nan, 64, 81, 100, 93, inf, 70, nan]</p>
<p>The output data will look [-1., 0., 8., 9., 10., 9.64, 1., 8.37, 0.]</p>
<p><span style="text-decoration: underline;"><strong>Submission Guidelines:<br /><br /></strong></span>Submit a CSV file with the columns “id” and “dataPoints”, in the same order of the "id"s as in the input_data.csv file. You can refer to the <a style="color: blue;" href="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/002/617/original/sample_submission.csv?1646667691" target="_blank" rel="noopener"><strong>sample_submission.csv</strong></a>file to understand what the output CSV file will look like.</p>

In [33]:
import pandas as pd
import numpy as np

In [34]:
df = pd.read_csv('input01.txt', sep=',')

In [35]:
df.isnull().sum()

id            0
dataPoints    8
dtype: int64

In [36]:
df['flag'] = 'others'

In [37]:
df.loc[df['dataPoints'].isnull(), 'flag'] = 'filled'

In [38]:
df['flag'].value_counts()

others    92
filled     8
Name: flag, dtype: int64

In [39]:
df.loc[df['dataPoints'].isin([np.inf, -np.inf]), 'flag'] = 'filled'
df['flag'].value_counts()

others    71
filled    29
Name: flag, dtype: int64

In [40]:
df['dataPoints'] = df['dataPoints'].fillna(0)
df.isnull().sum()

id            0
dataPoints    0
flag          0
dtype: int64

In [41]:
df['dataPoints'] = df['dataPoints'].replace(np.inf, 1)
df['dataPoints'] = df['dataPoints'].replace(-np.inf, -1)

In [42]:
count = np.isinf(df['dataPoints']).values.sum()
print("It contains " + str(count) + " infinite values")

It contains 0 infinite values


In [44]:
df['dataPoints'] = df[['dataPoints', 'flag']].apply(
    lambda x: round(np.sqrt(x[0]), 2) if x[1]=='others' else x[0], axis=1)

In [45]:
df

Unnamed: 0,id,dataPoints,flag
0,0,8.49,others
1,1,8.31,others
2,2,9.11,others
3,3,8.37,others
4,4,8.83,others
...,...,...,...
95,95,9.95,others
96,96,7.81,others
97,97,8.89,others
98,98,9.95,others


In [46]:
df = df[['id', 'dataPoints']]

In [47]:
df.to_csv('sample_submission.csv')