Data wrangling

Given a CSV:

id,name,age,salary
1,Alice,25,100000
2,Bob,,90000
3,Charlie,30,NaN


👉 Task:

Read file with pandas.

Fill missing ages with mean age.

Drop rows with missing salary.

Output JSON with schema:

[{"id":1, "name":"Alice", "age":25, "salary":100000}, ...]

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('csv_sample.csv')
data

Unnamed: 0,id,name,age,salary
0,1,Alice,25.0,100000.0
1,2,Bob,,90000.0
2,3,Charlie,30.0,


In [3]:
mean_age = data['age'].mean()
mean_age


np.float64(27.5)

In [4]:
data['age'].fillna(mean_age)

0    25.0
1    27.5
2    30.0
Name: age, dtype: float64

In [14]:
data['age'] = data['age'].fillna(mean_age).astype(int)
data

Unnamed: 0,id,name,age,salary
0,1,Alice,25,100000.0
1,2,Bob,27,90000.0


In [15]:
data.dropna(subset='salary', inplace=True)

In [16]:
data['salary'] = data['salary'].astype(int)
data

Unnamed: 0,id,name,age,salary
0,1,Alice,25,100000
1,2,Bob,27,90000


[{"id":1, "name":"Alice", "age":25, "salary":100000}, ...]

In [None]:
# Step 4: Convert to list of dicts
records = data.to_dict(orient="records")
records

[{'id': 1, 'name': 'Alice', 'age': 25, 'salary': 100000},
 {'id': 2, 'name': 'Bob', 'age': 27, 'salary': 90000}]

What orient does

When you call DataFrame.to_dict() or DataFrame.to_json(), the orient argument controls how the DataFrame is converted into Python objects (dicts/lists) or JSON.

2️⃣ orient="records"

Returns a list of dictionaries.

Each row in the DataFrame becomes a dictionary (column names → values).

Perfect when you want JSON-like output with rows as objects.

In [21]:
import json 

with open ('output.json', mode='w') as f:
    json.dump(records, f, indent=4)


In [18]:
class CsvParser:
    def __init__(self, path):
        import pandas as pd 
        self.data = pd.read_csv(path)

    def calculate_mean(self, col_name):
        return self.data[col_name].mean()

    def impute_col(self, col_name):
        mean_val = self.calculate_mean(col_name)
        self.data[col_name] = self.data[col_name].fillna(mean_val).astype(int)

    def drop_missing_rows(self, col_name):
        self.data.dropna(subset = [col_name], inplace = True)

    def save_json(self, out_put):
        import json 

        with open(out_put, mode = 'w') as f:
            json.dump(self.data.to_dict(orient='records'), f, indent = 4)

    def print_data(self):
        return self.data

csv_obj = CsvParser('csv_sample.csv')
csv_obj.impute_col('age')
csv_obj.print_data()
csv_obj.drop_missing_rows('salary')
csv_obj.print_data()
csv_obj.save_json('output.json')
csv_obj.print_data()


Unnamed: 0,id,name,age,salary
0,1,Alice,25,100000.0
1,2,Bob,27,90000.0
