## Loading the Libs

We need to load the modules from `preprocessor` package which is a directory ahead from us. For this, we need to change our system path to `..` or a directory up.

In [1]:
import sys
import os

# Add project root (one level up from notebooks/) to sys.path
sys.path.append(os.path.abspath(".."))

In [2]:
# Now Let's try to import it
from preprocessor.main import Preprocessor

## Data Loading

In [3]:
data_file = "../notebooks/temp/data_clean.pkl"

In [4]:
import numpy as np
import pandas as pd

In [5]:
data = pd.read_pickle(data_file)

In [6]:
data.head()

Unnamed: 0,softwareType,industryDomain,numUsers,targetMarket,adminDashboard,contentManagement,extraFeatures,thirdPartyService,authentication,dataMigration,uiUxDesign,performance,security,availability,timeline_months,Price
0,Mobile,Restaurant_Management,500-1000,Global,Basic,[Workflow],"[Reporting, Analysis]","[Analytics, Payment_Gateway, Mail]",Multi_Factor,Null,Custom,Medium,High,Normal,9,1300.0
1,Desktop,Fintech,30-50,Local,Null,[Workflow],"[Reporting, Analysis]","[Payment_Gateway, Mail]",Social,Null,Advanced,High,Null,Always,3,1050.0
2,Web,Fintech,01-10,Local,Basic,"[Pages, Media]","[Search, Filter]",[Mail],Multi_Factor,Null,Basic,Basic,Null,Normal,30,310.0
3,Mobile,Hotel_Management,100-500,Both,Advanced,[Workflow],"[Search, Filter]","[Mail, Payment_Gateway]",Social,Null,Custom,Basic,Null,Normal,9,750.0
4,Desktop,Ecommerce,01-10,Both,Basic,[Null],[File_Handling],"[AI_integration, Payment_Gateway]",Null,No,Custom,Basic,Null,Normal,4,920.0


## Data Preprocessing

In [7]:
preprocess = Preprocessor(data)

Let us now transform the data into numerical quantities

In [8]:
data = preprocess.transform()

In [9]:
# Let's see the transformed data
data.head()

Unnamed: 0,softwareType,industryDomain,numUsers,targetMarket,adminDashboard,contentManagement,extraFeatures,thirdPartyService,authentication,dataMigration,uiUxDesign,performance,security,availability,timeline_months,Price
0,3,4,750.0,2,1,1.0,0.0,2.333333,3,0,3.0,2,2,1.0,9,1300.0
1,2,9,40.0,1,0,1.0,0.0,3.0,2,0,2.0,3,0,2.0,3,1050.0
2,1,9,5.5,1,1,0.0,0.0,4.0,3,0,1.0,1,0,1.0,30,310.0
3,3,3,300.0,3,2,1.0,0.0,3.0,2,0,3.0,1,0,1.0,9,750.0
4,2,1,5.5,3,1,,4.0,1.0,0,1,3.0,1,0,1.0,4,920.0


It seems like there are some cells where data is missing stated as `NaN`. Let us see the count first.

In [10]:
nan_counts = data.isna().sum()
print(nan_counts)

softwareType          0
industryDomain        0
numUsers              0
targetMarket          0
adminDashboard        0
contentManagement    69
extraFeatures         8
thirdPartyService     7
authentication        0
dataMigration         0
uiUxDesign            1
performance           0
security              0
availability          5
timeline_months       0
Price                 5
dtype: int64


As per query, the `contentManagement` column has the most number of missing data. Let's fill the missing data as `0`.

In [11]:
data = data.fillna(0)

Let's recheck. 

In [12]:
nan_counts = data.isna().sum()
print(nan_counts)

softwareType         0
industryDomain       0
numUsers             0
targetMarket         0
adminDashboard       0
contentManagement    0
extraFeatures        0
thirdPartyService    0
authentication       0
dataMigration        0
uiUxDesign           0
performance          0
security             0
availability         0
timeline_months      0
Price                0
dtype: int64


In [13]:
data.head(10)

Unnamed: 0,softwareType,industryDomain,numUsers,targetMarket,adminDashboard,contentManagement,extraFeatures,thirdPartyService,authentication,dataMigration,uiUxDesign,performance,security,availability,timeline_months,Price
0,3,4,750.0,2,1,1.0,0.0,2.333333,3,0,3.0,2,2,1.0,9,1300.0
1,2,9,40.0,1,0,1.0,0.0,3.0,2,0,2.0,3,0,2.0,3,1050.0
2,1,9,5.5,1,1,0.0,0.0,4.0,3,0,1.0,1,0,1.0,30,310.0
3,3,3,300.0,3,2,1.0,0.0,3.0,2,0,3.0,1,0,1.0,9,750.0
4,2,1,5.5,3,1,0.0,4.0,1.0,0,1,3.0,1,0,1.0,4,920.0
5,4,7,300.0,2,3,0.0,0.0,2.5,2,0,3.0,1,0,1.0,36,1630.0
6,4,5,40.0,3,3,3.0,2.0,3.0,3,2,2.0,2,0,1.0,9,1960.0
7,3,2,40.0,3,0,3.0,5.0,3.0,2,2,1.0,3,1,2.0,12,1285.0
8,4,5,20.0,1,0,3.0,0.0,1.5,0,1,3.0,1,2,1.0,24,1100.0
9,2,9,750.0,2,0,1.0,0.0,0.0,0,1,2.0,3,1,1.0,6,980.0


## Saving the Data
The data seems almost perfect. Let us now save it to a file for the next step.

In [14]:
final_data_path = "../notebooks/temp/dataset.pkl"
data.to_pickle(final_data_path)