# Software migration
In this notebook, I'll explore the data associated with the Software migration

## Data Gathering
Let's read the different Excel sheets and csv files into Pandas Data Frames
### Migration Mapping
Let's read the migration mapping data from the Excel sheet into a Pandas Data Frame

In [27]:
import pandas as pd
from typing import Dict

migration_mapping_xlsx = pd.ExcelFile(
    "dataset/Software_Migration_Mapping.xlsx"
)
migration_mapping_df = pd.read_excel(migration_mapping_xlsx, sheet_name=0)
migration_mapping_df.head()

Unnamed: 0,Field,SoftwareA,SoftwareB
0,Channel,channel1,Channel1
1,Channel,channel2,Channel2
2,Channel,channel3,Channel3
3,Language,en,en-US
4,Language,en_us,en-US


In [28]:
migration_mapping_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Field      10 non-null     object
 1   SoftwareA  10 non-null     object
 2   SoftwareB  10 non-null     object
dtypes: object(3)
memory usage: 372.0+ bytes


### Data to migrate
Let's read the data to migrate from the associated csv file

In [29]:
data_to_migrate_df = pd.read_csv("dataset/data_to_migrate.csv")
data_to_migrate_df.head()

Unnamed: 0,id,Channel,Language,CustomFields,Duration,PointsGained
0,1,channel1,en,Area=account;New=true,01:23:14,57
1,1,channel1,en_us,Area=account;New=true,00:13:04,12
2,1,channel2,en,Area=finance;New=false,00:37:21,30
3,2,channel3,es,Area=finance;Premium=premium-user;New=false,03:01:47,254
4,3,channel2,es,Area=customer;New=false,01:56:34,71


In [30]:
data_to_migrate_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            5 non-null      int64 
 1   Channel       5 non-null      object
 2   Language      5 non-null      object
 3   CustomFields  5 non-null      object
 4   Duration      5 non-null      object
 5   PointsGained  5 non-null      int64 
dtypes: int64(2), object(4)
memory usage: 372.0+ bytes


## Data Cleaning
Some of the data to migrate requires some cleaning to facilitate their migration.
Let's start by making a copy of the data to clean.

In [31]:
clean_data_to_migrate_df = data_to_migrate_df.copy()

### Change `Duration` type to `Timedelta`
The `Duration` field is represented as a string. It is mandatory to have it a `Timedelta` type to perform further computation.

In [32]:
clean_data_to_migrate_df["Duration"] = pd.to_timedelta(
    clean_data_to_migrate_df["Duration"]
)
clean_data_to_migrate_df.head()

Unnamed: 0,id,Channel,Language,CustomFields,Duration,PointsGained
0,1,channel1,en,Area=account;New=true,0 days 01:23:14,57
1,1,channel1,en_us,Area=account;New=true,0 days 00:13:04,12
2,1,channel2,en,Area=finance;New=false,0 days 00:37:21,30
3,2,channel3,es,Area=finance;Premium=premium-user;New=false,0 days 03:01:47,254
4,3,channel2,es,Area=customer;New=false,0 days 01:56:34,71


In [33]:
clean_data_to_migrate_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype          
---  ------        --------------  -----          
 0   id            5 non-null      int64          
 1   Channel       5 non-null      object         
 2   Language      5 non-null      object         
 3   CustomFields  5 non-null      object         
 4   Duration      5 non-null      timedelta64[ns]
 5   PointsGained  5 non-null      int64          
dtypes: int64(2), object(3), timedelta64[ns](1)
memory usage: 372.0+ bytes


We can notice that the `Duration` field is now in `Timedelta` format.

## Data migration
Let's use the `migration_mapping_df` data frame to migrate the data to the new format.

In [35]:
migrated_data_df = clean_data_to_migrate_df.copy()
for field in migration_mapping_df["Field"].unique():
    field_mapping_df = migration_mapping_df[
        migration_mapping_df["Field"] == field
    ]
    # print(field_mapping_df)
    mapping_dict: Dict[str, str] = {
        row["SoftwareA"]: row["SoftwareB"]
        for _, row in field_mapping_df.iterrows()
    }
    # print(mapping_dict)
    migrated_data_df[field] = migrated_data_df[field].replace(mapping_dict)
migrated_data_df.head()

Unnamed: 0,id,Channel,Language,CustomFields,Duration,PointsGained
0,1,Channel1,en-US,Area=account;New=true,0 days 01:23:14,57
1,1,Channel1,en-US,Area=account;New=true,0 days 00:13:04,12
2,1,Channel2,en-US,Area=finance;New=false,0 days 00:37:21,30
3,2,Channel3,es-ES,Area=finance;Premium=premium-user;New=false,0 days 03:01:47,254
4,3,Channel2,es-ES,Area=customer;New=false,0 days 01:56:34,71
