<div class="alert alert-block" style = "background-color: black">
    <p><b><font size="+4" color="orange">Data Transformation in Pandas</font></b></p>
    <p><b><font size="+1" color="white">by Jubril Davies</font></b></p>
    </div>

In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

$$\begin{align} \text{This work focuses on filtering, cleaning and transforming data} \end{align}$$
---
<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Removing Duplicates</font></b></p>
    </div>
    
---

A messy dataset may contain duplicate rows for many reasons. This may be valid duplicates or invalid dpending on the context of the domain problem. In invalid cases, the DataFrame `duplicated` and `drop_duplicates` offer an excellent way of handling such cases.

> #### **Given the DataSet**

In [6]:
data = pd.DataFrame({'Name':['Adam','Bale','Chris','Dave','Edward','Adam','Bale','Adam'],
                    'Age':[25,30,35,40,45,25,30,28], 'Dept':['HR','IT','Sales','IT','Supply','HR','IT','HR'],
                    'Salary':[50000,60000,70000,80000,90000,50000,60000,52000]})
data

Unnamed: 0,Name,Age,Dept,Salary
0,Adam,25,HR,50000
1,Bale,30,IT,60000
2,Chris,35,Sales,70000
3,Dave,40,IT,80000
4,Edward,45,Supply,90000
5,Adam,25,HR,50000
6,Bale,30,IT,60000
7,Adam,28,HR,52000


In [11]:
data_no_duplicates = data.drop_duplicates()
data_no_duplicates

Unnamed: 0,Name,Age,Dept,Salary
0,Adam,25,HR,50000
1,Bale,30,IT,60000
2,Chris,35,Sales,70000
3,Dave,40,IT,80000
4,Edward,45,Supply,90000
7,Adam,28,HR,52000


In [8]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
6     True
7    False
dtype: bool

> #### **Remove duplicates based on Specific Columns**
Remove duplicates based only on the Name and Department columns, keeping the first occurrence

In [17]:
data_no_duplicates_specific = data.drop_duplicates(subset=['Name','Dept'])

data_no_duplicates_specific

Unnamed: 0,Name,Age,Dept,Salary
0,Adam,25,HR,50000
1,Bale,30,IT,60000
2,Chris,35,Sales,70000
3,Dave,40,IT,80000
4,Edward,45,Supply,90000


> #### **Passing the `keep=last` will return only the last one**

In [19]:
data_no_duplicates_keep_last = data.drop_duplicates(['Name','Dept'],keep='last')

data_no_duplicates_keep_last

Unnamed: 0,Name,Age,Dept,Salary
2,Chris,35,Sales,70000
3,Dave,40,IT,80000
4,Edward,45,Supply,90000
6,Bale,30,IT,60000
7,Adam,28,HR,52000


> #### **After removing duplicates, it might be necessary to reset index for a cleaner dataframe**

In [20]:
data_reset_index = data.drop_duplicates().reset_index(drop=True)

data_reset_index

Unnamed: 0,Name,Age,Dept,Salary
0,Adam,25,HR,50000
1,Bale,30,IT,60000
2,Chris,35,Sales,70000
3,Dave,40,IT,80000
4,Edward,45,Supply,90000
5,Adam,28,HR,52000


<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Transforming Data using a Function or Mapping</font></b></p>
    </div>

---

In some instances, it might be necessary to transform data based on values in an array, series or dataframe. This can be achieved using a function or Pandas map function in the case of a dataframe.

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">1. Transforming Data in an Array</font></b></p>
    </div>

> #### **Given an array of temperatures**

In [21]:
celsius_temp = np.array([0,20,30,40,100])
celsius_temp

array([  0,  20,  30,  40, 100])

**Define a transformation function to convert celsius to fahrenheit**

In [30]:
def celsius_to_fahrenheit(celsius):
    return celsius * 9/5 + 32

# Apply the function to the array
vectorized_func = np.vectorize(celsius_to_fahrenheit)
fahrenheit_temp = vectorized_func(celsius_temp)
temp = pd.DataFrame({'celsius_temp':celsius_temp,'fahrenheit_temp':fahrenheit_temp})
temp

Unnamed: 0,celsius_temp,fahrenheit_temp
0,0,32.0
1,20,68.0
2,30,86.0
3,40,104.0
4,100,212.0


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">2. Transforming Data in a Series</font></b></p>
    </div>

> #### **Given a series of dates with the goal of applying transformations to extract weekdays and weekends**

In [34]:
dates = pd.Series(['2024-01-01','2024-01-02','2024-01-03','2024-01-04','2024-01-05','2024-01-06','2024-01-07'])
dates

0    2024-01-01
1    2024-01-02
2    2024-01-03
3    2024-01-04
4    2024-01-05
5    2024-01-06
6    2024-01-07
dtype: object

**Convert the strings to datetime objects**

In [36]:
dates = pd.to_datetime(dates)
dates

0   2024-01-01
1   2024-01-02
2   2024-01-03
3   2024-01-04
4   2024-01-05
5   2024-01-06
6   2024-01-07
dtype: datetime64[ns]

**Define a function to get the day of the week**

In [38]:
def get_day_week(dates):
    return dates.day_name()

#Apply the function to the weekdays
weekdays = dates.map(get_day_week)
weekdays

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
dtype: object

**Define a new function to check whether each date is a weekend**

In [40]:
def is_weekend(date):
    return date.weekday() >= 5

#Apply the function to weekends
weekends = weekdays[dates.map(is_weekend)]
weekends

5    Saturday
6      Sunday
dtype: object

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">3. Transforming Data in a DataFrame</font></b></p>
    </div>

> #### **Given a dataset of housing prices with the goal of normalizing housing prices**

In [45]:
housing = pd.DataFrame({'Location':['New York','San Francisco','Los Angeles','Chicago','Houston'],
                       'Price':[1250000,1450000,1650000,800000,650000]})
housing

Unnamed: 0,Location,Price
0,New York,1250000
1,San Francisco,1450000
2,Los Angeles,1650000
3,Chicago,800000
4,Houston,650000


**Define a transformation function to normalize the prices**

In [46]:
def normalize_price(price,mean_price,std_price):
    return(price -mean_price)/std_price

#Calculate mean and std of prices
mean_price = housing['Price'].mean()
std_price = housing['Price'].std()

#Apply the function to the housing prices
housing['Normalized_Prices'] = housing['Price'].map(lambda x: normalize_price(x,mean_price, std_price))
housing

Unnamed: 0,Location,Price,Normalized_Prices
0,New York,1250000,0.211838
1,San Francisco,1450000,0.682589
2,Los Angeles,1650000,1.15334
3,Chicago,800000,-0.847352
4,Houston,650000,-1.200415


<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Replacing Values</font></b></p>
    </div>

---

There are a couple of ways to fill in missing data using Pandas:

* `fillna` method is a special case of general value replacement
* `map` is used to modify a subset of values
* `replace` provides a simpler and more flexible way to modify a subset of values

> #### **Given the dataset**

In [48]:
dt = pd.Series([100, 120, -999, -1000, 350])
dt

0     100
1     120
2    -999
3   -1000
4     350
dtype: int64

In [50]:
dt.replace([-999,-1000],np.nan)

0    100.0
1    120.0
2      NaN
3      NaN
4    350.0
dtype: float64

**To use a different replacement value for each value, pass a list of substitutes**

In [51]:
dt.replace([-999,-1000],[np.nan,0])

0    100.0
1    120.0
2      NaN
3      0.0
4    350.0
dtype: float64

**A dictionary of values and replacements can also be passed as arguments**

In [52]:
dt.replace({-999:np.nan,-1000:0})

0    100.0
1    120.0
2      NaN
3      0.0
4    350.0
dtype: float64

<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Renaming Axis Indexes</font></b></p>
    </div>

---

An axis label is either a column name, a row name or an index name. This can be modified in place without creating a new data structure.

> #### **Given the DataFrame**

In [59]:
dm = pd.DataFrame(np.linspace(25000,300000,12).reshape((3,4)),index=['Lagos','Maryland','Ikoyi'],
                  columns=['one','two','three','four'])
dm

Unnamed: 0,one,two,three,four
Lagos,25000.0,50000.0,75000.0,100000.0
Maryland,125000.0,150000.0,175000.0,200000.0
Ikoyi,225000.0,250000.0,275000.0,300000.0


> #### **The index has a map method which can be used to modify its labels**
Modify the dataframe in place

In [62]:
dm.index = dm.index.map(str.upper)
dm

Unnamed: 0,one,two,three,four
LAGOS,25000.0,50000.0,75000.0,100000.0
MARYLAND,125000.0,150000.0,175000.0,200000.0
IKOYI,225000.0,250000.0,275000.0,300000.0


> #### **Creating a transformed version rather than modifying the original dataframe**
#### Use rename: It returns a reference to the original dataframe

In [63]:
dm.rename(index=str.title,columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Lagos,25000.0,50000.0,75000.0,100000.0
Maryland,125000.0,150000.0,175000.0,200000.0
Ikoyi,225000.0,250000.0,275000.0,300000.0


#### rename can be used with a dictionary like reference to specify changes

In [66]:
dm.rename(index={'LAGOS': 'INDIANA'},columns={'three':'THIRD'})

Unnamed: 0,one,two,THIRD,four
INDIANA,25000.0,50000.0,75000.0,100000.0
MARYLAND,125000.0,150000.0,175000.0,200000.0
IKOYI,225000.0,250000.0,275000.0,300000.0
