# Data Transformation

## What is data transformation?

Many times in real life, you will be working with imperfect datasets with quality issues. **Data transformation** is the process of modifying a dataset in appropriate ways in order to eliminate these quality issues. Some of these activities include:

- Splitting columns
- Converting dates to `datetime` objects, which are far more easily manipulable using `pandas` libraries
- Encoding categorical variables
- Dealing with and replacing null or missing values
- Creating unique identifiers

The `pandas` library has many functions which can help with this task. In addition, you will also be using some other standard libraries like `string`, `base64`, and `sklearn`.

In [None]:
import sklearn
sklearn.__version__

In [None]:
import pandas as pd
import base64
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

pd.set_option('max_columns', None)
data = pd.read_csv('data.csv')
data.head()

## Splitting columns

You can see that the column **ZonaGeografica** includes two pieces of information separated by a hyphen. If you want to work with those pieces of information separately, you will have to split this column:

In [None]:
# See the unique values
data.ZonaGeografica.unique()

In [None]:
# You use pandas to split columns
data[['Departamento','Ciudad']] =  data.ZonaGeografica.str.split('-',expand=True,)

In [None]:
data[['ZonaGeografica','Departamento','Ciudad']].head(2).append(data[['ZonaGeografica','Departamento','Ciudad']].tail(2))

As another example, often times you will need to separate names into first and last names:

In [None]:
data[['Nombres']].head(10)

In [None]:
# Some names could have more that two names, for that reason you can use an additional parameter n where you define the number of splits.
data[['PrimerNombre','SegundoNombre']] =  data.Nombres.str.split(n=1,expand=True,)

In [None]:
data[['Nombres','PrimerNombre','SegundoNombre']].head().append(data[['Nombres','PrimerNombre','SegundoNombre']].tail())

### Exercise 1:

Split the `Apellidos` column into `PrimerApellido` and `SegundoApellido`.

In [None]:
# Your code

## Working with categorical variables and dates

Let's check the columns we have:

In [None]:
data.info()

### `datetime` conversion

You can see that `FechaMuestra` is structured as a date, but it cannot be manipulated as if it were a date! Thus, let's go ahead and convert it to a `datetime` object so that we can use various Python functions on it:

In [None]:
data.FechaMuestra.unique()

In [None]:
data.FechaMuestra = pd.to_datetime(data.FechaMuestra)

In [None]:
data.FechaMuestra.head(10)

In [None]:
# You can create additional information with the date as WeekDay
# holidays

data['WeekDay'] = data.FechaMuestra.dt.day_name()
data.head(2)

In [None]:
data.WeekDay.unique()

In [None]:
# You can convert the column as categorical and do it ordered.
cat_dtype = pd.api.types.CategoricalDtype(categories=['Monday','Tuesday','Wednesday','Thursday', 'Friday', 'Sunday', 'Saturday'], ordered=True)
data.WeekDay = data.WeekDay.astype(cat_dtype) # int str 

In [None]:
data.WeekDay.unique()

### Replacing null values

Now, we can do a `describe()` on the categorical variables to see more information about them:

In [None]:
data.describe(include ='O')

For example, `Sexo` has four categories:

In [None]:
data.Sexo.unique()

You can see that there are problems with this column; namely, you need to unify the values and replace the null ones:

In [None]:
sexo = {'Mujer':'Femenino','Hombre':'Masculino','Femenino':'Femenino','Masculino':'Masculino'}

In [None]:
data.Sexo = data.Sexo.map(sexo)

In [None]:
data.Sexo.unique()

In [None]:
data.Sexo.fillna('Sin Información', inplace = True)

In [None]:
data.Sexo.unique()

### Encoding labels

Sometimes, it is helpful to encode categorical variable values as numbers instead of text:

In [None]:
data.info()

In [None]:
data.Sexo = data.Sexo.astype('category')

In [None]:
data.Sexo.unique()

In [None]:
data.head(2)

Then you can assign the encoded variable to a new column:

In [None]:
data.Sexo.cat.codes

In [None]:
# you 
data['SexoCat'] = data.Sexo.cat.codes

In [None]:
data.head(2)

In [None]:
from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
data["CodeSexo"] = lb_make.fit_transform(data["Sexo"])
data[["Sexo", "CodeSexo"]].head(10)

In [None]:
data["CodeSexo"].value_counts()

### One-hot encoding

We can go one step further - instead of replacing each possible value of a categorical variable with a number, we can create *separate* columns for each possible value and assign a 1 or 0 (True or False) value to that column. A 1 indicates that that particular row's value for that categorical variable matches the value corresponding to that particular column, and 0 otherwise:

In [None]:
pd.get_dummies(data, columns=["Sexo"]).head(2)

In [None]:
data["Sexo"].values

In [None]:
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Sexo'])
transformed = jobs_encoder.transform(data['Sexo'])
ohe_df = pd.DataFrame(transformed)
ohe_df.head()

In [None]:
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Sexo'])
transformed = jobs_encoder.transform(data['Sexo'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1)

In [None]:
data.head(2)

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

In [None]:
pd.__version__

In [None]:
X = data['Sexo'].to_numpy().reshape(-1, 1)
X

In [None]:
X = data['Sexo'].to_numpy().reshape(-1, 1)
enc.fit(X)

In [None]:
enc.categories_

In [None]:
enc.transform(X).toarray()

In [None]:
enc.inverse_transform([[1., 0., 0.],[0., 1., 0.],[0., 0., 1.]])

In [None]:
enc.get_feature_names(['Gender'])

In [None]:
pd.set_option('max_columns', None)
data.head(2)

### Selecting columns by type

If you want to only work with categorical variables, it is possible via the `select_dtypes()` method:

In [None]:
cat_data = data.select_dtypes(include=['object']).copy()
cat_data.info()

### Creating a unique identifier

In some situations, you will not have a unique identifier readily available for your data. However, you can create one based on a combination of the available data, such that no two rows could possibly ever exhibit the same combination. Here, for examples, the columns `Nombres` and `Apellidos` can be combined and converted into a unique ID:

In [None]:
data['id_unique'] = data.apply(lambda x: ':'.join([str(x['Nombres']), str(x['Apellidos'])]), axis=1)
data['id_unique'].head()

In [None]:
data['id_unique'] = data.apply(lambda x: ':'.join([str(x['Nombres']), str(x['Apellidos'])]), axis=1)
data['id_unique'] = data['id_unique'].apply(lambda x: base64.b64encode(x.encode()).decode())
data['id_unique'].unique()

### Working with null values

There are a few ways to go about handling null values in `pandas` DataFrames. Earlier, we simply replaced missing values with text that indicated that no information was available.

Here, we will use a new method - **imputation**. Let's first check which of our columns actually contain null values:

In [None]:
data = pd.read_csv('crypto-markets.txt')
data.head(2)

In [None]:
data.isnull().any()

We'll go ahead and **impute** the missing values; that is, find suitable replacement values based on **interpolating** from the rest of the data: 

In [None]:
imputer = IterativeImputer()

cols_to_impute = ['open', 'low']

imputed_df = pd.DataFrame(imputer.fit_transform(data[cols_to_impute]))

print(imputed_df.head(2))
imputed_df.columns = cols_to_impute

data[cols_to_impute] = imputed_df[cols_to_impute]

data.isnull().any()

## Strings

String manipulation is another important component of data transformation.

In [None]:
text = "Hello, \n\tWorld!"
text

In [None]:
print(text)

When working with strings, we can use the `string` library to access some useful characters:

In [None]:
import string
from string import Formatter
from string import Template

# String constants
print('ascii_letters: ',string.ascii_letters)
print('ascii_lowercase: ',string.ascii_lowercase)
print('ascii_uppercase: ',string.ascii_uppercase)
print('digits: ',string.digits)
print('hexdigits: ',string.hexdigits)
print('whitespace: ',string.whitespace)  # ' \t\n\r\x0b\x0c'
print('punctuation: ',string.punctuation)
print('printable: ',string.printable)

In [None]:
# Changing text
print('hello world ds4a'.capitalize())
print('hello world ds4a'.upper())
print('HELLO WORLD DS4A'.lower())
print('  123456  '.lstrip())
print('  123456  '.rstrip())
print('  123456  '.strip())

In [None]:
# Looking in text
print('hello world ds4a'.count('o'))
print('hello world ds4a'.endswith('a'))
print('hello world ds4a'.startswith('a'))
print('hello world ds4a'.find('o'))
print('hello world ds4a'.find('z'))
print('hello world ds4a'.index('o'))
print('hello123'.isalnum()) # Return True if all characters in the string are alphanumeric and there is at least one character, False otherwise.
print('123456'.isdigit()) # Return True if all characters in the string are digits and there is at least one character, False otherwise. 
print('hello'.isalpha()) # Return True if all characters in the string are alphabetic and there is at least one character, False otherwise.

In [None]:
# Other functions
print('hello'.center(50,'*'))
print('123456'.zfill(10))