![data_tasks.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg)

# Transformations with Pandas

Sometimes if we have a question about our dataset, we may not be able to answer it directly, even if we have all of the data needed. It may happen that we will need a few additional steps in between where we will transform our data to a representation that will be easier for later data analysis. 

Today, we will learn how to modify our dataframe - transform the information we already have, add new columns, make some calculations... . We will also cover KPIs topic.

## Let's remember cars dataset:
The dataset contains the information of 10.000 cars. There are 9 different columns:
- Make (Car brand, example: Ford)
- Model (The Model of the Car, example: Focus)
- Year (The Year in which the car was build, example: 2012)
- Variant (The car model version showing the PS, example: 1.6 Trendline)
- Kms (The kilometers the cars has been driven, example: 90000)
- Price (The offered price for the car, example: 10000)
- Doors (How many doors the car has, example: 4)
- Kind (Type of car, example: Pick-Up)
- Location (Where the car is located, example: Buenos Aires)


## Prepare the dataset

In [1]:
# Imports
import pandas as pd
import plotly

In [2]:
# install plotly
# https://plotly.com/python/pandas-backend/
# !pip install plotly==4.14.3
!pip install plotly
pd.options.plotting.backend = "plotly"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# read in the data
cars = pd.read_csv("https://raw.githubusercontent.com/juliandnl/redi_ss20/master/cars.csv")


In [4]:
cars.sample(5)

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location
8367,Ford,Ka,2013,1.0 Fly Viral 63cv,61000,150000,3.0,Hatchback,Bs.as. G.b.a. Norte
1057,Honda,HR-V,2017,1.8 Ex-l 2wd Cvt,10500,825000,5.0,SUV,Capital Federal
5013,Volkswagen,Gol Trend,2014,1.6 Pack Ii 101cv,48000,210000,5.0,Hatchback,Bs.as. G.b.a. Oeste
9014,Honda,City,2011,1.5 Ex-l At 120cv Br,120000,215000,4.0,Sedán,Neuquén
9475,Volkswagen,Vento,2012,2.5 Luxury 170cv,80000,330000,4.0,Sedán,Neuquén


## Let's answer some questions!

1. How old (in years) are the cars in our dataset?
2. What is the min and max kilometers per year that the cars in the dataset traveled?
3. What is the distribution of horsepower?

### 1. How old (in years) are the cars in our dataset?

We don't have a column providing us a direct information to answer this question. However, we have `Year` column which we can use to extract the information from.

Let's do it step by step!

What date do we have right now?

We can use a hardcoded value for that like this:

In [13]:
# create hard-coded date
today_hardcoded = 2022
print(today_hardcoded)

# is configured as integer
print(type(today_hardcoded))

2022
<class 'int'>


But we can also get this value programatically, so we can run the code whenever we want and we don't need to care about checking if all the used values are up to date. 

In [14]:
from datetime import date

today = date.today()
print(today)

# automatically configured as date-time stamp
print(type(today))

2022-09-18
<class 'datetime.date'>


In [21]:
print('Year: ' + str(today.year))
print('Month: ' + str(today.month))
print('Day: ' + str(today.day))

Year: 2022
Month: 9
Day: 18


Now, when we already have the information about the current day, let's try to calculate the age of the first car in our dataset:

In [22]:
first_car = cars.loc[0]

# You can also use iloc function, like this:
# first_car = cars.iloc[0]
# Here, it doesn't matter which function you will use, however there are differences between those two. 
# You can read more about this here: https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/

first_car

Make              Volkswagen
Model                  Vento
Year                    2012
Variant     2.5 Luxury 170cv
Kms                    99950
Price                 360000
Doors                    4.0
Kind                   Sedán
Location             Córdoba
Name: 0, dtype: object

In [23]:
car_age_in_years = today.year - first_car.Year
car_age_in_years

10

Alright! Right now we know how to calculate a single age difference. Let's see how to automate this process and calculate this information for all of the cars at once.

### Functions

A function is a block of code which only runs when it is called. Data can be passed into a function in form of paramenters. A function returns data as a result.

* `def` creates a function and assigns it a name,
* `return` sends a result back to the caller,
* arguments are passed by assignment,
* arguments and return types are not declared,
* to execute a function it must be called.

``` python
def <name>(param1, param2, ..., paramN):
    <statements> 
    return <value> 

<name>(arg1, arg2, ..., argN)
```

In [24]:
today = date.today()

# A function calculating the age of a car.
def calculate_age_of_a_car(year):
    return today.year - year

cars['Age'] = cars.Year.apply(calculate_age_of_a_car)

In [None]:
cars.head()

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location,Age
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba,9
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos,9
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur,10
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén,4
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba,8


Let's answer our question: How old (in years) are the cars in our dataset?

In [25]:
# displays the ages on a bar plot
ages = cars.Age.value_counts()
ages.plot.bar()

### Anonymous (lambda) functions

*   *Anonymous function* is a function that is defined without a name
*   While normal functions are defined using `def` keyword, anonymous functions are defined using `lambda` keyword.
*   Anonymous functions are also called *lambda functions*





In [26]:
# Example of a lambda function:
double = lambda x:  x*2
double(5)

10

In [28]:
# is the same as:
def double2(x):
  return x*2

double2(5)

10

Coming back to the example of cars and their age, the lambda function would look like this:

In [29]:
today = date.today()

cars['Age'] = cars.Year.apply(lambda x: today.year - x)

In [30]:
cars.head()

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location,Age
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba,10
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos,10
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur,11
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén,5
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba,9


### 2. What is the min and max kilometers per year that the cars in the dataset traveled?

The procedure here will be very similar. We will create a new column `KM_per_year` to answer this question.

In [35]:
# A function calculating kilometers per year.
def calculate_km_per_year(row):
    return row.Kms / row.Age

cars['KM_per_year'] = cars.apply(calculate_km_per_year, axis=1)

Notice, that we have used an additional argument with the apply function: `axis`. We need that here since we will use more than one column to calculate a new value. In order to do that we will pass the whole row. And the `axis` parameter is just telling us in which direction we want to read our dataframe (Pandas Dataframe is a 2D structure which could be read as rows or as columns). 

In [36]:
cars.head()

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location,Age,KM_per_year
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba,10,9995.0
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos,10,14000.0
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur,11,12000.0
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén,5,2600.0
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba,9,11888.888889


In [37]:
round(cars.KM_per_year.max(), 1)

20000.0

In [38]:
cars.KM_per_year.min()

618.1818181818181

The maximum `kilometer per year` is 20.000 km/year and the minimum is 618 km/year.

### 3. What is the distribution of horsepower?

How to find horsepower? If you take a look on `Variant` column you will see that it contains a few information there. But the format is always the same - it starts with a number representing the horsepower (e.g. 2.5, 2.3...). And we can extract this information from there.

In [40]:
# A function extracting horsepower from `Variant` column
def get_horsepower(variant):
    # Extract the first element from our string. (More on strings next week!)
    return variant.split(' ')[0]  

cars['Horsepower'] = cars.Variant.apply(lambda x: get_horsepower(x))

In [42]:
cars.head()

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location,Age,KM_per_year,Horsepower
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba,10,9995.0,2.5
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos,10,14000.0,2.3
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur,11,12000.0,1.6
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén,5,2600.0,3.2
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba,9,11888.888889,1.4


In [41]:
# To andswer the question, plot the distribution of horsepower:
cars.Horsepower.value_counts().plot.bar()

Most of the cars have a horsepower of 1.6, followed by 2.0. Only very few cars have a horsepower of 3.0 are higher.

### Add a new column with random values to a dataframe:

Let's imagine that our dataset has an additional column `Sold_date` (we will add some random data here on our own).

1. How many cars per month are sold?

Let's create `Sold_date` column! We already know how to add a new column to our dataframe. Here, we will assign each row a random date from 2019. To imitate real case scenario, where not all of the products are sold, we will leave some of the dates as NULL values.

In [46]:
from random import randrange
import datetime 
import numpy as np

start_date = datetime.datetime(2019, 1, 1, 12, 0)

def create_new_date(start=start_date):
    # Creates a random date within a year distance from `start` date.
    
    # 30% of cars are not sold - they won't have a sold date.
    if randrange(10) > 2:
        # Generate random date from 2019.
        return start_date + datetime.timedelta(days=randrange(365))
    return np.nan

# Add a new column `Sold_date` to our dataframe.
cars['Sold_date'] = cars.Year.apply(lambda x: create_new_date())

In [47]:
cars.head()
# NaT is missing date-time value: https://pandas.pydata.org/docs/user_guide/missing_data.html

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location,Age,KM_per_year,Horsepower,Sold_date
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba,10,9995.0,2.5,NaT
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos,10,14000.0,2.3,2019-12-15 12:00:00
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur,11,12000.0,1.6,2019-08-19 12:00:00
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén,5,2600.0,3.2,2019-05-01 12:00:00
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba,9,11888.888889,1.4,2019-04-12 12:00:00


In [48]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Make         10000 non-null  object        
 1   Model        10000 non-null  object        
 2   Year         10000 non-null  int64         
 3   Variant      10000 non-null  object        
 4   Kms          10000 non-null  int64         
 5   Price        10000 non-null  int64         
 6   Doors        10000 non-null  float64       
 7   Kind         10000 non-null  object        
 8   Location     10000 non-null  object        
 9   Age          10000 non-null  int64         
 10  KM_per_year  10000 non-null  float64       
 11  Horsepower   10000 non-null  object        
 12  Sold_date    7021 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(4), object(6)
memory usage: 1015.8+ KB


## Saving the data to a file and reading it from a file  

Let's save this csv file and read that again from the file!

In [49]:
cars.to_csv("cars.csv")

In [50]:
cars_from_file = pd.read_csv("cars.csv")

In [51]:
cars_from_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   10000 non-null  int64  
 1   Make         10000 non-null  object 
 2   Model        10000 non-null  object 
 3   Year         10000 non-null  int64  
 4   Variant      10000 non-null  object 
 5   Kms          10000 non-null  int64  
 6   Price        10000 non-null  int64  
 7   Doors        10000 non-null  float64
 8   Kind         10000 non-null  object 
 9   Location     10000 non-null  object 
 10  Age          10000 non-null  int64  
 11  KM_per_year  10000 non-null  float64
 12  Horsepower   10000 non-null  float64
 13  Sold_date    7021 non-null   object 
dtypes: float64(3), int64(5), object(6)
memory usage: 1.1+ MB


## Datetime columns

Notice that datetime values when saved to file and read again, it changes to object data type. But we can transform it to datetime again very easily!

In [52]:
cars_from_file['Sold_date'] = pd.to_datetime(cars_from_file['Sold_date'])

In [53]:
cars_from_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Unnamed: 0   10000 non-null  int64         
 1   Make         10000 non-null  object        
 2   Model        10000 non-null  object        
 3   Year         10000 non-null  int64         
 4   Variant      10000 non-null  object        
 5   Kms          10000 non-null  int64         
 6   Price        10000 non-null  int64         
 7   Doors        10000 non-null  float64       
 8   Kind         10000 non-null  object        
 9   Location     10000 non-null  object        
 10  Age          10000 non-null  int64         
 11  KM_per_year  10000 non-null  float64       
 12  Horsepower   10000 non-null  float64       
 13  Sold_date    7021 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(5), object(5)
memory usage: 1.1+ MB


We will group our dataset by month (since we have the data only from one year) and calculate how many of them we have for each group:

In [54]:
grouped = cars_from_file.groupby(by=cars_from_file.Sold_date.dt.month)[['Year']].size()
grouped.index.rename('Sold_date_month', inplace=True)
grouped

Sold_date_month
1.0     608
2.0     534
3.0     618
4.0     563
5.0     592
6.0     602
7.0     584
8.0     588
9.0     574
10.0    557
11.0    596
12.0    605
dtype: int64

In [55]:
grouped.plot.bar()

Voila! Now, we can communicate how number of sold cars was changing in time :)

Extra materials:
- [Difference between loc and iloc functions](https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/)
- [Apply, Map and ApplyMap](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)
- [Transform function](https://towardsdatascience.com/when-to-use-pandas-transform-function-df8861aa0dcf)
- [Numpy library documentation](https://numpy.org/)
