# Data Manipulation in Python with `Pandas`

This article is about data manipulation in Python using `Pandas`, one of the most important Python packages.


## Indice:
* [Load Packages](#1)
* [Vectors in Python](#2)
* [Data-frames in Python](#3)
* * [Load a Data-frame](#4)
* * [Export a data-frame in csv format](#5)
* * [Numeric Summary](#6)
* * [Create a Data-frame manually](#7)
* * [Data-frame dimensions](#8)
* * [Transpose a data-frame](#9)
* * [Add new variables to a data-frame](#10)
* * [Selecting rows and columns in a data-frame](#11)
* * [Filtering observations (rows) in a data-frame](#12)
* * [Deleting columns in a data-frame](#13)
* * [Working with NaN in a data-frame](#14)
* * [Grouping and summarizing in a data-frame](#15)
* * [Join data-frames](#16)
*  [Numeric Sequencies](#17)
*  [Functions and loops](#18)


## Load Packages <a class="anchor" id="1"></a>

We load the Python packages that we are going to use:

In [19]:
import pandas as pd
import numpy as np
import array as arr

from IPython.display import display
pd.options.display.max_columns = None

In [20]:
import warnings
warnings.filterwarnings('ignore')

## Vectors in Python <a class="anchor" id="2"></a>

In Python the objects most liked to R vectors are `series`, that is an element of `Pandas` package.




In [21]:
vector = pd.Series([0.25, 0.5, 0.75, 1.0])
vector

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [22]:
elements=[0.25, 0.5, 0.75, 1.0]
vector = pd.Series(elements)
vector

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The *series* element has two atributes, `data.values` and `data.index` . 

`data.values` are the data of the vector.
`data.index` are the names of the vector elements.

In [23]:
vector_1 = pd.Series(data = [46350000, 44380000, 18120000, 31970000, 10330000], index=['Spain', 'Argentina', 'Chile', 'Peru', 'Cuba'])
vector_1

Spain        46350000
Argentina    44380000
Chile        18120000
Peru         31970000
Cuba         10330000
dtype: int64

Another way to create a vector in Python:

In [24]:
vector_2 = {'Spain': 46350000, 'Argentina': 44380000, 'Chile': 18120000, 'Peru': 31970000, 'Cuba': 10330000}

vector_2 = pd.Series(vector_2)

vector_2

Spain        46350000
Argentina    44380000
Chile        18120000
Peru         31970000
Cuba         10330000
dtype: int64

We can select the elements of a vector (`series`) as follows:

**Warning:** the natural index in Python is 0,1,2,... , so it begins in zero (0) , we have to be careful with that.

In [25]:
vector_1[0]

46350000

In [26]:
vector_1[1]

44380000

In [27]:
vector_1[0:3]

Spain        46350000
Argentina    44380000
Chile        18120000
dtype: int64

In [28]:
vector_1["Spain":"Chile"]

Spain        46350000
Argentina    44380000
Chile        18120000
dtype: int64

## Data-frames in Python <a class="anchor" id="3"></a>

A data-frame is a set of `series` that has two index, one associated to the rows and other to the columns.


### Load a Data-frame <a class="anchor" id="4"></a>
We load the data-set:


In [29]:
data = pd.read_csv("properties_data.csv")

### Export a Data-frame in csv format <a class="anchor" id="5"></a>

In [30]:
# We have the following data frame: df

# df.to_csv('filename.csv', index=False)  ## We create the file 'filename.csv' with the data frame 'df'

### Number of rows and columns of a data-frame <a class="anchor" id="6"></a>

We can compute the number of rows and columns of the data set as following:

In [31]:
ncol=len(data.columns)
nrow=len(data)

In [32]:
ncol 


38

In [33]:
nrow

1905

### Numeric Summary <a class="anchor" id="6"></a>
We can compute a numeric summary of the data-set variables:

In [34]:
data.describe()

Unnamed: 0,id,latitude,longitude,price,size_in_sqft,price_per_sqft,no_of_bedrooms,no_of_bathrooms
count,1905.0,1905.0,1905.0,1905.0,1905.0,1905.0,1905.0,1905.0
mean,7573308.0,25.116479,55.212338,2085830.0,1417.050394,1327.243785,1.793176,2.512861
std,192525.2,0.062704,0.068794,2913200.0,891.487639,668.473559,0.949489,1.063117
min,5528049.0,24.865992,55.069311,220000.0,294.0,361.87,0.0,1.0
25%,7560167.0,25.072697,55.145389,890000.0,840.0,870.92,1.0,2.0
50%,7631829.0,25.096545,55.207506,1400000.0,1271.0,1169.59,2.0,2.0
75%,7670328.0,25.18813,55.271797,2200000.0,1703.0,1622.5,2.0,3.0
max,7706643.0,25.273623,55.441623,35000000.0,9576.0,4805.87,5.0,6.0


### Create a Data-frame manually <a class="anchor" id="7"></a>


We can create a data-frame manually as follows.

First we create the variables of the data-frame:

In [35]:
variable_1 = {'Spain': 46350000, 'Argentina': 44380000, 'Chile': 18120000, 'Peru': 31970000, 'Cuba': 10330000}

variable_1 = pd.Series(variable_1)
variable_1

Spain        46350000
Argentina    44380000
Chile        18120000
Peru         31970000
Cuba         10330000
dtype: int64

In [36]:
variable_2 = {'Spain': 5, 'Argentina': 6, 'Chile': 3, 'Peru': 2, 'Cuba': 1.5}

variable_2 = pd.Series(variable_2)
variable_2

Spain        5.0
Argentina    6.0
Chile        3.0
Peru         2.0
Cuba         1.5
dtype: float64

Now we create the data-frame using the above variables:

In [37]:
data_frame = pd.DataFrame({'variable 1': variable_1, 'variable 2': variable_2})
data_frame

Unnamed: 0,variable 1,variable 2
Spain,46350000,5.0
Argentina,44380000,6.0
Chile,18120000,3.0
Peru,31970000,2.0
Cuba,10330000,1.5


Another way to create a data-frame in Python:

In [47]:
data_frame = pd.DataFrame({'variable 1': [46350000, 44380000, 18120000, 31970000, 10330000], 
                           'variable 2': [5, 6, 3, 2, 1.5] 
                           } , index=['Spain','Argentina', 'Chile', 'Peru', 'Cuba'])
data_frame

Unnamed: 0,variable 1,variable 2
Spain,46350000,5.0
Argentina,44380000,6.0
Chile,18120000,3.0
Peru,31970000,2.0
Cuba,10330000,1.5


###  Data-frame dimensions <a class="anchor" id="8"></a>

We can get the dimensions of the data-frame:

In [39]:
data_frame.values.shape

(5, 2)

###  Transpose a data-frame <a class="anchor" id="9"></a>

We can transpose the data-frame:

In [40]:
data_frame.T

Unnamed: 0,0,1,2,3,4
variable 1,46350000.0,44380000.0,18120000.0,31970000.0,10330000.0
variable 2,5.0,6.0,3.0,2.0,1.5


###  Adding new variables to a data-frame <a class="anchor" id="10"></a>

We can add new variables to the data-frame as follows:

In [41]:
data_frame['variable 3'] = ['Europe', 'South America', 'South America', 'South America', 'Caribbean']
data_frame

Unnamed: 0,variable 1,variable 2,variable 3
0,46350000,5.0,Europe
1,44380000,6.0,South America
2,18120000,3.0,South America
3,31970000,2.0,South America
4,10330000,1.5,Caribbean


In [42]:
new_variable = round(data_frame['variable 1']/data_frame['variable 2'])

data_frame['new_variable'] = new_variable

data_frame

Unnamed: 0,variable 1,variable 2,variable 3,new_variable
0,46350000,5.0,Europe,9270000.0
1,44380000,6.0,South America,7396667.0
2,18120000,3.0,South America,6040000.0
3,31970000,2.0,South America,15985000.0
4,10330000,1.5,Caribbean,6886667.0


### Selecting rows and columns in a data-frame <a class="anchor" id="11"></a>

We can select rows and columns using `loc` and `iloc` , the first with an explicit index and the second with an implicit index.

In [43]:
data_frame.loc[['Spain']]

KeyError: "None of [Index(['Spain'], dtype='object')] are in the [index]"

In [None]:
data_frame.loc[['Spain', 'Cuba']]

Unnamed: 0,variable 1,variable 2,variable 3,new_variable
Spain,46350000,5.0,Europe,9270000.0
Cuba,10330000,1.5,Caribbean,6886667.0


In [None]:
data_frame.loc[['Spain', 'Cuba','Chile'] , ['variable 2'] ]

Unnamed: 0,variable 2
Spain,5.0
Cuba,1.5
Chile,3.0


In [None]:
data_frame.loc[ ['Spain', 'Cuba','Chile'] , ['variable 2' , 'variable 1'] ]

Unnamed: 0,variable 2,variable 1
Spain,5.0,46350000
Cuba,1.5,10330000
Chile,3.0,18120000


In [None]:
data_frame.iloc[[2] , [0,1,3] ] 

Unnamed: 0,variable 1,variable 2,new_variable
Chile,18120000,3.0,6040000.0


In [None]:
data_frame.iloc[[0, 2, 4] , [1 , 0]]

Unnamed: 0,variable 2,variable 1
Spain,5.0,46350000
Chile,3.0,18120000
Cuba,1.5,10330000


###  Filtering observations (rows) in a data-frame <a class="anchor" id="12"></a>

We can filter observations as well:

In [None]:
data_frame['variable 2']>3

Spain         True
Argentina     True
Chile        False
Peru         False
Cuba         False
Name: variable 2, dtype: bool

In [None]:
data_frame.loc[data_frame['variable 2']>3 , : ]

Unnamed: 0,variable 1,variable 2,variable 3,new_variable
Spain,46350000,5.0,Europe,9270000.0
Argentina,44380000,6.0,South America,7396667.0


Other form of filtering rows:

In [None]:
X1 = pd.Series([ "France" , "Spain" , "Italy", "Russia", "Portugal" ])
X2 = pd.Series([10, 12, 50, 30, -13])
X3 = pd.Series([ 22 , 7, 40, 16, 45 ])
X4 = pd.Series([15, 22, 5, -30, 21])

df  = pd.DataFrame( {"Countries": X1 , "X2": X2 , "X3": X3 , "X4": X4} )

df

Unnamed: 0,Countries,X2,X3,X4
0,France,10,22,15
1,Spain,12,7,22
2,Italy,50,40,5
3,Russia,30,16,-30
4,Portugal,-13,45,21


In [None]:
df[df.Countries=="Spain"]

Unnamed: 0,Countries,X2,X3,X4
1,Spain,12,7,22


In [None]:
df[ (df.X2 > 15) & (df.X4 < 20) ]

Unnamed: 0,Countries,X2,X3,X4
2,Italy,50,40,5
3,Russia,30,16,-30


In [None]:
df[ (df.X2 > 15) | (df.X4 < 20) ]

Unnamed: 0,Countries,X2,X3,X4
0,France,10,22,15
2,Italy,50,40,5
3,Russia,30,16,-30


In [None]:
X1 = pd.Series([ "France" , "Spain" , "Italy", "Russia", "USA" ])
X2 = pd.Series([ "Europe" , "Europe" , "Europe", "Not Europe", "Not Europe" ])
X3 = pd.Series([10, 12, 50, 30, -13])
X4 = pd.Series([ 22 , 7, 40, 16, 45 ])
X5 = pd.Series([15, 22, 5, -30, 21])

df  = pd.DataFrame( {"Countries": X1 , "Continent": X2 , "X3": X3 , "X4": X4 ,"X5": X5} )
df

Unnamed: 0,Countries,Continent,X3,X4,X5
0,France,Europe,10,22,15
1,Spain,Europe,12,7,22
2,Italy,Europe,50,40,5
3,Russia,Not Europe,30,16,-30
4,USA,Not Europe,-13,45,21


In [None]:
df[ (df.Continent == "Europe") & (df.X4 < 25) ]

Unnamed: 0,Countries,Continent,X3,X4,X5
0,France,Europe,10,22,15
1,Spain,Europe,12,7,22


###  Deleting columns in a data-frame <a class="anchor" id="13"></a>

Delete columns:

In [None]:
del [ df['X3'] ]

df

Unnamed: 0,Countries,Continent,X4,X5
0,France,Europe,22,15
1,Spain,Europe,7,22
2,Italy,Europe,40,5
3,Russia,Not Europe,16,-30
4,USA,Not Europe,45,21


### Working with NaN in a data-frame  <a class="anchor" id="14"></a>

We are going to create several data-frames with NaNÂ´s, and later we are going to deal with them.

In [None]:
X1 = pd.Series([ 55 , np.nan , 85, np.nan, 63 ])
X2 = pd.Series([10, 12, np.nan, 30, -13])
X3 = pd.Series([ 22 , 7, 40, 16, 45 ])
X4 = pd.Series([15, 22, 5, -30, np.nan])

df_nan  = pd.DataFrame( {"X1": X1 , "X2": X2 , "X3": X3 , "X4": X4} )

df_nan

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,,12.0,7,22.0
2,85.0,,40,5.0
3,,30.0,16,-30.0
4,63.0,-13.0,45,


In [None]:
 df_nan.isnull()

Unnamed: 0,X1,X2,X3,X4
0,False,False,False,False
1,True,False,False,False
2,False,True,False,False
3,True,False,False,False
4,False,False,False,True


In [None]:
 df_nan.isnull().sum()

X1    2
X2    1
X3    0
X4    1
dtype: int64

In [None]:
df_nan.dropna() # Elimina filas con datos faltantes

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0


In [None]:
df_nan.dropna(axis=1)  # Elimina columnas con datos faltantes

Unnamed: 0,X3
0,22
1,7
2,40
3,16
4,45


In [None]:
df_nan.dropna(axis=1, how='all') 

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,,12.0,7,22.0
2,85.0,,40,5.0
3,,30.0,16,-30.0
4,63.0,-13.0,45,


In [None]:
df_nan.dropna(thresh=2)

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,,12.0,7,22.0
2,85.0,,40,5.0
3,,30.0,16,-30.0
4,63.0,-13.0,45,


In [None]:
df_nan.fillna(0)  # Rellenar con 0

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,0.0,12.0,7,22.0
2,85.0,0.0,40,5.0
3,0.0,30.0,16,-30.0
4,63.0,-13.0,45,0.0


In [None]:
df_nan.fillna(method='bfill')  # Rellenar con el valor siguiente

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,85.0,12.0,7,22.0
2,85.0,30.0,40,5.0
3,63.0,30.0,16,-30.0
4,63.0,-13.0,45,


In [None]:
df_nan.fillna(df_nan.mean())  # Rellenar con la media de cada columna

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,67.666667,12.0,7,22.0
2,85.0,9.75,40,5.0
3,67.666667,30.0,16,-30.0
4,63.0,-13.0,45,3.0


In [None]:
df_nan.fillna(df_nan.median())  # Rellenar con la mediana de cada columna

Unnamed: 0,X1,X2,X3,X4
0,55.0,10.0,22,15.0
1,63.0,12.0,7,22.0
2,85.0,11.0,40,5.0
3,63.0,30.0,16,-30.0
4,63.0,-13.0,45,10.0


###  Grouping and summarizing in a data-frame <a class="anchor" id="15"></a>

Group by and summarize:

In [None]:
X1 = pd.Series([ "Masculino" , "Femenino" , "Masculino", "Masculino", "Femenino" ])
X2 = pd.Series([10, 12, 32, 30, -13])
X3 = pd.Series([ 22 , 7, 40, 16, 45 ])
X4 = pd.Series([15, 22, 5, -30, 23])

df = pd.DataFrame( {"Sex": X1 , "X2": X2 , "X3": X3 , "X4": X4} ) 
df

Unnamed: 0,Sex,X2,X3,X4
0,Masculino,10,22,15
1,Femenino,12,7,22
2,Masculino,32,40,5
3,Masculino,30,16,-30
4,Femenino,-13,45,23


In [None]:
df.groupby('Sex')['X2'].mean()

Sex
Femenino     -0.5
Masculino    24.0
Name: X2, dtype: float64

In [None]:
df.groupby('Sex')['X2'].std()

Sex
Femenino     17.677670
Masculino    12.165525
Name: X2, dtype: float64

In [None]:

df.groupby('Sex')['X2'].min()

Sex
Femenino    -13
Masculino    10
Name: X2, dtype: int64

In [None]:
df.groupby('Sex')['X2'].agg([np.min, np.max, np.median, np.sum, np.std])

Unnamed: 0_level_0,amin,amax,median,sum,std
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Femenino,-13,12,-0.5,-1,17.67767
Masculino,10,32,30.0,72,12.165525


###  Join data-frames <a class="anchor" id="16"></a>

In [None]:
X7 = pd.Series([ 22 , 34 , 13, 35, 23 ])
X8 = pd.Series([10, 12, 32, 30, -13])
 

df2 = pd.DataFrame( {"X7": X7 , "X8": X8} ) 
df2

Unnamed: 0,X7,X8
0,22,10
1,34,12
2,13,32
3,35,30
4,23,-13


In [None]:
df

Unnamed: 0,Sex,X2,X3,X4
0,Masculino,10,22,15
1,Femenino,12,7,22
2,Masculino,32,40,5
3,Masculino,30,16,-30
4,Femenino,-13,45,23


We can use `concat` to join data-frames as follows:

In [None]:
from pandas import concat

In [None]:
concat([df,df2])

Unnamed: 0,Sex,X2,X3,X4,X7,X8
0,Masculino,10.0,22.0,15.0,,
1,Femenino,12.0,7.0,22.0,,
2,Masculino,32.0,40.0,5.0,,
3,Masculino,30.0,16.0,-30.0,,
4,Femenino,-13.0,45.0,23.0,,
0,,,,,22.0,10.0
1,,,,,34.0,12.0
2,,,,,13.0,32.0
3,,,,,35.0,30.0
4,,,,,23.0,-13.0


In [None]:
concat([df,df2] , ignore_index=True)

Unnamed: 0,Sex,X2,X3,X4,X7,X8
0,Masculino,10.0,22.0,15.0,,
1,Femenino,12.0,7.0,22.0,,
2,Masculino,32.0,40.0,5.0,,
3,Masculino,30.0,16.0,-30.0,,
4,Femenino,-13.0,45.0,23.0,,
5,,,,,22.0,10.0
6,,,,,34.0,12.0
7,,,,,13.0,32.0
8,,,,,35.0,30.0
9,,,,,23.0,-13.0


The following is equivalent to *cbind* R function

In [None]:
concat([df,df2] , axis=1) # equivalent to cbind R function

Unnamed: 0,Sex,X2,X3,X4,X7,X8
0,Masculino,10,22,15,22,10
1,Femenino,12,7,22,34,12
2,Masculino,32,40,5,13,32
3,Masculino,30,16,-30,35,30
4,Femenino,-13,45,23,23,-13


In [None]:
X1 = pd.Series([ 2 , 4 , 3, 35, 23 ])
X2 = pd.Series([10, 12, 3, 3, -13])
X3 = pd.Series([22, 33, 1, 5, -3])
X4 = pd.Series([52, 2, 23, 6, -5])
X5 = pd.Series([ 22 , 34 , 13, 35, 23 ])
X6 = pd.Series([10, 12, 32, 30, -13])
X7 = pd.Series([22, 33, 1, 56, -13])
X8 = pd.Series([5, 12, 2, 66, -5])

df1 = pd.DataFrame( {"X1": X1 , "X2": X2 , "X3": X3 , "X4": X4} ) 

df2 = pd.DataFrame( {"X1": X5 , "X2": X6 , "X3": X7 , "X4": X8} ) 

In [None]:
df1

Unnamed: 0,X1,X2,X3,X4
0,2,10,22,52
1,4,12,33,2
2,3,3,1,23
3,35,3,5,6
4,23,-13,-3,-5


In [None]:
df2

Unnamed: 0,X1,X2,X3,X4
0,22,10,22,5
1,34,12,33,12
2,13,32,1,2
3,35,30,56,66
4,23,-13,-13,-5


The next is equivalent to *rbind* R function:

In [None]:
pd.concat([df1,df2]) # equivalent ti rbind R function

Unnamed: 0,X1,X2,X3,X4
0,2,10,22,52
1,4,12,33,2
2,3,3,1,23
3,35,3,5,6
4,23,-13,-3,-5
0,22,10,22,5
1,34,12,33,12
2,13,32,1,2
3,35,30,56,66
4,23,-13,-13,-5


In [None]:
df_keys = pd.concat([df1,df2], keys=['x', 'y'])
df_keys

Unnamed: 0,Unnamed: 1,X1,X2,X3,X4
x,0,2,10,22,52
x,1,4,12,33,2
x,2,3,3,1,23
x,3,35,3,5,6
x,4,23,-13,-3,-5
y,0,22,10,22,5
y,1,34,12,33,12
y,2,13,32,1,2
y,3,35,30,56,66
y,4,23,-13,-13,-5


In [None]:
df_keys.loc['x']

Unnamed: 0,X1,X2,X3,X4
0,2,10,22,52
1,4,12,33,2
2,3,3,1,23
3,35,3,5,6
4,23,-13,-3,-5


In [None]:
df_keys.loc['y']

Unnamed: 0,X1,X2,X3,X4
0,22,10,22,5
1,34,12,33,12
2,13,32,1,2
3,35,30,56,66
4,23,-13,-13,-5


Now we are going to show how to use `merge` to join data-frames:

In [None]:
id1 = list(range(1,6)) 
id2 = [1, 3, 6, 7, 2]
X1 = pd.Series([ 2 , 4 , 3, 35, 23 ])
X2 = pd.Series([10, 12, 3, 3, -13])
X3 = pd.Series([22, 33, 1, 5, -3])
X4 = pd.Series([52, 2, 23, 6, -5])
X5 = pd.Series([ 22 , 34 , 13, 35, 23 ])

df1 = pd.DataFrame( {"id": id1 , "X1": X1 , "X2": X2 , "X3": X3 } ) 

df2 = pd.DataFrame( {"id": id2 ,  "X4": X4 , "X5": X5  } ) 

In [None]:
df1

Unnamed: 0,id,X1,X2,X3
0,1,2,10,22
1,2,4,12,33
2,3,3,3,1
3,4,35,3,5
4,5,23,-13,-3


In [None]:
df2

Unnamed: 0,id,X4,X5
0,1,52,22
1,3,2,34
2,6,23,13
3,7,6,35
4,2,-5,23


In [None]:
pd.merge(df1, df2, on='id') # equivalent to an inner join

Unnamed: 0,id,X1,X2,X3,X4,X5
0,1,2,10,22,52,22
1,2,4,12,33,-5,23
2,3,3,3,1,2,34


Now we put different names to the id column of each data-frame:

In [None]:
df1 = pd.DataFrame( {"id1": id1 , "X1": X1 , "X2": X2 , "X3": X3 } ) 

df2 = pd.DataFrame( {"id2": id2 ,  "X4": X4 , "X5": X5  } ) 

In [None]:
# pd.merge(df1, df2, on='id')  ## if we run it, we get an error

The correct way to use `merge` when the names of link columns are different is:

In [None]:
df_merge = pd.merge(df1, df2, left_on='id1' , right_on='id2')
df_merge

Unnamed: 0,id1,X1,X2,X3,id2,X4,X5
0,1,2,10,22,1,52,22
1,2,4,12,33,2,-5,23
2,3,3,3,1,3,2,34


In [None]:
del df_merge['id2']

df_merge

Unnamed: 0,id1,X1,X2,X3,X4,X5
0,1,2,10,22,52,22
1,2,4,12,33,-5,23
2,3,3,3,1,2,34


In [None]:
df1 = pd.DataFrame( {"id": id1 , "X1": X1 , "X2": X2 , "X3": X3 } ) 

df2 = pd.DataFrame( {"id": id2 ,  "X4": X4 , "X5": X5  } ) 

In [None]:
df1

Unnamed: 0,id,X1,X2,X3
0,1,2,10,22
1,2,4,12,33
2,3,3,3,1
3,4,35,3,5
4,5,23,-13,-3


In [None]:
df2

Unnamed: 0,id,X4,X5
0,1,52,22
1,3,2,34
2,6,23,13
3,7,6,35
4,2,-5,23


 **Outer Join**  &nbsp;  using  &nbsp; `merge`:

In [None]:
pd.merge(df1, df2, on='id', how='outer')

Unnamed: 0,id,X1,X2,X3,X4,X5
0,1,2.0,10.0,22.0,52.0,22.0
1,2,4.0,12.0,33.0,-5.0,23.0
2,3,3.0,3.0,1.0,2.0,34.0
3,4,35.0,3.0,5.0,,
4,5,23.0,-13.0,-3.0,,
5,6,,,,23.0,13.0
6,7,,,,6.0,35.0


 **Inner Join**  &nbsp;  using  &nbsp; `merge`:

In [None]:
pd.merge(df1, df2, on='id', how='inner')


Unnamed: 0,id,X1,X2,X3,X4,X5
0,1,2,10,22,52,22
1,2,4,12,33,-5,23
2,3,3,3,1,2,34


 **Left Join**  &nbsp;  using  &nbsp; `merge`:

In [None]:
pd.merge(df1, df2, on='id', how='left')

Unnamed: 0,id,X1,X2,X3,X4,X5
0,1,2,10,22,52.0,22.0
1,2,4,12,33,-5.0,23.0
2,3,3,3,1,2.0,34.0
3,4,35,3,5,,
4,5,23,-13,-3,,


In [None]:
df1[['id']]

Unnamed: 0,id
0,1
1,2
2,3
3,4
4,5


 **Right Join**  &nbsp;  using  &nbsp; `merge`:

In [None]:
pd.merge(df1, df2, on='id', how='right')

Unnamed: 0,id,X1,X2,X3,X4,X5
0,1,2.0,10.0,22.0,52,22
1,3,3.0,3.0,1.0,2,34
2,6,,,,23,13
3,7,,,,6,35
4,2,4.0,12.0,33.0,-5,23


In [None]:
df2[['id']]

Unnamed: 0,id
0,1
1,3
2,6
3,7
4,2


We could do all these joins even when the link columns would have different names:

In [None]:
df1 = pd.DataFrame( {"id1": id1 , "X1": X1 , "X2": X2 , "X3": X3 } ) 

df2 = pd.DataFrame( {"id2": id2 ,  "X4": X4 , "X5": X5  } ) 

For example, we can apply the outer join in this contest as follows:

In [None]:
pd.merge(df1, df2,  left_on='id1' , right_on='id2' , how='outer')

Unnamed: 0,id1,X1,X2,X3,id2,X4,X5
0,1.0,2.0,10.0,22.0,1.0,52.0,22.0
1,2.0,4.0,12.0,33.0,2.0,-5.0,23.0
2,3.0,3.0,3.0,1.0,3.0,2.0,34.0
3,4.0,35.0,3.0,5.0,,,
4,5.0,23.0,-13.0,-3.0,,,
5,,,,,6.0,23.0,13.0
6,,,,,7.0,6.0,35.0


## Numeric Sequencies <a class="anchor" id="17"></a>

In [None]:
repeat_unos = arr.array('i',(1 for i in range(0,15)))
repeat_unos

array('i', [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [None]:
list(range(4))

[0, 1, 2, 3]

In [None]:
list(range(2,10)) # equivalente a 2:9 en R

[2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
list(range(0, 10, 3)) # del 0 al 9 con saltos de 3 unidades

[0, 3, 6, 9]

In [None]:
list(range(0, 10, 4)) # del 0 al 9 con saltos de 4 unidades

[0, 4, 8]

In [None]:
np.random.random_integers(0,10, size=25)

  np.random.random_integers(0,10, size=25)


array([1, 7, 1, 6, 1, 8, 7, 8, 4, 1, 6, 9, 5, 9, 4, 6, 4, 1, 2, 6, 4, 2,
       9, 7, 4])

In [None]:
np.random.randn(25)

array([ 1.74, -0.72, -0.99, -1.93, -0.25,  0.22,  0.27,  1.94, -0.59,
        0.96,  0.22,  1.54, -2.01, -1.31,  0.79,  0.58,  0.05,  1.28,
        0.46, -0.17, -0.  ,  1.3 ,  1.36, -0.4 ,  0.14])

In [None]:
from random import uniform

np.round(np.random.uniform(3,10, 25))

array([ 8.,  9.,  9.,  7., 10., 10.,  7.,  8.,  9.,  7., 10.,  9.,  6.,
        4.,  5.,  6.,  8.,  5.,  5.,  4., 10.,  5.,  7.,  9.,  9.])

In [None]:
set=np.random.uniform(3,10, 25)
set

array([7.04, 5.38, 4.61, 5.97, 7.78, 8.02, 5.27, 8.69, 4.68, 5.82, 8.76,
       3.32, 6.27, 5.27, 7.07, 5.47, 6.91, 6.1 , 5.35, 7.04, 4.73, 8.06,
       3.59, 4.74, 8.65])

In [None]:
np.random.choice(set, 10) # sample(population, k)

array([3.32, 4.68, 5.27, 7.04, 8.65, 7.07, 7.07, 8.76, 7.07, 7.04])

## Functions and loops <a class="anchor" id="18"></a>

### Functions :

In [None]:
def my_function(a, b):
    """
    a: numero
    b: numero
    """
    return 2*a + b


In [None]:
my_function(10, 1)

21

In [None]:
def my_function_2(a, b, c):
    
    return 2*a, b, 100+c

In [None]:
my_function_2(10, 2, 5)

(20, 2, 105)

### Loop If :

In [None]:
range(10)

range(0, 10)

In [None]:
x = 99

if x < 0:
    print(x, 'es negativo')
elif x > 0:   # elif = else if
    print(x, 'es positivo')
else:
    print('Pues serÃ¡ 0')

99 es positivo


### Loop For :

In [None]:
for i in range(10):
    if i - 2 > 4 :
        print(i)

7
8
9


In [None]:
for i in range(1, 5):
    if i == 3:
        break
    print(i)

1
2


In [None]:
for i in [1,2,3,4]:
    if i == 3:
        continue
    print(i)

1
2
4


### Loop While :

In [None]:
i = 0
x = True

while x == True:
    i += 1
    if(i == 4):
        x = False

print(i)

4


LetÂ´s see how the  `+=` operator works in python:

In [None]:
i = 1

i += 5

i

6