<center><u><H1>Pandas-Estructuras de datos y Operaciones</H1></u></center>

## Pandas Library has the following main data structures:

1.Series

2.DataFrames

## SERIES:

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.Series(np.random.randn(8))
#colección en una dimensión con pd con función Seires
#np genera numeros aleatorios con función randn()
#Se generan indices numéricod


0    0.200742
1   -0.495106
2   -1.532731
3   -0.352933
4    1.077676
5    0.363326
6    0.826306
7    2.654586
dtype: float64

### The index of the series can be customized with index values(letters or numbers):
### El índice de la serie se puede personalizar con valores de índice (letras o números):


In [3]:
pd.Series(np.random.randn(5), index=[0,1,2,3,4])
#expecificar los indices
#tambien los indices pueden ser letras

0    0.656234
1    0.792556
2   -0.689447
3    0.767158
4    0.986380
dtype: float64

In [4]:
pd.Series(np.random.randn(5), index=["A","B","C","D","E"])
#indices con letras

A    1.022310
B    1.395995
C   -1.081523
D   -0.426717
E   -0.219361
dtype: float64

### We also can use a Python dict:

In [5]:
d = {'A': 6, 'B': 3, 'C': 'abc', 'D': np.exp(2)}
pd.Series(d)
#dtype: object porque hay diversidad de datos
#Le paso (d) el diccionario
#solo una dimensión

A           6
B           3
C         abc
D    7.389056
dtype: object

## DataFrame:

<b>DataFrame is a 2D data structure with columns of different datatypes and rows are named index. It can be formed from the following data structures:</b>
DataFrame es una estructura de datos 2D con columnas de diferentes tipos de datos y las filas se denominan índice. Se puede formar a partir de las siguientes estructuras de datos:

1. Numpy array
2. Lists
3. Dicts
4. Series

In [6]:
#using dict of series de pd
d = {'column_1': pd.Series([1,2,3]),
    'column_2': pd.Series(['abc',10.5,'xy'])}
df = pd.DataFrame(d)
df
#series, las claves van como indices

Unnamed: 0,column_1,column_2
0,1,abc
1,2,10.5
2,3,xy


In [7]:
#using dict of lists
d = {'column_1': [1,2,3],
    'column_2': ['abc',10.5,'xy'],
    'column_3': [14,15,26]}
df = pd.DataFrame(d)
df

Unnamed: 0,column_1,column_2,column_3
0,1,abc,14
1,2,10.5,15
2,3,xy,26


In [8]:
#using numpy array de numpy
array = np.array([[0.8, 5.5], [3.7, 12.4]])
df = pd.DataFrame({'Column1': array[:, 0], 'Column2': array[:, 1]}, index=['A','B'])
print(df)

   Column1  Column2
A      0.8      5.5
B      3.7     12.4


### Selection and Indexing:

In [9]:
# selecting a column
df['Column1']

A    0.8
B    3.7
Name: Column1, dtype: float64

In [10]:
# selecting more than one column
df[['Column1','Column2']]

Unnamed: 0,Column1,Column2
A,0.8,5.5
B,3.7,12.4


### <u>loc and iloc:</u>
#### -loc works on labels in the index.
#### -iloc works on the positions in the index (so it only takes integers).

In [11]:
#selecting rows
df.loc['A']
#loc referencia al nombre de la fila

Column1    0.8
Column2    5.5
Name: A, dtype: float64

In [12]:
df.iloc[0]
#iloc referencia al indice posición de la fila

Column1    0.8
Column2    5.5
Name: A, dtype: float64

In [13]:
type(df['Column1'])

pandas.core.series.Series

### Inserting new column

In [14]:
df['new'] = df['Column1'] + df['Column2']
df
#nueva columna en base a las otras columnas

Unnamed: 0,Column1,Column2,new
A,0.8,5.5,6.3
B,3.7,12.4,16.1


### Deleting a column

In [15]:
df.drop('new',axis=1,inplace=True) # use inplace to make changes permanent
df
#borro la columna new 
#eje 1 relativo a las columnas
#eje 0 relativo a las filas
#inplace (permanente)

Unnamed: 0,Column1,Column2
A,0.8,5.5
B,3.7,12.4


### Selecting a subset of the dataframe with rows and columns
### Seleccionar un subconjunto del marco de datos con filas y columnas


In [16]:
df.loc['A','Column1']
#fila A, columna Column1

0.8

In [17]:
# reset the index
df.reset_index(inplace=True)
df
# de A, B   a  0,1 indices
#indices antigos se crean en nueva columna

Unnamed: 0,index,Column1,Column2
0,A,0.8,5.5
1,B,3.7,12.4


In [18]:
df.loc[[0,1],['index','Column2']]

Unnamed: 0,index,Column2
0,A,5.5
1,B,12.4


### Selection by condition:

In [19]:
df[df['Column1']>1][['index','Column2']]
#Columna1 mayor de 1
#mostrar columna "Index" y columna2

Unnamed: 0,index,Column2
1,B,12.4


In [20]:
# AND condition
df[(df['Column1']>1) & (df['Column2'] > 5)]

Unnamed: 0,index,Column1,Column2
1,B,3.7,12.4


In [21]:
# OR condition
df[(df['Column1']>1) | (df['Column2'] <= 5.5)]

Unnamed: 0,index,Column1,Column2
0,A,0.8,5.5
1,B,3.7,12.4


In [22]:
# NOT condition
df[~((df['Column1']>1) | (df['Column2'] <= 5.5))]

Unnamed: 0,index,Column1,Column2


### Index properties:

In [23]:
#Array of index values
df.index.values

array([0, 1], dtype=int64)

In [24]:
#Using the split function of strings to have a list of items
#Uso de la función de división de cadenas para tener una lista de elementos
a = '10 abc'.split()
print(a)

['10', 'abc']


In [25]:
#Inserting new column since list values
df['Column3'] = a
df

Unnamed: 0,index,Column1,Column2,Column3
0,A,0.8,5.5,10
1,B,3.7,12.4,abc


In [26]:
#Inserting new row
df.loc[2]=['C',32,11.8,'xyz']
df

Unnamed: 0,index,Column1,Column2,Column3
0,A,0.8,5.5,10
1,B,3.7,12.4,abc
2,C,32.0,11.8,xyz


## OPERATIONS:

In [27]:
df = pd.DataFrame({'col1':[10,20,30,40,50],'col2':[4,5,6,5,2],'col3':['abc','def','ghi','xyz','123']})
df.head()
#head visualiza un dataframe
#se muestran solo los primeros 5 elementos

Unnamed: 0,col1,col2,col3
0,10,4,abc
1,20,5,def
2,30,6,ghi
3,40,5,xyz
4,50,2,123


In [28]:
df['col2'].sum()
#suma de la columna

22

In [29]:
(df['col2']==5).sum()

2

In [30]:
df['col2'].count()
#cuenta elementos por columna

5

In [31]:
df['col2'].value_counts()
#frecuencia de valores por columna

5    2
4    1
6    1
2    1
Name: col2, dtype: int64

In [32]:
df['col2'].values
#extraer los valores por columna. 
#retorna array

array([4, 5, 6, 5, 2], dtype=int64)

In [33]:
df['col3'].values
#extraer los valores por columna. 
#retorna array

array(['abc', 'def', 'ghi', 'xyz', '123'], dtype=object)

In [34]:
a = df.columns.values
#extraer cabeceras columnas
s = sorted(a)
#ordenada
print(s)

['col1', 'col2', 'col3']


In [35]:
df.sort_values(by='col2')
##ordeamiento en referencia a la col2 de todo el df 

Unnamed: 0,col1,col2,col3
4,50,2,123
0,10,4,abc
1,20,5,def
3,40,5,xyz
2,30,6,ghi


In [36]:
df.loc[5]=[np.nan,2,np.nan]
df
#crear valores nulos

Unnamed: 0,col1,col2,col3
0,10.0,4.0,abc
1,20.0,5.0,def
2,30.0,6.0,ghi
3,40.0,5.0,xyz
4,50.0,2.0,123
5,,2.0,


In [37]:
df.dropna(inplace=True)
df
#borrar valores nulos
#inplace=True cambios permanentes

Unnamed: 0,col1,col2,col3
0,10.0,4.0,abc
1,20.0,5.0,def
2,30.0,6.0,ghi
3,40.0,5.0,xyz
4,50.0,2.0,123


In [38]:
df.loc[5]=[np.nan,2,np.nan]
df.fillna('sin valor',inplace=True)
df
#llenar valor nulo conun nuevo valor

Unnamed: 0,col1,col2,col3
0,10.0,4.0,abc
1,20.0,5.0,def
2,30.0,6.0,ghi
3,40.0,5.0,xyz
4,50.0,2.0,123
5,sin valor,2.0,sin valor


In [39]:
df.loc[6]=[np.nan,2,np.nan]
df.isnull()
#retorna booleano si el valor es nulo

Unnamed: 0,col1,col2,col3
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,True,False,True


In [40]:
#Reset index
df.reset_index(inplace=True)
df

Unnamed: 0,index,col1,col2,col3
0,0,10.0,4.0,abc
1,1,20.0,5.0,def
2,2,30.0,6.0,ghi
3,3,40.0,5.0,xyz
4,4,50.0,2.0,123
5,5,sin valor,2.0,sin valor
6,6,,2.0,


In [41]:
#Sort by index
df.sort_index(ascending=False,inplace=True)
df

Unnamed: 0,index,col1,col2,col3
6,6,,2.0,
5,5,sin valor,2.0,sin valor
4,4,50.0,2.0,123
3,3,40.0,5.0,xyz
2,2,30.0,6.0,ghi
1,1,20.0,5.0,def
0,0,10.0,4.0,abc


In [42]:
#Setting index
df.set_index('col3',inplace=True)
df

Unnamed: 0_level_0,index,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,6,,2.0
sin valor,5,sin valor,2.0
123,4,50.0,2.0
xyz,3,40.0,5.0
ghi,2,30.0,6.0
def,1,20.0,5.0
abc,0,10.0,4.0


In [43]:
df.reset_index(inplace=True)
df

Unnamed: 0,col3,index,col1,col2
0,,6,,2.0
1,sin valor,5,sin valor,2.0
2,123,4,50.0,2.0
3,xyz,3,40.0,5.0
4,ghi,2,30.0,6.0
5,def,1,20.0,5.0
6,abc,0,10.0,4.0


### STRING OPERATIONS:

In [44]:
df['col3'].str.extract('(\w+)',expand=False)
#convertir en contenido del valor lo convertimos en str.
#extract con regexp

0    NaN
1    sin
2    123
3    xyz
4    ghi
5    def
6    abc
Name: col3, dtype: object

### Filtering:

In [45]:
df[df['col3']=='abc']
#igualando valor de la col3 = "abc"

Unnamed: 0,col3,index,col1,col2
6,abc,0,10.0,4.0


### Uppercase/Lowercase:

In [46]:
df['col3'].str.upper()#lower()

0          NaN
1    SIN VALOR
2          123
3          XYZ
4          GHI
5          DEF
6          ABC
Name: col3, dtype: object

### Lenght of the string:

In [47]:
df['col3'].str.len()

0    NaN
1    9.0
2    3.0
3    3.0
4    3.0
5    3.0
6    3.0
Name: col3, dtype: float64

### Split:

In [48]:
df['col3'].str.split('c')

0            NaN
1    [sin valor]
2          [123]
3          [xyz]
4          [ghi]
5          [def]
6         [ab, ]
Name: col3, dtype: object

### Replace:

In [49]:
df['col3'].str.replace(' ','__')

0           NaN
1    sin__valor
2           123
3           xyz
4           ghi
5           def
6           abc
Name: col3, dtype: object

### Contains:

In [50]:
c=df['col3'].str.contains('a')
c

0      NaN
1     True
2    False
3    False
4    False
5    False
6     True
Name: col3, dtype: object

### ONE HOT ENCONDING:

In [51]:
dfhot = pd.DataFrame({'gender':['male','female','male','female','male'],'age_range':['young','adult','senior','young','adult']})
dfhot
#datos categoricos en una data
#2 categorias
#estos valores no se pueden trabajar , 
#hay que asignar un valor numerico a cada clase posible

Unnamed: 0,gender,age_range
0,male,young
1,female,adult
2,male,senior
3,female,young
4,male,adult


In [52]:
data_dummies = pd.get_dummies(dfhot)
data_dummies
#de categorica a numerica (0/1)
#esta forma es muy utilizada en nuestros modelos en data analisis y ML
#el df inicial crece en columnas, no en filas

Unnamed: 0,gender_female,gender_male,age_range_adult,age_range_senior,age_range_young
0,0,1,0,0,1
1,1,0,1,0,0
2,0,1,0,1,0
3,1,0,0,0,1
4,0,1,1,0,0


## Reference:

https://pandas.pydata.org/pandas-docs/stable/dsintro.html

https://pandas.pydata.org/pandas-docs/stable/basics.html