## Pandas Library

it provides useful data structures based on Numpy arrays

### Pandas series
it's a sequence of homogeneous values (same type) this data is sequential (1D), <br>
the difference is that we can now add an explicit index (instead of 0,1,2 ecc) <br>
we can use other values to implement indexes --> associated indexes <br>

"3-july":0.8 <br>
'5-july":1.2

In [2]:
import pandas as pd
import numpy as np
from math import log2

In [12]:
# this is a series, converted to float

# Use index to create and index to be associated to each row
s = pd.Series([1,2 ,3.4], index=['a', 'b', 'c'])
print(s)
#print(s['a'])

# i can also use a dictionary to create a pandas data series

s2 = pd.Series({"alpha":2, "bravo":5})
print(s2)

# will give us a numpy array
print(s2.values)

# will gives us the class index that is used for each index
print(s2.index)

a    1.0
b    2.0
c    3.4
dtype: float64
alpha    2
bravo    5
dtype: int64
[2 5]
Index(['alpha', 'bravo'], dtype='object')


What to do with a pandas series?

Watch out for loc vs iloc

In [32]:
s = pd.Series([1,2 ,3.4, 6, 5], index=['a', 'b', 'c', 'e','d'])
s2 = pd.Series({"alpha":2, "bravo":5})

# Explicit indexing
print(s2['bravo'])
print(s2.loc["bravo"])

# Implicit indexing
print(s2.iloc[1])

# slicing (but the stop is included for loc/iloc)
print(s['a':'c'])
print(s.loc['a':'e']) # skips d, but doesn't stop at stop
print(s.iloc[0:3])


# Masking as usual, this selects only the value for which the predicate is true
print(s[(s>1) & (s<5)])

# use only a certain group of indexes
print(s.loc[['a', 'e']])


5
5
5
a    1.0
b    2.0
c    3.4
dtype: float64
a    1.0
b    2.0
c    3.4
e    6.0
dtype: float64
a    1.0
b    2.0
c    3.4
dtype: float64
b    2.0
c    3.4
dtype: float64
a    1.0
e    6.0
dtype: float64


### Pandas DataFrame

they behave as 2d arrays (a bunch of a bunch of series)

<br>
<br>
can be created by using as example three series with the same indexes
<br> using dictionaries
<br> using list of dictionaries
<br> using numpy arrays
<br> Using the read function of pandas

<br><br>
Pandas can also be used for missing values, generating nan
<br> we can also manually convert values that make little sense into nan

In [15]:
x = {'price':[1,2,3], "weight":[4, 5, 6]}
print(pd.DataFrame(x, index=["c1", "c2", "c3"]))
x =[{"price":2, "age":2}, {"price":4, "age":15}]
print(pd.DataFrame(x))

y = np.random.random((4,3))
print(y)
display(pd.DataFrame(y, index=['a', 'b', 'c', 'd'], columns=[f"Column {i+1}" for i in range(0, 3)]))

display(pd.read_csv("try.csv")) # With nan as proof

display(pd.read_csv("try.csv", na_values=[880])) # substitute values with NaN


    price  weight
c1      1       4
c2      2       5
c3      3       6
   price  age
0      2    2
1      4   15
[[0.88051331 0.13561913 0.36783279]
 [0.15409787 0.06877015 0.09643868]
 [0.09946422 0.77828218 0.98252722]
 [0.17276299 0.8583421  0.64004219]]


Unnamed: 0,Column 1,Column 2,Column 3
a,0.880513,0.135619,0.367833
b,0.154098,0.06877,0.096439
c,0.099464,0.778282,0.982527
d,0.172763,0.858342,0.640042


Unnamed: 0,#Patate,prezzo,provenienza
0,50,10,catania
1,1000,80,ragusa
2,880,20,
3,90,5,Messina


Unnamed: 0,#Patate,prezzo,provenienza
0,50.0,10,catania
1,1000.0,80,ragusa
2,,20,
3,90.0,5,Messina


#### PANDAS 2D indexing

in pandas indexing is done in cascade (you start from rows to columns)<br>
in numpy indexing is done simultaneously <br>
i can also use indexing to change all values simultaneously<br>
<br>
To add a new column i can do just this (if the column is already present, it's replaced)
<br> To remove a column i can use the drop method

In [52]:
array = pd.read_csv('try.csv')

# slicing as expected
print(array[1:3]['#Patate'])
print(array.iloc[0:2][["#Patate", 'provenienza']])

array['Ranking'] = [2,1,4,3]
display(array)

# A new array is created and returned, don't alter the original
display(array.drop(columns='prezzo'))
display(array)

# we operate in the original array, nothing is returned
display(array.drop(columns='prezzo', inplace=True))
display(array)

# we rename the columns
array.rename(columns={'#Patate':"Quantita"}, inplace=True)
display(array)


1    1000
2     880
Name: #Patate, dtype: int64
   #Patate provenienza
0       50     catania
1     1000      ragusa


Unnamed: 0,#Patate,prezzo,provenienza,Ranking
0,50,10,catania,2
1,1000,80,ragusa,1
2,880,20,,4
3,90,5,Messina,3


Unnamed: 0,#Patate,provenienza,Ranking
0,50,catania,2
1,1000,ragusa,1
2,880,,4
3,90,Messina,3


Unnamed: 0,#Patate,prezzo,provenienza,Ranking
0,50,10,catania,2
1,1000,80,ragusa,1
2,880,20,,4
3,90,5,Messina,3


None

Unnamed: 0,#Patate,provenienza,Ranking
0,50,catania,2
1,1000,ragusa,1
2,880,,4
3,90,Messina,3


Unnamed: 0,Quantita,provenienza,Ranking
0,50,catania,2
1,1000,ragusa,1
2,880,,4
3,90,Messina,3


#### OPERATION ON DF

pretty much the ones in numpy
- Unitary: applicable to one element at a time
- Binary: ex sum between two df, which is only done if the df have the same index, <br>
if one array has the index while the other one has not then it's assigned not a number <br>
NOTE: index will be sorted by default
- Same thing if it's a 2D case (rows are aligned), but also columns are aligned,<br>
so, if there is no match between columns then there will be a lot of NaN

In [1]:
y = np.random.random((4,3))
y = pd.DataFrame(y, index=['a', 'b', 'c', 'd'], columns=[f"Column {i+1}" for i in range(0, 3)])
display(y)




NameError: name 'np' is not defined

i can do broadcasting just like numpy (sum something with different shape)
<br>
<br>
i can also aggregate values:<br>
particularly easy if the df is a series (ex mean, min, max, sum...) <br>


In [27]:
s = pd.Series([1,2,3])
print(np.mean(s))

df = pd.read_csv("iris.csv")
df = df.drop(index=4)
display(df.mean())


2.0


TypeError: Could not convert ['Iris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginica'] to numeric

None value is converted to np.nan which is more efficient in floating point operation

In [None]:
# is null can detected the missing values, returning a boolean mask for true if values is missing, not true otherwise
s = pd.Series([1,2,None, 4])
print(pd.isnull(s))
#s[s.isnull()]=0 # replaces null values with zeros
display(s)
display(s.dropna()) # will drop the missing values
# for tables is trickier cause you can remove a single element, you have to choose if you prefer to delete the entire row or the entire column
# i can also specify if i want to delete a column if all are missing values or if just a few are to delete the associated rows instead
# ex with all you drop rows with missing values, with columns you drop columns
# you can also fill missing values

df = pd.DataFrame([[1, 2], [5, np.nan], [3, np.nan]], index=['a', 'b', 'c'])
df = df.dropna(how='all')
display(df)

# also exist ffill() or bfill() that fills the value with the before/after possible column value 

display(df.bfill())

s1 =pd.Series([1,2], index=['a', 'b'])
s2 = pd.Series([3,4,5], index=['a', 'c', 'd'])
display(pd.concat((s1,s2), ignore_index=True)) # ignore index to give new index

# i can also concatenate vertically dataFrame (more risky to do it horizontally)
# if i stack vertically and columns are missing or are in excess you can expect a lot of nan

# Merge function, i can merge two dataFrames, by default they are joined on all common columns
# otherwise i can specify a single column to be merged on, or also the indexes (right/left)

# the joining can be performed just like in databases 1-1 1-N or N-N '1:1'


df2 = pd.DataFrame([1,2])


0    False
1    False
2     True
3    False
dtype: bool


0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

0    1.0
1    2.0
3    4.0
dtype: float64

Unnamed: 0,0,1
a,1,2.0
b,5,
c,3,


Unnamed: 0,0,1
a,1,2.0
b,5,
c,3,


0    1
1    2
2    3
3    4
4    5
dtype: int64

ValueError: Cannot merge a Series without a name

I can also group by data to
- iterate on a specific group of values
- aggregate groups and compute stats (min, max, mean) for a single group at a time
- Filtering: i can filter data ex grouped_df.filter(lambda x:x['c1'].mean()>2.5) which filters all the groups with a mean higher than 2.5

In [None]:
df = pd.DataFrame({'k':['a','b', 'a', 'b'], 'c1':[1,2,3,4]})

[display(f) for f in df.groupby('k')] 



('a',
    k  c1
 0  a   1
 2  a   3)

('b',
    k  c1
 1  b   2
 3  b   4)

ValueError: No axis named c1 for object type DataFrame

Pivot table can be used to analyze correlation betweens tables ex
a dataframe with columns type, class, fail (True/False)
we can study if we have a correlation between type of sensor and the class to determine the fail

df.pivot_table('fail', index='type', columns='class', aggfunc='sum')
this sums the total number of failures for each sensor on class
        class1  class2   class3
type a     0       0        1
type b     ..       ...     ...

Multi index allows specific indexing hierarchy for series and data frames
ex a multi indexed series

In [None]:
s = pd.Series([10,9,7,5, 6, 6], index=[["rome", "rome", "rome", "milan", "turin", "turin"], [2018, 2019, 2020, 2018, 2019, 2020]])
display(s)

# filtering on the outer most index
display(s.loc['rome'])
# filter by year (second index)
display(s.loc[:, 2020])

# to do to data frame
# df.loc['Rome', 'c1'] accesses row/s Rome and column/s c1
# df.loc[:, ('c1', 'c1 sub1')] takes all the rows but just the sub column sub1 of multi column c1
# df.loc[('Rome', 2018), 'c1'] as before

# THe problem is with slicing 
# df.loc[ix['Rome', 2018], ix['c1':'c2', a]] # i take all the rows of rome 2018 and all the sub columns a of all the columns from c1 to c2

# To reset to single index i can use the df.reset_index() which drops all the previous indexes (which becomes new columns of the data frame)
# and the new indexes are integers starting from 0 
# to use custom indexes resetting df_reset.setindex([new index])

# i can also unstack series to convert them into a single dataFrame

rome   2018    10
       2019     9
       2020     7
milan  2018     5
turin  2019     6
       2020     6
dtype: int64

2018    10
2019     9
2020     7
dtype: int64

rome     7
turin    6
dtype: int64