## Pandas Library

it provides useful data structures based on Numpy arrays

### Pandas series
it's a sequence of homogeneous values (same type) this data is sequential (1D), <br>
the difference is that we can now add an explicit index (instead of 0,1,2 ecc) <br>
we can use other values to implement indexes --> associated indexes <br>

"3-july":0.8 <br>
'5-july":1.2

In [53]:
import pandas as pd
import numpy as np
from math import log2

In [12]:
# this is a series, converted to float

# Use index to create and index to be associated to each row
s = pd.Series([1,2 ,3.4], index=['a', 'b', 'c'])
print(s)
#print(s['a'])

# i can also use a dictionary to create a pandas data series

s2 = pd.Series({"alpha":2, "bravo":5})
print(s2)

# will give us a numpy array
print(s2.values)

# will gives us the class index that is used for each index
print(s2.index)

a    1.0
b    2.0
c    3.4
dtype: float64
alpha    2
bravo    5
dtype: int64
[2 5]
Index(['alpha', 'bravo'], dtype='object')


What to do with a pandas series?

Watch out for loc vs iloc

In [32]:
s = pd.Series([1,2 ,3.4, 6, 5], index=['a', 'b', 'c', 'e','d'])
s2 = pd.Series({"alpha":2, "bravo":5})

# Explicit indexing
print(s2['bravo'])
print(s2.loc["bravo"])

# Implicit indexing
print(s2.iloc[1])

# slicing (but the stop is included for loc/iloc)
print(s['a':'c'])
print(s.loc['a':'e']) # skips d, but doesn't stop at stop
print(s.iloc[0:3])


# Masking as usual, this selects only the value for which the predicate is true
print(s[(s>1) & (s<5)])

# use only a certain group of indexes
print(s.loc[['a', 'e']])


5
5
5
a    1.0
b    2.0
c    3.4
dtype: float64
a    1.0
b    2.0
c    3.4
e    6.0
dtype: float64
a    1.0
b    2.0
c    3.4
dtype: float64
b    2.0
c    3.4
dtype: float64
a    1.0
e    6.0
dtype: float64


### Pandas DataFrame

they behave as 2d arrays (a bunch of a bunch of series)

<br>
<br>
can be created by using as example three series with the same indexes
<br> using dictionaries
<br> using list of dictionaries
<br> using numpy arrays
<br> Using the read function of pandas

<br><br>
Pandas can also be used for missing values, generating nan
<br> we can also manually convert values that make little sense into nan

In [15]:
x = {'price':[1,2,3], "weight":[4, 5, 6]}
print(pd.DataFrame(x, index=["c1", "c2", "c3"]))
x =[{"price":2, "age":2}, {"price":4, "age":15}]
print(pd.DataFrame(x))

y = np.random.random((4,3))
print(y)
display(pd.DataFrame(y, index=['a', 'b', 'c', 'd'], columns=[f"Column {i+1}" for i in range(0, 3)]))

display(pd.read_csv("try.csv")) # With nan as proof

display(pd.read_csv("try.csv", na_values=[880])) # substitute values with NaN


    price  weight
c1      1       4
c2      2       5
c3      3       6
   price  age
0      2    2
1      4   15
[[0.88051331 0.13561913 0.36783279]
 [0.15409787 0.06877015 0.09643868]
 [0.09946422 0.77828218 0.98252722]
 [0.17276299 0.8583421  0.64004219]]


Unnamed: 0,Column 1,Column 2,Column 3
a,0.880513,0.135619,0.367833
b,0.154098,0.06877,0.096439
c,0.099464,0.778282,0.982527
d,0.172763,0.858342,0.640042


Unnamed: 0,#Patate,prezzo,provenienza
0,50,10,catania
1,1000,80,ragusa
2,880,20,
3,90,5,Messina


Unnamed: 0,#Patate,prezzo,provenienza
0,50.0,10,catania
1,1000.0,80,ragusa
2,,20,
3,90.0,5,Messina


#### PANDAS 2D indexing

in pandas indexing is done in cascade (you start from rows to columns)<br>
in numpy indexing is done simultaneously <br>
i can also use indexing to change all values simultaneously<br>
<br>
To add a new column i can do just this (if the column is already present, it's replaced)
<br> To remove a column i can use the drop method

In [52]:
array = pd.read_csv('try.csv')

# slicing as expected
print(array[1:3]['#Patate'])
print(array.iloc[0:2][["#Patate", 'provenienza']])

array['Ranking'] = [2,1,4,3]
display(array)

# A new array is created and returned, don't alter the original
display(array.drop(columns='prezzo'))
display(array)

# we operate in the original array, nothing is returned
display(array.drop(columns='prezzo', inplace=True))
display(array)

# we rename the columns
array.rename(columns={'#Patate':"Quantita"}, inplace=True)
display(array)


1    1000
2     880
Name: #Patate, dtype: int64
   #Patate provenienza
0       50     catania
1     1000      ragusa


Unnamed: 0,#Patate,prezzo,provenienza,Ranking
0,50,10,catania,2
1,1000,80,ragusa,1
2,880,20,,4
3,90,5,Messina,3


Unnamed: 0,#Patate,provenienza,Ranking
0,50,catania,2
1,1000,ragusa,1
2,880,,4
3,90,Messina,3


Unnamed: 0,#Patate,prezzo,provenienza,Ranking
0,50,10,catania,2
1,1000,80,ragusa,1
2,880,20,,4
3,90,5,Messina,3


None

Unnamed: 0,#Patate,provenienza,Ranking
0,50,catania,2
1,1000,ragusa,1
2,880,,4
3,90,Messina,3


Unnamed: 0,Quantita,provenienza,Ranking
0,50,catania,2
1,1000,ragusa,1
2,880,,4
3,90,Messina,3


#### OPERATION ON DF

pretty much the ones in numpy
- Unitary: applicable to one element at a time
- Binary: ex sum between two df, which is only done if the df have the same index, <br>
if one array has the index while the other one has not then it's assigned not a number <br>
NOTE: index will be sorted by default
- Same thing if it's a 2D case (rows are aligned), but also columns are aligned,<br>
so, if there is no match between columns then there will be a lot of NaN

In [57]:
y = np.random.random((4,3))
y = pd.DataFrame(y, index=['a', 'b', 'c', 'd'], columns=[f"Column {i+1}" for i in range(0, 3)])
display(y)




Unnamed: 0,Column 1,Column 2,Column 3
a,0.864734,0.297761,0.260702
b,0.780788,0.766764,0.469479
c,0.867504,0.189515,0.549727
d,0.665465,0.464273,0.637327


TypeError: 'DataFrame' object is not callable