# Intro to scientific Python

## Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python
[Documentation](https://pandas.pydata.org/docs/getting_started/overview.html)

> The contents are adapted from [Tirendaz Academy notebook](https://github.com/TirendazAcademy/PANDAS-TUTORIAL/tree/main).

In [1]:
import pandas as pd # do not forget to install it

### Series

Series can be thought of as a column of values, each of which have a unique index key value.

In [2]:
obj = pd.Series([1, "John", 3.5, "Hey"])
obj

0       1
1    John
2     3.5
3     Hey
dtype: object

In [3]:
obj[0]

1

In [9]:
obj.values

array([1, 'John', 3.5, 'Hey'], dtype=object)

In [10]:
# specifying custom index
obj2 = pd.Series([1, "John", 3.5, "Hey"], index=["a","b","c","d"])
obj2

a       1
b    John
c     3.5
d     Hey
dtype: object

In [11]:
# accessing rowa in this column by custom index
obj2["b"]

'John'

In [5]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [8]:
score={"Jane":90, "Bill":80,"Elon":85,"Tom":75,"Tim":95}
names=pd.Series(score)
names

Jane    90
Bill    80
Elon    85
Tom     75
Tim     95
dtype: int64

In [9]:
names["Tim"] 

95

In [10]:
bool_idx = names >= 85  # same as for numpy
bool_idx

Jane     True
Bill    False
Elon     True
Tom     False
Tim      True
dtype: bool

In [11]:
names[bool_idx]
# or same as
names[names >= 85]

Jane    90
Elon    85
Tim     95
dtype: int64

In [12]:
names["Tom"] = 60  # same as for numpy
names

Jane    90
Bill    80
Elon    85
Tom     60
Tim     95
dtype: int64

In [13]:
names[names<=80] = 83  # same as for numpy
names

Jane    90
Bill    83
Elon    85
Tom     83
Tim     95
dtype: int64

We can do a lot of operations on columns:

In [14]:
"Tom" in names

True

In [15]:
names / 10 

Jane    9.0
Bill    8.3
Elon    8.5
Tom     8.3
Tim     9.5
dtype: float64

In [16]:
names**2

Jane    8100
Bill    6889
Elon    7225
Tom     6889
Tim     9025
dtype: int64

In [17]:
names.isnull()

Jane    False
Bill    False
Elon    False
Tom     False
Tim     False
dtype: bool

### DataFrame

DataFrame is basically a table - a set of columns with one shared index

In [18]:
data = {
    "name": ["Bill", "Tom", "Tim", "John", "Alex", "Vanessa", "Kate"],
    "score": [90, 80, 85, 75, 95, 60, 65],
    "sport":["Wrestling", "Football", "Skiing", "Swimming", "Tennis", "Karete", "Surfing"],
    "sex": ["M", "M", "M", "M", "F", "F", "F"],
}

In [19]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,score,sport,sex
0,Bill,90,Wrestling,M
1,Tom,80,Football,M
2,Tim,85,Skiing,M
3,John,75,Swimming,M
4,Alex,95,Tennis,F
5,Vanessa,60,Karete,F
6,Kate,65,Surfing,F


In [20]:
df.head()

Unnamed: 0,name,score,sport,sex
0,Bill,90,Wrestling,M
1,Tom,80,Football,M
2,Tim,85,Skiing,M
3,John,75,Swimming,M
4,Alex,95,Tennis,F


In [21]:
df.tail(3)

Unnamed: 0,name,score,sport,sex
4,Alex,95,Tennis,F
5,Vanessa,60,Karete,F
6,Kate,65,Surfing,F


In [22]:
df=pd.DataFrame(
    data,
    columns=["name", "sport", "gender", "score", "age"],
    index=["one", "two", "three", "four", "five", "six", "seven"],
)
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,
two,Tom,Football,,80,
three,Tim,Skiing,,85,
four,John,Swimming,,75,
five,Alex,Tennis,,95,
six,Vanessa,Karete,,60,
seven,Kate,Surfing,,65,


In [26]:
# when using [] with dataframes we get a Series object - a column
df["sport"]

one      Wrestling
two       Football
three       Skiing
four      Swimming
five        Tennis
six         Karete
seven      Surfing
Name: sport, dtype: object

In [27]:
# querying two columns will return a new dataframe
my_columns = ["name", "sport"]
df[my_columns]  # same as df[["name", "sport"]]

Unnamed: 0,name,sport
one,Bill,Wrestling
two,Tom,Football
three,Tim,Skiing
four,John,Swimming
five,Alex,Tennis
six,Vanessa,Karete
seven,Kate,Surfing


In [28]:
df.sport

one      Wrestling
two       Football
three       Skiing
four      Swimming
five        Tennis
six         Karete
seven      Surfing
Name: sport, dtype: object

In [29]:
df.loc["one"]  # access "row" - take values from all series for index

name           Bill
sport     Wrestling
gender          NaN
score            90
age             NaN
Name: one, dtype: object

In [30]:
df.loc[["one", "two"]]  # select serveral rows

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,
two,Tom,Football,,80,


In [31]:
df["age"] = 18
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,18
two,Tom,Football,,80,18
three,Tim,Skiing,,85,18
four,John,Swimming,,75,18
five,Alex,Tennis,,95,18
six,Vanessa,Karete,,60,18
seven,Kate,Surfing,,65,18


In [34]:
values = [18, 19, 20, 18, 17, 17, 18]
df["age"] = values
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,18
two,Tom,Football,,80,19
three,Tim,Skiing,,85,20
four,John,Swimming,,75,18
five,Alex,Tennis,,95,17
six,Vanessa,Karete,,60,17
seven,Kate,Surfing,,65,18


In [35]:
df.score >= 70  # same as for numpy

one       True
two       True
three     True
four      True
five      True
six      False
seven    False
Name: score, dtype: bool

In [36]:
df["pass"] = df.score>=70
df

Unnamed: 0,name,sport,gender,score,age,pass
one,Bill,Wrestling,,90,18,True
two,Tom,Football,,80,19,True
three,Tim,Skiing,,85,20,True
four,John,Swimming,,75,18,True
five,Alex,Tennis,,95,17,True
six,Vanessa,Karete,,60,17,False
seven,Kate,Surfing,,65,18,False


In [37]:
df[df.score >= 70]

Unnamed: 0,name,sport,gender,score,age,pass
one,Bill,Wrestling,,90,18,True
two,Tom,Football,,80,19,True
three,Tim,Skiing,,85,20,True
four,John,Swimming,,75,18,True
five,Alex,Tennis,,95,17,True


In [40]:
# removing Series from a DataFrame
del df["pass"]
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,18
two,Tom,Football,,80,19
three,Tim,Skiing,,85,20
four,John,Swimming,,75,18
five,Alex,Tennis,,95,17
six,Vanessa,Karete,,60,17
seven,Kate,Surfing,,65,18


### Summarizing & Computing Descriptive Statistics

In [41]:
import numpy as np

data = [
    [   2.4, np.nan],
    [   6.3,   -5.4],
    [np.nan, np.nan],
    [  0.75,   -1.3],
]

df=pd.DataFrame(
    data,
    index=["a","b","c","d"],
    columns=["one","two"])
df

Unnamed: 0,one,two
a,2.4,
b,6.3,-5.4
c,,
d,0.75,-1.3


Pandas offers a wide range of usefule functions

In [44]:
df.sum(axis=0)  # axis=0 - column-wise

one    9.45
two   -6.70
dtype: float64

In [43]:
df.sum(axis=1)  # axis=1 - row-wise

a    2.40
b    0.90
c    0.00
d   -0.55
dtype: float64

In [None]:
df.mean(axis=1)

In [45]:
df.mean(axis=1, skipna=False)

a      NaN
b    0.450
c      NaN
d   -0.275
dtype: float64

`describe` function is helpful when we want to know the count, std, distribution of data in each column.

In [46]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.15,-3.35
std,2.85,2.899138
min,0.75,-5.4
25%,1.575,-4.375
50%,2.4,-3.35
75%,4.35,-2.325
max,6.3,-1.3


### Data Reading & Writting

Pandas supports a lot of formats. The most popular are:
- csv
- JSON
- pickle

CSV is usually the to-go option for text data.

In [None]:
df = pd.read_csv("high_scores.csv")  # reads file from filesystem
df


In [None]:
df['Level pass'] = df['Score'] >= 90
df

In [None]:
df.to_csv("high_scores_updated.csv")  # saves file to filesystem

## Additional materials

[Pandas cookbook](https://github.com/jvns/pandas-cookbook)

[Pandas Workshop](https://github.com/stefmolin/pandas-workshop)