## Lecture pandas basics

* pandas.Series
* pandas.Dataframe
* read_csv
* indexing
* plotting


## Pandas series

- Kan skapas från dictionary
- Kan skapas från list
- Kan skapas från np.array


In [62]:
import pandas as pd

programs_dict = dict(AI = 26, NET = 38, JAVA = 30, UX = 28)


programs_series = pd.Series(programs_dict)

programs_series

AI      26
NET     38
JAVA    30
UX      28
dtype: int64

In [63]:
# Extract values through indexing
print(f"{programs_series[0] = }") # Kommer åt värdet
print(f"{programs_series[-1] = }")

print(f"{programs_series['UX'] = }")   # Gå igenom igen

# Get keys, nycklarna är index
print(f"{programs_series() = }")
print(f"{programs_series()[0] = }")  # Plockar ut enskilt värde


programs_series[0] = 26
programs_series[-1] = 28
programs_series['UX'] = 28


TypeError: 'Series' object is not callable

In [None]:
programs_series["AI"]  # Man kan komma åt värdet genom at

26

In [None]:
import random as rnd

rnd.seed(1337)  # Geer samma värde om man har samma seed. # Gå igeom igen"""""

dice_series = pd.Series([rnd.randint(1,6) for _ in range(10)])
dice_series


# dtype säger endast bara att den kan spara mer än 64 bitars. INTE VIKTIGT

0    5
1    5
2    6
3    3
4    5
5    5
6    6
7    2
8    3
9    4
dtype: int64

In [None]:
import random as rnd

# Rader räknas ner
# Kolumner räknas rad för rad till höger(som man läser)

# for reproductibility - get same set of values
rnd.seed(1337)  # Ger samma värde om man har samma seed. Ger samma värde varje gång

dice_series = pd.Series([rnd.randint(1,6) for _ in range(10)])
#dice_series.head()  #Gör så man kan inspektera de 5 första raderna oavsett hur stort dokument med kolumner man undersöker
dice_series

0    5
1    5
2    6
3    3
4    5
5    5
6    6
7    2
8    3
9    4
dtype: int64

In [None]:
# Man kan 
print(f"{dice_series.min() = }") # Får minsta värdet
print(f"{dice_series.argmin() = }") # gives index for min value
print(f"{dice_series.max() = }") # Får max värdet
print(f"{dice_series.argmin() = }") # average
print(f"{dice_series.argmin() = }") # sort all values in order - pick the middle one, if middle are 2 numbers calculate average of them



dice_series.min() = 2
dice_series.argmin() = 7
dice_series.max() = 6
dice_series.argmin() = 7
dice_series.argmin() = 7


## Dataframe

- tabular data with rows and colums
- analog to 2D numpy arrays with flexible row indices and col names (För numpy arrayer har vi inte olika index namn utan bara tabelldata -> kommer finnas exempel)
- "Specialized" dictionary with  coll name mapped to a Series object (Varje kolumn har en serie)

In [None]:
#pd.DataFrame(programs_series)  # Vi får en dataframe av 0:lte columnen

# Instantiated a DataFrame from a Series object (Could do it from list or dictionary aswell)
df_programs = pd.DataFrame(programs_series, columns=("Number_of_students",))  # Vi får en dataframe av 0:lte columnen
df_programs 


Unnamed: 0,Number_of_students
AI,26
NET,38
JAVA,30
UX,28


In [None]:
# Create 2 Series objects

students = pd.Series({"AI": 26, "NET": 38, "UX": 28, "JAVA": 30})
skills = pd.Series({"AI": "Python", "net": "C#", "UX": "Figma", "JAVA": "Java"})


# Create a DataFrame from 2 Series objects
df_programs = pd.DataFrame({"Students": students, "Skills": skills}) 
df_programs  # Får tabullär data

#Vi har samma datatype hela vägen, vi har inte kolumn namn, vi har kolumn
# Varje kolumn-namn är en mapin() till en pandas series
# Specialized dictionary
# NET gives NaN because the index is not the same

Unnamed: 0,Students,Skills
AI,26.0,Python
JAVA,30.0,Java
NET,38.0,
UX,28.0,Figma
net,,C#


In [None]:
df_programs["Students"]

# Look at dtype, python has changed (typomvandlat) to flyttal
# Eftersom den är NaN över så får vi en annan datatype

AI      26.0
JAVA    30.0
NET     38.0
UX      28.0
net      NaN
Name: Students, dtype: float64

In [None]:
df_programs["Students"].mean(), (26+30+38+28)/4 # Ger medelvärdet, sista parentesen förtydligar detta

(30.5, 30.5)

In [None]:
median_student_number = df_programs["Students"].median()
print(f"Median students in the programs{df_programs.index.to_list()} is {median_student_number:.0f}")

Median students in the programs['AI', 'JAVA', 'NET', 'UX', 'net'] is 29


We have extracted information from the tablet from the 5 classes, we have calculated the median

In [None]:
df_programs["Skills"] # I get a series, notice dtype object

AI      Python
JAVA      Java
NET        NaN
UX       Figma
net         C#
Name: Skills, dtype: object

In [None]:
df_programs["Skills"][0] # I get a series, notice dtype object

'Python'

In [None]:
df_programs["Skills"][0], df_programs["Skills"]["AI"], df_programs["Skills"]["UX"]

('Python', 'Python', 'Figma')

## Indexers
- loc
Slicing and indexing using explicit index

- iloc
Motsvarande. slicing and indexing using 


In [None]:
df_programs.loc["AI"] #Returns a Series Object

Students      26.0
Skills      Python
Name: AI, dtype: object

In [None]:
df_programs.loc["JAVA"]

Students    30.0
Skills      Java
Name: JAVA, dtype: object

In [65]:
df_programs.loc[["AI", "UX"]]  # Här ändras formatteringen till en DataFrame

Unnamed: 0,Students,Skills
AI,26.0,Python
UX,28.0,Figma


In [None]:
#Index location, Testa skriva 1:4 t ex för att få 
df_programs.iloc[1:3] # Returns a DataFrame object

Unnamed: 0,Students,Skills
JAVA,30.0,Java
NET,38.0,


## Masking

In [None]:
df_programs

Unnamed: 0,Students,Skills
AI,26.0,Python
JAVA,30.0,Java
NET,38.0,
UX,28.0,Figma
net,,C#


In [None]:
df_programs["Students"] >= 30  # Har fler än 30

AI      False
JAVA     True
NET      True
UX      False
net     False
Name: Students, dtype: bool

In [None]:
# Using masking to filter the DataFrame
# Här får jag ut min filtrerade dataframe using Masking with values above 30

df_programs[df_programs["Students"] >= 30]

Unnamed: 0,Students,Skills
JAVA,30.0,Java
NET,38.0,


In [None]:
df_programs # Om jag skriver det här får jag fortfarande ut originalet

Unnamed: 0,Students,Skills
AI,26.0,Python
JAVA,30.0,Java
NET,38.0,
UX,28.0,Figma
net,,C#


In [64]:
df_programs_over_29 = df_programs[df_programs["Students"] >= 30]  #Här har jag tilldelat det uppdaterade värdet till en ny.
df_programs_over_29

Unnamed: 0,Students,Skills
JAVA,30.0,Java
NET,38.0,


----
## Excel data



In [67]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_excel("../Data/calories.xlsx")
df.head()  # Sees the first 5 rows. We can also see it has 5 columns


Unnamed: 0,FoodCategory,FoodItem,per100grams,Cals_per100grams,KJ_per100grams
0,CannedFruit,Applesauce,100g,62 cal,260 kJ
1,CannedFruit,Canned Apricots,100g,48 cal,202 kJ
2,CannedFruit,Canned Blackberries,100g,92 cal,386 kJ
3,CannedFruit,Canned Blueberries,100g,88 cal,370 kJ
4,CannedFruit,Canned Cherries,100g,54 cal,227 kJ
