# Introduction to Pandas
In this notebook, we will learn the basics of Python's **pandas** library. Pandas is a powerful tool for data manipulation and analysis.

We will cover:
- Creating and inspecting DataFrames
- Selecting and filtering data
- Adding and modifying columns
- Aggregation and group operations
- Handling missing data
- Reading from and writing to CSV files

[Docs - DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [None]:
!pip3 install pandas
# import pandas
import pandas as pd
# from pandas import DataFrame

linebreaks = '\n'*2
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


## 1. Creating DataFrames
You can create a DataFrame from a dictionary, list of lists, or read from a CSV file.

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, None],
    'City': ['Vienna', 'Feldkirch', 'Feldkirch', 'Bregenz', None]
}

df = pd.DataFrame(data)
print(df)

df2 = pd.DataFrame([['Alice', 24], ['Bob']], columns=['Name', 'Age'], index=[100, 101])
df2

## 2. Inspecting DataFrames
- `head()`, `tail()` — view first/last rows
- `info()` — summary of DataFrame
- `describe()` — statistics for numeric columns
- `shape` — number of rows and columns
- `columns` — list column names

In [None]:
print(f"df.head(2):\n{df.head(2)}{linebreaks}")
print(f"df.tail(2):\n{df.tail(2)}{linebreaks}")

print(f"df.info(): ")
df.info()
print(f"{linebreaks}")

print(f"df.describe():\n{df.describe()}{linebreaks}")
print(f"df.shape: {df.shape}{linebreaks}")
print(f"df.columns: {df.columns}")



In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 27, 22, 32, 29],
    'Salary': [50000, 60000, 45000, 70000, 52000],
    'Bonus': [5000, 6000, 4000, 7000, 5500]
}
df3 = pd.DataFrame(data)
df3.describe()

## 3. Selecting and Filtering Data
- Selecting columns: `df['Column']` or `df.Column`
- Selecting rows: `iloc[]` (by index), `loc[]` (by label)
- Filtering: using boolean conditions

In [None]:
name = df['Name']
print(f"name:\n{name}")
print(f"type:\n{type(name)}{linebreaks}")


row = df.iloc[1]
print(f"row:\n{row}")
print(f"row:\n{type(row)}")

## 4. loc vs iloc

`loc` gets rows (and/or columns) with particular labels. <br>
`iloc` gets rows (and/or columns) at integer locations.

In [None]:
s = pd.Series(list("abcdefg"), index=[49, 48, 47, 0, 1, 2, "G"]) 
print(f"{s}\n")

print(f"{'-'*5}loc{'-'*6}")
print(f"loc: {s.loc[0]}")
print(f"loc: {s.loc[2]}")
print(f"loc: {s.loc["G"]}\n")

print(f"{'-'*5}iloc{'-'*5}")
print(f"iloc: {s.iloc[0]}")
print(f"iloc: {s.iloc[2]}\n")


print(f"{'-'*14}")
print(s.loc[[0, 2]])
s.iloc[[0, 2]]


## 5 - Filter

In [None]:
print(f"{[df['Age'] > 25]}\n\n")
 
new_df = df[df['Age'] > 25]
print(f"type(new_df): {type(new_df)}\n")
new_df

In [None]:
print(df[(df['Age'] > 25) & (df['City'] == "Bregenz")])
df[(df['Age'] < 25) | (df['City'] == "Vienna")]

## 6. Adding and Modifying Columns

In [None]:
df

In [None]:
df['Student'] = [True, False, True, False, False]
print(df)

df['Age'] = df['Age'] + 1
df

In [None]:
df_new = df.copy()
df_new['City'] = df_new['City'].fillna('Unknown') + " text"
print(df_new)

# df_str_concat = df.assign(City=df['City'].fillna('Unknown') + " text")
# print(df_str_concat)

In [None]:
df[df['Name'].str.contains('o')]

In [None]:
print(df[df['City'].str.contains('Br', na=False)])

df[df['Name'].apply(lambda x: len(x) > 3)]

In [None]:
# df.drop([1, 3])

## 7. Aggregation and Group Operations
- `mean()`, `sum()`, `min()`, `max()` — basic statistics
- `groupby()` — group data by column and aggregate

In [None]:
print(f"{df['Age'].mean()}\n")
print(f"{df.groupby('City')['Age'].mean()}\n")


gen_object = df.groupby('Age')['City']
print(gen_object)
# for age, cities in df.groupby('Age')['City']:
#     print(f"Age: {age}")
#     print(cities)

[obj for obj in gen_object]

In [None]:
print(f"Average age: {df['Age'].mean()}\n")                     
print(f"{df.groupby('City')['Age'].mean()}\n")                     
print(f"{df.groupby('Student')['Age'].mean()}\n")                  
print(f"{df.groupby('City')['Age'].agg(['mean', 'count'])}\n")     
print(f"{df.groupby('City').agg({'Age': 'mean', 'Student': 'sum'})}\n")  


## 8. Handling Missing Data
- `isnull()` — check for missing values
- `dropna()` — remove missing data
- `fillna()` — fill missing data

In [None]:
print(f"{df.isnull()}\n")

df['Age'] = df['Age'].fillna(df['Age'].mean())
print(f"{df}\n")

df['City'] = df['City'].dropna()
print(f"{df}\n")

df['City'] = df['City'].fillna("Unknown")
print(df)

## 9. Reading and Writing CSV Files
- `read_csv()` — read CSV
- `to_csv()` — write CSV

In [None]:
df.to_csv('example.csv', index=False)
# sep, columns, header, encoding, na_rep, float_format, line_terminator, quotechar, dateformat

df2 = pd.read_csv('example.csv')
# sep, header, names, usecols, index_col, nrows, skiprows, na_values, dtype, parse_dates, encoding
df2

In [None]:
df.to_csv('example.csv', index=False, sep=';')
df_loaded = pd.read_csv('example.csv', sep=';')
print(df_loaded.equals(df))

## 10. Join dataframes

In [None]:
# Share a column
first_df = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 28, 23]
})

second_df = pd.DataFrame({
    'ID': [1, 2, 4],
    'City': ['Vienna', 'Linz', 'Graz']
})

df_inner = pd.merge(first_df, second_df, on='ID', how='inner') # left, right,, outer
df_inner

In [None]:
# No shared columns - vertical concatenation
a = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 28]})
b = pd.DataFrame({'Name': ['David', 'Eva'], 'Age': [30, 26]})

df_concat = pd.concat([a, b], ignore_index=True)
df_concat

In [None]:
# horizontal concatenation
x = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie']})
y = pd.DataFrame({'City': ['Vienna', 'Linz', 'Graz']})
pd.concat([x, y], axis=1)

In [None]:
# Add 'City' Column from y to x
x['City'] = y['City']
x