# Winter School 2024 - Pandas Tutorial

This notebook introduces basic features of Pandas data set analysis.

Authors: Christopher Katins, Mario Sänger, Christopher Lazik, Thomas Kosch
Credits to Patrick Schäfer (HU Berlin)

## Pandas

Documentation: [https://pandas.pydata.org/](https://pandas.pydata.org/)

First setup environment and install packages

In [175]:
!python -m venv env_ws_tutorial

In [176]:
!source env_ws_tutorial/bin/activate

In [177]:
!pip install pandas




### Series

A Pandas Series is a one-dimensional array-like object that can hold data of any type (integer, string, float, python objects, etc.), similar to a column in a spreadsheet or a SQL table. Each element in a Series is associated with an index, which is an array of labels that allows for fast lookup and advanced data manipulation capabilities.

In [None]:
import pandas as pd
from pandas import Series

In [None]:
l =[7, 'Heisenberg', 3.14, -178971, 'Happy!']
s = pd.Series(l)
s

In [None]:
pd.Series(
    [7, 'Heisenberg', 3.14,-178971, 'Happy!'],
    index=['A', 'Z', 'C', 'Y', 'E']
)

In [None]:
d = {'Chicago': 1000, 
     'New York': 1300,
     'Portland': 900,
     'San Francisco': 1100,
     'Austin': 450,
     'Boston': None} 
cities = pd.Series(d) 
cities

In [None]:
cities[cities < 1000]

### DataFrames

A DataFrame is the primary data structure of Pandas to represent two-dimensional, tabular data with labeled axes (rows and columns) in Python. DataFrames allow for storing and manipulating real-world data sets and offer a wide range of functionalities for data manipulation tasks such as filtering, aggregation, and visualization.

**Creating DataFrames**

In [None]:
data = {'year': [2010, 2011, 2012, 2011, 
                2012, 2010, 2011, 2012], 
        'team': ['Bears', 'Bears', 'Bears', 
                    'Packers', 'Packers',
                    'Lions', 'Lions' , 'Lions'], 
        'wins': [11, 8, 10, 15, 11, 6, 10, 4], 
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]} 
football = pd.DataFrame(data)
football

In [None]:
football.set_index('year').head()

In [None]:
football.rename(columns={'year': 'season'})

**Loading data from a CSV file**

In [None]:
income_data = pd.read_csv("data_eng.csv", sep=",")
income_data

In [None]:
income_data.set_index("Loan_ID", inplace=True)
income_data

**Accessing / selecting data rows and columns**

In [None]:
# accessing row(s) via index label
income_data.loc["LP001005"]

In [None]:
# accessing rows via row id
income_data.iloc[0]

In [None]:
# accessing multiple rows
income_data.iloc[0:5]

In [None]:
# accessing a column
income_data["Married"]

In [None]:
income_data.Gender

In [None]:
income_data[["Married", "Gender"]]

In [None]:
# accessing rows via row id
income_data.iloc[0:5][["Married", "Gender", "Education"]]

**Data inspection**

In [None]:
# Get basic information, e.g. number of entries, short column description, of the data frame
income_data.info()

In [None]:
# Highlight basic descriptive statistics of the integer / float columns
income_data.describe()

In [None]:
# Access the value distribution of a column
income_data["Married"].value_counts(normalize=True)

In [None]:
len(income_data["Dependents"].unique())

In [None]:
income_data["Dependents"].mode()

In [None]:
income_data["Dependents"].value_counts(normalize=True)

In [None]:
# Print default statistics
print(income_data["ApplicantIncome"].min())
print(income_data["ApplicantIncome"].max())
print(income_data["ApplicantIncome"].mean())
print(income_data["ApplicantIncome"].median())
print(income_data["ApplicantIncome"].quantile(0.25))

In [None]:
# Display missing values per column
income_data.isna().sum()

**Filtering data**

In [None]:
income_data["ApplicantIncome"]>10000

In [None]:
income_data[income_data["ApplicantIncome"]>10000]

In [None]:
income_data[(income_data["ApplicantIncome"]>10000) & (income_data["Self_Employed"] == "Yes")]

In [None]:
# Accessing rows having no value in Credit_History
income_data[income_data["Credit_History"].isna()]

In [None]:
# Finding duplicated rows
income_data[income_data.duplicated()]

In [None]:
# Finding duplicated rows restricted to certain columns
income_data[income_data.duplicated(subset=["Gender", "Married", "Education", "Self_Employed"], keep="first")]


**Grouping data**

In [None]:
income_data.groupby("Gender")["ApplicantIncome"].mean()

In [None]:
income_data.groupby(["Gender", "Education"]).count()

**Sorting data**

In [None]:
income_data.sort_values("ApplicantIncome")

In [None]:
income_data.sort_values(["Loan_Amount_Term", "ApplicantIncome"], ascending=[False, True])

**Transforming data**

In [None]:
import math
income_data["log_income"] = income_data["ApplicantIncome"].apply(lambda income: math.log(income))
income_data[["ApplicantIncome", "log_income"]]

In [None]:
def concat(row: Series) -> str:
    return str(row["Gender"]) + "-" + str(row["Married"])

income_data["gender_married"] = income_data.apply(concat, axis=1)
income_data["gender_married"]