# Pandas Filtering

Video zu diesem Thema: https://youtu.be/Lw2rlcxScZY

## <span style="color:#7030A0">Unser Setup</span>

In [2]:
import pandas as pd

In [5]:
# Beispieldaten vorbereiten
people = {
    'first': ['Gordon', 'Morgan', 'Naruto'],
    'last':  ['Freeman', 'Freeman', 'Uzumaki'],
    'email': ['gordonfreeman@valve.com', 'morganFreeman@email.com', 'narutoUzumaki@carlsen.com']
}

In [6]:
people_df = pd.DataFrame(people)

In [7]:
# Stack Overflow Survey-Daten vorbereiten
survey_df = pd.read_csv('../data/survey_results_public.csv', index_col='Respondent')
schema_df = pd.read_csv('../data/survey_results_schema.csv', index_col='Column')
pd.set_option('display.max_columns', 85) # konkatenierte Informationen mit vollständigen Spalten anzeigen
pd.set_option('display.max_rows', 85)    # konkatenierte Informationen mit vollständigen Zeilen anzeigen

In [7]:
people_df

Unnamed: 0,first,last,email
0,Gordon,Freeman,gordonfreeman@valve.com
1,Morgan,Freeman,morganFreeman@email.com
2,Naruto,Uzumaki,narutoUzumaki@carlsen.com


## <span style="color:#7030A0">Filter</span>
Filtern ist ein Hauptbestandteil der Pandas-Bibliothek, da wie wir die Daten, die wir wollen von denen, die wir nicht wollen, trennen bzw. filtern wollen. 
Gehen wir davon aus, dass wir alle Personen haben wollen, die "Freeman" als Nachnamen haben. Statt einem Dataframe bekommen wir diesmal eine Series zurück. 

In [11]:
filt = (people_df["last"] == "Freeman")
filt

0     True
1     True
2    False
Name: last, dtype: bool

Jetzt können wir den Filter aus diesen True- und False-Werten als "Maske" auf unser Dataframe anwenden:

In [12]:
people_df[filt]

Unnamed: 0,first,last,email
0,Gordon,Freeman,gordonfreeman@valve.com
1,Morgan,Freeman,morganFreeman@email.com


Andere Schreibweise:

In [14]:
people_df.loc[filt]

Unnamed: 0,first,last,email
0,Gordon,Freeman,gordonfreeman@valve.com
1,Morgan,Freeman,morganFreeman@email.com


Die zweite Schreibweise ist besser, da wir gleichzeitig noch auf die Spalten zugreifen können:

In [15]:
people_df.loc[filt, "email"]

0    gordonfreeman@valve.com
1    morganFreeman@email.com
Name: email, dtype: object

In [26]:
filt = (people_df["last"] == "Freeman") & (people_df["first"] == "Gordon") # UND-Operator
people_df.loc[filt, "email"]

0    gordonfreeman@valve.com
Name: email, dtype: object

In [25]:
filt = (people_df["last"] == "Uzumaki") | (people_df["first"] == "Gordon") # ODER-Operator
people_df.loc[filt, "email"]

0      gordonfreeman@valve.com
2    narutoUzumaki@carlsen.com
Name: email, dtype: object

In [24]:
people_df.loc[-filt, "email"] # Negierung

1    morganFreeman@email.com
Name: email, dtype: object

## <span style="color:#7030A0">Anwendungsbeispiele</span>

In [35]:
high_salary_filt = (survey_df["ConvertedComp"] > 70000)
survey_df.loc[high_salary_filt, ["Country", "LanguageWorkedWith", "ConvertedComp"]]

Unnamed: 0_level_0,Country,LanguageWorkedWith,ConvertedComp
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Canada,Java;R;SQL,366420.0
9,New Zealand,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;P...,95179.0
13,United States,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,90000.0
16,United Kingdom,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;T...,455352.0
22,United States,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,103000.0
...,...,...,...
88876,United States,Bash/Shell/PowerShell;C#;HTML/CSS;Java;Python;...,180000.0
88877,United States,Bash/Shell/PowerShell;C;Clojure;HTML/CSS;Java;...,2000000.0
88878,United States,HTML/CSS;JavaScript;Scala;TypeScript,130000.0
88879,Finland,Bash/Shell/PowerShell;C++;Python,82488.0


In [39]:
countries = ["United States", "India", "United Kingdom", "Germany", "Canada"]
filt = survey_df["Country"].isin(countries)
survey_df.loc[filt, "Country"]

Respondent
1        United Kingdom
4         United States
6                Canada
8                 India
10                India
              ...      
85642     United States
85961    United Kingdom
86012             India
88282     United States
88377            Canada
Name: Country, Length: 45008, dtype: object

In [40]:
survey_df["LanguageWorkedWith"]

Respondent
1                          HTML/CSS;Java;JavaScript;Python
2                                      C++;HTML/CSS;Python
3                                                 HTML/CSS
4                                      C;C++;C#;Python;SQL
5              C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
                               ...                        
88377                        HTML/CSS;JavaScript;Other(s):
88601                                                  NaN
88802                                                  NaN
88816                                                  NaN
88863    Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...
Name: LanguageWorkedWith, Length: 88883, dtype: object

In [4]:
filt = survey_df["LanguageWorkedWith"].str.contains("Python", na=False)
survey_df.loc[filt, "LanguageWorkedWith"]

NameError: name 'survey_df' is not defined