In [1]:
# Enable code formatting using external plugin: nb_black.
%reload_ext nb_black

<IPython.core.display.Javascript object>

# Pandas Tutorial - PART 3

**Ref: [Pandas Tutorials][1] by [Corey Schafer][2]**

[1]: https://youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS
[2]: https://coreyms.com/

#### Load and configure `pandas` library

In [2]:
import pandas as pd

print("Pandas version:", pd.__version__)

# Set display width to maximum 130 chacters in the output, post which it will continue in next line.
pd.options.display.width = 130

Pandas version: 1.3.4


<IPython.core.display.Javascript object>

#### Load _Stackoverflow_ data from csv file

In [3]:
df = pd.read_csv("./data/Stackoverflow/survey_results_public.csv", index_col="Respondent")

df.head(2)

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult


<IPython.core.display.Javascript object>

#### Load schema for the columns in _Stackoverflow_ data

Every column name in the _Stackoverflow_ csv file has a row in _Schema_ csv file. _Schema_ csv has two columns namely `Column` and `QuestionText`. Column _Column_ is converted to index in _Schema_ `DataFrame`.

In [4]:
# Load schema from csv file.
sdf = pd.read_csv("./data/Stackoverflow/survey_results_schema.csv", index_col="Column")

sdf.head(3)

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?


<IPython.core.display.Javascript object>

Using _Schema_ `DataFrame` - `sdf` - to understand some of the columns in the _Stackoverflow_ `DataFrame`.

In [5]:
df.columns

Index(['MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource', 'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode', 'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot',
       'MgrMoney', 'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz', 'JobFactors', 'ResumeUpdate', 'CurrencySymbol',
       'CurrencyDesc', 'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan', 'WorkChallenge', 'WorkRemote',
       'WorkLoc', 'ImpSyn', 'CodeRev', 'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat', 'LanguageWorkedWith',
       'LanguageDesireNextYear', 'DatabaseWorkedWith', 'DatabaseDesireNextYear', 'PlatformWorkedWith', 'PlatformDesireNextYear',
       'WebFrameWorkedWith', 'WebFrameDesireNextYear', 'MiscTechWorkedWith', 'MiscTechDesireNextYear', 'DevEnviron', 'OpSys',
       'Containers', 'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn', 'SocialMedia', 'Extraversion',
       'S

<IPython.core.display.Javascript object>

Lets check what does _MgrIdiot_ and _ImpSyn_ column names in _Stackoverflow_ `DataFrame` mean?

In [6]:
sdf.loc[["MgrIdiot", "ImpSyn"], "QuestionText"]  # Also works but the `QuestionText` column is not displayed fully.

# Instead print separately to fully display the text.
print("MgrIdiot :-", sdf.loc["MgrIdiot", "QuestionText"])
print("ImpSyn :-", sdf.loc["ImpSyn", "QuestionText"])

MgrIdiot :- How confident are you that your manager knows what they’re doing?
ImpSyn :- For the specific work you do, and the years of experience you have, how do you rate your own level of competence?


<IPython.core.display.Javascript object>

### Analyze performance of `iloc` and `loc` indexers

#### Accessing data in `DataFrame` with and without `loc`? Which is faster?

In [7]:
fltr = df["Hobbyist"] == "Yes"

%timeit df["Gender"][fltr]
%timeit df.loc[fltr, "Gender"]

592 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


<IPython.core.display.Javascript object>

> Note: Values might differ in your system based on the system specification.

1. It takes approximately **583 µs** to filter records **without `loc`**.
2. It takes approximately **933 µs** to filter records **using `loc`**.

**`loc` takes nearly double the amount of time to filter the records.**

#### Accessing data in `DataFrame` with and without `iloc`? Which is faster?

In [8]:
import numpy as np

fltr = df["Hobbyist"] == "Yes"

print(f"77th column is:", df.columns[77])

fltd_idx = np.array(df["Gender"][fltr].index) - 1

%timeit df["Gender"][fltr]
%timeit df.iloc[fltd_idx, 77]

77th column is: Gender
593 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
920 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


<IPython.core.display.Javascript object>

> Note: Values might differ in your system based on the system specification.

1. It takes approximately **584 µs** to filter records **without `iloc`**.
2. It takes approximately **901 µs** to filter records **using `iloc`**.

**`iloc` takes nearly double the amount of time to filter the records.**

### Filtering data

In [9]:
fltr = df["Hobbyist"] == "Yes"

<IPython.core.display.Javascript object>

In [10]:
# df[fltr[:4]] # Does not work.
df[fltr].head(4)

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy
6,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,28.0,Man,No,Straight / Heterosexual,East Asian,No,Too long,Neither easy nor difficult


<IPython.core.display.Javascript object>

In [11]:
df.loc[fltr, ["Student", "OpenSourcer"]].head(3)

Unnamed: 0_level_0,Student,OpenSourcer
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
1,No,Never
3,No,Never
5,No,Once a month or more often


<IPython.core.display.Javascript object>

Respondents b/w the age of 5 and 14 who's hobby is codding.

In [12]:
fltr = (df["Hobbyist"] == "Yes") & (df["Age"] > 5) & (df["Age"] < 14)
df.loc[fltr, ["Country", "Age"]]

Unnamed: 0_level_0,Country,Age
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
204,China,12.0
673,Turkey,13.0
2517,Bosnia and Herzegovina,12.0
3089,France,12.0
3306,United States,12.0
...,...,...
88201,United States,12.0
88450,Spain,13.0
88621,Canada,12.0
14724,United Kingdom,12.0


<IPython.core.display.Javascript object>

In [13]:
fltr = (df["Hobbyist"] == "Yes") | (df["OpenSourcer"] != "Never")

<IPython.core.display.Javascript object>

In [14]:
df.loc[fltr, ["Hobbyist", "OpenSourcer"]]

Unnamed: 0_level_0,Hobbyist,OpenSourcer
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Yes,Never
2,No,Less than once per year
3,Yes,Never
5,Yes,Once a month or more often
6,Yes,Never
...,...,...
88182,Yes,Once a month or more often
88212,No,Less than once per year
88282,Yes,Once a month or more often
88377,Yes,Less than once a month but more than once per ...


<IPython.core.display.Javascript object>

In [15]:
df.loc[~fltr, ["Hobbyist", "OpenSourcer"]]

Unnamed: 0_level_0,Hobbyist,OpenSourcer
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
4,No,Never
7,No,Never
12,No,Never
20,No,Never
25,No,Never
...,...,...
88062,No,Never
88076,No,Never
88601,No,Never
88802,No,Never


<IPython.core.display.Javascript object>

In [16]:
df["Hobbyist"].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

<IPython.core.display.Javascript object>

In [17]:
df["ConvertedComp"].max()

2000000.0

<IPython.core.display.Javascript object>

In [18]:
high_salary = df["ConvertedComp"] > 70000
df.loc[high_salary, ["Country", "LanguageWorkedWith", "ConvertedComp"]]

Unnamed: 0_level_0,Country,LanguageWorkedWith,ConvertedComp
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Canada,Java;R;SQL,366420.0
9,New Zealand,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;P...,95179.0
13,United States,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,90000.0
16,United Kingdom,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;T...,455352.0
22,United States,Bash/Shell/PowerShell;C++;HTML/CSS;JavaScript;...,103000.0
...,...,...,...
88876,United States,Bash/Shell/PowerShell;C#;HTML/CSS;Java;Python;...,180000.0
88877,United States,Bash/Shell/PowerShell;C;Clojure;HTML/CSS;Java;...,2000000.0
88878,United States,HTML/CSS;JavaScript;Scala;TypeScript,130000.0
88879,Finland,Bash/Shell/PowerShell;C++;Python,82488.0


<IPython.core.display.Javascript object>

In [19]:
countries = ["United States", "India", "United Kingdom", "Germany", "Canada"]
fltr = df["Country"].isin(countries)

df.loc[fltr, "Country"].value_counts()

United States     20949
India              9061
Germany            5866
United Kingdom     5737
Canada             3395
Name: Country, dtype: int64

<IPython.core.display.Javascript object>

In [20]:
fltr = df["LanguageWorkedWith"].str.contains("Python", na=False)
df.loc[fltr, "Country"].value_counts()

United States                            10083
India                                     3105
Germany                                   2451
United Kingdom                            2384
Canada                                    1558
                                         ...  
Timor-Leste                                  1
Mali                                         1
Sierra Leone                                 1
Liechtenstein                                1
Democratic People's Republic of Korea        1
Name: Country, Length: 162, dtype: int64

<IPython.core.display.Javascript object>