In [1]:
import pandas as pd

Previously, we looked at how we can load CSV files using Pandas' DataFrames. But what exactly is a DataFrame?

Let's start by looking at how we could create something similar to a DataFrame wihout using Pandas, but using only Python:

In [8]:
person = {'first_name': 'Adam',
         'last_name': 'Smith',
         'email': 'AdamSmith@gmail.com'}

In the above dictionary, we can already declare some pieces of information, i.e. dictionary values, and store them under unique keywords, i.e. dictionary keys. This is similar to the column to row functionality in a DataFrame, where a key would map to a column and a value would map to a row.

If we want to expand this example to include multiple rows, we could store muliple values under one key in a Python dictionay, using lists:

In [21]:
people = {'First name': ['Adam', 'Ahmed', 'Jake'],
         'Last name': ['Smith', 'Badr', 'Doe'],
         'Email': ['adamsmith@gmail.com', 'ahmedbadr@notreally.com', 'jakedoe@donotreadthis.org']}

In this example, we can access the value(s) stored within a key by calling that specific key, e.g.:

In [22]:
people['Email']

['adamsmith@gmail.com', 'ahmedbadr@notreally.com', 'jakedoe@donotreadthis.org']

A Dataframe is similar to the dictionary above, except it provides more functionality, as we're going to explore.

We can actually turn the dictionary above into a DataFrame, using:

In [17]:
people_df = pd.DataFrame(people)

In [18]:
people_df

Unnamed: 0,First name,Last name,Email
0,Adam,Smith,adamsmith@gmail.com
1,Ahmed,Badr,ahmedbadr@notreally.com
2,Jake,Doe,jakedoe@whydoyouevencaretoreadthis.org


Similar to how one accesses the value(s) stored within a dictionary key, we can also call a column in Pandas using similar syntax.

To view the 'Email' column in our DataFrame, we use: 

In [23]:
people_df['Email']

0                       adamsmith@gmail.com
1                   ahmedbadr@notreally.com
2    jakedoe@whydoyouevencaretoreadthis.org
Name: Email, dtype: object

An alternative way to access the same column is using the dot notation, as in:

In [25]:
people_df.Email

0                       adamsmith@gmail.com
1                   ahmedbadr@notreally.com
2    jakedoe@whydoyouevencaretoreadthis.org
Name: Email, dtype: object

Note: Although both the bracket notation and the dot notation are viable options for accessing columns in Pandas, it is highly recommended that you stick to the bracket notation, as it avoids the errors that could arise from having one of the columns sharing the name of a DataFrame method or attribute.

The dot notation is also able to access columns with spaces in their names, which means using it would make the syntax within a project more consistent.

**Note: Although the above example shows there are similarities between a DataFrame and a Python dictionary, we must remember that a DataFrame is much more functional than a dictionary. It does consist of rows and columns, but it is not simply a Python dictionary.**

Let's explore what a DataFrame consists of, by examining the data type of the single column we just pulled:

In [24]:
type(people_df['Email'])

pandas.core.series.Series

A 'Series' is defined as a one dimensional array of data. A DataFrame is a collection/container of Series objects, where each Series represents a column.

Now that we know how to access a single column, let's look at how we access multiple columns at once.

We use the bracket notation for that as well, and we pass **a list** of the columns we want to call/view, in order:

In [26]:
people_df[['Email', 'First name']]

Unnamed: 0,Email,First name
0,adamsmith@gmail.com,Adam
1,ahmedbadr@notreally.com,Ahmed
2,jakedoe@whydoyouevencaretoreadthis.org,Jake


Note: The above table is no longer a Series! As a Series is a single column. Instead, the above table is a filtered DataFrame.

Note: Changing the order of columns while calling them determins how they're ordered in a the resulting DataFrame.

We can view all columns at once, by using:

In [27]:
people_df.columns

Index(['First name', 'Last name', 'Email'], dtype='object')

Now that we've seen how we can call a single column or multiple columns, let's look at calling specific rows.

For that, we can use either the **loc**ation function, loc, or the **i**nteger **loc**ation function, iloc.


For the few simple uses we introduce in this notebook, both functions will look very similar. This will change in the next notebook as we explore indexing in details.

To call a single row using iloc:

In [31]:
people_df.iloc[2]

First name                                      Jake
Last name                                        Doe
Email         jakedoe@whydoyouevencaretoreadthis.org
Name: 2, dtype: object

As we see in the above example, calling row number 2 calls the 3rd row, as the numbering convention starts with 0. We receive, therefore, a Series cosisting of all columns of the row with the index '2'.

Note: The index in the example above is the column names.

Much Like we did with columns, we can access multiple rows at once by passing **a list** of indices in the desired order to iloc:

In [33]:
people_df.iloc[[2, 0]]

Unnamed: 0,First name,Last name,Email
2,Jake,Doe,jakedoe@whydoyouevencaretoreadthis.org
0,Adam,Smith,adamsmith@gmail.com


Once again, we receive a filtered DataFrame with the rows we called, in the ordered we called them.

With the loc and iloc functions, we can specify the columns we would like to call as well, by passing them as a second argument after the list of rows we pass. For example:

In [34]:
people_df.iloc[[1, 0], [1, 0]]

Unnamed: 0,Last name,First name
1,Badr,Ahmed
0,Smith,Adam


Once again, we get a filtered DataFrame with the rows and columns we called, and in the order we specified. If we call a single column in our 2nd argment, we would receive a Series object with the required values:

In [45]:
people_df.iloc[[0, 1], 1]

0    Smith
1     Badr
Name: Last name, dtype: object

Note: With **iloc**, we can **only** use **integers** to specify the rows and columns we want to call.

To call a single row using loc:

In [37]:
people_df.loc[0]

First name                   Adam
Last name                   Smith
Email         adamsmith@gmail.com
Name: 0, dtype: object

To access multiple rows at once in loc:

In [40]:
people_df.loc[[2, 1]]

Unnamed: 0,First name,Last name,Email
2,Jake,Doe,jakedoe@whydoyouevencaretoreadthis.org
1,Ahmed,Badr,ahmedbadr@notreally.com


With the loc function, we can specify the columns we would like to call, by passing **strings** as a second argument. For example:

In [41]:
people_df.loc[[1, 0], ['First name', 'Email']]

Unnamed: 0,First name,Email
1,Ahmed,ahmedbadr@notreally.com
0,Adam,adamsmith@gmail.com


Once again, if we call a single column in our 2nd argment, we would receive a Series object with the required values:

In [43]:
people_df.loc[[1, 0], 'Email']

1    ahmedbadr@notreally.com
0        adamsmith@gmail.com
Name: Email, dtype: object

Note: With **loc**, we can **only** use **strings** to specify the columns we want to call.

----------------------------------------------------------------------------------------------------

Now, let's apply what we've learned to the Stack Overflow dataset.

First, we load our files, then we change the default options of maximum visible rows and colums to 85 each:

In [46]:
df = pd.read_csv('survey_results_public.csv')
schema_df = pd.read_csv('survey_results_schema.csv')

In [47]:
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

Let's look at the head of our dataset again, then pick a column to call:

In [48]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",,,4.0,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,HTML/CSS;Java;JavaScript;Python,C;C++;C#;Go;HTML/CSS;Java;JavaScript;Python;SQL,SQLite,MySQL,MacOS;Windows,Android;Arduino;Windows,Django;Flask,Flask;jQuery,Node.js,Node.js,IntelliJ;Notepad++;PyCharm,Windows,I do not use containers,,,Yes,"Fortunately, someone else has that title",Yes,Twitter,Online,Username,2017,A few times per month or weekly,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,31-60 minutes,No,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are",Neutral,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,,"Developer, desktop or enterprise applications;...",,17,,,,,,,I am actively looking for a job,I've never had a job,,,Financial performance or funding status of the...,"Something else changed (education, award, medi...",,,,,,,,,,,,,,,,,C++;HTML/CSS;Python,C++;HTML/CSS;JavaScript;SQL,,MySQL,Windows,Windows,Django,Django,,,Atom;PyCharm,Windows,I do not use containers,,Useful across many domains and could change ma...,Yes,Yes,Yes,Instagram,Online,Username,2017,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was much faster,11-30 minutes,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, and I don't know what those are","Yes, somewhat",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",100 to 499 employees,"Designer;Developer, back-end;Developer, front-...",3.0,22,1,Slightly satisfied,Slightly satisfied,Not at all confident,Not sure,Not sure,"I’m not actively looking, but I am open to new...",1-2 years ago,Interview with people in peer roles,No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,THB,Thai baht,23000.0,Monthly,8820.0,40.0,There's no schedule or spec; I work on what se...,Distracting work environment;Inadequate access...,Less than once per month / Never,Home,Average,No,,"No, but I think we should",Not sure,I have little or no influence,HTML/CSS,Elixir;HTML/CSS,PostgreSQL,PostgreSQL,,,,Other(s):,,,Vim;Visual Studio Code,Linux-based,I do not use containers,,,Yes,Yes,Yes,Reddit,In real life (in person),Username,2011,A few times per week,Find answers to specific questions;Learn how t...,6-10 times per week,They were about the same,,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,100 to 499 employees,"Developer, full-stack",3.0,16,Less than 1 year,Very satisfied,Slightly satisfied,Very confident,No,Not sure,I am not interested in new job opportunities,Less than a year ago,"Write code by hand (e.g., on a whiteboard);Int...",No,"Languages, frameworks, and other technologies ...",I was preparing for a job search,USD,United States dollar,61000.0,Yearly,61000.0,80.0,There's no schedule or spec; I work on what se...,,Less than once per month / Never,Home,A little below average,No,,"No, but I think we should",Developers typically have the most influence o...,I have little or no influence,C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,,,.NET,.NET,Eclipse;Vim;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,"Useful for decentralized currency (i.e., Bitcoin)",Yes,SIGH,Yes,Reddit,In real life (in person),Username,2014,Daily or almost daily,Find answers to specific questions;Pass the ti...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Less than once per month or monthly,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,"10,000 or more employees","Academic researcher;Developer, desktop or ente...",16.0,14,9,Very dissatisfied,Slightly dissatisfied,Somewhat confident,Yes,No,I am not interested in new job opportunities,Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,"Industry that I'd be working in;Languages, fra...",I was preparing for a job search,UAH,Ukrainian hryvnia,,,,55.0,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Inadequ...,A few days each month,Office,A little above average,"Yes, because I see value in code review",,"Yes, it's part of our process",Not sure,I have little or no influence,C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA,HTML/CSS;Java;JavaScript;SQL;WebAssembly,Couchbase;MongoDB;MySQL;Oracle;PostgreSQL;SQLite,Couchbase;Firebase;MongoDB;MySQL;Oracle;Postgr...,Android;Linux;MacOS;Slack;Windows,Android;Docker;Kubernetes;Linux;Slack,Django;Express;Flask;jQuery;React.js;Spring,Flask;jQuery;React.js;Spring,Cordova;Node.js,Apache Spark;Hadoop;Node.js;React Native,IntelliJ;Notepad++;Vim,Linux-based,"Outside of work, for personal projects",Not at all,,Yes,Also Yes,Yes,Facebook,In real life (in person),Username,I don't remember,Multiple times per day,Find answers to specific questions,More than 10 times per week,Stack Overflow was much faster,,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","Yes, definitely",Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


If we want to look at all the columns we have and view them all wihout scrolling, we use:

In [50]:
df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

Let's call the 'Hobbyist' column:

In [49]:
df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
4        Yes
        ... 
88878    Yes
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

The following function is out of the scope of this notebook, but is used here to tease the abilities of Pandas.

If we want to count all answers present in the 'Hobbyist' column, we can use:

In [54]:
df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

Now, going back to our example, let's grab the first row of the 'Hobbyist' column:

In [56]:
df.loc[0, 'Hobbyist']

'Yes'

Now, if we want to grab the first 3 rows of the 'Hobbyist' column:

In [57]:
df.loc[[0, 1, 2], 'Hobbyist']

0    Yes
1     No
2    Yes
Name: Hobbyist, dtype: object

Or alternatively, we can use slicing, which works similarly to Python lists, except the last value is included:

In [61]:
df.loc[0:2, 'Hobbyist']

0    Yes
1     No
2    Yes
Name: Hobbyist, dtype: object

**Note: do not wrap the values in brackets when slicing!**

**Note: when slicing with loc, the last value is included!**

The reason why the last value is included when slicing with loc, is that we can also use it to slice/grab multiple columns by label.

Example: Let's grab all the columns from 'Hobbyist' to 'Employment' for the first 9 rows:

In [62]:
df.loc[0:8, 'Hobbyist':'Employment']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
3,No,Never,The quality of OSS and closed source software ...,Employed full-time
4,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time
5,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
6,No,Never,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em..."
7,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work"
8,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time
