In [1]:
import pandas as pd

In [6]:
people = {
    'first' : ['Corey', 'Jane', 'John'],
    'last' : ['Schafer', 'Doe', 'Doe'],
    'email' : ['Corey@gmail.com', 'Jane@gmail.com', 'John@gmail.com']
}

In [7]:
import pandas as pd

In [17]:
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,email
0,Corey,Schafer,Corey@gmail.com
1,Jane,Doe,Jane@gmail.com
2,John,Doe,John@gmail.com


In [12]:
# This is the preferred format since it is possible that a name of the column has the same name of an attribute or method.
df['email']

0    Corey@gmail.com
1     Jane@gmail.com
2     John@gmail.com
Name: email, dtype: object

In [13]:
# This is another way to rewrite the code above. This is called the dot notation. 
df.email

0    Corey@gmail.com
1     Jane@gmail.com
2     John@gmail.com
Name: email, dtype: object

In [15]:
type(df['email'])

pandas.core.series.Series

```df['email']``` is a Series object, which is just a list of data but has a lot more functionality. Formally, Series is a one dimensional array, which are the rows of a single column. We then define a DataFrame as a collection of series.

The Series still has an index. Moreover, we can also access more specific information within the series.

The functionality of the DataFrames can be seen through accessing multiple columns.
```df[[col_1, col_2, ..., col_n]]``` The output is no longer a series but rather another FILTERED DOWN dataframe because it is not one dimensional anymore.

In [20]:
df[['last', 'email']]

Unnamed: 0,last,email
0,Schafer,Corey@gmail.com
1,Doe,Jane@gmail.com
2,Doe,John@gmail.com


In [22]:
# It gives us all of the columns.
df.columns

Index(['first', 'last', 'email'], dtype='object')

To get rows, we either us ```loc``` or ```iloc```. 

```iloc``` allows us to access rows by integer location. If we would like to get the first (so the 0$^{\text{th}}$) row, we write
```df.iloc[0]```.

Whenever we are accessing a row, the indices now become the column nows.

If you would to select multiple rows like both 0th and 1st row, we write ```df.iloc[[0, 1]]```. We can also specify the columns that we want. However, since ```iloc``` is specifically for integers, we can't use names but rather the number corresponding to the column. For example, the 2nd column (remember we start counting from zero) is the *email* column. Thus, we write ```df.iloc[[0,1], 2]``` to get the data from the first and second row and specifically on the *email* column

In [23]:
df.iloc[0]

first              Corey
last             Schafer
email    Corey@gmail.com
Name: 0, dtype: object

In [24]:
df.iloc[[0, 1]]

Unnamed: 0,first,last,email
0,Corey,Schafer,Corey@gmail.com
1,Jane,Doe,Jane@gmail.com


In [26]:
df.iloc[[0, 1], 2]

0    Corey@gmail.com
1     Jane@gmail.com
Name: email, dtype: object

Let us now look at ```loc``` which uses the names. 

In [27]:
df

Unnamed: 0,first,last,email
0,Corey,Schafer,Corey@gmail.com
1,Jane,Doe,Jane@gmail.com
2,John,Doe,John@gmail.com


In [28]:
df.loc[0]

first              Corey
last             Schafer
email    Corey@gmail.com
Name: 0, dtype: object

In [29]:
df.loc[[0, 1]]

Unnamed: 0,first,last,email
0,Corey,Schafer,Corey@gmail.com
1,Jane,Doe,Jane@gmail.com


In [32]:
# Now that we are using loc, we can just pass in the value of email
# Notice also that the email came first then the last name. The ORDER of the list is important.
df.loc[[0, 1], ['email', 'last']]

Unnamed: 0,email,last
0,Corey@gmail.com,Schafer
1,Jane@gmail.com,Doe


Let us now go back to the Survey results.

In [34]:
df = pd.read_csv('data/survey_results_public.csv')
schema_df = pd.read_csv('data/survey_results_schema.csv')

In [36]:
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

In [40]:
df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

In [42]:
df['Hobbyist'].head()

0    Yes
1     No
2    Yes
3     No
4    Yes
Name: Hobbyist, dtype: object

In [43]:
# Let us now try to figure out how many counted yes and no
df['Hobbyist'].value_counts()

Hobbyist
Yes    71257
No     17626
Name: count, dtype: int64

In [49]:
round(71257/(71257 + 17626), 3)

0.802

In [51]:
# All of the responses from ONE respondent
df.loc[0]

Respondent                                                                1
MainBranch                           I am a student who is learning to code
Hobbyist                                                                Yes
OpenSourcer                                                           Never
OpenSource                The quality of OSS and closed source software ...
Employment                           Not employed, and not looking for work
Country                                                      United Kingdom
Student                                                                  No
EdLevel                                           Primary/elementary school
UndergradMajor                                                          NaN
EduOther                  Taught yourself a new language, framework, or ...
OrgSize                                                                 NaN
DevType                                                                 NaN
YearsCode   

In [62]:
# We can also use slicing. The last value is inclusive. Also, no need to wrap in brackets. This also works for columns.
df.loc[0 : 20, 'Hobbyist':'UndergradMajor']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design
3,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof..."
4,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof..."
5,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics
6,No,Never,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em...",Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele..."
7,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, but looking for work",India,,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof..."
8,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time,New Zealand,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof..."
9,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",


In [63]:
df['UndergradMajor'].value_counts()

UndergradMajor
Computer science, computer engineering, or software engineering          47214
Another engineering discipline (ex. civil, electrical, mechanical)        6222
Information systems, information technology, or system administration     5253
Web development or web design                                             3422
A natural science (ex. biology, chemistry, physics)                       3232
Mathematics or statistics                                                 2975
A business discipline (ex. accounting, finance, marketing)                1841
A humanities discipline (ex. literature, history, philosophy)             1571
A social science (ex. anthropology, psychology, political science)        1352
Fine arts or performing arts (ex. graphic design, music, studio art)      1233
I never declared a major                                                   976
A health science (ex. nursing, pharmacy, radiology)                        323
Name: count, dtype: int64