# About

- [Documentation](https://pandas.pydata.org/docs/)
- [Python Pandas Tutorial by Corey Schafer](https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)

- Why Pandas?
    - For data analysis
    - Great performance since it's built on top of numpy
    - Can easily visualize data 
    - DataFrame (df) : rows and columns of data
       - pandas df rows $ \equiv $ py dictionary values $ \leftrightarrow $ pandas df : rows :: py dictionary : values 
       - pandas df cols $ \equiv $ py dictionary keys $ \leftrightarrow $ pandas df : cols :: py dictionary : keys
       - pandas df $ \equiv $ 2D data structure
       - multiple Series
   - Series : rows of a single col so each col is considered a Series
       - Series $ \equiv $ 1D data structure

# TODOs
1. [x] Imports and Load Data (csv)
    - [x] See df shape with the `.shape` attribute
    - [x] See all columns, dtypes, etc with the `.info()`
    - [x] Get first and last X rows with `head(X)` and `tail(X)`
    - [x] Specifically view a series by getting one of the df columns
        - [x] Count the different values in this series with `.value_counts()`
2. [ ] `.loc[]` and `.iloc[]`
3. [ ] [Filter mask](https://youtu.be/Lw2rlcxScZY?t=88) and [Filter data](https://youtu.be/Lw2rlcxScZY?t=199)

## 1. Imports and Load Data (csv)

In [1]:
import pandas as pd

In [5]:
# dataframe on survey results
df = pd.read_csv("datasets/survey_results_public.csv")
# df

In [4]:
# .shape is an attribute (rows and columns)
df.shape

(88883, 85)

In [7]:
# shape and data types
# df.info()

In [6]:
# see all cols; run here, then go back to cell w/ data to run
pd.set_option('display.max_columns', 85)

# run below
pd.set_option('display.max_rows', 85)

In [8]:
# load other file - schema file (SF)
# (SF) describes the structure of your incoming data file
# use as a reference and can filter (will come in a later video)
schema_df = pd.read_csv('datasets/survey_results_schema.csv')
# schema_df

In [9]:
# get first x rows
# df.head(10)

# get last x rows
# df.tail()

In [13]:
# #rows and #cols in schema df
schema_df.shape, schema_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Column        85 non-null     object
 1   QuestionText  85 non-null     object
dtypes: object(2)
memory usage: 1.5+ KB


((85, 2), None)

## 2. .loc[] and .iloc[]

- integer location: locate specific rows and column
- syntax for both
    - single row : 
    - single column :
    - multiple rows :
    - multiple columns :

In [15]:
# all Hobbyist col (also dict row); this is a Series
df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
4        Yes
        ... 
88878    Yes
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

In [16]:
# how many per different answer - yes and nos?
df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

In [35]:
# 1st row - all the responses for this person
df.loc[1]

Respondent                                                                2
MainBranch                           I am a student who is learning to code
Hobbyist                                                                 No
OpenSourcer                                         Less than once per year
OpenSource                The quality of OSS and closed source software ...
Employment                               Not employed, but looking for work
Country                                              Bosnia and Herzegovina
Student                                                      Yes, full-time
EdLevel                   Secondary school (e.g. American high school, G...
UndergradMajor                                                          NaN
EduOther                  Taken an online course in programming or softw...
OrgSize                                                                 NaN
DevType                   Developer, desktop or enterprise applications;...
YearsCode   

In [36]:
# 1st row and Hobbyist col
df.loc[[1], ['Hobbyist']]

Unnamed: 0,Hobbyist
1,No


In [30]:
# 1st 3 rows and Hobbyist col
df.loc[[0, 1, 2, 3], ['Hobbyist']]

Unnamed: 0,Hobbyist
0,Yes
1,No
2,Yes
3,No


In [32]:
# 1st 3 rows by slicing (last value is inclusive w/ loc) and Hobbyist col
df.loc[0:2, ['Hobbyist']]

Unnamed: 0,Hobbyist
0,Yes
1,No
2,Yes


In [33]:
# 1st 3 rows by slicing (last value is inclusive w/ loc) and more col
df.loc[0:2, ['Hobbyist', 'OpenSourcer', 'OpenSource', 'Employment']]

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time


In [34]:
# 1st 3 rows by slicing (last value is inclusive w/ loc) and more col by slicing
df.loc[0:2, 'Hobbyist' : 'Employment']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time


## 3. Filter mask and Filter data

- [Filter mask](https://youtu.be/Lw2rlcxScZY?t=88) : get boolean values
- [Filter data](https://youtu.be/Lw2rlcxScZY?t=199) : see filter mask as df