# Introduction to pandas

Pandas is an open source data analysis and manipulation tool, built on top of the Python programming language.


In [2]:
import pandas as pd 

## read_csv: get data from other flat files

Pandas allows you to create dataframes from flat files such as csv or excel spreadsheets.

In [4]:
# Here we create a new dataframe by calling the "read_csv" method. The file name is the argument. 
# In this case, we don't need to specify the sep argument since comma is the default. 
# If we were creating a dataframe from another file type (for example, a tsv), we would add an argument.

df_salary = pd.read_csv('Levels_Fyi_Salary_Data.csv')



df_salary.head()


Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
0,6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000.0,...,0,0,0,0,0,0,0,0,,
1,6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0.0,...,0,0,0,0,0,0,0,0,,
2,6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000.0,...,0,0,0,0,0,0,0,0,,
3,6/17/2017 0:23:14,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.0,5.0,,157000.0,...,0,0,0,0,0,0,0,0,,
4,6/20/2017 10:58:51,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.0,3.0,,0.0,...,0,0,0,0,0,0,0,0,,


### use_cols: import a subset of columns

Let's say we wanted to import a subset of columns. In that case, we can use the "use_cols" argument 

In [14]:
# Create a list of columns to use

cols = ["timestamp","company","Race_White","Race_Black","totalyearlycompensation"]

# Create a dataframe that uses only the selected columns.

data = pd.read_csv('Levels_Fyi_Salary_Data.csv', usecols = cols)



timestamp                  object
company                    object
totalyearlycompensation     int64
Race_White                  int64
Race_Black                  int64
dtype: object

## dtypes: learn about data types

Pandas tries to guess the data type for data in flat files but sometimes it guesses wrong.

This is often true numbers that represent groups or qualities instead of quantities.

In [15]:
data.dtypes

# Notice how timestamp is an object but should be datetime

timestamp                  object
company                    object
totalyearlycompensation     int64
Race_White                  int64
Race_Black                  int64
dtype: object

## dtype argument: change data types

You pass a dictionary to the dtype argument to change the data type.

## parse_dates: parse simple dates

Pandas does not read datetime data correctly. For this reason, we use the parse_dates argument within read_csv.

In [21]:
# Pass the datetime column into the parse_dates argument.

data = pd.read_csv('Levels_Fyi_Salary_Data.csv', parse_dates = ["timestamp"])

data.timestamp.head()

0   2017-06-07 11:33:27
1   2017-06-10 17:11:29
2   2017-06-11 14:53:57
3   2017-06-17 00:23:14
4   2017-06-20 10:58:51
Name: timestamp, dtype: datetime64[ns]

In [29]:
# Create dict of columns to combine into new datetime column

datime_parsed = {"timestamp": ["start_date","start_time"]}

# Load file, supplying the dict to parse_dates
survey_data = pd.read_csv('Levels_Fyi_Salary_Data.csv', parse_dates = datime_parsed)


ValueError: Missing column provided to 'parse_dates': 'start_date, start_time'