# 100 days of Python
## Day 72: Data exploration using Pandas

Use pandas to open, explore, correct, and visualize data inside of a CSV file.

In [1]:
import pandas as pd

### We will be using DATAFRAMES objects from pandas to manage data

First, import the CSV file using the fuction read_csv method from Pandas

In [34]:
df = pd.read_csv("salaries_by_college_major.csv")
df

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business
5,Art History,35800.0,64900.0,28800.0,125000.0,HASS
6,Biology,38800.0,64800.0,36900.0,135000.0,STEM
7,Business Management,43000.0,72100.0,38800.0,147000.0,Business
8,Chemical Engineering,63200.0,107000.0,71900.0,194000.0,STEM
9,Chemistry,42600.0,79900.0,45300.0,148000.0,STEM


- We can use .head() method to visualize the top 5 data rows from the dataframe
- Same way, we can look at the last 5 data rows using the .tail() method

In [6]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


#### To know the dimensions of the dataframe, we can look into the attribute .shape

In [7]:
df.shape

(51, 6)

#### And we can also check the columns in the dataframe using the attribute .columns

In [8]:
df.columns

Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')

## Missing Values and Junk Data


Before we can proceed with our analysis we should try and figure out if there are any missing or junk data in our dataframe. That way we can avoid problems later on. In this case, we're going to look for NaN (Not A Number) values in our dataframe. NAN values are blank cells or cells that contain strings instead of numbers. Use the .isna() method and see if you can spot if there's a problem somewhere. 

In [14]:
df.tail().isna()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
46,False,False,False,False,False,False
47,False,False,False,False,False,False
48,False,False,False,False,False,False
49,False,False,False,False,False,False
50,False,True,True,True,True,True


#### The last row in the dataframe contains blank cells. Since we are not interested in row of data without any data, we can drop it out of our dataframe with the method .dropna()

In [15]:
clean_df = df.dropna()

In [16]:
clean_df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


## Accessing Columns and Individual Cells in a Dataframe

### Find College Major with Highest Starting Salaries

To access a particular column from a data frame we can use the square bracket notation, like so:

In [18]:
clean_df['Starting Median Salary']

0     46000.0
1     57700.0
2     42600.0
3     36800.0
4     41600.0
5     35800.0
6     38800.0
7     43000.0
8     63200.0
9     42600.0
10    53900.0
11    38100.0
12    61400.0
13    55900.0
14    53700.0
15    35000.0
16    35900.0
17    50100.0
18    34900.0
19    60900.0
20    38000.0
21    37900.0
22    47900.0
23    39100.0
24    41200.0
25    43500.0
26    35700.0
27    38800.0
28    39200.0
29    37800.0
30    57700.0
31    49100.0
32    36100.0
33    40900.0
34    35600.0
35    49200.0
36    40800.0
37    45400.0
38    57900.0
39    35900.0
40    54200.0
41    39900.0
42    39900.0
43    74300.0
44    50300.0
45    40800.0
46    35900.0
47    34100.0
48    36500.0
49    34000.0
Name: Starting Median Salary, dtype: float64

#### To find the highest starting salary we can simply chain the .max() method. 

In [19]:
clean_df['Starting Median Salary'].max()

74300.0

#### The highest starting salary is $74,300. But which college major earns this much on average? For this, we need to know the row number or index so that we can look up the name of the major. Lucky for us, the .idxmax() method will give us index for the row with the largest value. 

In [21]:
clean_df['Starting Median Salary'].idxmax()

43

#### To see the name of the major that corresponds to that particular row, we can use the .loc (location) property to check the data in the indicated row, or to check a specific data in the row

In [25]:
clean_df.loc[43]

Undergraduate Major                  Physician Assistant
Starting Median Salary                           74300.0
Mid-Career Median Salary                         91700.0
Mid-Career 10th Percentile Salary                66400.0
Mid-Career 90th Percentile Salary               124000.0
Group                                               STEM
Name: 43, dtype: object

In [26]:
clean_df['Undergraduate Major'].loc[43]

'Physician Assistant'

## Challenge:

Now that we've found the major with the highest starting salary, can you write the code to find the following:

- What college major has the highest mid-career salary? How much do graduates with this major earn? (Mid-career is defined as having 10+ years of experience).

- Which college major has the lowest starting salary and how much do graduates earn after university?

- Which college major has the lowest mid-career salary and how much can people expect to earn with this degree? 

### 1) College major with the highest mid-career salary

In [27]:
clean_df['Mid-Career Median Salary'].idxmax()

8

In [29]:
clean_df.loc[8]

Undergraduate Major                  Chemical Engineering
Starting Median Salary                            63200.0
Mid-Career Median Salary                         107000.0
Mid-Career 10th Percentile Salary                 71900.0
Mid-Career 90th Percentile Salary                194000.0
Group                                                STEM
Name: 8, dtype: object

### 2) College major with the lowest starting salary and how much do graduates earn after university

In [37]:
clean_df['Starting Median Salary'].min()

34000.0

In [38]:
clean_df.loc[clean_df['Starting Median Salary'].idxmin()]

Undergraduate Major                  Spanish
Starting Median Salary               34000.0
Mid-Career Median Salary             53100.0
Mid-Career 10th Percentile Salary    31000.0
Mid-Career 90th Percentile Salary    96400.0
Group                                   HASS
Name: 49, dtype: object

### 3) College major with the lowerst mid-career salary and how much people expect to earn with this degree

In [41]:
clean_df.loc[clean_df['Mid-Career Median Salary'].idxmin()]

Undergraduate Major                  Education
Starting Median Salary                 34900.0
Mid-Career Median Salary               52000.0
Mid-Career 10th Percentile Salary      29300.0
Mid-Career 90th Percentile Salary     102000.0
Group                                     HASS
Name: 18, dtype: object