# Analyze Wage Data with Python

- [View Solution Notebook](./solutions.html)
- [View Project Page](https://www.codecademy.com/projects/practice/analyze-wage-data-with-python)

## Task Group 1 - Import and Clean

### Task 1

Display the first five lines of `df_wages`.

In [None]:
import pandas as pd

df_wages = pd.read_csv('wages.csv')

# Preview the data
df_wages.head()

### Task 2

Rename the `Occupation title (click on the occupation title to view its profile)` column to `Occupation title`. 

In [None]:
col_mapper = {'Occupation title (click on the occupation title to view its profile)':'Occupation title'}
df_wages = df_wages.rename(mapper = col_mapper, axis=1)

# show output
df_wages.head()

### Task 3

Drop any redundant or otherwise unnecessary columns from `df_wages`. Make a note of why these columns are suitable for dropping!

In [None]:
drop_column_labels = ['Index', 'Year']
df_wages = df_wages.drop(labels=drop_column_labels, axis=1)

# show output
df_wages.head()

### Task 4

Display column information including names, # non-null entries, and data types.

In [None]:
df_wages.info()

## Task Group 2 - Column Transformations

### Task 5

Use pandas to split the information in the `Occupation title` column into new columns `Industry`, `Level`, and `Occupation`. 

In [None]:
title_split = df_wages['Occupation title'].str.split('-', expand=True)
df_wages['Industry'] = title_split[0]
df_wages['Level'] = title_split[1]
df_wages['Occupation'] = title_split[2]

# show output
df_wages[['Occupation title', 'Industry', 'Level', 'Occupation']].head()

### Task 6

Remove any leading and trailing whitespaces in the columns `Industry`, `Level`, and `Occupation`.

In [None]:
df_wages['Industry'] = df_wages['Industry'].str.strip()
df_wages['Level'] = df_wages['Level'].str.strip()
df_wages['Occupation'] = df_wages['Occupation'].str.strip()

### Task 7

Replace the `'$'` character in the columns `Average hourly wage`, `Industry average`, and `Similar occupation average` with an empty character `''` (no space between the single quotes!).

In [None]:
df_wages['Average hourly wage'] = df_wages['Average hourly wage'].str.replace('$', '', regex=False)
df_wages['Industry average'] = df_wages['Industry average'].str.replace('$', '', regex=False)
df_wages['Similar occupation average'] = df_wages['Similar occupation average'].str.replace('$', '', regex=False)

# show output
df_wages[['Average hourly wage', 'Industry average', 'Similar occupation average']].head()

### Task 8

Convert the data types of the columns `Average hourly wage`, `Industry average`, and `Similar occupation average` from `object` to `float`.

In [None]:
df_wages['Average hourly wage']  = df_wages['Average hourly wage'].astype(float)
df_wages['Industry average'] = df_wages['Industry average'].astype(float)
df_wages['Similar occupation average'] = df_wages['Similar occupation average'].astype(float)

# show output
df_wages.info()

## Task Group 3 - Comparison to Industry Average

### Task 9

Calculate the difference between the average hourly wage and the industry average. Assign the difference to a new column `Industry wage difference`.

In [None]:
df_wages['Industry wage difference'] = df_wages['Average hourly wage'] - df_wages['Industry average']

df_wages[['Occupation', 'Average hourly wage', 'Industry average', 'Industry wage difference']].head()

### Task 10

Divide `Industry wage difference` by `Industry average` to convert the difference to a percent change. (You might want to multiply by `100` at the end to display as a percentage).

Assign the result to new column called `Industry wage pctchg`. 

In [None]:
df_wages['Industry wage pctchg'] = df_wages['Industry wage difference']/df_wages['Industry average'] * 100

df_wages[['Industry', 'Occupation','Level', 'Average hourly wage', 'Industry average', 'Industry wage pctchg']].head()

### Task 11

Sort `df_wages` by the `Industry wage pctchg` column from *highest* to *lowest*. Assign the result to the variable `highest_industry_pctchg`.

In [None]:
highest_industry_pctchg = df_wages.sort_values(by='Industry wage pctchg', ascending=False)

highest_industry_pctchg[['Industry', 'Occupation','Level', 'Industry wage pctchg']].head(10)

## Task Group 4 - Computer Jobs

### Task 12

Use the separate `Industry` column you created in Task 5 to investigate occupations in the **'Computer and Mathematical Occupations'** industry. Filter `df_wages` for this specific industry and create a new DataFrame named `cs_math_occupations`.

In [None]:
cs_math_occupations = df_wages[df_wages['Industry'] == 'Computer and Mathematical Occupations']

### Task 13

Sort `cs_math_occupations` by `Average hourly wage` from highest to lowest, and display the results.

In [None]:
cs_math_occupations_sorted = cs_math_occupations.sort_values(by='Average hourly wage',ascending=False)

# show output
cs_math_occupations_sorted.head(10)